<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-case-studies/blob/master/neural_machine_translation_for_hindi_english.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation for Hindi-English: Sequence to sequence learning

This article throws light on various aspects of building a basic Neural Machine Translation (NMT) model using the sequence to sequence learning approach, with LSTMs (Long Short-Term Memory). While there are numerous papers and blogs on NMT, this is another attempt on highlighting some of the intuitive features and also a step by step guide to performing similar NLP tasks. I have tried maintaining a balance between technical and non-technical details in this article. I hope it helps. So let’s get started!

Reference:

https://medium.com/analytics-vidhya/neural-machine-translation-for-hindi-english-sequence-to-sequence-learning-1298655e334a

https://github.com/richaranjan23/My_Projects/tree/master/MSc_dissertation

http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/

## Setup

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import string
import codecs
import re
import h5py
from string import digits

from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
from sklearn.model_selection import train_test_split

from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import plot_model, get_file

import matplotlib.pyplot as plt
%matplotlib inline

TensorFlow 2.x selected.


## The Corpus — Pre-processing

The languages chosen for building this translation models were Hindi and English, and the parallel corpora was obtained from the [IIT Bombay Hindi-English Parallel Corpus](http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/). This is a Hindi-English parallel corpus containing 1,492,827 pairs of sentences. 

### Download the dataset

In [3]:
# ref: https://stackoverflow.com/questions/56976078/how-do-i-load-images-dataset-using-tf-keras-utils-get-file
dataset = tf.keras.utils.get_file(
    fname="parallel.tgz", 
    origin="http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz", 
    extract=True,
)

Downloading data from http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz


In [4]:
# check the donwloaded data that is donloaded into ~/.keras/datasets directory
! ls ~/.keras/datasets

parallel  parallel.tgz


In [0]:
# copy the donwloaded data into current directory
! cp -r ~/.keras/datasets .

### Original data load

In [0]:
eng_sentence = (open('datasets/parallel/IITB.en-hi.en', encoding='utf-8', errors='ignore').read()).split('\n')[:-1]
hin_sentence = (open('datasets/parallel/IITB.en-hi.hi', encoding='utf-8', errors='ignore').read()).split('\n')[:-1]

In [7]:
# prepare dataframe for both sentence
lines = pd.DataFrame(columns=['eng', 'hindi'])
lines.eng = eng_sentence
lines.hindi = hin_sentence
lines.head()

Unnamed: 0,eng,hindi
0,Give your application an accessibility workout,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें
1,Accerciser Accessibility Explorer,एक्सेर्साइसर पहुंचनीयता अन्वेषक
2,The default plugin layout for the bottom panel,निचले पटल के लिए डिफोल्ट प्लग-इन खाका
3,The default plugin layout for the top panel,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका
4,A list of plugins that are disabled by default,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...


In [8]:
lines.shape

(1561840, 2)

### Original data cleaning

In [9]:
# truncate dataset
lines = lines[:1000000]
lines.shape

(1000000, 2)

In [0]:
# clean hindi text
hindi_clean = []
for line in lines.hindi:
  nstr = re.sub('\s+_\s+[A-Za-z]', '', line)
  nstr = re.sub('\(*\)*', '', nstr)
  nstr = re.sub('%.*?[A-Za-z]', '', nstr)
  nstr = re.sub('[A-Za-z]', '', nstr)
  nstr = nstr.strip()
  nstr = re.sub('\s+', ' ', nstr)
  hindi_clean.append(nstr)

In [0]:
# clean english text
eng_clean = []
for line in lines.eng:
  estr = re.sub('\s+_\s', '', line)
  estr = re.sub(r'[^\x00-\x7F]+', ' ', estr)
  estr = estr.strip()
  estr = re.sub('\s+', ' ', estr)
  eng_clean.append(estr)

In [12]:
len(hindi_clean), len(eng_clean)

(1000000, 1000000)

In [0]:
# update dataframe hindi/english column with cleaned data
lines.hindi = hindi_clean
lines.eng = eng_clean

# make lower case
lines.hindi = lines.hindi.apply(lambda x: x.lower())
lines.eng = lines.eng.apply(lambda x: x.lower())

In [0]:
# remove punctuation
exclude = set(string.punctuation)
lines.hindi = lines.hindi.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.eng = lines.eng.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [0]:
# remove digits from text
remove_digits = str.maketrans('', '', digits)
lines.hindi = lines.hindi.apply(lambda x: x.translate(remove_digits))
lines.eng = lines.eng.apply(lambda x: x.translate(remove_digits))

In [16]:
lines.sample(10)

Unnamed: 0,eng,hindi
47722,voice mails,भ्वाएस मेल
606662,always blind carbon copy bcc to,इन्हें हमेशा अंध कार्बन प्रति भेजें
661521,he s carrying a cell phone because most incomi...,उसके पास मोबाईल है क्योकि आने वाली कॉल मुफ्त है
949615,somatopathy,शारीरिकरोग
439410,exyernal links,बाहरी कड़ियाँ
829693,open,उघरना
175023,rain sleet,
883403,final stage,उपसंहार
783705,mistake,भ्रांति
618962,your ubuntu release is not supported anymore,आपका उबुन्टू प्रकाशन अब समर्थित नहीं है


In [17]:
lines.sample(20)

Unnamed: 0,eng,hindi
281431,they said oppressed we have been ere thou came...,उन्होंने कहा तुम्हारे आने से पहले भी हम सताए ग...
21336,create corresponding header file,बनाएँ शीर्ष टिप्पणी फ़ाइल
6381,preview,प्रिंट पूर्व दर्शन
394258,he said i will by no means send him with you u...,उसने कहा मैं उसे तुम्हारे साथ कदापि नहीं भेज स...
455893,in theory american policymakers can break this...,सिद्धान्ततः अमेरिका के रणनीतिकार इस परिपाटी को...
271761,they said build him a building and cast him in...,वे बोले उनके लिए एक मकान अर्थात अग्निकुंड तैया...
174175,seconds,सेकण्ड
74165,no support for authentication type s,सत्यापन प्रकार के लिये कोई समर्थन नहीं
612337,general enquiries about policy on noise from a...,हवाई जहाजों से शोर पर नीति के बारे में सामान्य...
424972,the unbelievers spend their wealth to hinder m...,इसमें शक़ नहीं कि ये कुफ्फार अपने माल महज़ इस ...


## Filter short sentences

In [0]:
short_sent_df = pd.DataFrame(columns=['eng', 'hindi'])  

In [0]:
eng_list = []
hindi_list = []

for index, row in lines.iterrows():

  eng1 = row['eng']
  eng1 = eng1.strip()
  eng1 = re.sub('\s+', ' ', eng1)

  hindi1 = row['hindi']
  hindi1 = hindi1.strip()
  hindi1 = re.sub('\s+', ' ', hindi1)

  ctr1 = 0
  ctr2 = 0

  hindi_words = hindi1.split(' ')
  for word in hindi_words:
    new_word1 = word.strip()
    if len(new_word1) > 0:
      ctr1 = ctr1 + 1

  eng_words = eng1.split(' ')
  for word in eng_words:
    new_word2 = word.strip()
    if len(new_word2) > 0:
      ctr2 = ctr2 + 1

  if ctr1 <= 30 and ctr2 <= 30:
    eng_list.append(eng1)
    hindi_list.append(hindi1)

In [20]:
len(eng_list), len(hindi_list)

(892277, 892277)

In [21]:
short_sent_df.eng = eng_list
short_sent_df.hindi = hindi_list

short_sent_df.shape

(892277, 2)

In [22]:
short_sent_df.sample(5)

Unnamed: 0,eng,hindi
365299,nokia,नोकिया २६१०
585066,sortie,उड़ान
116461,the reported error was quot quot the message h...,रिपोर्ट की गई त्रुटि थी सबसे अधिक संभावना वाली...
743321,permanent,अमिट
638601,unknowable,अलेखा


In [23]:
short_sent_df.sample(10)

Unnamed: 0,eng,hindi
680976,selection,इन्तख़ाब
14222,faq,
625910,bodily structure,शारीरीय संरचना
637139,unmatched,असाधारण
534087,of the foreign office in that section,उस अनुभाग में विदेश कार्यालय में
834841,pine cystoid nematode,चीड़ पुटिकाभ सूत्रकृमि
227657,bristles,à¤¸à¥à¤à¥à¤²
601310,but you go and look to the center of the galaxy,अब अगर आप आकाशगंगा के केन्द्र को देखें
470820,there were about freedom fighters sent to cell...,लगभग स्वतंत्रतासेनानी सेलूलर जेल में भेजे गये
621945,uninquiring,जिज्ञासाहीन


## Generate Tokens

In [24]:
short_sent_df.eng = short_sent_df.eng.apply(lambda x: 'START_ ' + x + ' _END')
short_sent_df.sample(5)

Unnamed: 0,eng,hindi
618302,START_ seasonable _END,समयोचित
840235,START_ rumination _END,रोमन्थन निगले हुए भोजन का पुनर्चर्वण
737552,START_ derelict _END,जर्जर
140044,START_ authorizing session _END,प्राधिकृत करें
583752,START_ we connect events and emotions _END,हम घटनाओं और भावनाओं से जुड़ते हैं


In [25]:
all_eng_words=set()
for eng in short_sent_df.eng:
    for word in eng.split(' '):
        if word not in all_eng_words:
            all_eng_words.add(word)
    
all_hindi_words=set()
for hindi in short_sent_df.hindi:
    for word in hindi.split(' '):
        if word not in all_hindi_words:
            all_hindi_words.add(word)

len(all_eng_words), len(all_hindi_words)

(109472, 177258)

In [29]:
length_list = []
for l in short_sent_df.hindi:
  length_list.append(len(l.split(' ')))
np.max(length_list)

30

In [30]:
length_list = []
for l in short_sent_df.eng:
  length_list.append(len(l.split(' ')))
np.max(length_list)

32