<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-case-studies/blob/master/neural_machine_translation_for_hindi_english.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation for Hindi-English: Sequence to sequence learning

This article throws light on various aspects of building a basic Neural Machine Translation (NMT) model using the sequence to sequence learning approach, with LSTMs (Long Short-Term Memory). While there are numerous papers and blogs on NMT, this is another attempt on highlighting some of the intuitive features and also a step by step guide to performing similar NLP tasks. I have tried maintaining a balance between technical and non-technical details in this article. I hope it helps. So let’s get started!

Reference:

https://medium.com/analytics-vidhya/neural-machine-translation-for-hindi-english-sequence-to-sequence-learning-1298655e334a

https://github.com/richaranjan23/My_Projects/tree/master/MSc_dissertation

http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import string
import codecs
import re
import h5py
from string import digits

from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
from sklearn.model_selection import train_test_split

from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import plot_model, get_file

import matplotlib.pyplot as plt
%matplotlib inline

## The Corpus — Pre-processing

The languages chosen for building this translation models were Hindi and English, and the parallel corpora was obtained from the [IIT Bombay Hindi-English Parallel Corpus](http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/). This is a Hindi-English parallel corpus containing 1,492,827 pairs of sentences. 

### Download the dataset

In [4]:
# ref: https://stackoverflow.com/questions/56976078/how-do-i-load-images-dataset-using-tf-keras-utils-get-file
dataset = tf.keras.utils.get_file(
    fname="parallel.tgz", 
    origin="http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz", 
    extract=True,
)

Downloading data from http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz


In [5]:
# check the donwloaded data that is donloaded into ~/.keras/datasets directory
! ls ~/.keras/datasets

parallel  parallel.tgz


In [0]:
# copy the donwloaded data into current directory
! cp -r ~/.keras/datasets .

### Original data load

In [0]:
eng_sentence = (open('datasets/parallel/IITB.en-hi.en', encoding='utf-8', errors='ignore').read()).split('\n')[:-1]
hin_sentence = (open('datasets/parallel/IITB.en-hi.hi', encoding='utf-8', errors='ignore').read()).split('\n')[:-1]

In [13]:
# prepare dataframe for both sentence
lines = pd.DataFrame(columns=['eng', 'hindi'])
lines.eng = eng_sentence
lines.hindi = hin_sentence
lines.head()

Unnamed: 0,eng,hindi
0,Give your application an accessibility workout,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें
1,Accerciser Accessibility Explorer,एक्सेर्साइसर पहुंचनीयता अन्वेषक
2,The default plugin layout for the bottom panel,निचले पटल के लिए डिफोल्ट प्लग-इन खाका
3,The default plugin layout for the top panel,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका
4,A list of plugins that are disabled by default,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...


In [14]:
lines.shape

(1561840, 2)

### Original data cleaning

In [15]:
# truncate dataset
lines = lines[:1000000]
lines.shape

(1000000, 2)

In [0]:
# clean hindi text
hindi_clean = []
for line in lines.hindi:
  nstr = re.sub('\s+_\s+[A-Za-z]', '', line)
  nstr = re.sub('\(*\)*', '', nstr)
  nstr = re.sub('%.*?[A-Za-z]', '', nstr)
  nstr = re.sub('[A-Za-z]', '', nstr)
  nstr = nstr.strip()
  nstr = re.sub('\s+', '', nstr)
  hindi_clean.append(nstr)

In [0]:
# clean english text
eng_clean = []
for line in lines.eng:
  estr = re.sub('\s+_\s', '', line)
  estr = re.sub(r'[^\x00-\x7F]+', ' ', estr)
  estr = estr.strip()
  estr = re.sub('\s+', ' ', estr)
  eng_clean.append(estr)

In [18]:
len(hindi_clean), len(eng_clean)

(1000000, 1000000)

In [0]:
# update dataframe hindi/english column with cleaned data
lines.hindi = hindi_clean
lines.eng = eng_clean

# make lower case
lines.hindi = lines.hindi.apply(lambda x: x.lower())
lines.eng = lines.eng.apply(lambda x: x.lower())

In [0]:
# remove punctuation
exclude = set(string.punctuation)
lines.hindi = lines.hindi.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.eng = lines.eng.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [0]:
# remove digits from text
remove_digits = str.maketrans('', '', digits)
lines.hindi = lines.hindi.apply(lambda x: x.translate(remove_digits))
lines.eng = lines.eng.apply(lambda x: x.translate(remove_digits))

In [22]:
lines.sample(10)

Unnamed: 0,eng,hindi
149110,kgrab was unable to save the image to,केग्रैबछविकोयहाँसहेजनेमेंअक्षम
518661,similarly the imitation of the englishmen s se...,इसीप्रकारअंग्रेजीकेआत्मविश्वासऔरआत्मबलकीनकलबिन...
165856,successfully moved to trash,
85540,put personalized signatures at the top of replies,उत्तरोंकीशीर्षपरव्यक्तिगतहस्ताक्षररखें
997697,the differences between subspecies are less di...,उपजातियोंमेंविशेषअंतरनहींपाएजातेहैं।
980240,liniment is a medicinal fluid rubbed into the ...,विलेपनएकऔषधीयद्रवहैजिसेदर्दअथवारुक्षतासेराहतपा...
3940,reserve left,अतिरिक्तबाकीः
715120,appeal,मिन्नत
97626,login,लॉगिनकरें
860480,placid,मस्तीख़ोर


## Filter short sentences