# **Neural Machine Translator for English to Hindi**

---



---



###This will be a Seq to Seq architecture without attention mechanisam.

###I have tried several variation and mainly the notebook is split into 2 

- ### **Unidirectional LSTM Model**
- ### **Bidirectional LSTM Model**




In [0]:
!git clone https://github.com/svtsanoj/NMT-English-Hindi.git # for dataset

Cloning into 'NMT-English-Hindi'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects:  16% (1/6)[Kremote: Counting objects:  33% (2/6)[Kremote: Counting objects:  50% (3/6)[Kremote: Counting objects:  66% (4/6)[Kremote: Counting objects:  83% (5/6)[Kremote: Counting objects: 100% (6/6)[Kremote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), done.


In [0]:
import pandas as pd
import tensorflow as tf
import re
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
import numpy as np
from keras.layers import Input, LSTM, Embedding, Dense, Bidirectional
from keras.models import Model
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [0]:
filepath = "/content/NMT-English-Hindi/Hi-En-Parallel_Corpus.xlsx"

dataframe = pd.read_excel(filepath)
dataframe.to_csv('csvfile.csv', encoding='utf-8', index = False)

In [0]:
df = pd.read_csv("/content/csvfile.csv")
df.columns = ["en", "hi"]
print(df["hi"][1], df['en'][1])

मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी, I'd like to tell you about one such child,


##Data Preprocessing
I have initially converted the corpus from excel format to pandas to explore the dataset more easily. Later on i have converted it into a list.

A simple regex command is used here.
I have obtained all the punctuations and added it for replacement, additionally we can see that Hindi has its own numbers apart from 0-9, so as these numbers are one after the other in ascii we can use **०-९** to define the element between these as well.

Note that input should  be converted to String or else the dataset containing different types might throw error.

In [0]:
def preprocess(hi, en):

  for i in range(len(hi)):
    # Remove all the punctuations and digits. 
    # Hinid has characters for its own digits so we'll remove those as well as.
    hi[i] = re.sub("[!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0-9०-९]", "", str(hi[i])).lower() # use str(input) as some chars maybe of diferent dtype which throwed an error for me
    en[i] = re.sub("[!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0-9०-९]", "", str(en[i])).lower()
    
    hi[i] = re.sub("\s+", " ", hi[i]) #replace extra white spaces with single white space
    en[i] = re.sub("\s+", " ", en[i])

    hi[i] = "<sos> " + str(hi[i]) + " <eos>" # we add the start and end of sentence tag to the target language
                                              # the input language does not require these tags
  return zip(en,hi)

df = pd.DataFrame(preprocess(df["hi"][:], df["en"][:]))
df.columns = ["en", "hi"]
df

Unnamed: 0,en,hi
0,politicians do not have permission to do what ...,<sos> राजनीतिज्ञों के पास जो कार्य करना चाहिए ...
1,id like to tell you about one such child,<sos> मई आपको ऐसे ही एक बच्चे के बारे में बतान...
2,this percentage is even greater than the perce...,<sos> यह प्रतिशत भारत में हिन्दुओं प्रतिशत से ...
3,what we really mean is that theyre bad at not ...,<sos> हम ये नहीं कहना चाहते कि वो ध्यान नहीं द...
4,the ending portion of these vedas is called up...,<sos> इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता ...
...,...,...
127602,examples of art deco construction can be found...,<sos> आर्ट डेको शैली के निर्माण मैरीन ड्राइव औ...
127603,and put it in our cheeks,<sos> और अपने गालों में डाल लेते हैं। <eos>
127604,as for the other derivatives of sulphur the co...,<sos> जहां तक गंधक के अन्य उत्पादों का प्रश्न ...
127605,its complicated functioning is defined thus in...,<sos> zरचनाप्रकिया को उसने एक पहेली में यों बा...


##Building Dataset

**Note :** As i am not using a attention mechanism in this assignment, the length of the sequences is a factor in determining the accuracy that we will obtain.

So we initially restrict the maximum length of each sentence to 15 words.

In [0]:
X, y = [], [] # input and output data
maxlen = 15   # maximum length of each sentence
for i in range(len(df["hi"])):
  temp_hi = df["hi"][i].split()
  temp_en = df["en"][i].split()
  if(len(temp_en)<=maxlen and len(temp_hi)<=maxlen):
    X.append(temp_en)
    y.append(temp_hi)

len(X) ,X[0], y[0]

(66853,
 ['politicians',
  'do',
  'not',
  'have',
  'permission',
  'to',
  'do',
  'what',
  'needs',
  'to',
  'be',
  'done'],
 ['<sos>',
  'राजनीतिज्ञों',
  'के',
  'पास',
  'जो',
  'कार्य',
  'करना',
  'चाहिए',
  'वह',
  'करने',
  'कि',
  'अनुमति',
  'नहीं',
  'है',
  '<eos>'])

##Tokenization

we have to encode our dataset with integers so that we can represnt each word by an unique token.

Keras provied a good tokenizer so i have implemented using that and also padded the sequences using a keras function as well.

initially i had made the mistake of converting the array into float32 type which gave a really poor perfomance in the evaluation, so i have maintained it as default int32 and later it will be changed in the batching function.

In [0]:
hi_tok , en_tok = Tokenizer(), Tokenizer()

en_tok.fit_on_texts(X)
hi_tok.fit_on_texts(y)
X_vector = []
y_vector = []

X_vector = en_tok.texts_to_sequences(X)       #converts the vocabulary into integer indexes
y_vector = hi_tok.texts_to_sequences(y)

X_vector = tf.keras.preprocessing.sequence.pad_sequences(X_vector, padding='post')  #pad to obtain same sequence length - maxlen
y_vector = tf.keras.preprocessing.sequence.pad_sequences(y_vector, padding='post')

X_vec_to_word = dict(map(reversed, en_tok.word_index.items()))    #this is the inverese dictionary to map the indexes back to words
y_vec_to_word = dict(map(reversed, hi_tok.word_index.items()))

hi_vocab = len(hi_tok.word_index.items())+1
en_vocab = len(en_tok.word_index.items())

# X_vector = X_vector.astype("float32")
# y_vector = y_vector.astype("float32")

print(en_vocab, hi_vocab)
X_vector, X_test, y_vector, y_test = train_test_split(X_vector, y_vector, test_size=0.2)
len(X_vector), X_vector, y_vector

31420 32740


(53482, array([[27640,     0,     0, ...,     0,     0,     0],
        [    3,    13,   214, ...,     0,     0,     0],
        [    8,    15,   521, ...,     0,     0,     0],
        ...,
        [  615,    31,   779, ...,     0,     0,     0],
        [    8,    79,  2288, ...,     0,     0,     0],
        [ 2091,  3837,  1095, ...,     0,     0,     0]], dtype=int32), array([[    1,   293,  7288, ...,     0,     0,     0],
        [    1,    50,  2962, ...,     0,     0,     0],
        [    1,    26,    23, ...,     0,     0,     0],
        ...,
        [    1,  6417,     8, ..., 17226,     2,     0],
        [    1,   297,  3338, ...,     0,     0,     0],
        [    1,  6400,  1391, ...,     0,     0,     0]], dtype=int32))

##Batching for fitting the Model on System

This is an **important** learning for me as i had come to use the yield function recently. One of the problems we face is the Over consumption of RAM as some variables take up a lot of space.

So our model will output a categorical output (One-Shot), This is a large 3D matrix.

Feeding data in batches to the model alone does not solve this problem of RAM consupmtion if we just merely slice values from a big matrix stored in memory, So if we even try storing a 

- np.zeros((vocabulary_len,maxlen , len(X)))

the session crahes due to over consumption of RAM, so 'yield' allows us to return subsets of iterable data which consumes only memory for that batch_size.

INITIAL BATCHING FUNCTION I CREATED :

```
batch_size = 256

def batches(X, y, batch_size= 256):
  while(True):
    data = tf.data.Dataset.from_tensor_slices((X, y))
    data = data.batch(batch_size, drop_remainder=True)
    for i, (inp, out) in enumerate(data):
      y_onehot = to_categorical(out[:,1:], num_classes=hi_vocab)
      y_onehot[:,:,0] = 0
      y_onehot = np.insert(y_onehot,-1, np.zeros(hi_vocab),1)

    decoder_input = np.zeros((batch_size, maxlen),dtype='float32')
    encoder_input = np.zeros((batch_size, maxlen),dtype='float32')
    end_token_value = 2.0000e+00

    for i in range(len(out)):
      for j in range(len(out[i])):
        if(int(out[i, j])!=int(end_token_value)):
          decoder_input[i, j] = out[i, j]
        encoder_input[i, j] = inp[i, j]
      yield([encoder_input, decoder_input], y_onehot)
```
But this is not very optimized so i have used the batching function from [Reference](https://towardsdatascience.com/word-level-english-to-marathi-neural-machine-translation-using-seq2seq-encoder-decoder-lstm-model-1a913f2dc4a7)

###**NOTE** 
####I have used the above reference in several places in the code.



In [0]:
#FROM REFERENCE
def generate_batch(X = X_vector, y = y_vector, batch_size = 256):
  while(True):
    for j in range(0, len(X), batch_size):
      encoder_input_data = np.zeros((batch_size, maxlen),dtype='float32')
      decoder_input_data = np.zeros((batch_size, maxlen),dtype='float32')
      decoder_target_data = np.zeros((batch_size, maxlen, hi_vocab),dtype='float32')
      
      for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
        
        for t, token in enumerate(input_text):
          encoder_input_data[i, t] = token # encoder input seq

        for t, token in enumerate(target_text):
          if t<len(target_text)-1:
            decoder_input_data[i, t] = token # decoder input seq
          if t>0: #do not include start tag in onehot vector
            decoder_target_data[i, t - 1, token] = 1.
      yield([encoder_input_data, decoder_input_data], decoder_target_data)

In [0]:
# en_tok.word_index.items()
# batch_size = 256
# for i, (input_text, target_text) in enumerate(zip(X_vector[0:0+batch_size], y_vector[0:0+batch_size])):
#     for t, word in enumerate(input_text):
#       print(i ,t ,word)
#     break
  

In [0]:
latent_dim = 256
num_encoder_tokens = en_vocab
num_decoder_tokens = hi_vocab
embedding_dims = 512





#**Unidirectional LSTM Model**


---



---



---



##Encoder-Decoder (Teacher Forcing)

The Seq to Seq architecture invloves mainly 2 nodes, one LSTM cell in Encoder and one in Decoder. These LSTM cells are however rolled out into several units based on input dimesnions

**LSTM OverView:** Each LSTM **outputs 3 vectors**,  one **Cell State**, one **Hidden State**, and **Output**( depending on the dimension, we have defined it as 512. this 512 is nothing but the output of the Embedding layer which has 512 dimensions specified by us)

So, the context vector that will be fed to the decoder from the encoder is nothing but the Cell State and Hidden State vectors.

**Teacher Forcing** : During Training, we have to give the decoder correct outputs in each timestep to reduce training effort and propogating error. So we initially feed the <SOS> tag to the Decoder PLUS the Cell,Hidden States from the Encoder.

At each Timestep whatever the output from decoder maybe, we still feed it with the right word  which is why its called teacher enforcing.

**CODING Note :** we have to make return state = true, so that the lstm layer outputs cell,hidden state along with the output.

In [0]:
# Modified from Reference

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(en_vocab, embedding_dims, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(embedding_dims, return_state=True, return_sequences=True)
encoder_outputs, state_h, state_c= encoder_lstm(enc_emb)

encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(hi_vocab, embedding_dims, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)

decoder_lstm = LSTM(embedding_dims, return_sequences=True, return_state=True)
decoder_outputs, _, _  = decoder_lstm(dec_emb, initial_state=encoder_states)


decoder_dense = Dense(hi_vocab, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

In [0]:
model.summary()
encoder_outputs, state_h, state_c,

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 512)    16087040    input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 512)    16762880    input_2[0][0]                    
____________________________________________________________________________________________

(<tf.Tensor 'lstm_1/transpose_2:0' shape=(None, None, 512) dtype=float32>,
 <tf.Tensor 'lstm_1/while:5' shape=(None, 512) dtype=float32>,
 <tf.Tensor 'lstm_1/while:6' shape=(None, 512) dtype=float32>)

In [0]:
train_samples = len(X_vector)
val_samples = len(X_test)
batch_size = 128          #tried 128, 256
epochs = 45               # tried 20,30, 45

In [0]:
model.fit_generator(generator = generate_batch(X_vector, y_vector, batch_size = batch_size),
                    steps_per_epoch = train_samples//batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples//batch_size)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/45
Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45
Epoch 20/45
Epoch 21/45
Epoch 22/45
Epoch 23/45
Epoch 24/45
Epoch 25/45
Epoch 26/45
Epoch 27/45
Epoch 28/45
Epoch 29/45
Epoch 30/45
Epoch 31/45
Epoch 32/45
Epoch 33/45
Epoch 34/45
Epoch 35/45
Epoch 36/45
Epoch 37/45
Epoch 38/45
Epoch 39/45
Epoch 40/45
Epoch 41/45
Epoch 42/45
Epoch 43/45
Epoch 44/45
Epoch 45/45


<keras.callbacks.callbacks.History at 0x7f6beb757b00>

In [0]:
model.save("unidirectional4.h5")

In [0]:
model = tf.keras.models.load_model("/content/unidirectional4.h5")

## Testing the model

The Input values that needs to be fed to the decoder changes in Testing in comparison to Training. As i mentioned earlier for teacher enforcing we input the correct translated words at each timestep, but during testing we feed only the \<SOS\> token with the Encoder Cell,Hidden States(which is same as before)

So whenever an \<EOS\>  tag is encountered he model stops feeding data to the next timestep.

In [0]:
#from Reference

# Encoder Model same as before
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h1 = Input(shape=(embedding_dims,))
decoder_state_input_c1 = Input(shape=(embedding_dims,))
decoder_states_inputs = [decoder_state_input_h1, decoder_state_input_c1]

dec_emb2= dec_emb_layer(decoder_inputs)

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)


In [0]:
# Modified From Reference

def decode_sequence(input_seq):
    
  states_value = encoder_model.predict(input_seq)
  
  target_seq = np.zeros((1,1))
  target_seq[0, 0] = hi_tok.word_index['<sos>']

  stop_condition = False
  decoded_sentence = ''
  while not stop_condition:
    output_tokens, h1, c1 = decoder_model.predict([target_seq] + states_value)

    # Sample a token
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_char = y_vec_to_word[sampled_token_index]
    decoded_sentence += ' '+sampled_char

    # Exit condition: either hit max length
    # or find stop character.
    if (sampled_char == '<eos>' or
        len(decoded_sentence) > 30):
        stop_condition = True

    # Update the target sequence.
    target_seq = np.zeros((1,1))
    target_seq[0, 0] = sampled_token_index

    # Update states
    states_value = [h1, c1]

  return decoded_sentence

##BLEU Evaluation

Bleu metric evaluates a score based on the words present in the reference and the model's output sentence. It depends on the quality of the reference texts available as well.

In [0]:
from nltk.translate.bleu_score import sentence_bleu

def bleu(actual, predicted):
  actual = actual.split()
  predicted = predicted.split()

  bleu_score = sentence_bleu([actual], predicted)
  return bleu_score


##Output using Test Data
we iterate over the test dataset and output meaning for 10 sentences.

In [0]:
inp = X_test
output = y_test
for j in range(0,15):
  sent= ''
  input_seq=np.zeros((1,15))
  for i in range(len(inp[j])):
    if(inp[j][i] in en_tok.word_index.values()):
      input_seq[0, i] = inp[j][i]
    # padding = [0.]*(15-len(X_test[j]))
  # input_seq += padding


  decoded_sentence = decode_sequence(input_seq=input_seq)
  if(decoded_sentence[-5:]=="<eos>"):
    decoded_sentence = decoded_sentence[:-5]
  bleuScore = bleu(hi_tok.sequences_to_texts([output[j]])[0][5:-5], decoded_sentence) # [5:-5] is to remove <sos> and <eos> tags


  # print(input_seq)
  print('\nInput :',en_tok.sequences_to_texts([inp[j]]))          # converts the integer represented sentence to words
  print('Correct Hindi:', hi_tok.sequences_to_texts([output[j]]))
  print('Predicted Hindi:', decoded_sentence, "\nBLEU Score : ", bleuScore)


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



Input : ['it was the beginning of a much higher degree']
Correct Hindi: ['<sos> यह शुरुआत थी एक बहुत उपरी स्तर <eos>']
Predicted Hindi:  ये बहुत महान था ।  
BLEU Score :  0.4482700320176827

Input : ['taj mahal is unique model of mugal architecture']
Correct Hindi: ['<sos> ताज महल मुगल वास्तुकला का उत्कृष्ट नमूना है। <eos>']
Predicted Hindi:  ताजमहल महल का एक प्रान्त हैं शहर 
BLEU Score :  0.6337834876616586


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



Input : ['creation of surdan in poetry dictionary']
Correct Hindi: ['<sos> सूरदास की रचनाएँ कविताकोश में <eos>']
Predicted Hindi:  हरिवंश राय बच्चन की रचनाएँ  
BLEU Score :  0.5623413251903491

Input : ['i always loved political cartoons']
Correct Hindi: ['<sos> मुझे राजनीतिक कार्टून हमेशा बहुत पसंद थे <eos>']
Predicted Hindi:  मुझे हमेशा एक खबर हटाई जा रहा है 
BLEU Score :  0.7071067811865476

Input : ['in fact in the united states the teaching system has worked fairly well']
Correct Hindi: ['<sos> वास्तव में यूनाइटेड स्टेट्स में शिक्षण प्रणाली ने बहुत अच्छा काम किया है। <eos>']
Predicted Hindi:  वास्तव में केवल अमरीका में एक संस्थान 
BLEU Score :  0.21938936848339244

Input : ['news']
Correct Hindi: ['<sos> समाचार <eos>']
Predicted Hindi:  समाचार  
BLEU Score :  1.0

Input : ['most young people will never try sniffing']
Correct Hindi: ['<sos> अधिकतर युवा लोग सॉल्वैंट लोसूँघने की आज़माइश कभी नहीं करेंगे <eos>']
Predicted Hindi:  बहुत से लोग ढेर सारा अच्छे नहीं 
BLEU Score :  0.476273

# **Bidirectional LSTM layer :**  



---



---



---

[Google's NMT '16](https://arxiv.org/pdf/1609.08144.pdf) paper elaborates on deep lstm network and how the model is improved when we increase the number of layers to a certain limit.

Similar to GNMT, i have created a modle with one Bidirectional LSTM in the encoder followed by a normal LSTM layer,

And 2 LSTM layers on the Decoder side as you cannot have a Bi-LSTM on the decoder side as we will be predicting sequentially.



In [0]:
latent_dim = 512
num_encoder_tokens = en_vocab
num_decoder_tokens = hi_vocab
embedding_dims = 512

##Encoder-Decoder

**Architecture Diagram:**
![alt text](https://github.com/svtsanoj/NMT-English-Hindi/blob/master/NMT.jpeg?raw=true)

The Bidirectional LSTM  is the first layer after the usual embedding layer as you can see.

A bidirectional LSTM outputs 5 values unlike a normal LSTM which outputs 3, these extra 2 are nothing but the hidden and cell state of the reversed LSTM inside the Bi-LSTM.

So if you want to try this with just one Bidirectional layer in the enoder without another LSTM, you'll have to make sure that all the 4 States are fed to the Decoder but this will not be possible as Single layer lstm can only take 2 states as input so we use additional layers.

In [0]:
# Encoder

encoder_inputs = Input(shape=(None,))

enc_emb =  Embedding(en_vocab, embedding_dims, mask_zero = True)(encoder_inputs)

encoder_lstm1 = Bidirectional(LSTM(embedding_dims, return_sequences=True))(enc_emb)

encoder_lstm2 = LSTM(embedding_dims, return_state=True, return_sequences=True)

encoder_outputs, state_h, state_c= encoder_lstm2(encoder_lstm1)

encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))

dec_emb_layer =Embedding(hi_vocab, embedding_dims, mask_zero = True)    #do not declare as Embedding(..)(input) as this layer will be called later
dec_emb = dec_emb_layer(decoder_inputs)                                  

decoder_lstm1 =  LSTM(embedding_dims, return_sequences=True)
dec_lstm1 = decoder_lstm1(dec_emb, initial_state=encoder_states)

decoder_lstm2 =  LSTM(embedding_dims, return_sequences=True, return_state=True)

decoder_outputs, _, _  = decoder_lstm2(dec_lstm1)


decoder_dense = Dense(hi_vocab, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

model.summary()
encoder_outputs, state_h, state_c,

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 512)    16087040    input_5[0][0]                    
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, None, 1024)   4198400     embedding_3[0][0]                
____________________________________________________________________________________________

(<tf.Tensor 'lstm_4/transpose_2:0' shape=(None, None, 512) dtype=float32>,
 <tf.Tensor 'lstm_4/while:5' shape=(None, 512) dtype=float32>,
 <tf.Tensor 'lstm_4/while:6' shape=(None, 512) dtype=float32>)

In [0]:
train_samples = len(X_vector)
val_samples = len(X_test)
batch_size = 128
epochs = 45       #25 , 45

model.fit_generator(generator = generate_batch(X_vector, y_vector, batch_size = batch_size),
                    steps_per_epoch = train_samples//batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples//batch_size)

Epoch 1/45
Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45
Epoch 20/45
Epoch 21/45
Epoch 22/45
Epoch 23/45
Epoch 24/45
Epoch 25/45
Epoch 26/45
Epoch 27/45
Epoch 28/45
Epoch 29/45
Epoch 30/45
Epoch 31/45
Epoch 32/45
Epoch 33/45
Epoch 34/45
Epoch 35/45
Epoch 36/45
Epoch 37/45
Epoch 38/45
Epoch 39/45
Epoch 40/45
Epoch 41/45
Epoch 42/45
Epoch 43/45
Epoch 44/45
Epoch 45/45


<keras.callbacks.callbacks.History at 0x7fa8053fccf8>

In [0]:
model.save("bidirectional5.h5")

In [0]:
model = tf.keras.models.load_model("/content/bidirectional5.h5")

##Test the model

In [0]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h1 = Input(shape=(embedding_dims,))
decoder_state_input_c1 = Input(shape=(embedding_dims,))
decoder_states_inputs = [decoder_state_input_h1, decoder_state_input_c1]

dec_emb2= dec_emb_layer(decoder_inputs)

# To predict the next word in the sequence, set the initial states to the states from the previous time step
dec_lstm3 = decoder_lstm1(dec_emb2, initial_state=decoder_states_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm2(dec_lstm3)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

In [0]:
def decode_sequence(input_seq):
    
    states_value = encoder_model.predict(input_seq)
    
    target_seq = np.zeros((1,1))    # Generate empty target sequence of length 1.
    target_seq[0, 0] = hi_tok.word_index['<sos>']   # Populate the first character of target sequence with the start character.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h1, c1 = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = y_vec_to_word[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        if (sampled_char == '<eos>' or
           len(decoded_sentence) > 30):
            stop_condition = True

        target_seq = np.zeros((1,1))     # Update the target sequence
        target_seq[0, 0] = sampled_token_index

        states_value = [h1, c1]   # Update states

    return decoded_sentence

##Output using Test Data

In [0]:
inp = X_vector
output = y_vector

for j in range(0,20):
  sent= ''
  input_seq=np.zeros((1,15))
  for i in range(len(inp[j])):
    if(inp[j][i] in en_tok.word_index.values()):
      input_seq[0, i] = inp[j][i]
    # padding = [0.]*(15-len(X_test[j]))
  # input_seq += padding

  decoded_sentence = decode_sequence(input_seq=input_seq)
  if(decoded_sentence[-5:]=="<eos>"):
    decoded_sentence = decoded_sentence[:-5]
  bleuScore = bleu(hi_tok.sequences_to_texts([output[j]])[0][5:-5], decoded_sentence) # [5:-5] is to remove <sos> and <eos> tags

  # print(input_seq)
  print('\nInput :',en_tok.sequences_to_texts([inp[j]]))
  print('Correct Hindi:', hi_tok.sequences_to_texts([output[j]]))
  print('Predicted Hindi:', decoded_sentence, "\nBLEU Score : ", bleuScore)


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().



Input : ['and learn that as a foreign language']
Correct Hindi: ['<sos> और उसे विदेशी भाषा के रूप में सिखा <eos>']
Predicted Hindi:  और में डिग्री असलियत से देखिये 
BLEU Score :  0.5444460596606694

Input : ['electronic communication']
Correct Hindi: ['<sos> इलेक्ट्रॉनिक संचार <eos>']
Predicted Hindi:  तथा यही जहां  
BLEU Score :  0

Input : ['where do we come up with these ideas']
Correct Hindi: ['<sos> ऐसे विचार हमे कहाँ से आते है <eos>']
Predicted Hindi:  हम  
BLEU Score :  0

Input : ['rajya pal is the leader of the statechief minister']
Correct Hindi: ['<sos> राज्यपाल राज्य का प्रमुख है। <eos>']
Predicted Hindi:  राज्यपाल  
BLEU Score :  0.01831563888873418

Input : ['and theres kind of a bifurcation']
Correct Hindi: ['<sos> और यहाँ एक प्रकार का द्वि विभाजन है <eos>']
Predicted Hindi:  और उसका इसने के उसने एक अचानक कीजिये 
BLEU Score :  0.7071067811865476

Input : ['and the fresher the clay the better the impression']
Correct Hindi: ['<sos> और मिट्टीं जितनी ही ताजा होगी प्रभाव भी

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [0]:
def evaluate(inp_sentence):
  inp_sentence = re.sub("[!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0-9०-९]", "", str(inp_sentence)).lower()
  inp_sentence = re.sub("\s+", " ", inp_sentence) 
  
  out_sentence = ''
  input_seq=np.zeros((1,15))
  for i, word in enumerate(input_sentence.split()):
    if(word in en_tok.word_index.keys()):
      input_seq[0, i] = en_tok.word_index[word]

  decoded_sentence = decode_sequence(input_seq=input_seq)

  print('Input :',en_tok.sequences_to_texts([inp[j]]))
  print('Predicted :', decoded_sentence)

#**Inference**

The models seem to perform quite well on even the test data although the bidirectional model does not output as expected. Initially when trained for 25 epochs the result was not good. 

We can observe that the output sequences are shorter than the average input length. Adding Attention would improve the quality of translation significantly espeacially on long input sentences.

References:

[1] [Google NMT 2016](https://arxiv.org/pdf/1609.08144.pdf)

[2] [Sequence to sequence translation blog](https://towardsdatascience.com/word-level-english-to-marathi-neural-machine-translation-using-seq2seq-encoder-decoder-lstm-model-1a913f2dc4a7)

[3] [Keras blog](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)

[4]  [NLTK BLEU](https://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.SmoothingFunction)

[5] [Intro to BLEU](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/)


