#Neural Machine Transaltion (French -> English)

##What We Know?

* We are using many to many model of seq2seq modelling to get output text.
* We are using encoder decoder architecture for the model.

## New terms used in this model
###Teacher Forcing: 
* A advancement in RNN method of back propagation into the model.
* The previous output is provided as input to the model to predict next step.
* example: sentence - "I love cats"
* step 1: add start and end tokens. As initial start value is required in the teacher forcing and end token to make the model understatnd that the sentence is completed.  
* Therefore: "[start] i love cats [end]"
* Input : Prediction
* [start]  :  i
* [start], i  : love
* [start], i, love  : cats
* [start], i, love, cats: [end]

* Other methods of translation training: Beam Search, Attention Model, Greedy Search etc. 

## Encoder-Decoder Workflow in NMT(neural machine translation)

##### Note: 
* LSTM gives 3 outputs: 
1. output for next layer
2. Hidden State
3. Cell State

##### Procedure: 
* Encoder LSTM outputs: we only keep the state outputs of encoder LSTM layer as it will contain all the information about the input data. 
* This states of encoder lstm will be used to initialize the decoder lstm. Also, the [start] token will be provided as first word as we are performing teacher forcing.  
* The output of this decoder lstm layer will be passed through Dense layer to predict the output word.

##### Workflow:
1. Encoder side:
* input -> Encoder LSTM -> encoder states
2. Decoder side:
* encoder states + [start] -> Decoder LSTM -> word + decoder states
* decoder_states + word -> Decoder LSTM -> word2 + decoder states 2 
* word2 + decoder states 2 -> Decoder LSTM -> word3 + decoder states 3
... so on
* The process stops when [end] token is predicted

##### Note2: 
* There are 2 stages of this model: 
1. Training with english input and french output
2. inference mode where we will initialize model to take english words and predict french words

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np

In [None]:
lines = pd.read_table('/content/drive/My Drive/Colab Notebooks/data/french.txt', names=['english', 'french'])
lines = lines[:8000]
lines.sample(5)

Unnamed: 0,english,french
1852,Go with Tom.,Allez avec Tom.
7231,I hate to iron.,Je déteste repasser.
4681,He drank beer.,Il a bu de la bière.
678,I'm yours.,Je suis à toi.
3908,She is sharp.,Elle est affûtée.


In [None]:
lines.shape

(8000, 2)

In [None]:
lines.english = lines.english.apply(lambda x: x.lower())
lines.french = lines.french.apply(lambda x: x.lower())

In [None]:
import re
lines.english = lines.english.apply(lambda x: re.sub("'", '', x)).apply(lambda x: re.sub(",", 'COMMA', x))
lines.french = lines.french.apply(lambda x: re.sub("'", '', x)).apply(lambda x: re.sub(",", 'COMMA', x))

In [None]:
import string
exclude = set(string.punctuation)
lines.english = lines.english.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.french = lines.french.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [None]:
from string import digits
remove_digits = str.maketrans('', '', digits)
lines.english = lines.english.apply(lambda x: x.translate(remove_digits))
lines.french = lines.french.apply(lambda x: x.translate(remove_digits))

In [None]:
lines.sample(5)

Unnamed: 0,english,french
2920,anybody hurt,qui que ce soit estil blessé
2753,were ready,nous sommes prêtes
2006,i got fined,on ma collé une amende
2325,im touched,je suis touchée
2502,she hit him,elle le frappa


In [None]:
# applying start and end tokens in french sentences
lines.french = lines.french.apply(lambda x: 'START_' + ' ' + x + ' ' + '_END')
lines.head()

Unnamed: 0,english,french
0,go,START_ va _END
1,run,START_ cours _END
2,run,START_ courez _END
3,wow,START_ ça alors _END
4,fire,START_ au feu _END


In [None]:
# collecting all unique english words to create a vocabulary
all_english_words = set()
for eng in lines.english:
  for word in eng.split():
    if word not in all_english_words:
      all_english_words.add(word)

# collecting all unique french words to create a vocabulary
all_french_words = set()
for fre in lines.french:
  for word in fre.split():
    if word not in all_french_words:
      all_french_words.add(word)

In [None]:
# printing length of words in each language
print('length of english words: ', len(all_english_words))
print('length of french words: ', len(all_french_words))

length of english words:  1882
length of french words:  3987


In [None]:
# getting maximum sentence length of english sentences
length_list = []
for l in lines.english:
  length_list.append(len(l.split(' ')))

max_input_length = np.max(length_list)
print('max_input_length: ', max_input_length)

max_input_length:  5


In [None]:
# getting maximum sentence length of french sentences
length_list = []
for l in lines.french:
  length_list.append(len(l.split(' ')))

max_output_length = np.max(length_list)
print('max_output_length: ', max_output_length)

max_output_length:  12


In [None]:
# making a list of all input and output words and sorting them out
input_words = sorted(list(all_english_words))
output_words = sorted(list(all_french_words))
print('all input words: ', input_words)
print('all output words: ', output_words)

#getting total tokens(words) from input and output
num_encoder_tokens = len(all_english_words)
num_decoder_tokens = len(all_french_words)
print('encoder tokens: ', num_encoder_tokens)
print('decoder tokens: ', num_encoder_tokens)

all input words:  ['COMMA', 'a', 'abandon', 'aboard', 'about', 'above', 'absent', 'accept', 'ached', 'act', 'active', 'actor', 'adaptable', 'addicted', 'admire', 'adopted', 'adorable', 'adult', 'adults', 'advice', 'afraid', 'after', 'afternoon', 'again', 'against', 'age', 'agent', 'agree', 'agreed', 'agrees', 'ahead', 'aim', 'aint', 'air', 'alert', 'alive', 'all', 'allow', 'almost', 'alone', 'along', 'already', 'also', 'always', 'am', 'amazed', 'ambitious', 'ambush', 'american', 'amnesia', 'amuse', 'amused', 'an', 'and', 'angry', 'animals', 'annoy', 'annoying', 'another', 'answer', 'answers', 'anxious', 'any', 'anybody', 'anyone', 'anything', 'anyway', 'apart', 'apologize', 'apologized', 'apples', 'approve', 'approves', 'arabic', 'arabs', 'are', 'arent', 'argue', 'arguing', 'arm', 'armed', 'around', 'arrested', 'arrived', 'art', 'artist', 'as', 'asap', 'ashamed', 'asian', 'aside', 'ask', 'asked', 'asleep', 'assume', 'astonished', 'at', 'ate', 'attack', 'attacked', 'attend', 'attentive'

In [None]:
# getting index for words as these indexes will behave as words for machine interactions
input_token_index = dict([(word,i) for i,word in enumerate(input_words)])
output_token_index = dict([(word,i) for i,word in enumerate(output_words)])

print('input token index: ', input_token_index)
print('output token index: ', output_token_index)

input token index:  {'COMMA': 0, 'a': 1, 'abandon': 2, 'aboard': 3, 'about': 4, 'above': 5, 'absent': 6, 'accept': 7, 'ached': 8, 'act': 9, 'active': 10, 'actor': 11, 'adaptable': 12, 'addicted': 13, 'admire': 14, 'adopted': 15, 'adorable': 16, 'adult': 17, 'adults': 18, 'advice': 19, 'afraid': 20, 'after': 21, 'afternoon': 22, 'again': 23, 'against': 24, 'age': 25, 'agent': 26, 'agree': 27, 'agreed': 28, 'agrees': 29, 'ahead': 30, 'aim': 31, 'aint': 32, 'air': 33, 'alert': 34, 'alive': 35, 'all': 36, 'allow': 37, 'almost': 38, 'alone': 39, 'along': 40, 'already': 41, 'also': 42, 'always': 43, 'am': 44, 'amazed': 45, 'ambitious': 46, 'ambush': 47, 'american': 48, 'amnesia': 49, 'amuse': 50, 'amused': 51, 'an': 52, 'and': 53, 'angry': 54, 'animals': 55, 'annoy': 56, 'annoying': 57, 'another': 58, 'answer': 59, 'answers': 60, 'anxious': 61, 'any': 62, 'anybody': 63, 'anyone': 64, 'anything': 65, 'anyway': 66, 'apart': 67, 'apologize': 68, 'apologized': 69, 'apples': 70, 'approve': 71, 'a

In [None]:
# creating arrays of input and output data
encoder_input_data = np.zeros((len(lines.english), max_input_length), dtype='float32')
decoder_input_data = np.zeros((len(lines.french), max_output_length), dtype='float32')

#one hot encoding the target data as Dense layer only gives one output through softmax layer
decoder_target_data = np.zeros((len(lines.french), max_output_length, num_decoder_tokens))

In [None]:
print(encoder_input_data.shape)
print(decoder_input_data.shape)
print(decoder_target_data.shape)

(8000, 5)
(8000, 12)
(8000, 12, 3987)


In [None]:
# putting all the integer values in input, output data and target data
for i,(input_text, output_text) in enumerate(zip(lines.english, lines.french)):
  for t, word in enumerate(input_text.split()):
    encoder_input_data[i,t] = input_token_index[word]
  for t,word in enumerate(output_text.split()):
    decoder_input_data[i,t] = output_token_index[word]
    # as decoder target data is ahead of decoder input data, it will not include start_ character(which will be given to decoder model at prediction)
    if t > 0:
      decoder_target_data[i,t-1,output_token_index[word]] = 1

In [None]:
print("encoder input data: ", encoder_input_data[1])
print('decoder input data: ', decoder_input_data[1])
print('decoder target data: ',decoder_target_data[1])
print('shape of sample decoder target data: ', decoder_target_data[1].shape)

encoder input data:  [1352.    0.    0.    0.    0.]
decoder input data:  [  0. 769.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
decoder target data:  [[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
shape of sample decoder target data:  (12, 3987)


In [None]:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model

In [None]:
# setting hyperparameters
embedding_size = 120
lstm_dim = 324

In [None]:
# building model for training stage
#encoder model

encoder_inputs = Input(shape=(None,))
en_x = Embedding(num_encoder_tokens, embedding_size)(encoder_inputs)
encoder = LSTM(lstm_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(en_x)
encoder_states = [state_h, state_c]

In [None]:
# decoder model

decoder_inputs = Input(shape=(None,))
final_dex = Embedding(num_decoder_tokens, embedding_size)(decoder_inputs)

decoder_lstm = LSTM(lstm_dim, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(final_dex, initial_state=encoder_states)

decoder_dense = Dense(num_decoder_tokens, activation='softmax') 

decoder_outputs = decoder_dense(decoder_outputs)


In [None]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
model.compile(optimizer='rmsprop',
              loss = 'categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
model.summary()

Model: "model_9"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_13 (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
input_14 (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_9 (Embedding)         (None, None, 120)    225840      input_13[0][0]                   
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, None, 120)    478440      input_14[0][0]                   
____________________________________________________________________________________________

In [None]:
r = model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=64, epochs=30, validation_split=0.10)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
#Inference Stage

#encoder model
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()

Model: "model_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        [(None, None)]            0         
_________________________________________________________________
embedding_9 (Embedding)      (None, None, 120)         225840    
_________________________________________________________________
lstm_6 (LSTM)                [(None, 324), (None, 324) 576720    
Total params: 802,560
Trainable params: 802,560
Non-trainable params: 0
_________________________________________________________________


In [None]:
#decoder model
decoder_state_input_h = Input(shape=(lstm_dim,))
decoder_state_input_c = Input(shape=(lstm_dim,))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

final_dex2 = Embedding(num_decoder_tokens, embedding_size)(decoder_inputs)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_state_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs2] + decoder_states2)

In [None]:
# reversing the word index dictionary to get words from index values
reverse_input_char_index = dict((i,char) for char, i in input_token_index.items())
reverse_output_char_index = dict((i,char) for char, i in output_token_index.items())
print(reverse_input_char_index)
print(reverse_output_char_index)

{0: 'COMMA', 1: 'a', 2: 'abandon', 3: 'aboard', 4: 'about', 5: 'above', 6: 'absent', 7: 'accept', 8: 'ached', 9: 'act', 10: 'active', 11: 'actor', 12: 'adaptable', 13: 'addicted', 14: 'admire', 15: 'adopted', 16: 'adorable', 17: 'adult', 18: 'adults', 19: 'advice', 20: 'afraid', 21: 'after', 22: 'afternoon', 23: 'again', 24: 'against', 25: 'age', 26: 'agent', 27: 'agree', 28: 'agreed', 29: 'agrees', 30: 'ahead', 31: 'aim', 32: 'aint', 33: 'air', 34: 'alert', 35: 'alive', 36: 'all', 37: 'allow', 38: 'almost', 39: 'alone', 40: 'along', 41: 'already', 42: 'also', 43: 'always', 44: 'am', 45: 'amazed', 46: 'ambitious', 47: 'ambush', 48: 'american', 49: 'amnesia', 50: 'amuse', 51: 'amused', 52: 'an', 53: 'and', 54: 'angry', 55: 'animals', 56: 'annoy', 57: 'annoying', 58: 'another', 59: 'answer', 60: 'answers', 61: 'anxious', 62: 'any', 63: 'anybody', 64: 'anyone', 65: 'anything', 66: 'anyway', 67: 'apart', 68: 'apologize', 69: 'apologized', 70: 'apples', 71: 'approve', 72: 'approves', 73: 'a

In [None]:
# function to predict translation
def decode_seq(input_seq):
  state_values = encoder_model.predict(input_seq)

  target_seq = np.zeros((1,1))

  target_seq[0,0] = output_token_index['START_']

  stop_condition = False
  decoded_sentence = ''

  while not stop_condition:
    output_tokens, h, c = decoder_model.predict([target_seq] + state_values)

    sampled_token_index = np.argmax(output_tokens[0,-1,:])
    sampled_char = reverse_output_char_index[sampled_token_index]

    decoded_sentence += ' ' + sampled_char

    if(sampled_char == '_END' or len(decoded_sentence) > 52):
      stop_condition = True

    target_seq = np.zeros((1,1))
    target_seq[0,0] = sampled_token_index

    state_values = [h,c] 

  return decoded_sentence

In [None]:
# testing the model for a sample from existing data
for seq_index in [1234, 4356, 4565, 34, 2345, 7656]:
  input_seq = encoder_input_data[seq_index:seq_index+1]
  decoded_sentence = decode_seq(input_seq)
  print('----')
  print('Input_sentence: ', lines.english[seq_index:seq_index+1])
  print('decoded sentence: ', decoded_sentence)

----
Input_sentence:  1234    i wrote it
Name: english, dtype: object
decoded sentence:   je ça ça ça _END
----
Input_sentence:  4356    youre funny
Name: english, dtype: object
decoded sentence:   es êtes êtes êtes êtes _END
----
Input_sentence:  4565    do you get it
Name: english, dtype: object
decoded sentence:   vous vous _END
----
Input_sentence:  34    got it
Name: english, dtype: object
decoded sentence:   compris pigé nourri nourri jarrive _END
----
Input_sentence:  2345    is tom well
Name: english, dtype: object
decoded sentence:   tom tom _END
----
Input_sentence:  7656    im interested
Name: english, dtype: object
decoded sentence:   je _END
