# LSTM Language Translation

Agenada:
1. LSTM
2. Seq2Seq
3. Hindi-English text data
	- Clean up data
	- Convert words to list
	- Separate Encoder and Decoder input
4. Build the sequence to the input data
	- Encoder andDecoder Tokenizer
	- Padding Sequence
	- Build Decoder output data layer
5. Build Training Model
	- Build Encoder Layer
	- Build Decoder Layer
	- Build Model with Both Encoder and Decoder
	- Train the Model
6. Build model for Prediction
	- Build Encoder Model to Predict Encoder Sates
	- Build Decoder to Predict
7. Predict Output
	- It is the Hindi sentance translated from English
8. Save the Model.
9. Conclution

**Recurrent Neural Networks(RNN)** are a powerful and robust type of neural networks and belong to the most promising algorithms out there at the moment because they are the only ones with an internal memory.
Because of their internal memory, RNN’s are able to remember important things about the input they received, which enables them to be very precise in predicting what’s coming next. Recurrent Neural Networks add the immediate past to the present. Therefore a Recurrent Neural Network has two inputs, the present and the recent past. This is important because the sequence of data contains crucial information about what is coming next, which is why a RNN can do things other algorithms can’t.

Two issues with RNN. Exploding Gradients and Vanishing Gradients. Fortunately, it was solved through the concept of LSTM.

**LSTM:** Long Short Term Memory networks are an extension for RNN. LSTM’s enable RNN’s to remember their inputs over a long period of time. This is because LSTM’s contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory.

In an LSTM you have three gates: input, forget and output gate. These gates determine whether or not to let new input in (input gate), delete the information because it isn’t important (forget gate) or to let it impact the output at the current time step (output gate).

**Seq2seq Model** can be separated into 2 small sub-models. The first sub-model is called as Encoder, and the second sub-model is called as Decoder. Encoder takes a raw input text data just like any other RNN architectures do. At the end, Encoder outputs a neural representation. The output is going to be the input data for Decoder.
Decoder has the ability to look inside the Encoder’s output, and it will create a totally different output data (translated in Hindi in this case).

In [1]:
import numpy as np
np.random.seed(42)

## 1. Prepare data
Data for this exercise can be downloaded from http://www.manythings.org/anki/

### 1.1 Download and extract sentence pairs

In [0]:
!wget http://www.manythings.org/anki/hin-eng.zip --quiet

In [2]:
import zipfile
import io

In [3]:
zf = zipfile.ZipFile('C:/Users/s.mudalapuram/Documents/PythonMe/data/hin-eng.zip', 'r')


In [4]:
data = ''
with zf.open('hin.txt') as readfile:
  for line in io.TextIOWrapper(readfile, 'utf-8'):
    data += line

In [5]:
data

'Wow!\tवाह!\nHelp!\tबचाओ!\nJump.\tउछलो.\nJump.\tकूदो.\nJump.\tछलांग.\nHello!\tनमस्ते।\nHello!\tनमस्कार।\nCheers!\tवाह-वाह!\nCheers!\tचियर्स!\nGot it?\tसमझे कि नहीं?\nI\'m OK.\tमैं ठीक हूँ।\nAwesome!\tबहुत बढ़िया!\nCome in.\tअंदर आ जाओ।\nGet out!\tबाहर निकल जाओ!\nGo away!\tचले जाओ!\nGoodbye!\tख़ुदा हाफ़िज़।\nPerfect!\tउत्तम!\nPerfect!\tसही!\nWelcome.\tआपका स्वागत है।\nWelcome.\tस्वागतम्।\nHave fun.\tमज़े करना।\nHave fun.\tमौज करना।\nHave fun.\tमज़े करो।\nI forgot.\tमैं भूल गया।\nI forgot.\tमैं भूल गई।\nI\'ll pay.\tमैं पैसे दूंगा।\nI\'m fine.\tमैं ठीक हूँ।\nI\'m full.\tमेरा पेट भर गया है।\nLet\'s go!\tचलो चलें!\nAnswer me.\tमुझे जवाब दो।\nBirds fly.\tपंछी उड़ते हैं।\nExcuse me.\tमाफ़ कीजिए।\nFantastic!\tबहुत ख़ूब!\nI fainted.\tमैं बेहोश हो गया।\nI fear so.\tखेद की बात है, लेकिन वैसा ही है।\nI laughed.\tमैं हँसा।\nI\'m bored.\tमैं बोर हो रहा हूँ।\nI\'m broke.\tमेरा दीवालिया हो चुका है।\nI\'m tired.\tमैं थक गया हूँ।\nIt\'s cold.\tठंड हो रही है।\nWho knows?\tकौन जाने?\nWho knows?\tकिसको पता

In [6]:
data =  data.split('\n')

In [7]:
len(data)

2868

In [8]:
data[100:105]

["I don't know.\tमुझे नहीं मालूम।",
 'I have a car.\tमेरे पास एक गाड़ी है।',
 'I have a dog.\tमेरे पास एक कुत्ता है।',
 'I understand.\tमैं समझता हूँ।',
 "I'm a doctor.\tमैं डॉक्टर हूँ।"]

Review the data

### 1.2 Separate out Encoder and Decoder input data

In [9]:
encoder_text = []

In [10]:
decoder_text = []

In [11]:
for line in data:
    try:
        in_txt, out_txt = line.split('\t')
        encoder_text.append(in_txt)
        
        # Add tab '<start>' as 'start sequence in target
        # And '<end>' as End
        decoder_text.append('<start> ' + out_txt + ' <end>')
    except:
        pass #ignore data which goes into error        

In [12]:
decoder_text[100:105]

['<start> मुझे नहीं मालूम। <end>',
 '<start> मेरे पास एक गाड़ी है। <end>',
 '<start> मेरे पास एक कुत्ता है। <end>',
 '<start> मैं समझता हूँ। <end>',
 '<start> मैं डॉक्टर हूँ। <end>']

In [13]:
encoder_text[100:105]

["I don't know.",
 'I have a car.',
 'I have a dog.',
 'I understand.',
 "I'm a doctor."]

### 1.3 Build Sequences for Encoder and Decoder Input

In [14]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

Encoder tokenizer

In [15]:
encoder_t = Tokenizer()

In [16]:
encoder_t.fit_on_texts(encoder_text)

In [17]:
encoder_seq = encoder_t.texts_to_sequences(encoder_text)

In [18]:
max_encoder_seq_length = max([len(txt) for txt in encoder_seq])

In [19]:
max_encoder_seq_length

22

In [20]:
encoder_vocab_size = len(encoder_t.word_index)

In [21]:
encoder_vocab_size

2404

In [22]:
encoder_text[100:105]

["I don't know.",
 'I have a car.',
 'I have a dog.',
 'I understand.',
 "I'm a doctor."]

In [23]:
encoder_seq[100:105]

[[2, 28, 43], [2, 12, 6, 100], [2, 12, 6, 130], [2, 213], [38, 6, 153]]

Decoder tokenizer

In [24]:
decoder_t = Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')

In [25]:
decoder_t.fit_on_texts(decoder_text)

In [26]:
decoder_seq = decoder_t.texts_to_sequences(decoder_text)

In [27]:
max_decoder_seq_length = max([len(txt) for txt in decoder_seq])

In [28]:
max_decoder_seq_length

27

In [29]:
decoder_vocab_size = len(decoder_t.word_index)

In [30]:
decoder_vocab_size

3009

In [31]:
decoder_text[100:105]

['<start> मुझे नहीं मालूम। <end>',
 '<start> मेरे पास एक गाड़ी है। <end>',
 '<start> मेरे पास एक कुत्ता है। <end>',
 '<start> मैं समझता हूँ। <end>',
 '<start> मैं डॉक्टर हूँ। <end>']

In [32]:
decoder_seq[100:105]

[[1, 12, 5, 1499, 2],
 [1, 28, 40, 21, 106, 3, 2],
 [1, 28, 40, 21, 208, 3, 2],
 [1, 6, 779, 19, 2],
 [1, 6, 186, 19, 2]]

### 1.4 Padding Sequences

In [33]:
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [34]:
encoder_input_data = pad_sequences(encoder_seq, maxlen=max_encoder_seq_length, padding='pre')

In [35]:
decoder_input_data = pad_sequences(decoder_seq, maxlen=max_decoder_seq_length, padding='post')

In [36]:
encoder_input_data.shape

(2867, 22)

In [37]:
decoder_input_data.shape

(2867, 27)

Integer to Word converter for Decoder data

In [38]:
decoder_t.word_index

{'<start>': 1,
 '<end>': 2,
 'है।': 3,
 'में': 4,
 'नहीं': 5,
 'मैं': 6,
 'वह': 7,
 'से': 8,
 'क्या': 9,
 'है': 10,
 'के': 11,
 'मुझे': 12,
 'को': 13,
 'हैं।': 14,
 'की': 15,
 'हो': 16,
 'बहुत': 17,
 'का': 18,
 'हूँ।': 19,
 'उसने': 20,
 'एक': 21,
 'पर': 22,
 'तुम': 23,
 'था।': 24,
 'यह': 25,
 'कर': 26,
 'लिए': 27,
 'मेरे': 28,
 'कि': 29,
 'उसे': 30,
 'और': 31,
 'इस': 32,
 'हैं': 33,
 'भी': 34,
 'अपने': 35,
 'मैंने': 36,
 'मेरी': 37,
 'रहा': 38,
 'ने': 39,
 'पास': 40,
 'तो': 41,
 'करने': 42,
 'कल': 43,
 'ही': 44,
 'हम': 45,
 'अपनी': 46,
 'काम': 47,
 'गया।': 48,
 'उसकी': 49,
 'उसके': 50,
 'कोई': 51,
 'तुम्हें': 52,
 'तुम्हारे': 53,
 'था': 54,
 'उस': 55,
 'मेरा': 56,
 'साथ': 57,
 'घर': 58,
 'करना': 59,
 'सकते': 60,
 'थी।': 61,
 'गया': 62,
 'समय': 63,
 'बात': 64,
 'आप': 65,
 'पता': 66,
 'रही': 67,
 'कुछ': 68,
 'अभी': 69,
 'बजे': 70,
 'यहाँ': 71,
 'उसको': 72,
 'आज': 73,
 'लगता': 74,
 'तक': 75,
 'तुम्हे': 76,
 'चाहिए।': 77,
 'किताब': 78,
 'मुझसे': 79,
 'अच्छा': 80,
 'अपना': 81,
 'दिया।': 82,

In [39]:
int_to_word_decoder = dict((i,c) for c, i in decoder_t.word_index.items())

In [40]:
int_to_word_decoder

{1: '<start>',
 2: '<end>',
 3: 'है।',
 4: 'में',
 5: 'नहीं',
 6: 'मैं',
 7: 'वह',
 8: 'से',
 9: 'क्या',
 10: 'है',
 11: 'के',
 12: 'मुझे',
 13: 'को',
 14: 'हैं।',
 15: 'की',
 16: 'हो',
 17: 'बहुत',
 18: 'का',
 19: 'हूँ।',
 20: 'उसने',
 21: 'एक',
 22: 'पर',
 23: 'तुम',
 24: 'था।',
 25: 'यह',
 26: 'कर',
 27: 'लिए',
 28: 'मेरे',
 29: 'कि',
 30: 'उसे',
 31: 'और',
 32: 'इस',
 33: 'हैं',
 34: 'भी',
 35: 'अपने',
 36: 'मैंने',
 37: 'मेरी',
 38: 'रहा',
 39: 'ने',
 40: 'पास',
 41: 'तो',
 42: 'करने',
 43: 'कल',
 44: 'ही',
 45: 'हम',
 46: 'अपनी',
 47: 'काम',
 48: 'गया।',
 49: 'उसकी',
 50: 'उसके',
 51: 'कोई',
 52: 'तुम्हें',
 53: 'तुम्हारे',
 54: 'था',
 55: 'उस',
 56: 'मेरा',
 57: 'साथ',
 58: 'घर',
 59: 'करना',
 60: 'सकते',
 61: 'थी।',
 62: 'गया',
 63: 'समय',
 64: 'बात',
 65: 'आप',
 66: 'पता',
 67: 'रही',
 68: 'कुछ',
 69: 'अभी',
 70: 'बजे',
 71: 'यहाँ',
 72: 'उसको',
 73: 'आज',
 74: 'लगता',
 75: 'तक',
 76: 'तुम्हे',
 77: 'चाहिए।',
 78: 'किताब',
 79: 'मुझसे',
 80: 'अच्छा',
 81: 'अपना',
 82: 'दिया।',

### 1.5 Building Decoder Output data

In [41]:
decoder_target_data = np.zeros((decoder_input_data.shape[0], decoder_input_data.shape[1]))

In [42]:
for i in range(decoder_input_data.shape[0]):
    for j in range(1,decoder_input_data.shape[1]):
        decoder_target_data[i][j-1] = decoder_input_data[i][j]        

In [43]:
decoder_input_data[0]

array([  1, 767,   2,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0])

In [44]:
decoder_target_data[0]

array([767.,   2.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.])

Convert target data in one hot vector

In [45]:
from tensorflow.python.keras.utils import  to_categorical

In [46]:
decoder_target_one_hot = np.zeros((decoder_input_data.shape[0], 
                                   decoder_input_data.shape[1],
                                   len(decoder_t.word_index)+1))

In [47]:
for i in range(decoder_target_data.shape[0]):
    for j in range(decoder_target_data.shape[1]):
        decoder_target_one_hot[i][j] = to_categorical(decoder_target_data[i][j],
                                                      num_classes=len(
                                                          decoder_t.word_index)+1)    

In [48]:
decoder_target_one_hot.shape

(2867, 27, 3010)

## 2. Building the Training Model

In [49]:
from tensorflow.python.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.python.keras.callbacks import ModelCheckpoint

Define config parameters

In [50]:
encoder_embedding_size = 50
decoder_embedding_size = 50
rnn_units = 256

### 2.1 Build Encoder layers

In [51]:
encoder_inputs = Input(shape=(None,))

In [52]:
encoder_embedding = Embedding(encoder_vocab_size+1, encoder_embedding_size)

In [53]:
encoder_embedding_output = encoder_embedding(encoder_inputs)

In [54]:
x, state_h, state_c = LSTM(rnn_units,return_state=True)(encoder_embedding_output)

In [55]:
encoder_states = [state_h, state_c]

### 2.2 Build Decoder layers

In [56]:
decoder_inputs = Input(shape=(None,))

In [57]:
decoder_embedding = Embedding(decoder_vocab_size + 1, decoder_embedding_size)

In [58]:
decoder_embedding_output = decoder_embedding(decoder_inputs)

In [59]:
decoder_rnn = LSTM(rnn_units, return_sequences=True, return_state=True)

In [60]:
#Initialize initial state with encoder_states
#Output will be all hidden sequences, last 'h' state and last 'c' state
x,_,_ = decoder_rnn(decoder_embedding_output, initial_state=encoder_states)

In [61]:
decoder_dense = Dense(decoder_vocab_size + 1, activation='softmax')

In [62]:
decoder_outputs = decoder_dense(x)

### 2.3 Build Model using both Encoder and Decoder layers

In [63]:
from tensorflow.python.keras.models import Model

In [64]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [65]:
model.compile(optimizer='adam', loss='categorical_crossentropy')

## 3. Train the model

In [67]:
model.fit([encoder_input_data, decoder_input_data], decoder_target_one_hot,
          batch_size=32,
          epochs=10, #Try  with 1000
          validation_split=0.2)

Train on 2293 samples, validate on 574 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras._impl.keras.callbacks.History at 0x15ff8ae4908>

In [0]:
#model.save('drive/AI-ML/models/seq2seq_eng_hin_training.h5')

## 4. Building Model for Prediction

### 4.1 Build the Encoder Model to predict Encoder States

In [68]:
encoder_model = Model(encoder_inputs, encoder_states)

### 4.2 Build the Decoder Model 

1. Define Input for both 'h' state and 'c' state initialization
2. Get RNN outputs along with h and c state
3. Define Decoder Output
4. Build Model

In [69]:
decoder_state_input_h = Input(shape=(rnn_units,))

In [70]:
decoder_state_input_c = Input(shape=(rnn_units,))

In [71]:
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

Get RNN outputs, state(s)

In [72]:
x = decoder_embedding(decoder_inputs)

In [73]:
#We will use the layer which we trained earlier
rnn_outputs, state_h, state_c = decoder_rnn(x, initial_state=decoder_states_inputs)

In [74]:
#Why do we need this?
decoder_states = [state_h, state_c]

get Decoder output

In [75]:
decoder_outputs = decoder_dense(rnn_outputs)

Build Decoder Model

In [76]:
decoder_model = Model([decoder_inputs] + decoder_states_inputs,  #Model inputs
                     [decoder_outputs] + decoder_states)

## 5.0 Predicting Output

Build a prediction function

In [77]:
def decode_sentence(input_sequence):
    
    #Get the encoder state values
    decoder_initial_states_value = encoder_model.predict(input_sequence)
    
    #Build a sequence with '<start>' - starting sequence for Decoder
    target_seq = np.zeros((1,1))    
    target_seq[0][0] = decoder_t.word_index['<start>']
    
    #flag to check if prediction should be stopped
    stop_loop = False
    
    #Initialize predicted sentence
    predicted_sentence = ''
    
    #start the loop
    while not stop_loop:
        
        predicted_outputs, h, c = decoder_model.predict([target_seq] + 
                                                        decoder_initial_states_value)
        
        #Get the predicted output with highest probability
        predicted_output = np.argmax(predicted_outputs[0,-1,:])
        
        #Get the predicted word from predicter integer
        predicted_word = int_to_word_decoder[predicted_output]
        
        #Check if prediction should stop
        if(predicted_word == '<end>' or len(predicted_sentence) > max_decoder_seq_length):
            
            stop_loop = True
            continue
                    
        #Updated predicted sentence
        if (len(predicted_sentence) == 0):
            predicted_sentence = predicted_word
        else:
            predicted_sentence = predicted_sentence + ' ' + predicted_word
            
        #Update target_seq to be the predicted word index
        target_seq[0][0] = predicted_output
        
        #Update initial states value for decoder
        decoder_initial_states_value = [h,c]
        
    
    return predicted_sentence

Call Prediction function above

In [80]:
#Get a random sentence
start_num = np.random.randint(0, high=len(encoder_text) - 10)
print(start_num)

for i in range(start_num, start_num + 10):
    input_seq = encoder_input_data[i : i+1]
    predicted_sentence = decode_sentence(input_seq)
    print('--------')
    print ('Input sentence: ', encoder_text[i])
    print ('Predicted sentence: ', predicted_sentence )

263
--------
Input sentence:  I have to go now.
Predicted sentence:  मैं पास में है।
--------
Input sentence:  I live near here.
Predicted sentence:  मैं एक में से लिए है।
--------
Input sentence:  I love my mother.
Predicted sentence:  मैं पास में नहीं है।
--------
Input sentence:  I must leave now.
Predicted sentence:  मैं पास में है।
--------
Input sentence:  I'm really tired.
Predicted sentence:  मैं पास है
--------
Input sentence:  I'm very thirsty.
Predicted sentence:  मैं कहाँ है
--------
Input sentence:  It's April first.
Predicted sentence:  मैं कहाँ है
--------
Input sentence:  It's now my turn.
Predicted sentence:  मैं बहुत है
--------
Input sentence:  Let me try again.
Predicted sentence:  मैं कहाँ है
--------
Input sentence:  Nice to meet you.
Predicted sentence:  मैं कहाँ है


## 6. Save Prediction models and tokenizers 

In [None]:
#Save encoder and decoder model for Prediction
encoder_model.compile(optimizer='adam', loss='mse')
decoder_model.compile(optimizer='adam', loss='categorical_crossentropy')
encoder_model.save('drive/AI-ML/models/seq2seq_encoder_eng_hin.hd5')
decoder_model.save('drive/AI-ML/models/seq2seq_decoder_eng_hin.hd5')

In [0]:
#Save tokenizers
import pickle

pickle.dump(encoder_t,open('drive/AI-ML/models/encoder_tokenizer_eng','wb'))
pickle.dump(decoder_t,open('drive/AI-ML/models/decoder_tokenizer_hin','wb'))

## Conclution
RNN and LSTMs are far better for text processing, speach processing and Machine Translation of data etc. But RNN and LSTM are difficult to train because they require memory-bandwidth-bound computation, which is the worst nightmare for hardware designer and ultimately limits the applicability of neural networks solutions.