<a href="https://colab.research.google.com/github/sheelaj123/Machine-Translation/blob/main/Encoder_Decoder_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Encoder-Decoder LSTM Model building.... In Machine translation

##step by step: --

In [1]:
!pip install keras



In [2]:
!pip3 install nltk



In [3]:
!pip3 install transformers



##Problem statement:

Collection of sample English words and their equivalent Hindi words are given.

The task is to train the machine on these words using LSTM architecture so that, when given an English word, we get the translated Hindi word as output.

#Dataset:  
“English_Hindi.txt” dataset consisting of English and equivalent Hindi words.

In this example of the model building using LSTM, we need to import the required libraries and need to install TensorFlow before executing the other parts of the code.

In [4]:
#Importing library
import numpy as np
import keras
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras.utils import *
from keras.initializers import *
import tensorflow as tf
import time, random
#from keras.optimizers.Adam import keras


After importing all required libraries,

we need to define the values for all hyperparameters which include batch size for training,

latent dimensionality for the encoding space and also a number of samples to train on.

In [5]:
#Hyperparameters
batch_size = 64
latent_dim = 256
num_samples = 31
#31


the next section of the code, the data vectorization will take place where we will read the input file which contains English sentences and their corresponding French Sentences.

 In this process, the text sequences are converted into featured vectors.

In [7]:
#Vectorize the data.
input_texts = []
target_texts = []
input_chars = set()
target_chars = set()
with open(r'English_Hindi.txt', 'r',
          encoding='utf-8') as f:lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_chars:
            input_chars.add(char)
    for char in target_text:
        if char not in target_chars:
            target_chars.add(char)
input_chars = sorted(list(input_chars))
target_chars = sorted(list(target_chars))
num_encoder_tokens = len(input_chars)
num_decoder_tokens = len(target_chars)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])
#Print size
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
# Number of samples: 31
# Number of unique input tokens: 38
# Number of unique output tokens: 51
# Max sequence length for inputs: 10
# Max sequence length for outputs: 21


Number of samples: 31
Number of unique input tokens: 38
Number of unique output tokens: 51
Max sequence length for inputs: 10
Max sequence length for outputs: 21


After featured engineering,

we will get the data with all features which will help us to define the input data for the encoder and decoder and the target data for the decoder.

In [8]:
#Define data for encoder and decoder
input_token_id = dict([(char, i) for i, char in enumerate(input_chars)])
target_token_id = dict([(char, i) for i, char in enumerate(target_chars)])
encoder_in_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype='float32')
decoder_in_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_in_data[i, t, input_token_id[char]] = 1.
    for t, char in enumerate(target_text):
        decoder_in_data[i, t, target_token_id[char]] = 1.
        if t > 0:
            decoder_target_data[i, t - 1, target_token_id[char]] = 1.


the next section of code,

we will define the input sequence for the encoder which has been defined above and process this sequence. At last, we need to set up an initial state for the decoder using ‘encoder_states’.

In [9]:
#Define and process the input sequence
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
#We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
#Using `encoder_states` set up the decoder as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)


Now, we will be defining the code for the final model which

will accept ‘encoder_inputs’ and ‘decoder_inputs’ as input parameters and ‘decoder_outputs’ as target parameters.

After defining the final model, we will be checking it by its summary and data shape.

In [10]:
#Final model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


In [11]:
#Model Summary
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None, 38)]           0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, None, 51)]           0         []                            
                                                                                                  
 lstm (LSTM)                 [(None, 256),                302080    ['input_1[0][0]']             
                              (None, 256),                                                        
                              (None, 256)]                                                        
                                                                                              

In [12]:
#Model data Shape
print("encoder_in_data shape:",encoder_in_data.shape)
print("decoder_in_data shape:",decoder_in_data.shape)
print("decoder_target_data shape:",decoder_target_data.shape)
# encoder_in_data shape: (31, 10, 38)
# decoder_in_data shape: (31, 21, 51)
# decoder_target_data shape: (31, 21, 51)


encoder_in_data shape: (31, 10, 38)
decoder_in_data shape: (31, 21, 51)
decoder_target_data shape: (31, 21, 51)


Once we are ready with the final model,

we need to compile and train the model. In this example, the model will be trained in 50 epochs only. But,

we can train the model for more epochs for better accuracy.

In [13]:
#Compiling and training the model
import keras
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
#model.compile(optimizer=Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.001), loss='categorical_crossentropy')
model.fit([encoder_in_data, decoder_in_data], decoder_target_data, batch_size = batch_size, epochs=50, validation_split=0.2)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7d91ec4f2e90>

After successful training the model,

we need to test the trained model with a sample model using the parameters of the above-trained model.

In [14]:
#Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)


In the below lines of code,

we will define the decode sequence of the text which will be passed to the model as an input sequence.

 The input sequence is encoded into a context vector or state vector which will be passed as an input to the decoder with the target sequence.

 This process will continue to generate the output until the end of the sequence.

In [15]:
reverse_input_char_index = dict((i, char) for char, i in input_token_id.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_id.items())
#Define Decode Sequence
def decode_sequence(input_seq):
    #Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    #Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    #Get the first character of target sequence with the start character.
    target_seq[0, 0, target_token_id['\t']] = 1.
    #Sampling loop for a batch of sequences
    #(to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        #Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char
        #Exit condition: either hit max length
        #or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True
        #Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.
        #Update states
        states_value = [h, c]
    return decoded_sentence


Finally,

 we will validate the model to decode the input words into the target words

  in this case the model will translate the English words into equivalent Hindi words.

In [16]:
for seq_index in range(20):
    input_seq = encoder_in_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: Wow!
Decoded sentence: ममा 

-
Input sentence: Help!
Decoded sentence: ममा 

-
Input sentence: Jump.
Decoded sentence: माा 

-
Input sentence: Jump.
Decoded sentence: माा 

-
Input sentence: Jump.
Decoded sentence: माा 

-
Input sentence: Hello!
Decoded sentence: ममा 

-
Input sentence: Hello!
Decoded sentence: ममा 

-
Input sentence: Cheers!
Decoded sentence: ममा 

-
Input sentence: Cheers!
Decoded sentence: ममा 

-
Input sentence: Got it?
Decoded sentence: ममा 

-
Input sentence: I'm OK.
Decoded sentence: ममा 

-
Input sentence: Awesome!
Decoded sentence: ममा 

-
Input sentence: Come in.
Decoded sentence: ममा 

-
Input sentence: Get out!
Decoded sentence: ममा 

-
Input sentence: Go away!
Decoded sentence: ममा 

-
Input sentence: Goodbye!
Decoded sentence: ममा 

-
Input sentence: Perfect!
Decoded sentence: ममा 

-
Input sentence: Perfect!
Decoded sentence: ममा 

-
Input sentence: Welcome.
Decoded sentence: ममा 

-
Input sentence: Welcome.
Decoded sentence: ममा 



#Topics ##Ends here, thanks for visiting...!!