# Neural Machine Translation

In this task we develop a neural machine translation (NMT) system to translate text from one language to another. For this, one need to chose the data to train the models, perform data processing and train a sequence2sequence neural model.


#### Author : Shubhajit Basak

**Import the required packages :**

In [1]:
import sys
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import nltk
nltk.download('punkt') # Uncomment if running from Google Colab
from nltk.translate.bleu_score import corpus_bleu


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model

Using TensorFlow backend.


## Section 1- Data Collection and Preprocessing 


There are few datasets to train an NMT system available from Tatoeba Project (http://www.manythings.org/anki/) or OPUS project (http://opus.nlpl.eu/).

*  Download a langauge pair (preferably European language) and **extract** the file(s) and upload it to colab
*  Create a list of lines by splitting the text file at every occurance of '\n'
*  Print number of sentences
*  Limit the amount of senteces to 10,000 lines (but more than 5,000 lines)
*  Split the data into train and test [You can split validation set here or while training use kerase validation_split option]
*  Print 100th sentence in original script[ not unicode] for source and target language


**Note -** I am using 15000 records instead of 10000 records for better performance. I will split it to train and test after preprocessing the data at task 2 Otherwise we have to preprocess train and test separately and will create duplicate code.

In [3]:
#your code goes here

#------------- Uncomment Following if running from Colab ------------ #
if not (os.path.isfile('./fra.txt')):
  !wget http://www.manythings.org/anki/fra-eng.zip
  !unzip fra-eng.zip

#your code goes here
filename = "./fra.txt"
with open(filename , 'r') as f:
    lines=f.read().split('\n')

lines = lines[0:15000]

--2019-04-18 21:12:20--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 104.24.108.196, 104.24.109.196, 2606:4700:30::6818:6cc4, ...
Connecting to www.manythings.org (www.manythings.org)|104.24.108.196|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3423204 (3.3M) [application/zip]
Saving to: ‘fra-eng.zip’


2019-04-18 21:12:21 (4.91 MB/s) - ‘fra-eng.zip’ saved [3423204/3423204]

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 


In [4]:
print(len(lines))
print(lines[100])

15000
Come in.	Entrez !


**Task 2** 

* Add '\t' to denote begining of sentence and '\n'  or '<eos\>' to denote end of the sentence to the each target line.
* Preprocess (word tokenisation, lowecasing) the text.

In [0]:
# your code goes here
input_texts = list()
target_texts = list()
input_characters = set()
target_characters = set()


Adding the start tag as 'sos_' and end tag as '_eos' after lowering the text

In [0]:
for line in lines:
    input_text, target_text = line.split('\t')
    target_text = 'sos_ ' + target_text.lower() + ' _eos'
    input_texts.append(input_text.lower())
    target_texts.append(target_text)

Tokenize the data and creating the Source and Target Vocabulary

In [0]:
# Source Vocabulary
Source_vocabulary=set()
source_length_list=[]
for line in input_texts:
    source_length_list.append(len(nltk.word_tokenize(line)))
    for word in nltk.word_tokenize(line):
        if word not in Source_vocabulary:
            Source_vocabulary.add(word)
 

# Target Vocabulary
target_vocabulary=set()
target_length_list=[]
for line in target_texts:
    target_length_list.append(len(nltk.word_tokenize(line)))
    for word in nltk.word_tokenize(line):
        if word not in target_vocabulary:
            target_vocabulary.add(word)

num_source_tokens = len(Source_vocabulary)
num_target_tokens = len(target_vocabulary)

max_source_seq_length = np.max(source_length_list)
max_target_seq_length = np.max(target_length_list)

Split the data into train and test after preprocessing 

In [0]:
input_texts_train, input_texts_test, target_texts_train, target_texts_test = train_test_split(
    input_texts,target_texts, test_size=0.33, random_state=42)

In [9]:
print('Number of samples:', len(input_texts))
print('Number of unique source language tokens:', num_source_tokens)
print('Number of unique target language tokens:', num_target_tokens)
print('Max sequence length of source language:', max_source_seq_length)
print('Max sequence length of target language:', max_target_seq_length)
print("Source Vocabulary",Source_vocabulary)
print("Target Vocabulary",target_vocabulary)

Number of samples: 15000
Number of unique source language tokens: 2883
Number of unique target language tokens: 6056
Max sequence length of source language: 7
Max sequence length of target language: 14
Source Vocabulary {'call', 'buried', 'insects', 'spinach', 'struck', 'country', 'lid', 'mistaken', 'reptiles', 'cuff', 'see', 'nailed', 'whining', 'wow', 'tied', 'weddings', '99', 'doctors', 'hassle', 'balls', 'anytime', "n't", 'divorce', 'packing', 'ruthless', 'helped', 'over', 'tower', 'patrol', 'congratulations', 'cost', 'moved', 'losing', 'neighbors', 'week', 'laughing', 'history', 'group', 'elected', 'store', 'ponchos', 'hobo', 'naked', 'sunday', "''", 'suddenly', 'minute', 'resist', 'complain', 'bomb', 'immediately', 'safer', 'boss', 'arrested', 'bored', 'nearly', 'threw', 'news', 'often', 'uninsured', 'surprise', 'complex', 'tools', 'optimistic', 'ethical', 'succeeded', 'trout', 'glad', 'brave', 'laughs', 'climb', 'furious', 'driving', 'painful', 'letter', 'bet', 'word', 'changes'

** Task 3**

*  Assign each unique word to an integer value (5 marks).
*  Create word embedding for your vocabulary using pretrained Glove embeddigns (5 marks) (http://nlp.stanford.edu/data/glove.6B.zip) [see Lab 7]
* Print the first line of the embeddings (see below) 

In [0]:
# Assign unique token to words in the vocabulary
#your code goes here
input_token_index = dict([(word,i) for i, word in enumerate(Source_vocabulary)])
target_token_index= dict([(word,i) for i, word in enumerate(target_vocabulary)])

Download the Glove Embedding -

In [11]:
#------------- Uncomment Following if running from Colab ------------ #
if not (os.path.isfile('./glove.6B.50d.txt')):
  !wget http://nlp.stanford.edu/data/glove.6B.zip
  !unzip glove.6B.zip

--2019-04-18 21:12:32--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2019-04-18 21:12:32--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2019-04-18 21:13:54 (10.0 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


Create the embedding dictionary using the 50 feature dimensional glove embedding

In [12]:
embeddings_index = dict()
f = open('./glove.6B.50d.txt', 'r', encoding='utf8', errors='ignore')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [13]:
embedding_matrix = np.zeros((num_source_tokens, 50))
for word,i in input_token_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        print(word)
        print("Not Found")

housesitting
Not Found
223-1374
Not Found
ok.
Not Found


In [14]:
embedding_matrix.shape

(2883, 50)

In [15]:
#print first line of embeddings here
print(embedding_matrix[0])

[ 0.098201    0.39923999  0.25696999 -0.085349    0.27175    -0.63637
 -0.62718999  0.25895    -0.53249002 -0.22927    -0.76257998  0.2173
  0.37017     0.082194    0.46015999  0.14439    -0.35332999 -0.62408
  0.1025     -0.58596998  0.16874     0.41938999  0.082275    0.48931
  0.62348002 -1.8434     -0.11815    -0.25465     0.38033    -0.41892999
  3.0158999   0.35014001 -1.26559997 -0.14951    -0.32056001 -0.72768998
  0.53979999 -1.25320005 -0.013795   -0.0048093   0.37452999  0.41136
 -0.12614     0.48701     0.47819999  0.35898    -0.1709      0.70283997
  0.32207     0.77503002]


## Section 2 Translation Model training



---



**Task 3**
* Provide code for the encoder using Keras LSTM 
* Provide code for the decoder using Keras LSTM 
* Train the sequence2sequence (encoder-decoder) model


Prepare the Training data as a matrix with the unique index:

In [0]:
# Create matrix with zeros for source and target data -
encoder_input_data_train = np.zeros(
    (len(input_texts_train), max_source_seq_length),
    dtype='float32')
decoder_input_data_train = np.zeros(
    (len(target_texts_train), max_target_seq_length),
    dtype='float32')

# Create matrix with zeros for target label
decoder_target_data_train = np.zeros(
    (len(target_texts_train), max_target_seq_length, num_target_tokens),
    dtype='float32')

In [0]:
# update the matrix data with the index from the unique index -
for i, (input_text, target_text) in enumerate(zip(input_texts_train, target_texts_train)):
    for t, word in enumerate(nltk.word_tokenize(input_text)):
        encoder_input_data_train[i, t] = input_token_index[word]
    for t, word in enumerate(nltk.word_tokenize(target_text)):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data_train[i, t] = target_token_index[word]
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            # will update as one hot vector at the labels 
            decoder_target_data_train[i, t - 1, target_token_index[word]] = 1.

In [18]:
print(encoder_input_data_train.shape)
print(decoder_input_data_train.shape)
print(decoder_target_data_train.shape)

(10050, 7)
(10050, 14)
(10050, 14, 6056)


In [0]:
# Embedding layer size 
embedding_size = 50

Encoder Layer :

In [20]:
# Encoder Input
encoder_inputs = Input(shape=(None,))
# Embedding layer (non-trainable) with the Glove Embedding weights
encoder_embedd=  Embedding(num_source_tokens, embedding_size , weights=[embedding_matrix],
                input_shape=(max_source_seq_length,), trainable=False)(encoder_inputs)
# LSTM Layer
encoder = LSTM(50, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embedd)
encoder_states = [state_h, state_c] # Encoder States

Instructions for updating:
Colocations handled automatically by placer.


Decoder Layer :

In [0]:
# decoder code goes here

decoder_inputs = Input(shape=(None,))
# Decoder Embedding with Keras Default Embedding (will be trained)
decoder_embedd=  Embedding(num_target_tokens, embedding_size)
final_decoder= decoder_embedd(decoder_inputs)
# Decoder LSTM Layers
decoder_lstm = LSTM(50, return_sequences=True, return_state=True)

# Set up the decoder, using `encoder_states` as initial state.
decoder_outputs, _, _ = decoder_lstm(final_decoder,
                                     initial_state=encoder_states)

# Softmax & Dense Layer
decoder_dense = Dense(num_target_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)


Final Model -

In [22]:
model_seq= Model([encoder_inputs, decoder_inputs], decoder_outputs)
model_seq.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['acc'])
model_seq.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 50)     144150      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 50)     302800      input_2[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LS

Train the model - 

In [23]:
if(os.path.isfile('./seq2seq_source_target.h5')):
    model_seq.load_weights('seq2seq_source_target.h5')
else:
    model_seq.fit([encoder_input_data_train, decoder_input_data_train], decoder_target_data_train,
          batch_size=128,
          epochs=30,
          validation_split=0.05)  
    model_seq.save('seq2seq_source_target.h5')

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 9547 samples, validate on 503 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


  '. They will not be included '


# Section 3 Testing

** Task 4 **

*   Use the trained model to translate the text from source into target language. Use the test/evaluation set (see Section 1) and perform an automatic evaluation with the BLEU metric. use the NLTK library to calculate BLEU.


Prepara Data for Test (Only the Source Data) same as training data -

In [0]:
encoder_input_data_test = np.zeros(
    (len(input_texts_test), max_source_seq_length),
    dtype='float32')


In [0]:
for i, (input_text, target_text) in enumerate(zip(input_texts_test, target_texts_test)):
    for t, word in enumerate(nltk.word_tokenize(input_text)):
        encoder_input_data_test[i, t] = input_token_index[word]


In [26]:
print(encoder_input_data_test.shape)


(4950, 7)


Create a dictionary to get the word from their index -

In [0]:
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

To Predict we will create a model sames as training without the softmax layer:

* This will behave as transfer learning
* Will use the already trained embedding and the encoder and decoder states values

In [28]:
# Encoder Model
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, None)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, None, 50)          144150    
_________________________________________________________________
lstm_1 (LSTM)                [(None, 50), (None, 50),  20200     
Total params: 164,350
Trainable params: 20,200
Non-trainable params: 144,150
_________________________________________________________________


In [0]:
# Decoder Model
# Decoder States
decoder_state_input_h = Input(shape=(50,))
decoder_state_input_c = Input(shape=(50,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Decoder Embedding with pre trained weights
final_dex2= decoder_embedd(decoder_inputs)

# Decoder LSTM with Pretrained weights
decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
# Decoder Dense with Pretrained Weights
decoder_outputs2 = decoder_dense(decoder_outputs2)
# Decoder Final Model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

In [0]:
#Your code goes here

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['sos_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_eos' or
           len(decoded_sentence) > max_target_seq_length):
            decoded_sentence = 'sos_ '+ decoded_sentence + ' _eos'
            stop_condition = True
        else:
            decoded_sentence += ' '+sampled_char  

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]
        
        
    return decoded_sentence

Create list of Actual and predicted Target Values

In [0]:
actual_seq, predicted_seq = list(), list()
for seq_index in range(len(encoder_input_data_test)):
    input_seq = encoder_input_data_test[seq_index: seq_index + 1]
    decoded_sent = decode_sequence(input_seq)
    predicted_seq.append(nltk.word_tokenize(decoded_sent))
    actual_seq.append([nltk.word_tokenize(target_texts_test[seq_index])])

Calculate Cumulative BLEU Score :

In [32]:
print("BLEU Score on Test Set : ")
print('BLEU-1: %f' % corpus_bleu(actual_seq, predicted_seq, weights=(1.0, 0, 0, 0)))
print('BLEU-2: %f' % corpus_bleu(actual_seq, predicted_seq, weights=(0.5, 0.5, 0, 0)))
print('BLEU-3: %f' % corpus_bleu(actual_seq, predicted_seq, weights=(0.3, 0.3, 0.3, 0)))
print('BLEU-4: %f' % corpus_bleu(actual_seq, predicted_seq, weights=(0.25, 0.25, 0.25, 0.25)))

BLEU Score on Test Set : 
BLEU-1: 0.493255
BLEU-2: 0.308685
BLEU-3: 0.217218
BLEU-4: 0.105131


# Section 4 Attention

---



** Task 5 **sequence2sequence

* Extend the existing Seq2Seq model with an attention mechanism
* Create sequence2sequence model with attention
* Train the model with the same data from Section 1
* Translate the evaluation using the sequence2sequence attention model
* Evaluate the translations made with the sequence2sequence attention model and compare it with the model without attention using BLEU

Make the sequence to sequence with attention model :

* Here I am using global attention model proposed in Lonng et al’s paper

In [0]:
# Import additional libraries
from keras.layers import Activation, dot, concatenate
from keras.layers import TimeDistributed

In [0]:
# Define Encoder and Decoder Input
encoder_input = Input(shape=(max_source_seq_length,))
decoder_input = Input(shape=(max_target_seq_length,))

In [0]:
# Embedding (non-trainable) with Glove Embedding weight
encoder = Embedding(num_source_tokens, embedding_size , weights=[embedding_matrix] , 
                    input_length=max_source_seq_length , mask_zero=True , trainable=False)(encoder_input) #
# Encoder LSTM
encoder = LSTM(embedding_size, return_sequences=True, unroll=True)(encoder)
# Last word from encoder to fed to the decoder input
encoder_last = encoder[:,-1,:]

In [0]:
# Decoder Embedding with Normal Keras Embedding (will be trained)
decoder = Embedding(num_target_tokens, embedding_size, 
                    mask_zero=True, input_length=max_target_seq_length)(decoder_input) #, mask_zero=True
# Decoder LSTM
decoder = LSTM(embedding_size, return_sequences=True, unroll=True)(decoder, initial_state=[encoder_last, encoder_last])


In [0]:
# Attention dot product of the encoder and decoder weights
attention = dot([decoder, encoder], axes=[2, 2])
# Attention Softmax to get the optimum weight
attention = Activation('softmax', name='attention')(attention)

# Context dot product of pre attention weight with encoder
context = dot([attention, encoder], axes=[2,1])

In [0]:
# Decoder combined context 
decoder_combined_context = concatenate([context, decoder])

# Time Distributed Layer with activation and softmax
output = TimeDistributed(Dense(embedding_size, activation="tanh"))(decoder_combined_context)
output = TimeDistributed(Dense(num_target_tokens, activation="softmax"))(output)


Final Model

In [0]:
# Final Model
model_attention = Model(inputs=[encoder_input, decoder_input], outputs=[output])
model_attention.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['acc']) # 'adam'

In [40]:
model_attention.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 14)           0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 7)            0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 14, 50)       302800      input_6[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 7, 50)        144150      input_5[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LS

Train the Model

In [41]:
if(os.path.isfile('./seq2seq_attention_source_target.h5')):
    model_attention.load_weights('seq2seq_attention_source_target.h5')
else:
    model_attention.fit([encoder_input_data_train, decoder_input_data_train], decoder_target_data_train,
              batch_size=128,
              epochs=50,
              validation_split=0.05)
    model_attention.save('seq2seq_attention_source_target.h5')

Train on 9547 samples, validate on 503 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


  '. They will not be included '


Function to predict from input :

In [0]:
def decode_sequence_attention(input_seq):
    decoded_sentence = ''
    # encoder input
    encoder_input = input_seq.reshape(1,max_source_seq_length)
    # create blank matrix for decoder data
    decoder_input = np.zeros(shape=(len(encoder_input), max_target_seq_length))
    # update the first element with the start index
    decoder_input[:,0] = target_token_index['sos_']
    # loop through the max length
    for i in range(1, max_target_seq_length):
        # Predict the index of the next word
        output = model_attention.predict([encoder_input, decoder_input]).argmax(axis=2)
        decoder_input[:,i] = output[:,i]
    # Reverse to get the sentence from the word index
    for d in decoder_input[:,0:][0]:
        if (reverse_target_char_index[d] == '_eos'):
            decoded_sentence += ' '+reverse_target_char_index[d]
            break
        decoded_sentence += ' '+reverse_target_char_index[d]
    return decoded_sentence.strip()

In [43]:
print(target_texts_train[10])
print(decode_sequence_attention(encoder_input_data_train[10]))

sos_ nous avons un plan. _eos
sos_ nous avons un plan . _eos


In [44]:
len(encoder_input_data_test)

4950

Create list of actual and predicted target

In [0]:
actual_attention, predicted_attention = list(), list()
for seq_index in range(len(encoder_input_data_test)):
    input_seq = encoder_input_data_test[seq_index: seq_index + 1]
    decoded_sent = decode_sequence_attention(input_seq)
    predicted_attention.append(nltk.word_tokenize(decoded_sent))
    actual_attention.append([nltk.word_tokenize(target_texts_test[seq_index])])

Get the Cumulative BLEU Score

In [46]:
print("BLEU Score on 1000 records of Training Set : ")
print('BLEU-1: %f' % corpus_bleu(actual_attention, predicted_attention, weights=(1.0, 0, 0, 0)))
print('BLEU-2: %f' % corpus_bleu(actual_attention, predicted_attention, weights=(0.5, 0.5, 0, 0)))
print('BLEU-3: %f' % corpus_bleu(actual_attention, predicted_attention, weights=(0.3, 0.3, 0.3, 0)))
print('BLEU-4: %f' % corpus_bleu(actual_attention, predicted_attention, weights=(0.25, 0.25, 0.25, 0.25)))

BLEU Score on 1000 records of Training Set : 
BLEU-1: 0.560153
BLEU-2: 0.413235
BLEU-3: 0.279184
BLEU-4: 0.142912


**Observation :**

* We have found the BLEU Score has been improved after applying the attention model
* This can further improved if we use local attention layer and train more epochs with more data

**Reference : **

1. Medium. (2019). Neural Machine Translation using word level seq2seq model. [online] Available at: https://medium.com/@dev.elect.iitd/neural-machine-translation-using-word-level-seq2seq-model-47538cba8cd7 [Accessed 18 Apr. 2019].
2. Tanakitrungruang, W. (2019). Attention-based Sequence-to-Sequence in Keras. [online] WT. Available at: https://wanasit.github.io/attention-based-sequence-to-sequence-in-keras.html [Accessed 18 Apr. 2019].
3. Brownlee, J. (2019). How to Use Word Embedding Layers for Deep Learning with Keras. [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/ [Accessed 18 Apr. 2019].
4. Brownlee, J. (2019). A Gentle Introduction to Calculating the BLEU Score for Text in Python. [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/ [Accessed 18 Apr. 2019].
5. Luong, M., Pham, H. and Manning, C. (2019). Effective Approaches to Attention-based Neural Machine Translation. [online] arXiv.org. Available at: https://arxiv.org/abs/1508.04025 [Accessed 18 Apr. 2019].