**Assignment 2** focuses on the training on a Neural Machine Translation (NMT) system with an attention model.

This is an **individual assignment** and usual rules for plagiarism apply! With this you agree that: "In submitting this work I confirm that it is entirely my own. I acknowledge that I may be invited to online interview if there is any concern in relation to the integrity of my exam." 

**Write comments and documentation.**



## Section 1- Data Collection and Preprocessing 


---


**Task 1  (5 marks)**

---

There are few datasets to train an NMT system available from the OPUS project (http://opus.nlpl.eu/).

*  Download a language pair (preferably European language) and **extract** the file(s) and upload it to CoLab
*  Create a list of lines by splitting the text file at every occurrence accordingly, i.e. source and target language
*  Print number of sentences
*  Limit the number of sentences to 10,000 lines (but more than 5,000 lines)
*  Split the data into train, development and test set
*  Print 100th sentence in original script for source and target language

In [None]:
## importing required libraries

import os, sys

from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding, Dropout, concatenate, dot, Activation
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [None]:

eng_filename = "english.txt"
french_filename = "french.txt"

# reading data from each file where lines are split using "\n"
with open(eng_filename , 'r') as english:
    english_lines = english.read().split('\n')

with open(french_filename, 'r') as french:
    french_lines = french.read().split('\n')

print(len(english_lines))


english_lines, french_lines = np.array(english_lines), np.array(french_lines)

# dividing the data in train, dev and test set
random = np.random.randint(0, len(english_lines), len(english_lines))

train_x, dev_x, test_x = english_lines[random[0:6000]],  english_lines[random[6000:6500]],  english_lines[random[6500:len(english_lines)]]
train_y, dev_y, test_y = french_lines[random[0:6000]],  french_lines[random[6000:6500]],  french_lines[random[6500:len(english_lines)]]

print(f"100th line of source is {english_lines[100]}")
print(f"100th line of target is {french_lines[100]}")

7635
100th line of source is &Keep report file for sending later or copying to somewhere else
100th line of target is &Garder le fichier du rapport pour un envoi ultérieur ou pour le copier ailleurs


**Task 2 (5 marks)** 

---

* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Perform the pre-processing step of the text.
* Print the last 5 sentences of the preprocessed text.
* Print statistics on the selected dataset:
  * Number of samples
  * Number of unique source language tokens
  * Number of unique target language tokens
  * Max sequence length of source language
  * Max sequence length of target language
  * Source Vocabulary
  * Target Vocabulary





In [None]:
import re

source, target, target_input = list(), list(), list()
source_words, target_words = list(), list()

# taking first 30 chars from each line for faster processing
for i in range(len(train_x)):
    eng = train_x[i][:30]
    frc =  train_y[i][:30] + ' <eos>'
    output_sentence_input = '<bof> ' + train_y[i][:30]
    eng = re.sub(r'[^\w\s]', '', eng)
    frc = re.sub(r'[^\w\s]', '', frc)
    source.append(eng)
    target.append(frc)
    for e in eng.split(" "):
      source_words.append(e)
    for t in frc.split(" "):
      target_words.append(t)
    target_input.append(output_sentence_input)
    
    
max_source = max([len(txt.split(" ")) for txt in source])
max_target = max([len(txt.split(" ")) for txt in target])

vocab_source = len(source)
vocab_target = len(target)

print("Last 5 sentences are : \n")
print(target[-5:], "\n")

print("Statistics are : \n")
print(f"Number of samples {len(source)}")
print(f"Numbe of unique source language tokens is {len(set(source_words))}")
print(f"Numbe of unique target language tokens is {len(set(target_words))}")
print(f"Max sequence length of source language is {max_source}")
print(f"Max sequence length of target language is {max_target}")
print(f"Source Vocabulary is {len(source_words)}")
print(f"Target Vocabulary is {len(target_words)}")

Last 5 sentences are : 

['machines eos', 'Redémarrage dopenbsdinetd eos', 'Color name eos', 'Installer le micrologiciel eos', 'La clef avec lempreinte s n eos'] 

Statistics are : 

Number of samples 6000
Numbe of unique source language tokens is 3579
Numbe of unique target language tokens is 3943
Max sequence length of source language is 8
Max sequence length of target language is 10
Source Vocabulary is 19970
Target Vocabulary is 27643


**Task 3 (5 marks)** 

---

---

*  Assign each unique word an integer value (5 marks).
*  Create word embedding for your vocabulary using pre-trained embeddings, for example GloVe or fastText (10 marks) (https://nlp.stanford.edu/projects/glove/ , https://fasttext.cc/docs/en/english-vectors.html)
* Print the first line of the embeddings (see below) 

In [None]:
## initializing required values

BATCH_SIZE = 128
EPOCHS = 20
LSTM_NODES = 512
NUM_SENTENCES = 20000
MAX_SENTENCE_LENGTH = 10
MAX_NUM_WORDS = 20000
EMBEDDING_SIZE = 100

In [None]:
## assigning each unique word an integer
input_token_index = dict([(word,i) for i, word in enumerate(source)])
target_token_index= dict([(word,i) for i, word in enumerate(target)])

In [None]:
## tokenizing the text using Tokenizer from keras

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# tokenizing input sentences
MAX_NUM_WORDS = 20000
source_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n', oov_token="UNK")
source_tokenizer.fit_on_texts(source)
input_integer_seq = source_tokenizer.texts_to_sequences(source)

word2idx_inputs = source_tokenizer.word_index
print('Total unique words in the input: %s' % len(word2idx_inputs))

max_input_len = max(len(sen) for sen in input_integer_seq)
print("Length of longest sentence in input: %g" % max_input_len)

# tokenizing output sentences
target_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')
target_tokenizer.fit_on_texts(target + target_input)
output_integer_seq = target_tokenizer.texts_to_sequences(target)
output_input_integer_seq = target_tokenizer.texts_to_sequences(target_input)

word2idx_outputs = target_tokenizer.word_index
print('Total unique words in the output: %s' % len(word2idx_outputs))

num_words_output = len(word2idx_outputs) + 1
max_out_len = max(len(sen) for sen in output_integer_seq)
print("Length of longest sentence in the output: %g" % max_out_len)

Total unique words in the input: 2755
Length of longest sentence in input: 8
Total unique words in the output: 3832
Length of longest sentence in the output: 10


In [None]:
# padding the sentences so all sentences are of equal length

encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=MAX_SENTENCE_LENGTH)
print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("encoder_input_sequences[172]:", encoder_input_sequences[172])

encoder_input_sequences.shape: (6000, 10)
encoder_input_sequences[172]: [  0   0   0   0   0  27 209 321  87   3]


In [None]:
# padding the sentences so all sentences are of equal length

decoder_input_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding='post')
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_input_sequences[172]:", decoder_input_sequences[172])

decoder_input_sequences.shape: (6000, 10)
decoder_input_sequences[172]: [ 38  23   3   5 169   9   1   0   0   0]


In [None]:
# creating embedding from Glove

from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open(r'glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

In [None]:
# reading for each word it's embedding from dictionary

num_words = min(MAX_NUM_WORDS, len(source) + 1)
embedding_matrix = zeros((num_words, EMBEDDING_SIZE))
for word, index in input_token_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [None]:
print("Printing first line of embedding")
print(embedding_matrix[1])

Printing first line of embedding
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


In [None]:
## creting one hot encoding for target decoder 
decoder_targets_one_hot = np.zeros((
        len(target),
        max_out_len,
        num_words_output
    ),
    dtype='float32'
)

for i, d in enumerate(decoder_input_sequences):
    for t, word in enumerate(d):
        decoder_targets_one_hot[i, t, word] = 1

## Section 2 Translation Model training



---



**Task 4 (15 marks)**
* Provide code for the encoder using Keras LSTM (5 marks)
* Provide code for the decoder using Keras LSTM (5 marks)
* Train the sequence2sequence (encoder-decoder) model (5 marks) 


In [None]:
## creating the embedding layer

embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

In [None]:
# encoder code goes here
# creating the encoder
encoder_inputs_placeholder = Input(shape=(MAX_SENTENCE_LENGTH,))
x = embedding_layer(encoder_inputs_placeholder)
# adding dropout to prevent overfitting
encoder = LSTM(LSTM_NODES, dropout=0.4, recurrent_dropout=0.4, return_state=True)
encoder_outputs, h, c = encoder(x)
state = [h, c]
encoder_states = state # keeping the encoder states for future

In [None]:
# decoder code goes here
# creating the decoder

decoder_inputs_placeholder = Input(shape=(max_out_len,))

decoder_embedding = Embedding(num_words_output, LSTM_NODES)
decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder)

# adding dropout to prevent overfitting
decoder_lstm = LSTM(LSTM_NODES, dropout=0.4, recurrent_dropout=0.4, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states) # keeping decoder outputs for future

In [None]:
# creating the dense layer for final layer of model

decoder_dense = Dense(num_words_output, activation='softmax')(decoder_outputs)
# decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
# creating the model 
model= Model([encoder_inputs_placeholder,
  decoder_inputs_placeholder], decoder_dense,
)
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics= ['accuracy'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 10)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 10)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 10, 100)      600100      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 10, 512)      1962496     input_2[0][0]                    
______________________________________________________________________________________________

In [None]:
# training the model using the values declared above
model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets_one_hot,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=0.3,
)
model.save('seq2seq_source_target.h5')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# creating the encoder model with the inputs and the states
encoder_model = Model(encoder_inputs_placeholder, encoder_states)

In [None]:
# creating the decoder states

decoder_state_input_h = Input(shape=(LSTM_NODES,))
decoder_state_input_c = Input(shape=(LSTM_NODES,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding(decoder_inputs_single)



In [None]:
# creating the decoder model

decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)
decoder_states = [h, c]
# decoder_states = d_state
decoder_outputs = Dense(num_words_output, activation='softmax')(decoder_outputs)


decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

## Section 3 Testing

---

**Task 5 (20 marks)**

* Use the trained model to translate the text from the source into the target language (10 marks). 
* Use the test/evaluation set (see Section 1) and perform an automatic evaluation with the BLEU metric (10 marks). 
You can use the NLTK library to calculate BLEU.

In [None]:
# creating dictionary for index to words
indextoword_input = {v:k for k, v in word2idx_inputs.items()}
indextoword_target = {v:k for k, v in word2idx_outputs.items()}

In [None]:
""" Below method takes the input sequence and tries to translate the sequence to target language"""

def decode_sentence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<bof>']
    eos = word2idx_outputs['eos']
    output_sentence = []

    for _ in range(max_out_len):
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        if eos == idx:
            break

        word = ''

        if idx > 0:
            word = indextoword_target[idx]
            output_sentence.append(word)

        target_seq[0, 0] = idx
        states_value = [h, c]

    return ' '.join(output_sentence)

In [None]:
# taking test data and selecting first 30 chars
test_x_processed = list()

for i in range(len(test_x)):
  e = test_x[i][:30]
  test_x_processed.append(e)

In [None]:
## preprocessing test data so it could be used to predict using the model
source_tokenizer.fit_on_texts(test_x_processed)
input_integer_seq = source_tokenizer.texts_to_sequences(test_x_processed)

word2idx_inputs = source_tokenizer.word_index
print('Total unique words in the input: %s' % len(word2idx_inputs))

max_input_len = max(len(sen) for sen in input_integer_seq)
print("Length of longest sentence in input: %g" % max_input_len)

test_encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=MAX_SENTENCE_LENGTH)


Total unique words in the input: 2982
Length of longest sentence in input: 9


In [None]:
# translating sequences from test using the model
normal_actual_sentence, normal_predicted_sentence = list(), list() 
for i in range(len(test_encoder_input_sequences)):
    input_seq = test_encoder_input_sequences[i:i+1]
    translation = decode_sentence(input_seq)
    normal_actual_sentence.append(test_y[i])
    normal_predicted_sentence.append(translation)

In [None]:
print(normal_actual_sentence[:3],"\n" ,normal_predicted_sentence[:3])

['Auto-frappes', 'Quitter', 'Le pair a été trouvé par échange de pairs (PEX)'] 
 ['lobtention review débog six six lautofrappe fot v fot fot', 'aut détails vérifier radioactif radioactif retour répétition lexécut répétition lexécut', 'cou mio communauté communauté raid téléchargement soutien luminécran os luminécran']


In [None]:
# calculating the bleu score
import nltk
BLEUscore = nltk.translate.bleu_score.corpus_bleu(normal_actual_sentence, normal_predicted_sentence, weights=[0.5])
print(round(BLEUscore, 2))
BLEUscore = nltk.translate.bleu_score.corpus_bleu(normal_actual_sentence, normal_predicted_sentence, weights=[0.25])
print(round(BLEUscore, 2))
BLEUscore = nltk.translate.bleu_score.corpus_bleu(normal_actual_sentence, normal_predicted_sentence, weights=[1])
print(round(BLEUscore, 2))

0.38
0.62
0.15


# Section 4 Attention

---



**Task 5 (40 Marks)** Sequence2Sequence

* Extend the existing Seq2Seq model with an attention mechanism [Discussed in Class]
* Create sequence2sequence model with attention (15 marks)
* Train the model with the same data from Section 1 (10 marks)
* Translate the evaluation set using the sequence2sequence attention model (10 marks)
* Evaluate the translations made with the sequence2sequence attention model and compare it with the model without attention using BLEU (5 marks)

In [None]:
# encoder code goes here
# creating encoder for attention model
encoder_inputs_placeholder = Input(shape=(MAX_SENTENCE_LENGTH,))
x = embedding_layer(encoder_inputs_placeholder)
encoder = LSTM(LSTM_NODES, dropout=0.2, recurrent_dropout=0.2, return_state=True, return_sequences = True)
encoder_stack_h, h, c = encoder(x)
state = [h, c]
encoder_states = state

In [None]:
# creating decoder model
decoder_inputs_placeholder = Input(shape=(max_out_len,))

decoder_embedding = Embedding(num_words_output, LSTM_NODES)
decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder)

decoder_lstm = LSTM(LSTM_NODES, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, return_state=True)

decoder_stack_h, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)

In [None]:
# creating attention mechanism

attention = dot([decoder_stack_h, encoder_stack_h], axes=[2, 2])
attention = Activation('softmax')(attention)
context = dot([attention, encoder_stack_h], axes=[2,1])
decoder_outputs = concatenate([context, decoder_stack_h])
decoder_dense = Dense(num_words_output, activation='softmax')(decoder_outputs)

In [None]:
# creating the model

model= Model([encoder_inputs_placeholder,
  decoder_inputs_placeholder], decoder_dense,
)
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics= ['accuracy'])
model.summary()

In [None]:
# training the model
model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets_one_hot,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=0.3,
)
model.save('seq2seq_source_target.h5')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
#creating the encoder model using the inputs and states
attention_encoder_model = Model(encoder_inputs_placeholder, encoder_states)

In [None]:
# creating decoder inputs 
decoder_state_input_h = Input(shape=(LSTM_NODES,))
decoder_state_input_c = Input(shape=(LSTM_NODES,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# decoder_states_inputs = [decoder_state_input_h]

In [None]:
decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding(decoder_inputs_single)

In [None]:
# creating decoder model for attention

decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)
decoder_states = [h, c]
# decoder_states = d_state
# decoder_outputs = decoder_dense(decoder_outputs)
decoder_dense = Dense(num_words_output, activation='softmax')(decoder_outputs)
attention_decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_dense] + decoder_states
)

In [None]:
""" Below method takes an input sequence and translates it to target language"""

def decode_sentence(input_seq):
    states_value = attention_encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<bof>']
    eos = word2idx_outputs['eos']
    output_sentence = []

    for _ in range(max_out_len):
        output_tokens, h, c = attention_decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        if eos == idx:
            break

        word = ''

        if idx > 0:
            word = indextoword_target[idx]
            output_sentence.append(word)

        target_seq[0, 0] = idx
        states_value = [h, c]

    return ' '.join(output_sentence)

In [None]:

# translating sequences from test using the model
actual_sentence, predicted_sentence = list(), list() 
for i in range(len(test_encoder_input_sequences)):
    input_seq = test_encoder_input_sequences[i:i+1]
    translation = decode_sentence(input_seq)
    actual_sentence.append(test_y[i])
    predicted_sentence.append(translation)

In [None]:
# calculating bleu score using the attention model

import nltk
BLEUscore = nltk.translate.bleu_score.corpus_bleu(actual_sentence, predicted_sentence, weights=[0.5])
print(round(BLEUscore, 2))
BLEUscore = nltk.translate.bleu_score.corpus_bleu(actual_sentence, predicted_sentence, weights=[0.25])
print(round(BLEUscore, 2))
BLEUscore = nltk.translate.bleu_score.corpus_bleu(actual_sentence, predicted_sentence, weights=[1])
print(round(BLEUscore, 2))

0.38
0.61
0.14


The bleu score has not changed much after implementing attention model. The attention model in our case is performing worse in terms of bleu score. Possible reasons for this could be less data for training or less epochs for training.

The model is overfitted in both the cases. This could be one of the reason for poor results.

The bleu score of model without attention with weight = 1 is 0.15.

The bleu score for the attention model with weight = 1 is 0.14.