# Ronan Murphy

## 15/04/2020

# Neural Machine Translation

Developed a neural machine translation (NMT) system to translate text from English to French. Choose french.txt data, as it contains english and french version of phrases, to train the models, perform data processing and train a sequence2sequence neural model.


## Section 1- Data Collection and Preprocessing 


---


In [None]:
#download language pair from (http://www.manythings.org/anki/)
from google.colab import files
files.upload()

In [0]:
import pandas as pd
import random
from sklearn.model_selection import train_test_split

#read the file and split into lines using pandas
filename="fra.txt"
dataset = pd.read_csv(filename, sep = '\t',header=None)
#remove the 2nd column containing copyright information
dataset = dataset.drop(dataset.columns[2], axis=1)

In [7]:
#convert dataframe back to a list
lines = dataset.values.tolist()
#print the number of sentences
print(len(lines))
#print the 100th sentence
print(lines[100])

#take a random sample of 10,000 from the original file
lines = random.sample(lines, 10000)

#split data into training and test when model is being built - section 4

175623
['Come in.', 'Entre.']




* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Preprocess (word tokenisation, lowercasing) the text.

In [8]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
from nltk.tokenize import word_tokenize
import re
import string
import unicodedata
#create 3 arrays for input english, french input and french output
input_texts = []
target_inputs = []
target_outputs = []

#code to convert unicode characters to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')

#preprocessing stage for each line in the dataset
for line in lines:
  #strip, lower and convert to ascii, removing punctuation from both english and french text
  input_text = unicode_to_ascii(line[0].lower().strip())
  input_text = input_text.translate(str.maketrans('', '', string.punctuation))
  
  target_text = unicode_to_ascii(line[1].lower().strip())
  target_text = target_text.translate(str.maketrans('', '', string.punctuation))
  
  #create different sets for input and output so model will know when data starts and ends
  target_input = "<bof>" + target_text
  target_output = target_text+"<eos>"
  
  #add each sentence to list
  input_texts.append(input_text)
  target_inputs.append(target_input)
  target_outputs.append(target_output)
  
#print length and 142nd value of each, all length 10,000 and sentence should match
print(len(input_texts))
print(len(target_outputs))
print(len(target_inputs))
print(input_texts[142])
print(target_outputs[142])
print(target_inputs[142])

10000
10000
10000
we came back to camp before dark
nous sommes revenues au camp avant la nuit<eos>
<bof>nous sommes revenues au camp avant la nuit


In [0]:
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
#convert sentences to word tokens and find length of vocab and max sentence length for input and output
input_tokenizer = Tokenizer()
input_tokenizer.fit_on_texts(input_texts)
Source_vocabulary = input_tokenizer.word_index.keys()

num_source_tokens = len(input_tokenizer.word_index)
max_source_seq_length = max(len(text_to_word_sequence(x)) for x in input_texts)


output_tokenizer = Tokenizer()
output_tokenizer.fit_on_texts(target_inputs + target_outputs)
target_vocabulary = output_tokenizer.word_index.keys()

num_target_tokens = len(output_tokenizer.word_index)
max_target_seq_length = max(len(text_to_word_sequence(y)) for y in target_inputs)


In [None]:
print('Number of samples:', len(input_texts))
print('Number of unique source language tokens:', num_source_tokens)
print('Number of unique target language tokens:', num_target_tokens)
print('Max sequence length of source language:', max_source_seq_length)
print('Max sequence length of target language:', max_target_seq_length)
print("Source Vocabulary",Source_vocabulary)
print("Target Vocabulary",target_vocabulary)

*  Assign each unique word an integer value 
*  Create word embedding for your vocabulary using pre-trained Glove embeddings 
* Print the first line of the embeddings 

In [0]:
#convert unique input words to interger values
num_words_output = num_target_tokens + 1
input_int = input_tokenizer.texts_to_sequences(input_texts)
target_output_int = output_tokenizer.texts_to_sequences(target_outputs)
target_input_int = output_tokenizer.texts_to_sequences(target_inputs)


In [14]:
#pad the sequences to the same length
encoder_input = pad_sequences(input_int, maxlen=max_source_seq_length)
print("endocder input shape", encoder_input.shape)

#add padding for encoder after the sentence integers
decoder_input = pad_sequences(target_input_int, maxlen=max_target_seq_length, padding='post')
print("decoder input shape", decoder_input.shape)


decoder_output = pad_sequences(target_output_int, maxlen=max_target_seq_length, padding='post')
print("decoder output shape", decoder_output.shape)

endocder input shape (10000, 34)
decoder input shape (10000, 34)
decoder output shape (10000, 34)


In [17]:
#upload glove txt file for 100 dimensional vectors for each token
from google.colab import files
files.upload()

Saving glove.6B.100d.txt to glove.6B.100d.txt


In [0]:
#word embeddings from integers to vectors using Glove
from numpy import array
from numpy import asarray
from numpy import zeros

embedding = dict()
glove_file="glove.6B.100d.txt"

g = open(glove_file , 'r')

#create a dictionary for embedding mapping each value to an array of similar vectors 
for l in g:
    vects = l.split()
    w = vects[0]
    vect_dim = asarray(vects[1:], dtype='float32')
    embedding[w] = vect_dim

g.close()

In [0]:
#create the weights for the embedding layer from the glove model
words = min(10000, num_source_tokens+1)
embed_mat = zeros((words, 100))
for value, position in input_tokenizer.word_index.items():
    embedding_vect = embedding.get(value)
    if embedding_vect is not None:
        embed_mat[position] = embedding_vect



In [0]:
from keras.layers import Embedding
#create embedding layer
layer_embedding = Embedding(words, 100, weights=[embed_mat], input_length=max_source_seq_length)

In [21]:
print(embed_mat.shape)

(4754, 100)


In [22]:
#print first line of embeddings
print(embedding['i'])
print(embed_mat[1])

[-0.046539   0.61966    0.56647   -0.46584   -1.189      0.44599
  0.066035   0.3191     0.14679   -0.22119    0.79239    0.29905
  0.16073    0.025324   0.18678   -0.31001   -0.28108    0.60515
 -1.0654     0.52476    0.064152   1.0358    -0.40779   -0.38011
  0.30801    0.59964   -0.26991   -0.76035    0.94222   -0.46919
 -0.18278    0.90652    0.79671    0.24825    0.25713    0.6232
 -0.44768    0.65357    0.76902   -0.51229   -0.44333   -0.21867
  0.3837    -1.1483    -0.94398   -0.15062    0.30012   -0.57806
  0.20175   -1.6591    -0.079195   0.026423   0.22051    0.99714
 -0.57539   -2.7266     0.31448    0.70522    1.4381     0.99126
  0.13976    1.3474    -1.1753     0.0039503  1.0298     0.064637
  0.90887    0.82872   -0.47003   -0.10575    0.5916    -0.4221
  0.57331   -0.54114    0.10768    0.39784   -0.048744   0.064596
 -0.61437   -0.286      0.5067    -0.49758   -0.8157     0.16408
 -1.963     -0.26693   -0.37593   -0.95847   -0.8584    -0.71577
 -0.32343   -0.43121    0

## Section 2 Translation Model training



---



* Provide code for the encoder & decoder using Keras LSTM 
* Train the sequence2sequence (encoder-decoder) model 


In [0]:
import numpy as np
from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense 

#create one hot target output layer fill with zeros
decoder_targets_one_hot=np.zeros((len(input_texts), max_target_seq_length, num_words_output),dtype='float32')


In [24]:
print(decoder_targets_one_hot.shape)

(10000, 34, 7717)


In [0]:
#add values to one hot layer
for i, d in enumerate(decoder_output):
    for t, word in enumerate(d):
        decoder_targets_one_hot[i, t, word] = 1

In [0]:
# encoder layer
#set the values for LSTM nodes , batch size and epochs for model
nodes_lstm = 256
Batch_size = 64
epochs = 10


#define shape and inputs to encoder embedding layer, record output states as input to next layer
encoder_inputs = Input(shape=(max_source_seq_length,))

x = layer_embedding(encoder_inputs)
encoder_lstm = LSTM(nodes_lstm, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(x)
encoder_states = [state_h, state_c]



In [0]:
#decoder layer
#define shape and input into decoder embedding layer
decoder_inputs = Input(shape=(max_target_seq_length,))
decoder_embedding = Embedding(num_words_output, nodes_lstm)
decoder_inputs_x = decoder_embedding(decoder_inputs)
 
#define decoder of lstm which will be output is inputted to final dense layer
decoder_lstm = LSTM(nodes_lstm, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)



In [0]:
#decoder dense output layer, use softmax as activation 
decoder_dense = Dense(num_words_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
#create the model with encoder, decoder and decoder dense layers
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

#loss calculated using categoircal crossentrophy as not binary data
#Root mean square used to optimise, comparing accuracy over epochs
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

#print model summary
model.summary()

In [0]:
#train-test split 85:15 for training and test for each layer training data inputted into model fit
from sklearn.model_selection import train_test_split
encoder_train, encoder_test, decoder_train, decoder_test,decoder_targets_train, decoder_targets_test = train_test_split(encoder_input, decoder_input,decoder_targets_one_hot, test_size=0.15, random_state=42)

In [None]:
#fit the model with training data, batch size 64 over 20 epochs with 80:20 split train to validation data
model.fit([encoder_train, decoder_train], decoder_targets_train, batch_size=Batch_size, epochs=epochs,validation_split=0.2)

#save the model as a h5 file 
model.save('seq2seq_source_target.h5')

In [0]:
#update encoder and decoder models for testing to allow predictions word by word
#encoder stays the same
encoder_model = Model(encoder_inputs, encoder_states)

decoder_h = Input(shape=(nodes_lstm,))
decoder_c = Input(shape=(nodes_lstm,))
decoder_states = [decoder_h, decoder_c]

#shape is changed to single decoders rather than full phrase, can iterate through and translate sentence
single_dec = Input(shape=(1,))
single_dec_x = decoder_embedding(single_dec)
decoder_out, h, c = decoder_lstm(single_dec_x, initial_state=decoder_states)

decode = [h, c]
decoder_out = decoder_dense(decoder_out)

decoder_model = Model([single_dec] + decoder_states,[decoder_out] + decode)

## Section 3 Testing

---

* Use the trained model to translate the text from the source into the target language 
* Use the test/evaluation set (see Section 1) and perform an automatic evaluation with the BLEU metric 

In [0]:
#convert integers back to words for source and target
#create dictionaries to convert them back
word_input = {value:key for key, value in input_tokenizer.word_index.items()}
word_target = {value:key for key, value in output_tokenizer.word_index.items()}

#method used to translate, converting the integer values of each sentence back to the word vectors
def translator(input):
    states = encoder_model.predict(input)
    target = np.zeros((1, 1))
    
    target[0, 0] = output_tokenizer.word_index['bof']
    eos = output_tokenizer.word_index['eos']
    out_sen = []

    for _ in range(max_target_seq_length):
        out_tok, h, c = decoder_model.predict([target] + states)
        index = np.argmax(out_tok[0, 0, :])

        if eos == index:
            break

        word = ''

        if index > 0:
            word = word_target[index]
            out_sen.append(word)

        target[0, 0] = index
        states = [h, c]

    return ' '.join(out_sen)

In [None]:
#BLEU evalution of test dataset

import nltk
from statistics import mean 
accuracy_overall = []
#iterate through all test set data and predict french translation
#record the accuracy for each sentence and add to list, get the mean value of this list 
#returns the accuracy of test data
for sentence in range (len(encoder_test)):
  input_sequence = encoder_input[sentence:sentence+1]
  translate = translator(input_sequence)
  print('English', input_texts[sentence])
  print('French', translate)
  print("------------------------------------------------")


  actual = target_outputs[sentence].split()
  predict = translate.split()
  actual = actual[:-1]
  #print(actual,predict)

  #use BLEU to score the actual vs predicted sets
  accuracy = nltk.translate.bleu_score.sentence_bleu([actual],predict)
  accuracy_overall.append(accuracy)

#return the accuracy from the average across all translations
test_accuracy = round(mean(accuracy_overall),2)
print("Test Score is: ", test_accuracy)


# Section 4 Add Attention to Model

---



Sequence2Sequence

* Extend the existing Seq2Seq model with an attention mechanism 
* Create sequence2sequence model with attention 
* Train the model with the same data from Section 1 
* Translate the evaluation set using the sequence2sequence attention model 
* Evaluate the translations made with the sequence2sequence attention model and compare it with the model without attention using BLEU 

In [0]:
# Added keep encoder and decoder the same without Glove training
#Embedding and LSTM layer for each take outputs of encoder as input to decoder


#define shape and inputs to encoder embedding layer, record output states as input to next layer
encoder_inputs = Input(shape=(max_source_seq_length,))
embedding = Embedding(words, 100, weights=[embed_mat], mask_zero=True)
x = embedding(encoder_inputs)

encoder_lstm = LSTM(nodes_lstm, return_state=True)
encoder= encoder_lstm(x)
encoder_seq, state_h, state_c = encoder
encoder_states = [state_h, state_c]

In [0]:
# Decoder layer
#take outputs of encoder as inputs

decoder_inputs = Input(shape=(max_target_seq_length,))
decoder_embedding = Embedding(num_words_output, nodes_lstm)
decoder_embedded = decoder_embedding(decoder_inputs)

decoder_lstm = LSTM(nodes_lstm, return_sequences=True, return_state=True)
decoder= decoder_lstm(decoder_embedded, initial_state=encoder_states)



In [None]:
#Create attention layer
from keras.layers import Input, Permute, Activation, Embedding, Dense, LSTM, concatenate, dot, BatchNormalization
#use keras dot of the encoder and decoder with softmax activation to get attention input 
attention = Activation('softmax')(dot([decoder, encoder], axes=[2,2]))
#join the decoder and the attention layer to create dense output 
att = dot([attention, encoder], axes=[2,1]) 
join = Concatenate()([att, decoder])
#dense output layer with relu activaion function 
attention_dense = Dense(256, activation='relu')
attention_out = attention_dense(join)
#output of the layer 
out = Dense(num_words_output, activation="softmax")(attention_out)

#**** This code is giving an error which I couldnt solve, saying the input to the layer is of type list rather than symbolic tensor, 
#didnt have time to try another method***

#the remaining code is the approach used if I was able to create this layer

In [None]:
#create 2nd model with attention the encoder and decoder contain the main layers fed into a dense attention layer output
model2 = Model([encoder_inputs, decoder_inputs], out)
#using root mean squared with categoritcal crossentrophy loss function
model2.compile(optimizer='rmsprop', loss='categorical_crossentropy')


In [None]:
#train the model with the datasets split, split validation for 80:20 over 10 epochs with batch size 64 same as previous model 
model2.fit([encoder_train, decoder_train], out, batch_size=64, epochs=10, validation_split=0.2)
#save the model as a h5 file 
model2.save('seq2seq_attention.h5')

In [0]:
#calcualte the accuracy of the model and predict each sentence using the translate method previously made
#would have to update this method converting model to model2 
#the accuracy is then compared with BLEU adding accuracy of each translation to a list and getting mean score
accuracy_att = []
#iterate through all test set data and predict french translation
for sentence in range (len(encoder_test)):
  input_sequence = encoder_input[sentence:sentence+1]
  translate = translator_att(input_sequence)
  #print('English', input_texts[sentence])
  #print('French', translate)


  actual = target_outputs[sentence].split()
  predict = translate.split()
  actual = actual[:-1]
  #print(actual,predict)

  #use BLEU to score the actual vs predicted sets
  accuracy = nltk.translate.bleu_score.sentence_bleu([actual],predict)
  accuracy_overall.append(accuracy)

#return the accuracy from the average across all translations
test_accuracy_att = round(mean(accuracy_overall),2)
print("Test Score for is: ", test_accuracy)

**CONCLUSION**

The attention layer improved the the overall accuracy of the predictions as it defines the most relevant part of the English sentence to translate, assigning more weight to this. The accuracy for training recieved in model with attention was 89% as well as validation sets.

The accuracy for training recieved in part 1 was 84% as well as validation, if the epochs were increased to 20 there could have been some improvement seen here. Another possibility would be to write a method which stops training once losses begin to increase.