## Machine Translation 

We want to buld an end-to-end machine translation pipeline to translate English text to French, using different neural network architectures.

In [46]:
import os
import collections
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Flatten, LSTM
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

### Load dataset
Use a dataset that contains a small vocabulary (to be able to run it on a local machine). The data is partially preprocessed: the puncuations have been delimited using spaces and all text is lowercase.

In [28]:
def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split('\n')

In [52]:
# Load English data
english_sentences = load_data('data/small_vocab_en')
# Load French data
french_sentences = load_data('data/small_vocab_fr')

In [122]:
for sample_i in [0,1300]:
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 1301:  our least favorite fruit is the peach , but your least favorite is the lemon .
small_vocab_fr Line 1301:  notre moins fruit préféré est la pêche , mais votre moins préféré est le citron .


In [31]:
# counts the unique words
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


### Preprocessing
1) **Tokenize words to ids** 

2) **Padding** in the end of the sentence so that English and French sequences have the same length.

In [54]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency
    x_tk = Tokenizer()
    
    # Transforms each text in texts to a sequence of integers
    x_tk.fit_on_texts(x) 
    return x_tk.texts_to_sequences(x), x_tk

In [55]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    if length==None:
        length = max([len(el) for el in x])
    
    # padding post adds zeros in the end of array
    return pad_sequences(x, maxlen=length, padding='post')

In [57]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

In [59]:
# apply preprocessing to dataset
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


### Post-processing
From predicted token ids to French words

In [36]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

`logits_to_text` function loaded.


### Modeling
We'll experiment with different model architectures: 
- 1: a simple RNN
- 2: RNN with an embedding layer
- 3: Bidirectional RNN with embedding 

#### Model 1: RNN 

In [60]:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    
    learning_rate= 0.1
    input_l = Input(shape=input_shape[1:])
    # The simplest RNN consists of a dense layer with size Ninput (padded length so that input 
    # and output have the same shape) * french vocab size
    rnn = Dense(french_vocab_size +1)(input_l)
    # The output layer is a softmax layer that, for each possible french word and sequence position, tells what 
    # is the most likely word (so this works to predict single inputs)
    model = Model(input_l, Activation('softmax')(rnn))

    print(model.output_shape)
    print(model.summary())
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

In [120]:
# Reshaping the input to work with a basic RNN
# input is padded to have same size as output (21)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
# Code below reshapes input to be (batch_size, sequence_length, output_dim)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

In [61]:
# Train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)

simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

(None, 21, 345)
Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 21, 1)             0         
_________________________________________________________________
dense_8 (Dense)              (None, 21, 345)           690       
_________________________________________________________________
activation_5 (Activation)    (None, 21, 345)           0         
Total params: 690
Trainable params: 690
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
est est est est en est est est est est est est est <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


In [121]:
# Print predictions of two sample sentences
print(english_sentences[0])
print(logits_to_text(simple_rnn_model.predict(tmp_x[np.newaxis, 0, :])[0], french_tokenizer))

print(english_sentences[1300])
print(logits_to_text(simple_rnn_model.predict(tmp_x[np.newaxis, 1300, :])[0], french_tokenizer))

new jersey is sometimes quiet during autumn , and it is snowy in april .
est est est est en est est est est est est est est <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
our least favorite fruit is the peach , but your least favorite is the lemon .
est est est est est est aime est est est est est est chaud <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


The model seems to predict the most probable words most of the times. Playing a bit with the learning rate does not seem to help much.


#### Model 2: RNN with word embeddings
We use an embedding to better represent words in vector representation (n-dimensional, with n = embedding size).

In [68]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """

    learning_rate= 1e-3
    input_l = Input(shape=input_shape[1:])
    # Add an Embedding layer to smartly encode the english inputs
    emb_size = 100
    emb_layer = Embedding(english_vocab_size+1, emb_size)(input_l)
    # using GRU with TimeDistributed vs Dense improves performance from 0.6 to 0.8!
    rnn = GRU(64, return_sequences=True)(emb_layer)
    logits = TimeDistributed(Dense(french_vocab_size+1, activation='softmax'))(rnn)

    model = Model(input_l, logits)
    model.compile(loss=sparse_categorical_crossentropy, 
                        optimizer=Adam(learning_rate), 
                        metrics=['accuracy'])
    print(model.summary())
    return model

In [115]:
# Reshaping the input to work with embeddings [input shape of embeddings should be (batch_size, sequence_length)]
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)

In [69]:
# Train the neural network
embedded_model = embed_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)

embedded_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Model: "model_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_11 (InputLayer)        (None, 21)                0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 21, 100)           20000     
_________________________________________________________________
gru_6 (GRU)                  (None, 21, 64)            31680     
_________________________________________________________________
time_distributed_6 (TimeDist (None, 21, 345)           22425     
Total params: 74,105
Trainable params: 74,105
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f399f2dbb00>

In [119]:
# Print predictions of two sample sentences
print(english_sentences[0])
print(logits_to_text(embedded_model.predict(tmp_x[np.newaxis, 0, :])[0], french_tokenizer))

print(english_sentences[1300])
print(logits_to_text(embedded_model.predict(tmp_x[np.newaxis, 1300, :])[0], french_tokenizer))

new jersey is sometimes quiet during autumn , and it is snowy in april .
new jersey est parfois calme en l' automne il il est neigeux en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
our least favorite fruit is the peach , but your least favorite is the lemon .
notre fruit préféré moins est la pêche mais votre moins préféré est la citron <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


Adding the Embedding layer and using a GRU with TimeDistributed layer already achieves quite a good performance.

#### Model 3: Embedding and Bidirectional RNNs
The model incorporates embeddings and bidirectional RNN (so that the model can see following, and not only preceding, words).

In [106]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate= 1e-3
    input_l = Input(shape=input_shape[1:])
    emb_size = 100
   
    X = Embedding(input_dim=english_vocab_size+1, output_dim=emb_size)(input_l)
    # RepeatVector takes input and makes it fit the size you want. However, as we want to import all the info from Embedding,
    # we need to Flatten first 
    X = Flatten()(X)
    emb_layer = RepeatVector(output_sequence_length)(X)

    # Bidirectional creates a set of 2*rnn size, one to look at input in foward direction, one in backward direction
    rnn = Bidirectional(GRU(64, return_sequences=True))(emb_layer)
    logits = TimeDistributed(Dense(french_vocab_size+1, activation='softmax'))(rnn)

    model = Model(input_l, logits)

    model.compile(loss=sparse_categorical_crossentropy, 
                        optimizer=Adam(learning_rate), 
                        metrics=['accuracy'])
    print(model.summary())
    return model

In [117]:
# Reshaping the input to work with embeddings
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)

In [108]:
# Train the neural network
final_model = model_final(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)

final_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Model: "model_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        (None, 21)                0         
_________________________________________________________________
embedding_8 (Embedding)      (None, 21, 100)           20000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 2100)              0         
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 21, 2100)          0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 21, 128)           831360    
_________________________________________________________________
time_distributed_8 (TimeDist (None, 21, 345)           44505     
Total params: 895,865
Trainable params: 895,865
Non-trainable params: 0
____________________________________________________

<keras.callbacks.callbacks.History at 0x7f399b780a58>

In [118]:
# Print predictions of two sample sentences
print(english_sentences[0])
print(logits_to_text(final_model.predict(tmp_x[np.newaxis, 0, :])[0], french_tokenizer))

print(english_sentences[1300])
print(logits_to_text(final_model.predict(tmp_x[np.newaxis, 1300, :])[0], french_tokenizer))

new jersey is sometimes quiet during autumn , and it is snowy in april .
new jersey est parfois calme pendant l'automne et il est est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
our least favorite fruit is the peach , but your least favorite is the lemon .
votre fruit préféré moins est la chaux mais votre moins préféré est le citron <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


Using a bidirectional layer does not seem to significantly improve model performance. The long training time of this network, however, prevents further investigation on possible improvements resulting from parameter tuning and modifications of the network structure.

Disclaimer: This notebook is inspired by a project which is part of the Natural language Processing Udacity Nanodegree.