# Machine Translation - English to French

##### This project is to build an end-to-end machine translation pipeline that takes English text and retuns French translation. The model is built from scratch based on Encoder-Decoder and RNN structures. 

## Modules

In [1]:
import collections
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.models import Input
from keras.models import Sequential
from keras.layers import GRU, LSTM, InputLayer, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


## Dataset
The most common datasets used for machine translation are from WMT (http://www.statmt.org/).  However, that will take a long time to train a neural network on. To train the model in a reasonalbe time, a dataset that contains a small vocabulary has been chosen for this project.

### Load Data
The file path for English setneces is `data/small_vocab_en` and for French `data/small_vocab_fr`. 

In [2]:
import os

def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split('\n')

# Load English data
english_sentences = load_data('data/small_vocab_en')
# Load French data
french_sentences = load_data('data/small_vocab_fr')

print('Dataset Loaded')

Dataset Loaded


### Check Samples
Let's view the first two lines from each file (English and French).

In [3]:
for sample_i in range(2):
    print('English Sample Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('French Sample Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

English Sample Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
French Sample Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
English Sample Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
French Sample Line 2:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .


The dataset has been partially preprocessed. For example the puncuations have been delimited using spaces and the text have been converted to lowercase.  But the text requires more preprocessing.

### Stats of the Dataset
The dataset has relatively small vocabulary.

In [4]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


## Preprocess the Data
Following preprocessing steps need to be done:
1. Tokenize the words into ids.
2. Add padding to make all the sequences the same length.

### Tokenize

In [5]:
# Define function to generate the language  specific tokenizer and tokenize the sentence.

def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    tokenized_x = tokenizer.texts_to_sequences(x)
    return tokenized_x, tokenizer

### Padding

In [6]:
# Define function to pad sentences to the same max length.

def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    if length == None:
        length = max([len(sequence) for sequence in x])
    padded_x = pad_sequences(x, maxlen = length, padding='post')
    return padded_x

### Preprocess and Split Data into Train and Test Datasets
Create a preprocess pipeline that tokenize and pad the sentences. 

In [7]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Input List of sentences
    :param y: Target (Label) List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

In [8]:
# Preprocess all English and French sentences. Calculate the maximun length for each language.

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


In [9]:
# Split the dataset into Train and Test datasets. 

from sklearn.model_selection import train_test_split

train_english, test_english, train_french, test_french = train_test_split(
    preproc_english_sentences, preproc_french_sentences, shuffle=False, test_size=0.20, random_state=42)

In [10]:
train_english.shape

(110288, 15)

In [11]:
train_french.shape

(110288, 21, 1)

## Model Architecture and Training

### Model Architecture

Model is based on Encoder-Decoder RNN incorporating embedding and bidirectional layers. For RNN units, both GRU and LSTM have been experimented and I ended up with LSTM, which took a bit longer to train but generated better outcomes. 

#### Hyperparameters
Various values for hyperparameters were experimented. Embedding size, learning rate, epochs, GRU vs LSTM units, etc. Also 'relu' activation function was used for LSTM instead of its default 'tanh'. 

In [14]:
def model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    """

    embedding_dim = 300
    learning_rate = 2e-3
    
    model = Sequential()
    
    model.add(Embedding(english_vocab_size, embedding_dim, input_length=input_shape[1]))
    model.add(Bidirectional(LSTM(256, activation='relu', return_sequences=False))) 
    model.add(RepeatVector(output_sequence_length))
    
    model.add(Bidirectional(LSTM(256, activation='relu', return_sequences=True))) 
    model.add(TimeDistributed(Dense(french_vocab_size*5, activation='relu')))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    model.compile(loss=sparse_categorical_crossentropy,
      optimizer=Adam(learning_rate),
      metrics=['accuracy'])
    
    print(model.summary())
    
    return model

### Create Callback Function for Saving Model Checkpoints

In [25]:
# from keras.callbacks import ModelCheckpoint

# # Include the epoch in the file name (uses `str.format`)
# checkpoint_path = "training/cp-{epoch:04d}.ckpt"
# checkpoint_dir = os.path.dirname(checkpoint_path)

# batch_size = 1024
# epochs=25

# # Create a callback that saves the model's weights every 5 epochs
# cp_callback = ModelCheckpoint(filepath=checkpoint_path, 
#                               verbose=1, 
#                               save_weights_only=True, 
#                               save_freq=5*epochs)

### Train the Model

In [32]:
# Train the final model with 10% of validation data.

# Pad English sentences to Maximum French sentence lenth which is longer. This is for the consistency of the input / output tensor sizes. 
tmp_x = pad(train_english, max_french_sequence_length)

# For embedding, 1 is added to each vocability to represent <PAD>.
final = model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size+1,
    french_vocab_size+1)

# # Save the weights using the `checkpoint_path` format
# final.save_weights(checkpoint_path.format(epoch=0))
# final.fit(tmp_x, train_french, batch_size=1024, callbacks=[cp_callback], epochs=25, validation_split=0.1)

final.fit(tmp_x, train_french, batch_size=1024, epochs=25, validation_split=0.1)  

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 21, 300)           60000     
_________________________________________________________________
bidirectional_9 (Bidirection (None, 512)               1140736   
_________________________________________________________________
repeat_vector_5 (RepeatVecto (None, 21, 512)           0         
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 21, 512)           1574912   
_________________________________________________________________
time_distributed_9 (TimeDist (None, 21, 1725)          884925    
_________________________________________________________________
time_distributed_10 (TimeDis (None, 21, 345)           595470    
Total params: 4,256,043
Trainable params: 4,256,043
Non-trainable params: 0
_________________________________________________________________


#### The model achieved over 98% accuracy for both tratining and validation data!

### Save the Model

In [34]:
!mkdir -p saved_model
final.save('saved_model/my_model')

### Check the Output of the Trained Model.
The model training is done. Its output is logits which need to be converted to French words, then join the words back to sentences. 

In [37]:
# Define function that convert the output word ids into French sentences. 
# Because the neural network will generate the logits (word ids) as the output.
 
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

In [38]:
# The model's translation of the first sample of the dataset. 

text = logits_to_text(final.predict(tmp_x[:1])[0], french_tokenizer)
print(text)

new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


In [39]:
# Compared to the target (label) sentence of the sample.

french_sentences[0]

"new jersey est parfois calme pendant l' automne , et il est neigeux en avril ."

## Prediction Using Test Dataset

In [46]:
def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    
    model = final
    sentences = pad(x, max_french_sequence_length)
    
    predictions = model.predict(sentences, len(sentences))
    

    return predictions

predictions = final_predictions(test_english, test_french, english_tokenizer, french_tokenizer)

### Compare Predictions to Labels

In [60]:
# The first sample of the test dataset.

# Predictsion from the model are logits. Convert them to words.
pred = logits_to_text(predictions[0], french_tokenizer)
print('Prediction: \n{}'.format(pred))

# Test Dataset of French sentences are in word ID format. Convert them back to words.
id_to_word = {value: key for key, value in french_tokenizer.word_index.items()}
id_to_word[0] = '<PAD>'

label = ' '.join([id_to_word[np.max(x)] for x in test_french[0]])
print('Label: \n{}'.format(label))



Prediction: 
californie est agréable pendant l' hiver mais il est généralement chaud au printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Label: 
californie est agréable pendant l' hiver mais il est généralement chaud au printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


#### Looks pretty good!