# RSE Problem 4 - Machine Translation

The following is a simple end-to-end implementation of an LSTM encoder-decoder for translation from English to Spanish. I chose a sufficiently simple architecture that I was able to write a fully implemented solution that can be executed in this notebook, from data preparation, to the model definition, training/validation and evaluation. 

I've prepared a version that can be run in a Colab notebook [here](https://colab.research.google.com/drive/18eGiR2mBA69EOvQCC5uy3hzp3lo-ZaEx?usp=sharing) - I recommend using a GPU runtime (if available) otherwise the model training takes a good 20 mins (as opposed to about 3 mins with GPU)

## Setup/Requirements

In [1]:
import pathlib
import random
import string
import re
from nltk.translate.bleu_score import corpus_bleu
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding, RepeatVector, TimeDistributed
from keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical, get_file

## Download Dataset

We begin by loading the training dataset - I took the below code snippet from this example in the Keras documentation as a time saving exercise. (It also has the advantage of being english -> Spanish translations; I speak Spanish so it made debugging easier!)

The end product is a list containing pairs of short English phrase strings and their translations in Spanish (examples in the output of the cell below). These are shuffled and then split into training/validation/test datasets to facilitate the rest of the training pipeline.

In [2]:
# Download and read English -> Spanish dataset
text_file = get_file(
    fname="spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=True,
)
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = None
# Used only first 10,000 examples so model can be trained relatively quickly
for idx, line in enumerate(lines):
    if idx == 10000:
        break
    eng, spa = line.split("\t")
    if text_pairs is None:
        text_pairs = np.array([[eng, spa]])
    else:
        text_pairs = np.append(text_pairs, [[eng, spa]], axis=0)

print("=" * 80)
print("Examples of text pairs:\n")
for _ in range(5):
    print(random.choice(text_pairs))

# Shuffle data and split into train/validation/test datasets
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train = text_pairs[:num_train_samples]
val = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test = text_pairs[num_train_samples + num_val_samples :]

print()
print(f"{len(text_pairs)} total pairs")
print(f"{len(train)} training pairs")
print(f"{len(val)} validation pairs")
print(f"{len(test)} test pairs")
print("=" * 80)

Examples of text pairs:

["I'm after him." 'Voy tras él.']
['Close that door.' 'Cerrá esa puerta.']
['That a boy!' '¡Ése es mi chico!']
['Look at it.' 'Obsérvalo.']
["It won't work." 'No servirá.']

10000 total pairs
7000 training pairs
1500 validation pairs
1500 test pairs


## Data Prep

As with most machine learning pipelines, a considerable amount of the complexity is in the data preparation phase. Machine translation models can't "read" raw text: they require text to be encoded or sequenced - converting raw text into a sequence of numbers. To do this, we fit a "tokenizer" for each language - tokenizers take the full vocabulary in the collection English/Spanish phrases and associate each word with a unique number. The text phrases are then represented by a vector of numbers equal length to the phrase, each number representing the corresponding word. The labels or ground truths are then "one-hot encoded" - this converts the sequenced vectors (length = maximum phrase length) into a 2-dimensional matrix (vocabulary size x maximum phrase length) with a 1 in the row position of the value of the correctly translated word and a zero in the other positions. 

Examples of outputs from each stage of the encoding process can be seen in the output below: 

In [3]:
# function to fit tokenizer for each language
def return_fitted_tokenizer(texts):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(texts)
	return tokenizer
 
def calculate_max_length(texts):
	return max(len(text.split()) for text in texts)
 
def sequence_texts(tokenizer, maxlen, texts):
	sequences = tokenizer.texts_to_sequences(texts)
	sequences = pad_sequences(sequences, maxlen=maxlen, padding='post')
	return sequences 
 
### Prep English examples

# fit tokeniser and calculate vocab size and max sentence length
eng_tokenizer = return_fitted_tokenizer(text_pairs[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = calculate_max_length(text_pairs[:, 0])

# Encode English train/validation/test data
X_train = sequence_texts(eng_tokenizer, eng_length, train[:, 0])
X_val = sequence_texts(eng_tokenizer, eng_length, val[:, 0])
X_test = sequence_texts(eng_tokenizer, eng_length, test[:, 0])

### Prep Spanish examples
# same process as english examples, but sequenced examples are one-hot encoded

spa_tokenizer = return_fitted_tokenizer(text_pairs[:, 1])
spa_vocab_size = len(spa_tokenizer.word_index) + 1
spa_length = calculate_max_length(text_pairs[:, 1])

print("Example sentence:\n")
print(train[0,1])

Y_train = sequence_texts(spa_tokenizer, spa_length, train[:, 1])

print("\n... sequenced:\n")
print(Y_train[0])

Y_train = to_categorical(Y_train, num_classes=spa_vocab_size)

print("\n... and one-hot encoded:\n")
print(Y_train[0])

Y_val = sequence_texts(spa_tokenizer, spa_length, val[:, 1])
Y_val = to_categorical(Y_val, num_classes=spa_vocab_size)

Y_test = sequence_texts(spa_tokenizer, spa_length, test[:, 1])
Y_test = to_categorical(Y_test, num_classes=spa_vocab_size)


Example sentence:

Ve.

... sequenced:

[50  0  0  0  0  0  0  0]

... and one-hot encoded:

[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


## Build model

I decided on a simple LSTM encoder/decoder framework, as it is an effective and well established technique for machine translation, and can be easily implemented in a short timeframe. I chose to implement the model using the Keras package, and specifically their "sequential" framework, that allows models to be built as a simple stack of pre-implemented layers with a single input and output. I defined the model with sufficient complexity that it can achieve reasonable performance on the given task, but not with so many free-parameters that it would take hours and hours to train. The simplicity of this implementation can be seen below - the code implements an initial embedding layer, an LSTM encoder and Decoder, and a fully connected output layer, all in a few short lines of code:


In [4]:
# Define and compile Keras LSTM model
model = Sequential()
model.add(
    Embedding(eng_vocab_size, 
              256, 
              input_length=eng_length, 
              mask_zero=True, 
              name = "embedding_layer")
)
model.add(LSTM(256, name="encoder"))
model.add(RepeatVector(spa_length))
model.add(LSTM(256, return_sequences=True, name="decoder"))
model.add(TimeDistributed(
        Dense(spa_vocab_size, 
              activation='softmax', 
              name = "fully_connected_layer")
))
model.compile(optimizer='adam', loss='categorical_crossentropy')

print(model.summary())



Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 5, 256)            404224    
_________________________________________________________________
encoder (LSTM)               (None, 256)               525312    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 8, 256)            0         
_________________________________________________________________
decoder (LSTM)               (None, 8, 256)            525312    
_________________________________________________________________
time_distributed (TimeDistri (None, 8, 3182)           817774    
Total params: 2,272,622
Trainable params: 2,272,622
Non-trainable params: 0
_________________________________________________________________
None


## LSTM

An LSTM is a specific implementation of recurrent neural network or RNN. RNNs can take a variable length sequence, and encode that sequence into a vector representation one time step at a time. The "recurrent" naming of an RNN signifies that the network uses a recurring set of neurons and weights to encode each timestep in the sequence. Each recurring unit passes a "state" signal from one time step to the next - this signal is combined with the input at each timestep to calculate a new "state" signal to be passed along to the next timestep. 

In the case of language translation, RNNs are structured in an "encoder" / "decoder" arrangement. The encoder takes the input text sequence and encodes it into a vector representation of the text, as described above. This vector representation isn't human readable/understandable, so a separate decoder network is used to translate the sentence into the target language. The decoder network uses the same recurrent process as the encoder, but takes the encoded signal as an input, and at each timestep outputs a set of probabilities that signify the likelihood that a word in the target language appears at that timestep in the sequence.

RNNs suffer significantly exploding/vanishing gradients. Put simply, the longer the encoded/decoded sequence is, the more often the signal propagating through the network is multiplied by the recurrent network weights - the result is an exploding signal (when the weights are > 1) or vanishing signal (when the weights are < 1). LSTMs overcome this issue by allowing the network to "forget" certain aspects of the signal propagating from one timestep to the next. The architecture includes "gated" units, with sets of free parameters that allow each recurrent unit to determine:
 * the parts of the input signal to update
 * the size of the updates that should be made
 * what parts of the signal to output to the next timestep

By allowing parts of the signal to propagate through the network unchanged, LSTMs largely avoid the issue of exploding/vanishing gradients, and can be used to encode/decode much longer signals than more simple recurrent architectures. 

## Model Training / Validation

The following code fragment implements the model training loop. As with many of the python machine learning frameworks, Keras models are trained using the `model.fit()` method of the instantiated model, with a few simple hyperparameters to determine the details of the training loop. At the end of each training epoch, the performance of the model is evaluated against the validation dataset - in a more complex pipeline, this could be used as a benchmark to choose between different model architectures (e.g. numbers of layers, dropout rates, optimisation functions) and hyperparameters (e.g. learning rates, number of training iterations, early stopping), and avoids biasing the final model towards performance on the test dataset.


In [5]:
# Create model checkpoint to save model at each epoch
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, 
                             monitor='val_loss', 
                             verbose=0,
                             save_best_only=True, 
                             mode='min')

# Initiate training loop - 30 "epochs" or iterations
model.fit(X_train, Y_train, 
          epochs=30, 
          batch_size=64, 
          validation_data=(X_val, Y_val), 
          callbacks=[checkpoint], 
          verbose=2)

Epoch 1/30
110/110 - 12s - loss: 3.2057 - val_loss: 2.7223
Epoch 2/30
110/110 - 2s - loss: 2.1850 - val_loss: 2.5924
Epoch 3/30
110/110 - 2s - loss: 2.0513 - val_loss: 2.5094
Epoch 4/30
110/110 - 2s - loss: 1.9583 - val_loss: 2.4193
Epoch 5/30
110/110 - 3s - loss: 1.8769 - val_loss: 2.3843
Epoch 6/30
110/110 - 2s - loss: 1.7928 - val_loss: 2.3482
Epoch 7/30
110/110 - 2s - loss: 1.7117 - val_loss: 2.3002
Epoch 8/30
110/110 - 2s - loss: 1.6237 - val_loss: 2.2462
Epoch 9/30
110/110 - 2s - loss: 1.5310 - val_loss: 2.2046
Epoch 10/30
110/110 - 2s - loss: 1.4336 - val_loss: 2.1550
Epoch 11/30
110/110 - 2s - loss: 1.3358 - val_loss: 2.0933
Epoch 12/30
110/110 - 2s - loss: 1.2416 - val_loss: 2.0498
Epoch 13/30
110/110 - 2s - loss: 1.1473 - val_loss: 1.9934
Epoch 14/30
110/110 - 2s - loss: 1.0664 - val_loss: 1.9505
Epoch 15/30
110/110 - 2s - loss: 0.9813 - val_loss: 1.9009
Epoch 16/30
110/110 - 2s - loss: 0.9054 - val_loss: 1.8750
Epoch 17/30
110/110 - 2s - loss: 0.8392 - val_loss: 1.8383
Epoch

<keras.callbacks.History at 0x7fabc0169fd0>

Below are some examples of output from trained model. It performs relatively well given the short training cycle.

In [6]:
# Generate examples of translations from test data
def generate_eg_predictions():
    random_idxs = np.random.choice(100, 20)
    eng_sequences = X_test[random_idxs,:]
    preds = model.predict(eng_sequences)
    pred_sequences = np.argmax(preds, axis=2)
    eng_sentences = eng_tokenizer.sequences_to_texts(eng_sequences)
    pred_sentences = spa_tokenizer.sequences_to_texts(pred_sequences)
    ground_truthes = test[random_idxs, 1]
    sentences = zip(eng_sentences, pred_sentences, ground_truthes)
    for eng_sentence, pred_sentence, truth in sentences:
        print(f"{eng_sentence} => {pred_sentence}. Truth: {truth}")

generate_eg_predictions()

it's his => es le de. Truth: Es suyo.
answer tom => ¡respóndale a tomás. Truth: ¡Respóndanle a Tomás!
tom vanished => tom se ahogó. Truth: Tomás desapareció.
it's no joke => es un pez. Truth: No es una broma.
you go first => tú primero. Truth: Vosotros primero.
that's for you => no es es. Truth: Es para ti.
you were lucky => traed centrarte. Truth: La sacaste barata.
what a pity => ¡qué lástima. Truth: Qué lástima.
i need it asap => yo tengo lo. Truth: Lo necesito tan pronto como sea posible.
that's so lame => eso es un patético. Truth: Eso es tan patético.
i didn't walk => yo no voté. Truth: Yo no caminé.
i write poems => me siento mal. Truth: Escribo versos.
tom drove fast => ella fue. Truth: Tom condujo rápido.
i'm fair => estoy estoy. Truth: Soy justo.
stop singing => deja de. Truth: Deja de cantar.
come at once => ven enseguida. Truth: Ven enseguida.
quiet please => por favor silencio. Truth: Por favor, silencio.
we love it => lo encanta. Truth: Nos encanta.
we can meet => podemos

## Evaluating Model Performance

Once model selection and parameter tuning are complete, the final model(s) can be evaluated agains the holdout test set. This is data that hasn't been used in model training/validation and so is a (relatively) unbiased estimator of model performance on unseen data. A common scoring methodology in machine translation is the BLEU (Bilingual Evaluation Understudy) - a score from 0 to 1 that evaluates the similarity of model translations to the ground truth translations. 

The following code uses the NLTK implementation of the BLEU score to evaluate the model we've trained:



In [7]:
# Evaluate model performance against Bleu score
preds = model.predict(X_test)
pred_sequences = np.argmax(preds, axis=2)
pred_sentences = spa_tokenizer.sequences_to_texts(pred_sequences)
ground_truthes = test[:, 1]
bleu_score = corpus_bleu(ground_truthes, pred_sentences)
print("="*80)
print(f"Test Bleu score is {bleu_score}")
print("="*80)

Test Bleu score is 0.8566313698811789


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Note: This implementation of the BLEU score is not well suited to translations of the length used in this example - hence the warning message and the overinflated score (the model doesn't perform nearly well enough to achieve above 0.7). Unfortunately I didn't have the time to explore the alternatives available in the NLTK SmoothingFunction() implementation of BLEU, but I would certainly do so with more time. I thought it best to leave in as an example of how one might evaluate the performance of a model like this.