# Machine translation

In this exercise, we will look at how to perform machine translation with sequence to sequence models.

## Loading the data

Let's first load the data. We will be using data from [this Udacity repo]() that has a set of English and French sentences.

In [14]:
import pandas as pd 

n_rows = 20000
en_text = []
with open('ch3.small_vocab_en.txt', 'r', encoding='utf-8') as f:
    for ri,row in enumerate(f):
        en_text.append(row.strip())
        if ri>=n_rows-1: break
            
fr_text = []
with open('ch3.small_vocab_fr.txt', 'r', encoding='utf-8') as f:
    for ri, row in enumerate(f):
        fr_text.append(row.strip())
        if ri>=n_rows-1: break
            
en_ser = pd.Series(en_text)
fr_ser = pd.Series(fr_text)

## Printing some data

We will print some data and see if they align properly.

In [15]:
for i,(en,fr) in enumerate(zip(en_text[:5], fr_text[:5])):
    print('English: ', en)
    print('\tFrench: ',fr,'\n')

English:  new jersey is sometimes quiet during autumn , and it is snowy in april .
	French:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril . 

English:  the united states is usually chilly during july , and it is usually freezing in november .
	French:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre . 

English:  california is usually quiet during march , and it is usually hot in june .
	French:  california est généralement calme en mars , et il est généralement chaud en juin . 

English:  the united states is sometimes mild during june , and it is cold in september .
	French:  les états-unis est parfois légère en juin , et il fait froid en septembre . 

English:  your least liked fruit is the grape , but my least liked is the apple .
	French:  votre moins aimé fruit est le raisin , mais mon moins aimé est la pomme . 



## Preprocessing French sentences 

The main preprocessing step that will be performed is appending `<sos>` (beginning of the sentence) and `<eos>` (end of the sentence) tokens.

In [16]:
fr_ser = fr_ser.apply(lambda x: ____)

## Splitting data to train, valid and test

Let's now split the data to train/valid/test data. What's the purpose of each set?

In [17]:
df = pd.concat([en_ser, fr_ser], axis=1)
df = df.rename(columns=dict(zip(df.columns.tolist(),['en','fr'])))
df = df.sample(frac=1.0, random_state=100) # Shuffling data as a precaution

# Splitting data to train/valid/test
train_df = df.sample(frac=0.6, random_state=100)
test_valid_df = df.drop(train_df.index)
valid_df = test_valid_df.sample(frac=0.5, random_state=100)
test_df = test_valid_df.drop(valid_df.index)
del test_valid_df

# Printing shapes
print("Train shape: {}".format(train_df.shape))
print("Valid shape: {}".format(valid_df.shape))
print("Test shape: {}".format(test_df.shape))

# Converting dataframes to lists
train_en_text = train_df["en"].tolist()
train_fr_text = train_df["fr"].tolist()
valid_en_text = valid_df["en"].tolist()
valid_fr_text = valid_df["fr"].tolist()
test_en_text = test_df["en"].tolist()
test_fr_text = test_df["fr"].tolist()

print(train_en_text[:2])
print(train_fr_text[:2])
print('\n')
print(test_en_text[:2])
print(test_fr_text[:2])

Train shape: (12000, 2)
Valid shape: (4000, 2)
Test shape: (4000, 2)
['my least liked fruit is the lime , but her least liked is the orange .', 'you dislike oranges , apples , and grapefruit .']
["<sos> mon fruit est moins aimé la chaux , mais elle est moins aimé l'orange . <eos>", "<sos> vous n'aimez pas les oranges , les pommes et le pamplemousse . <eos>"]


['new jersey is usually freezing during october , but it is never nice in july .', 'france is usually dry during spring , but it is wonderful in november .']
['<sos> new jersey est le gel habituellement en octobre , mais il est jamais agréable en juillet . <eos>', '<sos> la france est généralement sec au printemps , mais il est merveilleux en novembre . <eos>']


In [18]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import tensorflow.keras.backend as K

## Fitting a tokenizer and preprocessing

We will fit two tokenizers. One for the English corpus and the other for the French corpus. We will also write a function called `preprocess_text(...)` that does basic preprocessing of the text.

In [19]:
n_en_vocab, n_fr_vocab = 250, 250
en_len, fr_len = 100, 100
def create_and_fit_tokenizer(text, n_vocab):
    # Return a tokenizer (tok) with n_vocab words and <unk> OOV token
    # Fit on text
    return tok

en_tok = create_and_fit_tokenizer(train_en_text, n_en_vocab)
fr_tok = create_and_fit_tokenizer(train_fr_text, n_fr_vocab)

def preprocess_text(tok, text, pad_len, pad_type, truncate_type, n_vocab, onehot=False):
    seq = tok.texts_to_sequences(text)
    # PAd the sequence using the correct arguments
    pad_seq = ____
    if onehot:
        # Convert padded sequence to onehot encoded vectors using to_categorical
        pad_seq = ____
    return pad_seq

train_x = preprocess_text(en_tok, train_en_text, en_len, 'pre', 'post', n_en_vocab, onehot=True)
valid_x = preprocess_text(en_tok, valid_en_text, en_len, 'pre', 'post', n_en_vocab, onehot=True)
print(train_x.shape)
train_y = preprocess_text(fr_tok, train_fr_text, fr_len, 'post', 'post', n_fr_vocab, onehot=True)
valid_y = preprocess_text(fr_tok, valid_fr_text, fr_len, 'post', 'post', n_fr_vocab, onehot=True)
print(train_y.shape)

(12000, 100, 250)
(12000, 100, 250)


In [7]:
K.clear_session()
# Without these lines I get 
# > UnknownError:  [_Derived_]  Fail to find the dnn implementation.
# >  [[{{node CudnnRNN}}]]
# >  [[model/lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_5819]
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

## Creating the Keras model

Let's now define a simple sequence to sequence model. This model has the following.

* Bidirectional LSTM layer with 50 nodes
* A repeat vector layer
* A decoder LSTM layer with 50 nodes
* A TimeDistributed Dense layer with `n_fr_vocab` nodes

We will use `categorical_crossentropy` as the loss and `adam` as the optimizer. Furthermore, we'll use `acc` as a metric.

In [8]:
# Write the encoder
____
____

# Write the decoder
____
____
____

# Define the model
model = ____(____, ____)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 100, 250)]        0         
_________________________________________________________________
bidirectional (Bidirectional (None, 100)               120400    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 100, 100)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100, 50)           30200     
_________________________________________________________________
time_distributed (TimeDistri (None, 100, 250)          12750     
Total params: 163,350
Trainable params: 163,350
Non-trainable params: 0
_________________________________________________________________


## Training the model

Let's now train the model for 10 epochs while using validation data to monitor the validation accuracy.

In [9]:
# Fit the model on training data
____(____, ____, epochs=10, validation_data=(____, ____))

Train on 12000 samples, validate on 4000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x209bee691d0>

## Evaluating the model

Here we will evaluate the model on the test data and compute the test loss and test accuracy.

In [20]:
# Freeing up memory
try:
    del train_en_text, train_fr_text
except NameError:
    pass

test_x = preprocess_text(en_tok, test_en_text, en_len, 'pre', 'post', n_en_vocab, onehot=True)
test_y = preprocess_text(fr_tok, test_fr_text, fr_len, 'post', 'post', n_fr_vocab, onehot=True)

res = model.evaluate(test_x, test_y, verbose=0)
print("Test loss: {} & test accuracy: {}".format(res[0], res[1]))

Test loss: 0.32266148686408996 & test accuracy: 0.911454975605011


## Building a model with teacher forcing

This new model has two input layers. One for the encoder another for the decoder.

* Bidirectional GRU layer with 50 nodes
* A decoder GRU layer with 50 nodes
* A TimeDistributed Dense layer with `n_fr_vocab` nodes

We will use `categorical_crossentropy` as the loss and `adam` as the optimizer. Furthermore, we'll use `acc` as a metric.

In [21]:
K.clear_session()
# Encoder
enc_inp = layers.Input(shape=(en_len, n_en_vocab))
enc_out, enc_fwd, enc_fwd = layers.Bidirectional(layers.GRU(50, return_state=True))(enc_inp)
# Decoder
dec_inp = layers.Input(shape=(fr_len-1, n_fr_vocab))
# Define the rest of the decoder
____
____

model2 = models.Model(inputs=[enc_inp, dec_inp], outputs=dec_out)
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model2.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 100, 250)]   0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            [(None, 99, 250)]    0                                            
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) [(None, 100), (None, 90600       input_4[0][0]                    
__________________________________________________________________________________________________
gru_3 (GRU)                     (None, 99, 50)       45300       input_5[0][0]                    
                                                                 bidirectional_2[0][2]      

In [22]:
# Correctly fill the data that needs to be passed to model2
model2.fit(
    [____, train_y[____]], train_y[____], epochs=10, 
    validation_data=(____))

Train on 12000 samples, validate on 4000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x208ca983ef0>

In [24]:
res = model2.evaluate([test_x, test_y[:,:-1,:]], test_y[:,1:,:], verbose=0)
print("(Model 2) Test loss: {} & test accuracy: {}".format(res[0], res[1]))

(Model 2) Test loss: 0.1721156494617462 & test accuracy: 0.942570686340332
