# Homework 4: Build a Seq2seq model for machine translation.

### Name: Christopher Hittner

### Task: Translate English to Spanish and French

I pledge my honor that I have abided by the Stevens Honor System.

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation English to **German** is not acceptable!!! Try another language.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the followings. By doing more, you will get up to 2 bonus scores to the total.

    * Bi-LSTM instead of LSTM
    
    * Multi-task learning (e.g., both English to French and English to Spanish)
    
    * Attention
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 2 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your own Github repo. 

7. Submit the link to the HTML file to Canvas

    * E.g., https://github.com/wangshusen/CS583A-2019Spring/blob/master/homework/HM4/seq2seq.html

#### Move to Drive Directory

Use this only when working on Colab.

In [None]:
from google.colab import drive
import os

# Mount my Google Drrive
drive.mount('/content/drive')

# Go to the directory with the 583 data
os.chdir('/content/drive/My Drive/College/CS583')
!pwd
!ls

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text

Data is loaded and extracted for several language pairs.


In [1]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            #line = normalize('NFD', line).encode('ascii', 'ignore')
            #line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            #line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

The language names and training splits:

In [2]:
# e.g., filename = 'Data/deu.txt'
langs = ['fra', 'spa']

filenames = {
    l : f'Data/{l}.txt'
    for l in langs
}

# e.g., n_train = 20000
n_train = {
    ('eng', 'fra') : 80000,
    ('eng', 'spa') : 60000
}

print('Languages:', *langs)

Languages: fra spa


In [3]:
def load_pairs(fname):
    # load dataset
    doc = load_doc(fname)

    # split into Language1-Language2 pairs
    pairs = to_pairs(doc)
    
    return pairs

def clean_data_pairs(pairs):
    return clean_data(pairs)

# Load the dataset as pairs
pairs = {
    ('eng', l): load_pairs(filenames[l])
    for l in langs
}
    

data = {
    p: clean_data_pairs(pairs[p])
    for p in pairs
}

# clean sentences (training data)
clean_pairs = {
    p: data[p][:n_train[p]]
    for p in pairs
}

# dirty sentences (test data)
dirty_pairs = {
    p: data[p][n_train[p]:]
    for p in pairs
}

Shuffle the data, keeping identical inputs together.

In [4]:
import random

for p in clean_pairs:
    ps = clean_pairs[p]
    
    xs = list(set(x[0] for x in ps))
    ys = {}
    for x, y in ps:
        if x not in ys:
            ys[x] = [y]
        else:
            ys[x] += [y]
    
    # Shuffle the data.
    random.shuffle(xs)
    for x in xs:
        random.shuffle(ys[x])
    
    # Build the data again, but shuffled. Keep identical inputs together.
    clean_pairs[p] = numpy.array([(x,y) for x in xs for y in ys[x]])

In [5]:
for p in clean_pairs:
    print(p[0], '=>', p[1])
    print('=' * (len(p[0]) + len(p[1]) + 4))
    
    for i in range(3000, 3010):
        print('[' + clean_pairs[p][i, 0] + '] => [' + clean_pairs[p][i, 1] + ']')

eng => fra
[put your hat back on] => [remets ton chapeau]
[keep your coat on] => [garde ton manteau sur toi]
[he told me where to go] => [il ma dit où aller]
[may i use some paper] => [puisje utiliser un peu de papier]
[i am afraid to go] => [jai peur dy aller]
[i am afraid to go] => [jai peur de my rendre]
[im not nervous at all] => [je ne suis pas du tout nerveux]
[im not nervous at all] => [je ne suis pas du tout nerveuse]
[maybe next time] => [peutêtre la prochaine fois]
[it was frightening] => [cétait effrayant]
eng => spa
[tom didnt come did he] => [tomás no vino]
[i like the way tom sings] => [me agrada la forma en que canta tom]
[maybe its a trap] => [tal vez es una trampa]
[its a lot of work] => [es mucho trabajo]
[she is not tall] => [ella no es alta]
[tom and mary went outside] => [tom y mary salieron fuera]
[tom joined us] => [tom se unió a nosotras]
[tom joined us] => [se nos unió tom]
[tom joined us] => [tom se unió a nosotros]
[i plan to go there] => [pienso ir allí]


Now, we can enfore properties of the dataset.

In [6]:
def generate_texts(pairs):
    input_texts = pairs[:,0]
    target_texts = ['\t' + text + '\n' for text in pairs[:, 1]]
    return input_texts, target_texts

input_texts = {}
target_texts = {}

for p in pairs:
    input_texts[p], target_texts[p] = generate_texts(clean_pairs[p])
    
    print(f'Length of input_texts[{p[0]}, {p[1]}]:  ' + str(input_texts[p].shape))
    print(f'Length of target_texts[{p[0]}, {p[1]}]: ' + str(input_texts[p].shape))

Length of input_texts[eng, fra]:  (80000,)
Length of target_texts[eng, fra]: (80000,)
Length of input_texts[eng, spa]:  (60000,)
Length of target_texts[eng, spa]: (60000,)


In [7]:
def maxlen(lines):
    return max(len(line) for line in lines)

# Build the maximum input and output lengths
max_encoder_seq_length = {l: 0 for l in set(a for a,b in pairs)}
max_decoder_seq_length = {l: 0 for l in set(b for a,b in pairs)}

for a,b in pairs:
    max_encoder_seq_length[a] = max(max_encoder_seq_length[a], maxlen(input_texts[a,b]))
    max_decoder_seq_length[b] = max(max_decoder_seq_length[b], maxlen(target_texts[a,b]))
    
print('Encoder data available for:', *max_encoder_seq_length)
print('Decoder data available for:', *max_decoder_seq_length)

for l in max_encoder_seq_length:
    print(f'max length of input  sentences in {l}: %d' % (max_encoder_seq_length[l]))   
for l in max_decoder_seq_length:
    print(f'max length of output sentences in {l}: %d' % (max_decoder_seq_length[l]))

Encoder data available for: eng
Decoder data available for: fra spa
max length of input  sentences in eng: 28
max length of output sentences in fra: 72
max length of output sentences in spa: 68


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [8]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs_pad = encode_text(lines, tokenizer, max_len)
    return seqs_pad, tokenizer.word_index, tokenizer

def encode_text(texts, tokenizer, max_len):
    seqs = tokenizer.texts_to_sequences(texts)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad

encoder_input_seq = {}
input_token_index = {}

decoder_input_seq = {}
target_token_index = {}


# Generate all of the tokenizers
encoder_tokenizer = {}
decoder_tokenizer = {}

for a,b in clean_pairs:
    inputs = input_texts[a,b]
    targets = target_texts[a,b]
    
    # Augment the encoding tokenizer
    if a not in encoder_tokenizer:
        encoder_tokenizer[a] = tok = Tokenizer(char_level=True, filters='')
    else:
        tok = encoder_tokenizer[a]
        tok.fit_on_texts(inputs)
    
    # Augment the decoding tokenizer
    if b not in decoder_tokenizer:
        decoder_tokenizer[b] = tok = Tokenizer(char_level=True, filters='')
    else:
        tok = decoder_tokenizer[b]
    tok.fit_on_texts(targets)


print('Encoder tokenizer available for:', *encoder_tokenizer)
print('Decoder tokenizer available for:', *decoder_tokenizer)

# Generate the token_indices (map tokens to indices)
input_token_index = {l: encoder_tokenizer[l].word_index for l in encoder_tokenizer}
target_token_index = {l: decoder_tokenizer[l].word_index for l in decoder_tokenizer}

for l in input_token_index:
    print(f'input_token_index[{l}]:', input_token_index[l])
    
for l in target_token_index:
    print(f'target_token_index[{l}]:', target_token_index[l])

encoder_input_seq = {
    p: encode_text(input_texts[p], encoder_tokenizer[p[0]], max_encoder_seq_length[p[0]])
    for p in clean_pairs
}

decoder_input_seq = {
    p: encode_text(target_texts[p], decoder_tokenizer[p[1]], max_decoder_seq_length[p[1]])
    for p in clean_pairs
}

decoder_target_seq = {}
for l in clean_pairs:
    decoder_target_seq[l] = numpy.zeros(decoder_input_seq[l].shape)
    decoder_target_seq[l][:, 0:-1] = decoder_input_seq[l][:, 1:]

# Print the data shapes
for p in clean_pairs:
    print(f'shape of encoder_input_seq[{p[0]}, {p[1]}]: ' + str(encoder_input_seq[p].shape))
    print(f'shape of input_token_index[{p[0]}, {p[1]}]: ' + str(len(input_token_index[p[0]])))
    print(f'shape of decoder_input_seq[{p[0]}, {p[1]}]: ' + str(decoder_input_seq[p].shape))
    print(f'shape of target_token_index[{p[0]}, {p[1]}]: ' + str(len(target_token_index[p[1]])))

Using TensorFlow backend.


Encoder tokenizer available for: eng
Decoder tokenizer available for: fra spa
input_token_index[eng]: {' ': 1, 'e': 2, 't': 3, 'o': 4, 'i': 5, 'a': 6, 's': 7, 'h': 8, 'n': 9, 'r': 10, 'l': 11, 'd': 12, 'm': 13, 'y': 14, 'u': 15, 'w': 16, 'g': 17, 'c': 18, 'p': 19, 'k': 20, 'f': 21, 'b': 22, 'v': 23, 'j': 24, 'x': 25, 'z': 26, 'q': 27, 'é': 28}
target_token_index[fra]: {' ': 1, 'e': 2, 's': 3, 'a': 4, 't': 5, 'i': 6, 'u': 7, 'n': 8, 'o': 9, 'r': 10, '\t': 11, '\n': 12, 'l': 13, 'm': 14, 'p': 15, 'c': 16, 'd': 17, 'v': 18, 'é': 19, 'j': 20, 'q': 21, 'f': 22, 'b': 23, 'h': 24, 'g': 25, 'z': 26, 'x': 27, 'à': 28, 'ê': 29, 'è': 30, 'y': 31, 'ç': 32, 'ù': 33, 'ô': 34, 'î': 35, 'û': 36, 'â': 37, 'œ': 38, 'k': 39, 'ï': 40, 'w': 41, 'ë': 42}
target_token_index[spa]: {' ': 1, 'e': 2, 'a': 3, 'o': 4, 's': 5, 'n': 6, 'r': 7, 't': 8, '\t': 9, '\n': 10, 'l': 11, 'i': 12, 'u': 13, 'm': 14, 'd': 15, 'c': 16, 'p': 17, 'b': 18, 'v': 19, 'h': 20, 'g': 21, 'q': 22, 'é': 23, 'y': 24, 'á': 25, 'í': 26, 'ó':

In [9]:
num_encoder_tokens = {
    l: len(input_token_index[l]) + 1
    for l in encoder_tokenizer
}

num_decoder_tokens = {
    l: len(target_token_index[l]) + 1
    for l in decoder_tokenizer
}

for l in encoder_tokenizer:
    print(f'num_encoder_tokens[{l}]: ' + str(num_encoder_tokens[l]))
    
for l in decoder_tokenizer:
    print(f'num_decoder_tokens[{l}]: ' + str(num_decoder_tokens[l]))

num_encoder_tokens[eng]: 29
num_decoder_tokens[fra]: 43
num_decoder_tokens[spa]: 37


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [10]:
for l in clean_pairs:
    print(l[0], '=>', l[1], ':', repr(target_texts[l][100]))

eng => fra : '\tcomme cest embarrassant\n'
eng => spa : '\tyo no sé su verdadero nombre\n'


In [11]:
for l in clean_pairs:
    print(l[0], '=>', l[1], ':', repr(decoder_input_seq[l][100, :]))

eng => fra : array([11, 16,  9, 14, 14,  2,  1, 16,  2,  3,  5,  1,  2, 14, 23,  4, 10,
       10,  4,  3,  3,  4,  8,  5, 12,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0], dtype=int32)
eng => spa : array([ 9, 24,  4,  1,  6,  4,  1,  5, 23,  1,  5, 13,  1, 19,  2,  7, 15,
        3, 15,  2,  7,  4,  1,  6,  4, 14, 18,  7,  2, 10,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      dtype=int32)


## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [12]:
from keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

"""
encoder_input_data = {
    p: onehot_encode(encoder_input_seq[p], max_encoder_seq_length[p[0]], num_encoder_tokens[p[0]])
    for p in clean_pairs
}
decoder_input_data = {
    p: onehot_encode(decoder_input_seq[p], max_decoder_seq_length[p[1]], num_decoder_tokens[p[1]])
    for p in clean_pairs
}

for l in clean_pairs:
    print(f'size of encoder_input_data[{l[0]}, {l[1]}]:', encoder_input_data[l].shape)
    print(f'size of decoder_input_data[{l[0]}, {l[1]}]:', decoder_input_data[l].shape) 

"""

decoder_target_data = {}
    
for l in clean_pairs:
    decoder_target_data[l] = onehot_encode(decoder_target_seq[l], 
                                        max_decoder_seq_length[l[1]], 
                                        num_decoder_tokens[l[1]])   

## 2.3. Data Batching

When training the modes, we want to use all of the dataset to train at each iteration. Hence, we can say that for each batch of a translation direction and x-y pairing, fit the translation to the data.

In [13]:
import random

class MultiDataGenerator:
    def __init__(self, Xs, Ys, batch_size=64):
        self.Xs = Xs
        self.Ys = Ys
        self.indices = None
        self.batch_size = batch_size

        self.X = None
        self.Y = None

    def __iter__(self):
        Xs = self.Xs
        Ys = self.Ys
        batch_size = self.batch_size

        slices = []
        
        X = []
        Y = []

        # Select random slices
        for i in Xs:
            ys = Ys[i]
            
            # Shuffle each dataset
            idxs = list(range(len(ys)))
            random.shuffle(idxs)

            # Select slices
            slices.extend([(i, idxs[j-batch_size:j])
                for j in range(batch_size, len(ys)+1, batch_size)])
            

        random.shuffle(slices)
            
        for i, s in slices:
            p = i
            X = [x[s] for x in Xs[i]]
            Y = Ys[i][s]
            
            yield p, X, Y

    def __len__(self):
        # The size is the number of possible batches
        return sum(len(self.Ys[p]) // self.batch_size for p in self.Ys)

In [14]:
Xs = {
    #p: [encoder_input_data[p], decoder_input_data[p]]
    p: [encoder_input_seq[p], decoder_input_seq[p]]
    for p in clean_pairs
}

Ys = {
    p: decoder_target_data[p]
    for p in clean_pairs
}

# Split the data
validation_split = 0.2
batch_size = 64

Xs_train = {
    p: [X[:int((1-validation_split)*len(X))] for X in Xs[p]]
    for p in Xs
}
Xs_valid = {
    p: [X[int((1-validation_split)*len(X)):] for X in Xs[p]]
    for p in Xs
}

Ys_train = {
    p: Ys[p][:int((1-validation_split)*len(Ys[p]))]
    for p in Ys
}
Ys_valid = {
    p: Ys[p][int((1-validation_split)*len(Ys[p])):]
    for p in Ys
}

gen_train = MultiDataGenerator(Xs_train, Ys_train, batch_size)
gen_valid = MultiDataGenerator(Xs_valid, Ys_valid, batch_size)

gen_train.__iter__()

print('Batch size:', batch_size)
print('Training batch count:', len(gen_train))
print('Validation batch count:', len(gen_valid))

Batch size: 64
Training batch count: 1750
Validation batch count: 437


## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.0. Miscellaneous layers

To build our models, special layers will be needed. These include:

* Attention context generation
* Bidirectional LSTM

In [15]:
from keras.layers import Activation, Dense, Permute, Lambda, Layer, Softmax, Input
import keras.backend as K

def general_score(x, h):
    """ Attention score computed as such:
    score(x, h) = x^T Wh
    """
    dim = x.shape.as_list()[-1]
    
    # Weighted state vectors
    H = Dense(dim)(h)
    
    # Transpose the inputs
    hT = Permute((2,1))(H)

    # Dot product (yields score alpha)
    a = Lambda(lambda z: K.batch_dot(z[0], z[1]), name='attention')([x, hT])
    a = Softmax(axis=2)(a)
    
    return a

def Attention(x, h, score=general_score, concat_original=True):
    """ Attention using the weighted dot score.
    """
    a = score(x, h)
    
    # Compute the context for each position
    c = Lambda(lambda x: K.batch_dot(x[0], x[1]), name='context')([a, h])
    
    # Concatenate to the original input
    if concat_original:
        return Concatenate(axis=-1)([c, x])
    else:
        return c

def AttentionCell(num_decoder_tokens, latent_dim):
    # inputs of the decoder network
    x = Input(shape=(1, num_decoder_tokens + latent_dim))
    h = Input(shape=(latent_dim,))
    c = Input(shape=(latent_dim,))
    hs = Input(shape=(None, latent_dim))

    # set the LSTM layer
    decoder_lstm = LSTM(latent_dim, return_sequences=False,
                        return_state=True, dropout=0.5,
                        input_shape=x.shape[1:])
    y, _h, _c = decoder_lstm(x, initial_state=[h, c])

    # Compute attentional context
    context = Attention(y, hs, concat_original=False)

    return Model([x, h, c, hs], [y, _h, _c, context])

def AttentionRNN(vocab_size, latent_dim, length):
    """ Creates an unfolded RNN for training with Attention
    x      - The input sequence
    state  - The state input for the base RNN
    hs     - The encoder states for generating the attention vector
    length - The length of the RNN unfolding
    """
    x = Input(shape=(length, vocab_size))
    h = h0 = Input(shape=(latent_dim,))
    c = c0 = Input(shape=(latent_dim,))
    hs = Input(shape=(None, latent_dim))

    cell = AttentionCell(vocab_size, latent_dim)

    # Initial context vector
    ctx = zero = K.zeros((1, latent_dim))
    ctx = Lambda(lambda xs: K.tile(ctx, [K.shape(x)[0], 1]))(x)

    ys = []

    for i in range(length):
        # Get the current index
        z = Lambda(lambda z: z[:,i])(x)
        # Apply the context vector
        z = Concatenate(axis=-1)([z, ctx])
        # Expand to allow for application to the LSTM
        z = Lambda(lambda z: K.expand_dims(z, 1))(z)
        # Apply the attentive LSTM layer to one input.
        # Given state and encoder state, outputs new state and context
        y, h, c, ctx = cell([z, h, c, hs])

        # Save the output
        ys.append(y)

    # Concatenate all outputs
    ys = list(map(Lambda(lambda z: K.expand_dims(z, 1)), ys))
    y = Concatenate(axis=1)(ys)

    # Return the end result
    return Model([x, h0, c0, hs], [y, h, c, ctx]), cell


In [16]:
from keras.layers import Bidirectional, Concatenate, LSTM

def Bidirect(lstm, encoder_inputs):
    """ Handles the functional call for creating a Bidirectional LSTM all in one.
    """
    encoder_bilstm = Bidirectional(lstm)
    
    ys = encoder_bilstm(encoder_inputs)
    
    y = ys[0]
    hs = ys[1:]
    
    ys = [y]
    
    # Concatenate the states
    c = len(hs) // 2
    for i in range(c):
        h = Concatenate()([hs[i], hs[i+c]])
        ys.append(h)

    return ys
    

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [26]:
from keras.layers import Input, LSTM, GRU, Embedding
from keras.models import Model

import keras.backend as K

latent_dim = 256

def build_encoder(num_encoder_tokens, name='encoder'):
    # inputs of the encoder network
    encoder_inputs = Input(shape=(None,),  name=f'{name}_inputs', dtype='int32')
    
    
    inputs = Embedding(num_encoder_tokens, num_encoder_tokens,
                       embeddings_initializer='identity', trainable=False)(encoder_inputs)

    # set the LSTM layer
    encoder_lstm = LSTM(latent_dim // 2, return_state=True, return_sequences=True,
                        dropout=0.5, name='encoder_lstm')
    outputs = Bidirect(encoder_lstm, inputs) #encoder_lstm(encoder_inputs)

    # build the encoder network model
    encoder_model = Model(inputs=encoder_inputs, 
                          outputs=outputs,
                          name='encoder')

    return encoder_model

encoder_model = {
    l: build_encoder(num_encoder_tokens[l], name=f'encoder_{l}')
    for l in encoder_tokenizer
}

Print a summary and save the encoder network structure to "./encoder.pdf"

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

for l in encoder_tokenizer:
    SVG(model_to_dot(encoder_model[l], show_shapes=False).create(prog='dot', format='svg'))

    plot_model(
        model=encoder_model[l], show_shapes=False,
        to_file=f'encoder_{l[0]}_{l[1]}.pdf'
    )

In [18]:
for l in encoder_tokenizer:
    print(l)
    print('=' * len(l))
    encoder_model[l].summary()

eng
===
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_eng_inputs (InputLayer) (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 29)     841         encoder_eng_inputs[0][0]         
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, None, 256),  121344      embedding_1[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 256)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][2]            
To

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [28]:
from keras.layers import Dense
from keras.models import Model

def build_decoder(num_decoder_tokens, name='decoder'):
    # inputs of the decoder network
    decoder_input_s = Input(shape=(None, latent_dim), name=f'{name}_input_s')
    decoder_input_h = Input(shape=(latent_dim,), name=f'{name}_input_h')
    decoder_input_c = Input(shape=(latent_dim,), name=f'{name}_input_c')
    decoder_input_x = Input(shape=(None,), name='decoder_input_x')
    
    state_inputs = [
        decoder_input_h,
        decoder_input_c,
    ]
    
    input_x = Embedding(num_decoder_tokens, num_decoder_tokens,
                        embeddings_initializer='identity', trainable=False)(decoder_input_x)
    
    # set the LSTM layer
    decoder_lstm = LSTM(latent_dim, return_sequences=True, 
                        return_state=True, dropout=0.5, name='decoder_lstm')
    decoder_lstm_outputs = decoder_lstm(input_x, 
                                                          initial_state=state_inputs)
    
    decoder_lstm_outputs, state_outputs = decoder_lstm_outputs[0], decoder_lstm_outputs[1:]
    
    # set the attention layer
    attn_outputs = Attention(decoder_lstm_outputs, decoder_input_s)
    
    
    """
    decoder_attn, attn_cell = AttentionRNN(32, latent_dim, max_decoder_seq_length[name[8:]])
    attn_outputs, state_h, state_c, _ = decoder_attn([input_x, decoder_input_h, decoder_input_c, decoder_input_s])
    """
    
    # set the dense layer
    decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
    decoder_outputs = decoder_dense(attn_outputs)
    
    inputs = [decoder_input_x, decoder_input_s] + state_inputs

    outputs = [decoder_outputs] + state_outputs
    
    # build the decoder network model
    decoder_model = Model(
        inputs=inputs,
        outputs=outputs,         
        name='decoder')
    
    return decoder_model#, attn_cell

decoder_model = {
    l: build_decoder(num_decoder_tokens[l], name=f'decoder_{l}')
    for l in decoder_tokenizer
}

Print a summary and save the encoder network structure to "./decoder.pdf"

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

for l in decoder_tokenizer:
    SVG(model_to_dot(decoder_model[l], show_shapes=False).create(prog='dot', format='svg'))

    plot_model(
        model=decoder_model[l], show_shapes=False,
        to_file=f'decoder_{l}.pdf'
    )

In [20]:
for l in decoder_tokenizer:
    print(l)
    print('=' * len(l))
    decoder_model[l].summary()

fra
===
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    (None, None)         0                                            
__________________________________________________________________________________________________
decoder_fra_input_s (InputLayer (None, None, 256)    0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 43)     1849        decoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_fra_input_h (InputLayer (None, 256)          0                                            
__________________________________________________________________________________________________
de

### 3.3. Connect the encoder and decoder

In [29]:
def build_translator(src, tgt, ename='encoder', dname='decoder'):
    encoder = encoder_model[src]
    decoder = decoder_model[tgt]
    
    # input layers
    encoder_input_x = Input(shape=(None,), name=f'{ename}_input_x')
    decoder_input_x = Input(shape=(None,), name=f'{dname}_input_x')

    # connect encoder to decoder
    encoder_final_states = encoder([encoder_input_x])
    
    x = [decoder_input_x] + encoder_final_states
    
    decoder_pred = decoder(x)[0]

    return Model(inputs=[encoder_input_x, decoder_input_x], 
                  outputs=decoder_pred, 
                  name=f'{src}2{tgt}_training')

model = {
    l: build_translator(l[0], l[1], ename=f'enc_{l[0]}_{l[1]}', dname=f'dec_{l[0]}_{l[1]}')
    for l in clean_pairs
}

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

for l in clean_pairs:
    SVG(model_to_dot(model[l], show_shapes=False).create(prog='dot', format='svg'))

    plot_model(
        model=model[l], show_shapes=False,
        to_file=f'model_training_{l[0]}_{l[1]}.pdf'
    )

In [22]:
for l in clean_pairs:
    print(l)
    print('=' * len(str(l)))
    model[l].summary()

('eng', 'fra')
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
enc_eng_fra_input_x (InputLayer (None, None)         0                                            
__________________________________________________________________________________________________
dec_eng_fra_input_x (InputLayer (None, None)         0                                            
__________________________________________________________________________________________________
encoder (Model)                 [(None, None, 256),  122185      enc_eng_fra_input_x[0][0]        
__________________________________________________________________________________________________
decoder (Model)                 [(None, None, 43), ( 320100      dec_eng_fra_input_x[0][0]        
                                                                 encoder[1][0]                

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [23]:
for l in clean_pairs:
    #print(f'shape of encoder_input_data[{l[0]}, {l[1]}]:', str(encoder_input_data[l].shape))
    #print(f'shape of decoder_input_data[{l[0]}, {l[1]}]:', str(decoder_input_data[l].shape))
    print(f'shape of decoder_target_data[{l[0]}, {l[1]}]:', str(decoder_target_data[l].shape))

shape of decoder_target_data[eng, fra]: (80000, 72, 43)
shape of decoder_target_data[eng, spa]: (60000, 68, 37)


In [30]:
for l in clean_pairs:
    print('Compiling', l[0], 'to', l[1], 'translator')
    model[l].compile(optimizer='rmsprop', loss='categorical_crossentropy')

Compiling eng to fra translator
Compiling eng to spa translator


In [31]:
import sys, time

epochs = 48

history = {
    p: {
        'loss': [],
        'val_loss': []
    }
    for p in clean_pairs
}
    
for ep in range(epochs):
    print(f'Epoch {ep+1} of {epochs}...')
    
    loss = {}
    
    start_time = time.time()
    
    it = 0
    for p, X, Y in gen_train:
        it += 1
        l = model[p].train_on_batch(X, Y)
        
        # Save the loss for metric computation
        if p in loss:
            loss[p].append(l)
        else:
            loss[p] = [l]
        
        # Print iteration info
        if it % 10 == 0 or it == len(gen_train):
            # Print iteration number
            sys.stdout.write(f'\rIteration {it} of {len(gen_train)}:')
            
            # Print losses
            for q in clean_pairs:
                if q in loss:
                    l = sum(loss[q]) / len(loss[q])
                    sys.stdout.write(f' loss[{q[0]}, {q[1]}]: {l:0.6f}')
                
            # Flush stdout buffer (display the text)
            sys.stdout.flush()
    
    # Save training metrics
    for p in loss:
        history[p]['loss'].append(sum(loss[p]) / len(loss[p]))
    
    # Go to next line to print validation stats
    sys.stdout.write('\n')
    sys.stdout.flush()
    
    # Compute the validation loss
    for p in clean_pairs:
        loss = model[p].evaluate(Xs_valid[p], Ys_valid[p], verbose=False)
        print(f'val_loss[{p[0]}, {p[1]}]: {loss:0.6f}')
        
        # Save validation metrics
        history[p]['val_loss'].append(loss)
    
    print('Time elapsed:', time.time() - start_time, 'seconds')

    # Save the results
    for p in clean_pairs:
        model[p].save(f'seq2seq_{p[0]}_{p[1]}.h5')

Epoch 1 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.879779 loss[eng, spa]: 0.879420
val_loss[eng, fra]: 0.662418
val_loss[eng, spa]: 0.685822
Time elapsed: 442.06355476379395 seconds


  '. They will not be included '
  '. They will not be included '


Epoch 2 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.663258 loss[eng, spa]: 0.689995
val_loss[eng, fra]: 0.554021
val_loss[eng, spa]: 0.596848
Time elapsed: 402.1472773551941 seconds
Epoch 3 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.584082 loss[eng, spa]: 0.622725
val_loss[eng, fra]: 0.498703
val_loss[eng, spa]: 0.541984
Time elapsed: 343.5735728740692 seconds
Epoch 4 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.534586 loss[eng, spa]: 0.580182
val_loss[eng, fra]: 0.459998
val_loss[eng, spa]: 0.506861
Time elapsed: 388.8891043663025 seconds
Epoch 5 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.500305 loss[eng, spa]: 0.548796
val_loss[eng, fra]: 0.429379
val_loss[eng, spa]: 0.481494
Time elapsed: 501.9429557323456 seconds
Epoch 6 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.474958 loss[eng, spa]: 0.523998
val_loss[eng, fra]: 0.413335
val_loss[eng, spa]: 0.456932
Time elapsed: 460.7687494754791 seconds
Epoch 7 of 48...
Iteration 1750 of 1750: loss[eng, fra]

Iteration 1750 of 1750: loss[eng, fra]: 0.259718 loss[eng, spa]: 0.291701
val_loss[eng, fra]: 0.227235
val_loss[eng, spa]: 0.249005
Time elapsed: 332.70329427719116 seconds
Epoch 46 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.258774 loss[eng, spa]: 0.289434
val_loss[eng, fra]: 0.225957
val_loss[eng, spa]: 0.247610
Time elapsed: 332.39294028282166 seconds
Epoch 47 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.257140 loss[eng, spa]: 0.287414
val_loss[eng, fra]: 0.225161
val_loss[eng, spa]: 0.247508
Time elapsed: 332.52661085128784 seconds
Epoch 48 of 48...
Iteration 1750 of 1750: loss[eng, fra]: 0.255328 loss[eng, spa]: 0.285883
val_loss[eng, fra]: 0.224718
val_loss[eng, spa]: 0.246880
Time elapsed: 332.378333568573 seconds


In [32]:
from matplotlib import pyplot as plt

for p in history:
    loss = history[p]['loss']
    val_loss = history[p]['val_loss']
    
    plt.plot(range(len(loss)), loss, 'bo', label='Training loss')
    plt.plot(range(len(val_loss)), val_loss, 'ro', label='Validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title(f'Translation loss: {p[0]} to {p[1]}')
    plt.legend()
    plt.show()

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

## 4. Make predictions


### 4.1. Translate English to many

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [33]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = {
    l: dict((i, char) for char, i in input_token_index[l].items())
    for l in input_token_index
}
reverse_target_char_index = {
    l: dict((i, char) for char, i in target_token_index[l].items())
    for l in target_token_index
}

for l in reverse_input_char_index:
    print(f'reverse_input_char_index[{l}]:', reverse_input_char_index[l])
    
for l in reverse_target_char_index:
    print(f'reverse_target_char_index[{l}]:', reverse_target_char_index[l])

reverse_input_char_index[eng]: {1: ' ', 2: 'e', 3: 't', 4: 'o', 5: 'i', 6: 'a', 7: 's', 8: 'h', 9: 'n', 10: 'r', 11: 'l', 12: 'd', 13: 'm', 14: 'y', 15: 'u', 16: 'w', 17: 'g', 18: 'c', 19: 'p', 20: 'k', 21: 'f', 22: 'b', 23: 'v', 24: 'j', 25: 'x', 26: 'z', 27: 'q', 28: 'é'}
reverse_target_char_index[fra]: {1: ' ', 2: 'e', 3: 's', 4: 'a', 5: 't', 6: 'i', 7: 'u', 8: 'n', 9: 'o', 10: 'r', 11: '\t', 12: '\n', 13: 'l', 14: 'm', 15: 'p', 16: 'c', 17: 'd', 18: 'v', 19: 'é', 20: 'j', 21: 'q', 22: 'f', 23: 'b', 24: 'h', 25: 'g', 26: 'z', 27: 'x', 28: 'à', 29: 'ê', 30: 'è', 31: 'y', 32: 'ç', 33: 'ù', 34: 'ô', 35: 'î', 36: 'û', 37: 'â', 38: 'œ', 39: 'k', 40: 'ï', 41: 'w', 42: 'ë'}
reverse_target_char_index[spa]: {1: ' ', 2: 'e', 3: 'a', 4: 'o', 5: 's', 6: 'n', 7: 'r', 8: 't', 9: '\t', 10: '\n', 11: 'l', 12: 'i', 13: 'u', 14: 'm', 15: 'd', 16: 'c', 17: 'p', 18: 'b', 19: 'v', 20: 'h', 21: 'g', 22: 'q', 23: 'é', 24: 'y', 25: 'á', 26: 'í', 27: 'ó', 28: 'f', 29: 'j', 30: 'z', 31: 'ñ', 32: 'ú', 33: 'x'

In [34]:
def decode_sequence(input_seq, trans, temperature=0.2):
    src, tgt = trans
    
    states_value = encoder_model[src].predict(input_seq)
    
    num_dec_toks = num_decoder_tokens[tgt]

    target_seq = numpy.zeros((1, 1))
    target_seq[0, 0] = target_token_index[tgt]['\t']
    
    reverse_target_index = reverse_target_char_index[tgt]

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model[tgt].predict([target_seq] + states_value)

        # Get the probability of each character plus an epsilon
        p = output_tokens[0, -1, :]
        
        # Apply temperature
        p = numpy.log(p + 1e-16) / temperature
        
        # Rescale
        p = numpy.exp(p.astype('float64'))
        p = p / numpy.sum(p)
        
        # Randomly choose one from the distribution
        p = numpy.random.multinomial(1, p, 1)
        
        # Choose the most likely character
        sampled_token_index = numpy.argmax(p)
        
        if sampled_token_index in reverse_target_index:
            sampled_char = reverse_target_index[sampled_token_index]
            decoded_sentence += sampled_char
        else:
            sampled_char = ''

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length[tgt]):
            stop_condition = True

        target_seq = numpy.zeros((1, 1))
        target_seq[0, 0] =  sampled_token_index

        states_value[1:3] = [h, c]

    return decoded_sentence


In [35]:
for l in pairs:
    for seq_index in range(2100, 2120):
        # Take one sequence (part of the training set)
        # for trying out decoding.
        input_seq = encoder_input_seq[l][seq_index: seq_index + 1]
        decoded_sentence = decode_sequence(input_seq, l)
        print('-')
        print(f'{l[0]}:        ', input_texts[l][seq_index])
        print(f'{l[1]} (true): ', target_texts[l][seq_index][1:-1])
        print(f'{l[1]} (pred): ', decoded_sentence[0:-1])


-
eng:         murder is a wicked crime
fra (true):  le meurtre est un crime odieux
fra (pred):  la mure est un crime de criment
-
eng:         i heard about your problems
fra (true):  jai entendu parler de tes problèmes
fra (pred):  jentends des problèmes de tes problèmes
-
eng:         i heard about your problems
fra (true):  jai entendu parler de vos problèmes
fra (pred):  jai entendu des problèmes de votre problème
-
eng:         you should do it
fra (true):  tu devrais le faire
fra (pred):  tu devrais le faire
-
eng:         you should do it
fra (true):  vous devriez le faire
fra (pred):  tu devrais le faire
-
eng:         why would anybody kiss me
fra (true):  pourquoi quelquun membrasseraitil
fra (pred):  pourquoi qui que ce soit de temps cela marchez
-
eng:         its not so easy
fra (true):  ce nest pas si facile
fra (pred):  ce nest pas si facile
-
eng:         youre an idiot
fra (true):  tu es un idiot
fra (pred):  tu es un idiot
-
eng:         tom wants to come with us
fra

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [36]:
input_sentence = 'why is that'
print('source sentence is: ' + input_sentence)

def translate(t, a, b, temperature=0.2):
    # Tokenize the text
    input_sequence = encode_text([t], encoder_tokenizer[a], max_encoder_seq_length[a])
    
    # Evaluate the translation
    translated_sentence = decode_sequence(input_sequence, (a,b), temperature=0.2)
    
    return translated_sentence

for l in pairs:
    if l[0] != 'eng':
        continue
    
    # Evaluate and get the sentence
    translated_sentence = translate(input_sentence, l[0], l[1])
    print(f'translated sentence in {l[1]} is: ' + translated_sentence.strip())
    

source sentence is: why is that
translated sentence in fra is: pourquoi ça
translated sentence in spa is: qué es eso


## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


In [None]:
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

from tqdm import tqdm

for p in pairs:
    dpairs = dirty_pairs[p]
    
    # Compute the set of expected vs actual results
    expected = {}
    print(f'Building {p[0]} to {p[1]} reference')
    for src, tgt in dpairs:
        if src not in expected:
            expected[src] = [tgt.split()]
        else:
            expected[src] += [tgt.split()]
            
    # Get the expected strings
    references = [expected[s] for s in expected]
    print('Found', len(references), 'reference translations')
    
    # Translate the source strings
    print('Generating candidates')
    candidates = [translate(s, p[0], p[1]).split()
                  for s in tqdm(expected, desc=f'Translating from {p[0]} to {p[1]}')]
    
    # Compute the BLEU score
    print('Computing BLEU score')
    score = corpus_bleu(references, candidates)
    
    print(f'bleu_score[{p[0]}, {p[1]}]: {score}')


Building eng to fra reference
Generating candidates
Computing BLEU score
bleu_score[eng, fra]: 0.0429645604749982
Building eng to spa reference
Generating candidates
