# Home 4: Build a seq2seq model for machine translation.

    ### Name: Andrew Chen

### Task: Translate English to French

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation English to **German** is not acceptable!!! Try another language.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the followings. By doing more, you will get up to 2 bonus scores to the total.

    * Bi-LSTM instead of LSTM
    
    * Multi-task learning (e.g., both English to French and English to Spanish)
    
    * Attention
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 2 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your own Github repo. 

7. Submit the link to the HTML file to Canvas

    * E.g., https://github.com/wangshusen/CS583A-2019Spring/blob/master/homework/HM4/seq2seq.html
    


### Hint: 

To implement ```Bi-LSTM```, you will need the following code to build the encoder; the decoder won't be much different.

In [217]:
# from keras.layers import Bidirectional, Concatenate

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

### Hint: 

To implement multi-task training, you can refer to ```Section 7.1.3 Multi-output models``` of the textbook, ```Deep Learning with Python```.

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text


In [258]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [259]:
# e.g., filename = 'Data/deu.txt'
filename = "Data/fra.txt"

# e.g., n_train = 20000
n_train = 30000

rand_indices = numpy.random.permutation(n_train)
train_indices = rand_indices[0:20000]
valid_indices = rand_indices[20000:25000]
test_indices = rand_indices[25000:30000]

In [260]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[train_indices, :]

In [261]:
for i in range(3000, 3010):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[he squinted] => [il a louche]
[does it offend you] => [cela vous offensetil]
[did he go there] => [y estil alle]
[be prepared] => [soyez pretes]
[she baked me a cake] => [elle me prepara un gateau]
[i feel it now] => [je le sens maintenant]
[she is darkskinned] => [elle a la peau noire]
[lets not watch tv] => [ne regardons pas la television]
[youre talkative] => [tu es bavard]
[i hate fanatics] => [je deteste les fanatiques]


In [262]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (20000,)
Length of target_texts: (20000,)


In [263]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 19
max length of target sentences: 56


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [264]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (20000, 19)
shape of input_token_index: 27
shape of decoder_input_seq: (20000, 56)
shape of target_token_index: 29


In [265]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 30


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [266]:
target_texts[100]

'\tjaurai besoin de votre aide\n'

In [267]:
decoder_input_seq[100, :]

array([10, 19,  4,  8, 12,  4,  5,  2, 21,  1,  3,  9,  5,  7,  2, 18,  1,
        2, 17,  9,  6, 12,  1,  2,  4,  5, 18,  1, 11,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0])

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [268]:
from keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(20000, 19, 28)
(20000, 56, 30)


## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [269]:
from keras.layers import Input, LSTM
from keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# set the LSTM layer
encoder_lstm = LSTM(latent_dim, return_state=True, 
                    dropout=0.5, name='encoder_lstm')
_, state_h, state_c = encoder_lstm(encoder_inputs)

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')

Print a summary and save the encoder network structure to "./encoder.pdf"

In [270]:
# from IPython.display import SVG
# from keras.utils.vis_utils import model_to_dot, plot_model

# SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

# plot_model(
#     model=encoder_model, show_shapes=False,
#     to_file='encoder.pdf'
# )

encoder_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_inputs (InputLayer)  (None, None, 28)          0         
_________________________________________________________________
encoder_lstm (LSTM)          [(None, 256), (None, 256) 291840    
Total params: 291,840
Trainable params: 291,840
Non-trainable params: 0
_________________________________________________________________


### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [271]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [272]:
# from IPython.display import SVG
# from keras.utils.vis_utils import model_to_dot, plot_model

#SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

#plot_model(
#    model=decoder_model, show_shapes=False,
#    to_file='decoder.pdf'
#)

decoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    (None, None, 30)     0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    (None, 256)          0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    (None, 256)          0                                            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 256),  293888      decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]            
          

### 3.3. Connect the encoder and decoder

In [273]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')

In [274]:
print(state_h)
print(decoder_input_h)

Tensor("decoder_lstm_16/while/Exit_2:0", shape=(?, 256), dtype=float32)
Tensor("decoder_input_h_8:0", shape=(?, 256), dtype=float32)


In [275]:
# from IPython.display import SVG
# from keras.utils.vis_utils import model_to_dot, plot_model

#SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

#plot_model(
#    model=model, show_shapes=False,
#    to_file='model_training.pdf'
#)

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    (None, None, 28)     0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    (None, None, 30)     0                                            
__________________________________________________________________________________________________
encoder (Model)                 [(None, 256), (None, 291840      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 256),  293888      decoder_input_x[0][0]            
                                                                 encoder[1][0]                    
          

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [276]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(20000, 19, 28)
shape of decoder_input_data(20000, 56, 30)
shape of decoder_target_data(20000, 56, 30)


In [277]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
callbacks=[EarlyStopping(patience=5, monitor='val_loss'),
            ModelCheckpoint(filepath='save' + "/" + 'seq2seq_second.{epoch:02d}-{val_loss:.2f}.hdf5',\
                            monitor='val_loss', verbose=0, mode='auto', period=5)]

In [278]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

In [279]:
history = model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=300, validation_split=0.2, callbacks=callbacks)

model.save('seq2seq.h5')

Train on 16000 samples, validate on 4000 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300


  '. They will not be included '


Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 79/300
Epoch 80/300
Epoch 81/300
Epoch 82/300


Epoch 83/300
Epoch 84/300
Epoch 85/300
Epoch 86/300
Epoch 87/300
Epoch 88/300
Epoch 89/300
Epoch 90/300
Epoch 91/300


## 4. Make predictions


### 4.1. Translate English to XXX

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [280]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [281]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])
        
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [282]:
for seq_index in range(2100, 2120):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('English:       ', input_texts[seq_index])
    print('French (true): ', target_texts[seq_index][1:-1])
    print('French (pred): ', decoded_sentence[0:-1])


-
English:        come on admit it
French (true):  allez admettezle
French (pred):  allez aider moi tout le monde
-
English:        i need to do this
French (true):  il me faut le faire
French (pred):  jai besoin de vous aller tout les deux
-
English:        you may go
French (true):  tu peux ten aller
French (pred):  vous pouvez vous aider
-
English:        i owe you a lunch
French (true):  je vous dois un dejeuner
French (pred):  je vous dois un peu de main
-
English:        i want a low table
French (true):  je veux une table basse
French (pred):  je veux un peu de main
-
English:        it tasted sweet
French (true):  ca goutait sucre
French (pred):  ca a pris un peu de main
-
English:        we found something
French (true):  on a trouve quelque chose
French (pred):  nous lavons vu un aller
-
English:        it was just hype
French (true):  cetait juste du battage publicitaire
French (pred):  cetait tres bien
-
English:        she drives me crazy
French (true):  elle me rend chevr

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [283]:
input_sentence = 'why is that'

input_sequences, _ = text2sequences(len(input_sentence), [input_sentence])

translated_sentence = decode_sequence(input_seq)

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)

source sentence is: why is that
translated sentence is: estu serieux



## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


**Hint:** Randomly partition the dataset to training, validation, and test. Evaluate the BLEU score using the test set.

In [284]:
from nltk.translate.bleu_score import sentence_bleu

# reference from first BLEU link

# reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
# candidate = ['the', 'cat', 'is', 'on', 'cat']
# score = sentence_bleu(reference, candidate)
# print(score)


# use test rand indices to test on BLEU
test_pairs = clean_data(pairs)[test_indices, :]

print(test_pairs[0])

n_bleu_tests = 10

score_total = 0

for i in range(n_bleu_tests):
    # get translation
    input_sentence = test_pairs[i][0]
    input_sequences, _ = text2sequences(len(input_sentence), [input_sentence])
    translated_sentence = decode_sequence(input_seq)
    
    # get translation's bleu score
    reference = [test_pairs[i][0].split()]
    candidate = translated_sentence.split()
    score = sentence_bleu(reference, candidate)
    score_total += score
    
    print('-')
    print('source sentence is: ' + input_sentence)
    print('translated sentence is: ' + translated_sentence)
    print('true traslation:', test_pairs[i][1])
    print('score:', score)

print("\nAVERAGE SCORE:", score_total / n_bleu_tests)

['i want to quit' 'je veux demissionner']
-
source sentence is: i want to quit
translated sentence is: estu serieux

true traslation: je veux demissionner
score: 0
-
source sentence is: its not a loan
translated sentence is: estu serieux

true traslation: ce nest pas un pret
score: 0
-
source sentence is: come back here
translated sentence is: estu serieux

true traslation: revenez ici
score: 0
-
source sentence is: were sunk
translated sentence is: estu serieux

true traslation: on est foutu
score: 0
-
source sentence is: were really good
translated sentence is: estu serieux

true traslation: nous sommes vraiment bons
score: 0
-
source sentence is: dont wait for me
translated sentence is: estu serieux

true traslation: ne mattendez pas
score: 0
-
source sentence is: turn it off
translated sentence is: estu serieux

true traslation: eteinsle
score: 0
-
source sentence is: i love the sun
translated sentence is: estu serieux

true traslation: jadore le soleil
score: 0
-
source sentence i