# Home 4: Build a seq2seq model for machine translation.

### Name: Saeid Hosseinipoor

### Task: Translate English to Italian and Catalan

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation English to **German** is not acceptable!!! Try another language.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the followings. By doing more, you will get up to 2 bonus scores to the total.

    * Bi-LSTM instead of LSTM
    
    * Multi-task learning (e.g., both English to French and English to Spanish)
    
    * Attention
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 2 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your own Github repo. 

7. Submit the link to the HTML file to Canvas

    * E.g., https://github.com/wangshusen/CS583A-2019Spring/blob/master/homework/HM4/seq2seq.html

#### Hint: To implement Bi-LSTM, you will need the following code to build the encoder; the decoder won't be much different.

In [1]:
# from keras.layers import LSTM, Bidirectional, Concatenate
# 
# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text


In [2]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [3]:
# e.g., filename = 'Data/deu.txt'
filename_it = 'Data/ita.txt'
filename_ron = 'Data/ron.txt'
filename_por = 'Data/por.txt'

# e.g., n_train = 20000
# Total number is 321433
n_train = 20000

In [4]:
# load dataset
doc_it = load_doc(filename_it)
doc_por = load_doc(filename_por)
doc_ron = load_doc(filename_ron)

# split into Language1-Language2 pairs
pairs_it = to_pairs(doc_it)
# print ("Total number of records: {:d}".format(len(pairs)))
pairs_ron = to_pairs(doc_ron)
pairs_por = to_pairs(doc_por)

# clean sentences
clean_pairs_it = clean_data(pairs_it)[0:n_train, :]
clean_pairs_por = clean_data(pairs_por)[0:n_train, :]
clean_pairs_ron = clean_data(pairs_ron)[0:n_train, :]

In [5]:
print ('English ====> Italian')
for i in range(3000, 3010):
    print('[' + clean_pairs_it[i, 0] + '] => [' + clean_pairs_it[i, 1] + ']')

print ('\nEnglish ====> Romanian')
for i in range(3000, 3010):
    print('[' + clean_pairs_ron[i, 0] + '] => [' + clean_pairs_ron[i, 1] + ']')

print ('\nEnglish ====> Protuguese')
for i in range(3000, 3010):
    print('[' + clean_pairs_por[i, 0] + '] => [' + clean_pairs_por[i, 1] + ']')

English ====> Italian
[im flabby] => [io sono fiacco]
[im flabby] => [sono fiacca]
[im flabby] => [io sono fiacca]
[im for it] => [sono a favore]
[im for it] => [io sono a favore]
[im frugal] => [sono parsimonioso]
[im greedy] => [sono avido]
[im greedy] => [io sono avido]
[im greedy] => [sono avida]
[im greedy] => [io sono avida]

English ====> Romanian
[we cant all be like tom] => [nu putem fi toti ca tom]
[we dont need your money] => [nu avem nevoie de banii tai]
[we dont want to do that] => [nu vrem sa facem asta]
[we have a house for rent] => [avem o casa de inchiriat]
[we have only two dollars] => [noi avem doar doi dolari]
[we have used up the coal] => [am folosit tot carbunele]
[we heard the bomb go off] => [am auzit bomba explodand]
[we must win at all costs] => [trebuie sa castigam cu orice pret]
[we use shared facilities] => [folosim facilitati comune]
[we want nothing from you] => [nu vrem nimic de la tine]

English ====> Protuguese
[you must go] => [voce deve ir]
[you must

In [6]:
input_texts_it = clean_pairs_it[:, 0]
target_texts_it = ['\t' + text + '\n' for text in clean_pairs_it[:, 1]]

input_texts_por = clean_pairs_por[:, 0]
target_texts_por = ['\t' + text + '\n' for text in clean_pairs_por[:, 1]]

input_texts_ron = clean_pairs_ron[:, 0]
target_texts_ron = ['\t' + text + '\n' for text in clean_pairs_ron[:, 1]]

input_texts_en = numpy.concatenate((input_texts_it, input_texts_por, input_texts_ron))

print('Length of Italian input_texts:  ' + str(input_texts_it.shape))
print('Length of Italian target_texts: ' + str(input_texts_it.shape))
print('Length of Romanian input_texts:  ' + str(input_texts_ron.shape))
print('Length of Romanian target_texts: ' + str(input_texts_ron.shape))
print('Length of Portuguese input_texts:  ' + str(input_texts_por.shape))
print('Length of Portuguese target_texts: ' + str(input_texts_por.shape))
print ('Length of all input_texts: ' + str(input_texts_en.shape))

Length of Italian input_texts:  (20000,)
Length of Italian target_texts: (20000,)
Length of Romanian input_texts:  (8215,)
Length of Romanian target_texts: (8215,)
Length of Portuguese input_texts:  (20000,)
Length of Portuguese target_texts: (20000,)
Length of all input_texts: (48215,)


In [7]:
max_encoder_seq_length = max(len(line) for line in input_texts_en)
max_decoder_seq_length_it = max(len(line) for line in target_texts_it)
max_decoder_seq_length_por = max(len(line) for line in target_texts_por)
max_decoder_seq_length_ron = max(len(line) for line in target_texts_ron)

print('max length of English input  sentences: %d' % (max_encoder_seq_length))
print('max length of Italian target sentences: %d' % (max_decoder_seq_length_it))
print('max length of Romanian target sentences: %d' % (max_decoder_seq_length_ron))
print('max length of Portuguese target sentences: %d' % (max_decoder_seq_length_por))

max length of English input  sentences: 203
max length of Italian target sentences: 41
max length of Romanian target sentences: 232
max length of Portuguese target sentences: 43


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [8]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# encode and pad sequences
def text2sequences(max_len, lines):
    tokenizer = Tokenizer(char_level=True, filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index


encoder_input_seq_it, input_token_index_it = text2sequences(max_encoder_seq_length, input_texts_it)
encoder_input_seq_ron, input_token_index_ron = text2sequences(max_encoder_seq_length, input_texts_ron)
encoder_input_seq_por, input_token_index_por = text2sequences(max_encoder_seq_length, input_texts_por)

decoder_input_seq_it, target_token_index_it = text2sequences(max_decoder_seq_length_it, target_texts_it)
decoder_input_seq_ron, target_token_index_ron = text2sequences(max_decoder_seq_length_ron, target_texts_ron)
decoder_input_seq_por, target_token_index_por = text2sequences(max_decoder_seq_length_por, target_texts_por)

print('shape of encoder_input_seq: ' + str(encoder_input_seq_it.shape))
print('shape of input_token_index: ' + str(len(input_token_index_it)))
print('shape of input_token_index: ' + str(len(input_token_index_por)))
print('shape of input_token_index: ' + str(len(input_token_index_ron)))
print('shape of Italian decoder_input_seq: ' + str(decoder_input_seq_it.shape))
print('shape of Italian target_token_index: ' + str(len(target_token_index_it)))
print('shape of Romanian decoder_input_seq: ' + str(decoder_input_seq_ron.shape))
print('shape of Romanina target_token_index: ' + str(len(target_token_index_ron)))
print('shape of Portugese decoder_input_seq: ' + str(decoder_input_seq_por.shape))
print('shape of Portugese target_token_index: ' + str(len(target_token_index_por)))

Using TensorFlow backend.


shape of encoder_input_seq: (20000, 203)
shape of input_token_index: 27
shape of input_token_index: 27
shape of input_token_index: 27
shape of Italian decoder_input_seq: (20000, 41)
shape of Italian target_token_index: 29
shape of Romanian decoder_input_seq: (8215, 232)
shape of Romanina target_token_index: 28
shape of Portugese decoder_input_seq: (20000, 43)
shape of Portugese target_token_index: 29


In [9]:
num_encoder_tokens = len(input_token_index_it) + 1
num_decoder_tokens_it = len(target_token_index_it) + 1
num_decoder_tokens_ron = len(target_token_index_ron) + 1
num_decoder_tokens_por = len(target_token_index_por) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens_it: ' + str(num_decoder_tokens_it))
print('num_decoder_tokens_ron: ' + str(num_decoder_tokens_ron))
print('num_decoder_tokens_por: ' + str(num_decoder_tokens_por))

num_encoder_tokens: 28
num_decoder_tokens_it: 30
num_decoder_tokens_ron: 29
num_decoder_tokens_por: 30


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [10]:
target_texts_it[100]

'\tlavoro ai ferri\n'

In [11]:
decoder_input_seq_it[100, :]

array([ 6, 12,  3, 18,  1, 11,  1,  2,  3,  4,  2, 22,  5, 11, 11,  4,  7,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0], dtype=int32)

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [12]:
from keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data_it = onehot_encode(encoder_input_seq_it, max_encoder_seq_length, num_encoder_tokens)
encoder_input_data_ron = onehot_encode(encoder_input_seq_ron, max_encoder_seq_length, num_encoder_tokens)
encoder_input_data_por = onehot_encode(encoder_input_seq_por, max_encoder_seq_length, num_encoder_tokens)

decoder_input_data_it = onehot_encode(decoder_input_seq_it, max_decoder_seq_length_it, num_decoder_tokens_it)
decoder_input_data_ron = onehot_encode(decoder_input_seq_ron, max_decoder_seq_length_ron, num_decoder_tokens_ron)
decoder_input_data_por = onehot_encode(decoder_input_seq_por, max_decoder_seq_length_por, num_decoder_tokens_por)

decoder_target_seq_it = numpy.zeros(decoder_input_seq_it.shape)
decoder_target_seq_it[:, 0:-1] = decoder_input_seq_it[:, 1:]
decoder_target_data_it = onehot_encode(decoder_target_seq_it, 
                                    max_decoder_seq_length_it, 
                                    num_decoder_tokens_it)

decoder_target_seq_ron = numpy.zeros(decoder_input_seq_ron.shape)
decoder_target_seq_ron[:, 0:-1] = decoder_input_seq_ron[:, 1:]
decoder_target_data_ron = onehot_encode(decoder_target_seq_ron, 
                                    max_decoder_seq_length_ron, 
                                    num_decoder_tokens_ron)

decoder_target_seq_por = numpy.zeros(decoder_input_seq_por.shape)
decoder_target_seq_por[:, 0:-1] = decoder_input_seq_por[:, 1:]
decoder_target_data_por = onehot_encode(decoder_target_seq_por, 
                                    max_decoder_seq_length_por, 
                                    num_decoder_tokens_por)

print(encoder_input_data_it.shape)
print(encoder_input_data_ron.shape)
print(encoder_input_data_por.shape)
print(decoder_input_data_it.shape)
print(decoder_input_data_ron.shape)
print(decoder_input_data_por.shape)

(20000, 203, 28)
(8215, 203, 28)
(20000, 203, 28)
(20000, 41, 30)
(8215, 232, 29)
(20000, 43, 30)


## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_{t-1}$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [13]:
from keras.layers import Input, LSTM
from keras.models import Model
from keras.layers import LSTM, Bidirectional, Concatenate

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), name='encoder_inputs')

# set the LSTM layer
# encoder_lstm = LSTM(latent_dim, return_state=True, 
#                     dropout=0.5, name='encoder_lstm')
# _, state_h, state_c = encoder_lstm(encoder_inputs)

encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')



Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Print a summary and save the encoder network structure to "./encoder.pdf"

In [14]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='encoder.pdf'
)

encoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, None, 28)     0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 512), (None, 583680      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 512)          0           bidirectional_1[0][2]            
          

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [15]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h_it = Input(shape=(2*latent_dim,), name='decoder_input_h_it')
decoder_input_c_it = Input(shape=(2*latent_dim,), name='decoder_input_c_it')
decoder_input_x_it = Input(shape=(None, num_decoder_tokens_it), name='decoder_input_x_it')

# set the LSTM layer
decoder_lstm_it = LSTM(2*latent_dim, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm_it')
decoder_lstm_outputs_it, state_h_it, state_c_it = decoder_lstm_it(decoder_input_x_it, 
                                                      initial_state=[decoder_input_h_it, decoder_input_c_it])

# set the dense layer
decoder_dense_it = Dense(num_decoder_tokens_it, activation='softmax', name='decoder_dense_it')
decoder_outputs_it = decoder_dense_it(decoder_lstm_outputs_it)

# build the decoder network model
decoder_model_it = Model(inputs=[decoder_input_x_it, decoder_input_h_it, decoder_input_c_it],
                      outputs=[decoder_outputs_it, state_h_it, state_c_it],
                      name='decoder_it')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [16]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(decoder_model_it, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model_it, show_shapes=False,
    to_file='decoder_it.pdf'
)

decoder_model_it.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x_it (InputLayer) (None, None, 30)     0                                            
__________________________________________________________________________________________________
decoder_input_h_it (InputLayer) (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_input_c_it (InputLayer) (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_lstm_it (LSTM)          [(None, None, 512),  1112064     decoder_input_x_it[0][0]         
                                                                 decoder_input_h_it[0][0]         
          

In [17]:
# inputs of the decoder network
decoder_input_h_ron = Input(shape=(2*latent_dim,), name='decoder_input_h_ron')
decoder_input_c_ron = Input(shape=(2*latent_dim,), name='decoder_input_c_ron')
decoder_input_x_ron = Input(shape=(None, num_decoder_tokens_ron), name='decoder_input_x_ron')

# set the LSTM layer
decoder_lstm_ron = LSTM(2*latent_dim, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm_ron')
decoder_lstm_outputs_ron, state_h_ron, state_c_ron = decoder_lstm_ron(decoder_input_x_ron, 
                                                      initial_state=[decoder_input_h_ron, decoder_input_c_ron])

# set the dense layer
decoder_dense_ron = Dense(num_decoder_tokens_ron, activation='softmax', name='decoder_dense_ron')
decoder_outputs_ron = decoder_dense_ron(decoder_lstm_outputs_ron)

# build the decoder network model
decoder_model_ron = Model(inputs=[decoder_input_x_ron, decoder_input_h_ron, decoder_input_c_ron],
                      outputs=[decoder_outputs_ron, state_h_ron, state_c_ron],
                      name='decoder_ron')

In [18]:
# inputs of the decoder network
decoder_input_h_por = Input(shape=(2*latent_dim,), name='decoder_input_h_por')
decoder_input_c_por = Input(shape=(2*latent_dim,), name='decoder_input_c_por')
decoder_input_x_por = Input(shape=(None, num_decoder_tokens_por), name='decoder_input_x_por')

# set the LSTM layer
decoder_lstm_por = LSTM(2*latent_dim, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm_por')
decoder_lstm_outputs_por, state_h_por, state_c_por = decoder_lstm_por(decoder_input_x_por, 
                                                      initial_state=[decoder_input_h_por, decoder_input_c_por])

# set the dense layer
decoder_dense_por = Dense(num_decoder_tokens_por, activation='softmax', name='decoder_dense_por')
decoder_outputs_por = decoder_dense_por(decoder_lstm_outputs_por)

# build the decoder network model
decoder_model_por = Model(inputs=[decoder_input_x_por, decoder_input_h_por, decoder_input_c_por],
                      outputs=[decoder_outputs_por, state_h_por, state_c_por],
                      name='decoder_por')

### 3.3. Connect the encoder and decoder

In [19]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')

decoder_input_x_it = Input(shape=(None, num_decoder_tokens_it), name='decoder_input_x_it')
decoder_input_x_ron = Input(shape=(None,num_decoder_tokens_ron), name='decoder_input_x_ron')
decoder_input_x_por = Input(shape=(None, num_decoder_tokens_por), name='decoder_input_x_por')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])

decoder_lstm_output_it, _, _ = decoder_lstm_it(decoder_input_x_it, initial_state=encoder_final_states)
decoder_pred_it = decoder_dense_it(decoder_lstm_output_it)

decoder_lstm_output_ron, _, _ = decoder_lstm_ron(decoder_input_x_ron, initial_state=encoder_final_states)
decoder_pred_ron = decoder_dense_ron(decoder_lstm_output_ron)

decoder_lstm_output_por, _, _ = decoder_lstm_por(decoder_input_x_por, initial_state=encoder_final_states)
decoder_pred_por = decoder_dense_por(decoder_lstm_output_por)

model_it = Model(inputs=[encoder_input_x, decoder_input_x_it], outputs=decoder_pred_it, name='model_training_it')
model_ron = Model(inputs=[encoder_input_x, decoder_input_x_ron], outputs=decoder_pred_ron, name='model_training_ron')
model_por = Model(inputs=[encoder_input_x, decoder_input_x_por], outputs=decoder_pred_por, name='model_training_por')

In [20]:
print(state_h_it)
print(decoder_input_h_it)

Tensor("decoder_lstm_it/while/Exit_2:0", shape=(?, 512), dtype=float32)
Tensor("decoder_input_h_it:0", shape=(?, 512), dtype=float32)


In [21]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(model_it, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model_it, show_shapes=False,
    to_file='model_training_it.pdf'
)

model_it.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    (None, None, 28)     0                                            
__________________________________________________________________________________________________
decoder_input_x_it (InputLayer) (None, None, 30)     0                                            
__________________________________________________________________________________________________
encoder (Model)                 [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm_it (LSTM)          [(None, None, 512),  1112064     decoder_input_x_it[0][0]         
                                                                 encoder[1][0]                    
          

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [22]:
print('shape of encoder_input_data' + str(encoder_input_data_it.shape))
print('shape of decoder_input_data' + str(decoder_input_data_it.shape))
print('shape of decoder_target_data' + str(decoder_target_data_it.shape))

shape of encoder_input_data(20000, 203, 28)
shape of decoder_input_data(20000, 41, 30)
shape of decoder_target_data(20000, 41, 30)


In [23]:
model_por.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model_por.fit([encoder_input_data_por, decoder_input_data_por],  # training data
          decoder_target_data_por,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=50, validation_split=0.2)

# model.save('seq2seq.h5')

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 16000 samples, validate on 4000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fdedffa77f0>

In [24]:
model_ron.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model_ron.fit([encoder_input_data_ron, decoder_input_data_ron],  # training data
          decoder_target_data_ron,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=50, validation_split=0.2)

# model.save('seq2seq.h5')

Train on 6572 samples, validate on 1643 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fded90f0978>

In [25]:
model_it.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model_it.fit([encoder_input_data_it, decoder_input_data_it],  # training data
          decoder_target_data_it,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=50, validation_split=0.2)

model_it.save('seq2seq.h5')

Train on 16000 samples, validate on 4000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


  '. They will not be included '


## 4. Make predictions


### 4.1. Translate English to Italian

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [26]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index_it.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index_it.items())

In [84]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens_it))
    target_seq[0, 0, target_token_index_it['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model_it.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])
        
        if sampled_token_index == 0:
            stop_condition = True
        else:
            sampled_char = reverse_target_char_index[sampled_token_index]
            decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length_it):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens_it))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [48]:
from nltk.translate.bleu_score import sentence_bleu
import numpy as np

n = 2100, 2120
for seq_index in range(n[0], n[1]):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data_it[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    
    reference = [y[1] for y in clean_pairs_it if y[0] == input_texts_it[seq_index] ]
    candidate = decoded_sentence[0:-1]
    score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    
    print('-')
    print('English:        ', input_texts_it[seq_index])
    print('Italian (true): ', target_texts_it[seq_index][1:-1])
    print('Italian (pred): ', decoded_sentence[0:-1])
    print('BLEU score: ', score)


-
English:         we waited
Italian (true):  abbiamo aspettato
Italian (pred):  abbiamo aspettato
BLEU score:  1.0
-
English:         we waited
Italian (true):  noi abbiamo aspettato
Italian (pred):  abbiamo aspettato
BLEU score:  1.0
-
English:         we waited
Italian (true):  aspettammo
Italian (pred):  abbiamo aspettato
BLEU score:  1.0
-
English:         we waited
Italian (true):  noi aspettammo
Italian (pred):  abbiamo aspettato
BLEU score:  1.0
-
English:         we walked
Italian (true):  abbiamo camminato
Italian (pred):  abbiamo lavaro le pare
BLEU score:  0.5454545454545454
-
English:         we walked
Italian (true):  camminavamo
Italian (pred):  abbiamo lavaro le pare
BLEU score:  0.5454545454545454
-
English:         we walked
Italian (true):  camminammo
Italian (pred):  abbiamo lavaro le pare
BLEU score:  0.5454545454545454
-
English:         we yawned
Italian (true):  abbiamo sbadigliato
Italian (pred):  lo sbbgglliamo
BLEU score:  0.7857142857142857
-
English:       

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [49]:
input_sentence = 'why is that'

input_sequence, input_token_index = text2sequences(max_encoder_seq_length, input_sentence)

encoder_input_instance = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)

translated_sentence = decode_sequence(encoder_input_instance)

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)

source sentence is: why is that
translated sentence is: lo sono una pioro



## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


In [89]:
# 1-gram individual BLEU
import datetime

All_scores = []
seq_index = 0
verbose = np.inf
while seq_index in range(int(0.8*n_train)):
    input_seq = encoder_input_data_it[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    
    reference = [y[1] for y in clean_pairs_it if y[0] == input_texts_it[seq_index] ]
    candidate = decoded_sentence[0:-1]
    score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    All_scores += [score]
    
#     print (seq_index)
    if seq_index > verbose:
        t = datetime.datetime.now()
        print ('{}: sequesnce {} is processed.'.format(t.time(), seq_index))
        verbose = seq_index + 1000
    seq_index += len(reference)

print ("Average BLEU score for training set: ", np.mean(All_scores))
    

Average BLEU score for training set:  0.7422382090580981


In [90]:
All_scores = []
seq_index = int(0.8*n_train)+1
verbose = np.inf
while seq_index in range(int(0.8*n_train)+1, n_train):
    input_seq = encoder_input_data_it[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    
    reference = [y[1] for y in clean_pairs_it if y[0] == input_texts_it[seq_index] ]
    candidate = decoded_sentence[0:-1]
    score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    All_scores += [score]
    
    if seq_index > verbose:
        t = datetime.datetime.now()
        print ('{}: sequesnce {} is processed.'.format(t.time(), seq_index))
        verbose = seq_index + 1000
    seq_index += len(reference)

print ("Average BLEU score for validation set: ", np.mean(All_scores))

Average BLEU score for validation set:  0.6343801973784912
