# Language translation with encoder-decoder model

This notebook is based on the programming assignment "Machine translation" of the Udacity Natural Language Processing Nanodegree. 

In this notebook I am implementing a sequence-to-sequence model to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French). The network architecture is a GRU encoder-decoder model without attention.

In [1]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [2]:
import collections

import helper
import numpy as np
import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


In [3]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2916401904031470286
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 357433344
locality {
  bus_id: 1
}
incarnation: 12616659151367246620
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0"
]


# Use of English-French translation dataset

The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file. 

In [4]:
# Load English data
english_sentences = helper.load_data('data/small_vocab_en')
# Load French data
french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')

Dataset Loaded


Each line in `small_vocab_en` contains an English sentence with the respective translation in each line of `small_vocab_fr`. The sentences have been preprocessed already. The puncuations have been delimited using spaces and all the text has been converted to lowercase.

In [5]:
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line 2:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .


### Complexity of the vocabulary

In [6]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


## Preprocess
We convert the text into sequences of integers using the following preprocess methods:
1. Tokenize the words into ids.
2. Add padding to make all the sequences the same length.


### Tokenize
We turn each sentence into a sequence of words ids.

In [7]:
def tokenize(x):

    x_tk = Tokenizer()
    x_tk.fit_on_texts(x)
    
    return x_tk.texts_to_sequences(x), x_tk
tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


### Padding
All sequences in each batch need to have the same length. Since sentences are dynamic in length, we add padding to the end of the sequences to make them the same length.

In [8]:
def pad(x, length=None):

    # length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    if length is None:
        length = max([len(sentence) for sentence in x])
    return pad_sequences(x, maxlen=length, padding='post')

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


### Preprocess Pipeline

In [9]:
def preprocess(x, y):

    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


## Model
### Ids Back to Text

In [10]:
def logits_to_text(logits, tokenizer):

    # Turn logits from the neural network into text using the tokenizer

    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    # String that represents the text of the logits
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


### Architecture
The encoder-decoder model architecture that we implement is shown in the next figure.
<img src="images/seq2seq.png" style="width:180;height:180px;">

In [11]:
learning_rate = 1e-3
embedding_size = 32
gru_units = 128

# encoder
encoder_input_shape = preproc_english_sentences.shape

encoder_input = Input(shape=(encoder_input_shape[1], ))
encoder_embedding = Embedding(english_vocab_size+1, embedding_size, input_length=encoder_input_shape[1])
encoder_outputs = encoder_embedding(encoder_input)

encoder_bidrectional_GRU = Bidirectional(GRU(gru_units, return_state = False))
encoder_outputs = encoder_bidrectional_GRU(encoder_outputs)
encoder_state = encoder_outputs

# decoder
decoder_input = Input(shape=(None, ))
decoder_embedding = Embedding(french_vocab_size+1, embedding_size)
decoder_outputs = decoder_embedding(decoder_input)

decoder_GRU = GRU(2*gru_units, return_sequences = True, return_state = True)
decoder_outputs, _ = decoder_GRU(decoder_outputs, initial_state = encoder_state)

decoder_dense = TimeDistributed(Dense(french_vocab_size+1, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs) 

encoder_decoder_model = Model(inputs = [encoder_input, decoder_input], outputs = decoder_outputs)
encoder_decoder_model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
encoder_decoder_model.summary()


preproc_french_sentences_decinp = np.zeros((preproc_french_sentences.shape))
preproc_french_sentences_decinp[:,1:,:] = preproc_french_sentences[:,0:-1,:]

num_epochs = 100
encoder_decoder_model.fit([preproc_english_sentences, preproc_french_sentences_decinp[:,:,0]], preproc_french_sentences, 
                           batch_size=1024, epochs= num_epochs, validation_split=0.2)
encoder_decoder_model.save_weights("encoder_decoder_model_"+str(num_epochs)+"_epochs_v2.h5")

print(logits_to_text(encoder_decoder_model.predict([preproc_english_sentences[:1], 
                                                   preproc_french_sentences_decinp[:1,:,0]])[0], french_tokenizer))

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 15, 32)       6400        input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 32)     11040       input_2[0][0]                    
__________________________________________________________________________________________________
bidirectio

In [12]:
encoder_model = Model(encoder_input, encoder_state)

decoder_state_input = Input(shape=(2*gru_units,))
decoder_outputs = decoder_embedding(decoder_input)
decoder_outputs, decoder_state = decoder_GRU(decoder_outputs, initial_state = decoder_state_input)
decoder_outputs = decoder_dense(decoder_outputs) 
decoder_model = Model([decoder_input, decoder_state_input], [decoder_outputs, decoder_state])

## Prediction 

In [13]:
def final_predictions(x, y, x_tk, y_tk, sentence):
    
    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'

    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    
    encoder_state_value = encoder_model.predict(sentence)
    target_seq = np.zeros((1, 1))

    decoded_sentence = ''
    decoder_state_ = encoder_state_value

    target_seq_len = 0
    while (target_seq_len < 21):
        decoder_output_, decoder_state_ = decoder_model.predict([target_seq, decoder_state_])
        decoded_sentence += y_id_to_word[np.argmax(decoder_output_)]+' '
    
        target_seq = np.argmax(decoder_output_).reshape(1,1)
        target_seq_len += 1
        
    return decoded_sentence

In [14]:
sentence = 'he saw a old yellow truck'
decoded_sentence = final_predictions(preproc_english_sentences, preproc_french_sentences, 
                                       english_tokenizer, french_tokenizer, sentence)

print('Sample 1:')
print(decoded_sentence)
print('Il a vu un vieux camion jaune')

sentence = 'new jersey is sometimes quiet during autumn and it is snowy in april'
decoded_sentence = final_predictions(preproc_english_sentences, preproc_french_sentences, 
                                       english_tokenizer, french_tokenizer, sentence)

print('Sample 2:')
print(decoded_sentence)
print('new jersey est parfois calme pendant l\' automne et il est neigeux en avril')


Sample 1:
il a vu un vieux camion jaune <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 
Il a vu un vieux camion jaune
Sample 2:
new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 
new jersey est parfois calme pendant l' automne et il est neigeux en avril
