In this notebook, the following models are built to practice machine translation from English to French:  
<li> Simple RNN
<li> RNN with embedding
<li> Bidirectional RNN
<li> Encoder-decoder Model


*Reference*:<br> 
https://towardsdatascience.com/neural-machine-translation-with-python-c2f0a34f7dd<br>
https://github.com/susanli2016/NLP-with-Python

In [1]:
import collections
import utils
import numpy as np
import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


**Load the [data](https://github.com/susanli2016/NLP-with-Python/tree/master/data)**: a *small_vocab_en* file that contains English sentences and their French translations in the *small_vocab_fr*.

In [2]:
english_sents = utils.load_data('data/small_vocab_en.txt')
french_sents = utils.load_data('data/small_vocab_fr.txt')
print('Dataset loaded')

Dataset loaded


In [3]:
for line_i in range(2):
    print('small_vocab_en line %d: %s' % (line_i+1, english_sents[line_i]))
    print('french_vocab_en line %d: %s' % (line_i+1, french_sents[line_i]))
    print('----------------------------------------------')

small_vocab_en line 1: new jersey is sometimes quiet during autumn , and it is snowy in april .
french_vocab_en line 1: new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
----------------------------------------------
small_vocab_en line 2: the united states is usually chilly during july , and it is usually freezing in november .
french_vocab_en line 2: les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
----------------------------------------------


**Note: the complexity of the problem is determined by the complexity of vocabulary**
<br> The below function summarize the vocabulary of the dataset

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter

In [35]:
def get_word_summary(list_of_sents, lang):
    word_counts = Counter()
    for sent in list_of_sents:
        for word in word_tokenize(sent):
            word_counts[word] += 1
    vocab = set(word_counts.keys())
    print('%d %s words' % (sum(word_counts.values()), lang))
    print('%d unique %s words' % (len(vocab), lang))
    print('19 most common words:', word_counts.most_common(10))
    return word_counts, vocab

In [36]:
eng_words, eng_vocab = get_word_summary(english_sents, 'english')

1831620 english words
200 unique english words
19 most common words: [('is', 205882), (',', 140897), ('.', 137049), ('in', 75525), ('it', 75377), ('during', 74933), ('the', 67628), ('but', 63987), ('and', 59850), ('sometimes', 37746)]


In [37]:
french_words, french_vocab = get_word_summary(french_sents, 'french')

2000741 french words
354 unique french words
19 most common words: [('est', 196809), ('.', 137048), (',', 123135), ('en', 105768), ('il', 84079), ('les', 65255), ('mais', 63987), ('et', 59851), ('la', 49861), ("'", 38017)]


## Preprocessing
In this part, we will use the one-hot representation with paddings to make all the text sequences the same length

### The one-hot encoding part

In [65]:
def keras_tokenizer(list_of_strings):
    tk = Tokenizer()
    tk.fit_on_texts(list_of_strings)
    return tk.texts_to_sequences(list_of_strings), tk

In [71]:
eng_indexed_seq, eng_tokenizer = keras_tokenizer(english_sents)
print(eng_indexed_seq[:10])

[[17, 23, 1, 8, 67, 4, 39, 7, 3, 1, 55, 2, 44], [5, 20, 21, 1, 9, 62, 4, 43, 7, 3, 1, 9, 51, 2, 45], [22, 1, 9, 67, 4, 38, 7, 3, 1, 9, 68, 2, 34], [5, 20, 21, 1, 8, 64, 4, 34, 7, 3, 1, 57, 2, 42], [29, 12, 16, 13, 1, 5, 82, 6, 30, 12, 16, 1, 5, 83], [31, 11, 13, 1, 5, 84, 6, 30, 11, 1, 5, 82], [18, 1, 66, 4, 47, 6, 3, 1, 9, 62, 2, 43], [17, 23, 1, 60, 4, 35, 7, 3, 1, 10, 68, 2, 38], [49, 12, 16, 13, 1, 5, 85, 6, 30, 12, 16, 1, 5, 82], [5, 20, 21, 1, 8, 60, 4, 36, 7, 3, 1, 8, 56, 2, 45]]


### The padding part

In [72]:
def keras_padding(sequence, length = None):
    if length is None:
        length = max([len(sent) for sent in sequence])
    return pad_sequences(sequence, maxlen=length, padding='post')

In [73]:
keras_padding(eng_indexed_seq)

array([[17, 23,  1, ..., 44,  0,  0],
       [ 5, 20, 21, ..., 51,  2, 45],
       [22,  1,  9, ..., 34,  0,  0],
       ..., 
       [24,  1, 10, ..., 54,  0,  0],
       [ 5, 84,  1, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0]], dtype=int32)

### Pipeline