# Machine Translation Project

## Introduction
In this notebook, you will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will accept English text as input and return the French translation.

- **Preprocess** - You'll convert text to sequence of integers.
- **Models** Create models which accepts a sequence of integers as input and returns a probability distribution over possible translations. After learning about the basic types of neural networks that are often used for machine translation, you will engage in your own investigations, to design your own model!
- **Prediction** Run the model on English text.

In [14]:
import collections
import numpy as np
import re
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.callbacks import EarlyStopping

## Dataset
### Load Data --> Deciding to do German-English

In [2]:
import nltk
nltk.download('comtrans')
from nltk.corpus import comtrans
# function to retrieve the corpora
def retrieve_corpora(translated_sentences_l1_l2='alignment-de-en.txt'):
    print("Retrieving corpora: {}".format(translated_sentences_l1_l2))
    als = comtrans.aligned_sents(translated_sentences_l1_l2)
    sentences_l1 = [sent.words for sent in als]
    sentences_l2 = [sent.mots for sent in als]
    return sentences_l1, sentences_l2

sen_l1, sen_l2 = retrieve_corpora()

[nltk_data] Downloading package comtrans to /home/robyn/nltk_data...
[nltk_data]   Package comtrans is already up-to-date!


Retrieving corpora: alignment-de-en.txt


In [3]:
# Load English data
german_sentences = sen_l1
# Load French data
english_sentences = sen_l2

print('Dataset Loaded')

Dataset Loaded


### Sampling the Files

In [4]:
for sample_i in range(5):
    print('German sample {}:  {}'.format(sample_i + 1, german_sentences[sample_i]))
    print('English sample {}:  {}\n'.format(sample_i + 1, english_sentences[sample_i]))

German sample 1:  ['Wiederaufnahme', 'der', 'Sitzungsperiode']
English sample 1:  ['Resumption', 'of', 'the', 'session']

German sample 2:  ['Ich', 'erkläre', 'die', 'am', 'Freitag', ',', 'dem', '17.', 'Dezember', 'unterbrochene', 'Sitzungsperiode', 'des', 'Europäischen', 'Parlaments', 'für', 'wiederaufgenommen', ',', 'wünsche', 'Ihnen', 'nochmals', 'alles', 'Gute', 'zum', 'Jahreswechsel', 'und', 'hoffe', ',', 'daß', 'Sie', 'schöne', 'Ferien', 'hatten', '.']
English sample 2:  ['I', 'declare', 'resumed', 'the', 'session', 'of', 'the', 'European', 'Parliament', 'adjourned', 'on', 'Friday', '17', 'December', '1999', ',', 'and', 'I', 'would', 'like', 'once', 'again', 'to', 'wish', 'you', 'a', 'happy', 'new', 'year', 'in', 'the', 'hope', 'that', 'you', 'enjoyed', 'a', 'pleasant', 'festive', 'period', '.']

German sample 3:  ['Wie', 'Sie', 'feststellen', 'konnten', ',', 'ist', 'der', 'gefürchtete', '"', 'Millenium-Bug', '"', 'nicht', 'eingetreten', '.', 'Doch', 'sind', 'Bürger', 'einiger', 

From looking at the sentences, you can see they have been preprocessed already. 
### Vocabulary
The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.

In [5]:
german_words_counter = collections.Counter([word for sentence in german_sentences for word in sentence])
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence])

print('{} German words.'.format(len([word for sentence in german_sentences for word in sentence])))
print('{} unique German words.'.format(len(german_words_counter)))
print('10 Most common words in the German dataset:')
print('"' + '" "'.join(list(zip(*german_words_counter.most_common(10)))[0]) + '"')
print()
print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')

666937 German words.
36146 unique German words.
10 Most common words in the German dataset:
"," "." "die" "der" "und" "in" "zu" "den" "ist" "daß"

710091 English words.
19231 unique English words.
10 Most common words in the English dataset:
"the" "." "," "of" "to" "and" "in" "is" "a" "that"


For comparison, _Alice's Adventures in Wonderland_ contains 2,766 unique words of a total of 15,500 words.
## Preprocess
For this project, you won't use text data as input to your model. Instead, you'll convert the text into sequences of integers using the following preprocess methods:
1. Tokenize the words into ids
2. Add padding to make all the sequences the same length.

Time to start preprocessing the data...

### Clean
In the following step, we want to clean up the tokens. Specifically, we want to tokenize punctuation and lowercase the tokens. We will use the regex module to perform the further splitting tokenization:

In [6]:
def clean_sentence(sentence):
    regex_splitter = re.compile("([!?.,:;$\"')( ])")
    clean_words = [re.split(regex_splitter, word.lower()) for word in sentence]
    return [w for words in clean_words for w in words if words if w]

### Tokenize
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings.  Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s). We can turn each word into a number called word ids. A word level model uses word ids that generate text predictions for each word.

Turn each sentence into a sequence of words ids using Keras's [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer) function. Use this function to tokenize `german_sentences` and `english_sentences` in the cell below. Running the cell will run `tokenize` on sample data and show output for debugging.

In [7]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # Implement
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer

### Filter Dataset Size
The next step for this project is filtering the sentences that are too long to be processed. Since our goal is to perform the processing on a local machine, we should limit ourselves to sentences up to N tokens. In this case, we set N=20, in order to be able to train the learner within 24 hours. If you have a powerful machine, feel free to increase that limit. To make the function generic enough, there’s also a lower bound with a default value set to 0, such as an empty token set.

The logic of the function is very easy: if the number of tokens for a sentence or its translation is greater than N, then the sentence (in both languages) is removed:

In [8]:
# to minimize processing power, limit sentences to 20 word length -- can drop this later if proven to not take too long to train
def filter_sentence_length(sentences_l1, sentences_l2, min_len=0, max_len=20):
    filtered_sentences_l1 = []
    filtered_sentences_l2 = []
    for i in range(len(sentences_l1)):
        if min_len <= len(sentences_l1[i]) <= max_len and min_len <= len(sentences_l2[i]) <= max_len:
            filtered_sentences_l1.append(sentences_l1[i])
            filtered_sentences_l2.append(sentences_l2[i])
    return filtered_sentences_l1, filtered_sentences_l2

### Padding
When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the **end** of each sequence using Keras's [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences) function.

In [9]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    return pad_sequences(x, maxlen=length, padding='post')

## Preprocess Pipeline
Your focus for this project is to build neural network architecture, so we won't ask you to create a preprocess pipeline.  Instead, we've provided you with the implementation of the `preprocess` function.

In [10]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences (Language 1)
    :param y: Label List of sentences (Language 2)
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x = [clean_sentence(s) for s in x]
    preprocess_y = [clean_sentence(s) for s in y]
    
    preprocess_x, x_tk = tokenize(preprocess_x)
    preprocess_y, y_tk = tokenize(preprocess_y)

    preprocess_x, preprocess_y = filter_sentence_length(preprocess_x, preprocess_y)
    
    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)
    
    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_german_sentences, preproc_english_sentences, german_tokenizer, english_tokenizer = preprocess(german_sentences, english_sentences)
    
max_german_sequence_length = preproc_german_sentences.shape[1]
max_english_sequence_length = preproc_english_sentences.shape[1]
german_vocab_size = len(german_tokenizer.word_index)
english_vocab_size = len(english_tokenizer.word_index)

print('Data Preprocessed')
print("Max German sentence length:", max_german_sequence_length)
print("Max English sentence length:", max_english_sequence_length)
print("German vocabulary size:", german_vocab_size)
print("English vocabulary size:", english_vocab_size)

Data Preprocessed
Max German sentence length: 20
Max English sentence length: 20
German vocabulary size: 34213
English vocabulary size: 17343


## Models
In this section, you will experiment with various neural network architectures.

### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want.  We want the French translation.  The function `logits_to_text` will bridge the gap between the logits from the neural network to the French translation.  You'll be using this function to better understand the output of the neural network.

In [59]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = ''

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


## LSTM Model

In [29]:
def lstm_model(input_shape, output_sequence_length, german_vocab_size, english_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """   
    # Initialising the RNN
    model = Sequential()
    # word embedding layer seen before
    model.add(Embedding(german_vocab_size, 1024, input_length=input_shape[1], input_shape=input_shape[1:]))
    # Adding the first LSTM layer and some Dropout regularisation
    model.add(LSTM(units = 1024, return_sequences = True))
    #model.add(Dropout(0.2))
    # Adding a second LSTM layer and some Dropout regularisation
    model.add(LSTM(units = 512, return_sequences = True))
    #model.add(Dropout(0.2))
    # Adding a third LSTM layer and some Dropout regularisation
    model.add(LSTM(units = 512, return_sequences = True))
    #model.add(Dropout(0.2))
    # Adding a fourth LSTM layer and some Dropout regularisation
    model.add(LSTM(units = 512, return_sequences = True))
    model.add(Dropout(0.2))
    # hidden dense layer
    model.add(TimeDistributed(Dense(512, activation='relu')))
    model.add(Dropout(0.5))
    # output dense layer
    model.add(TimeDistributed(Dense(english_vocab_size, activation='softmax')))
    # Compiling the RNN
    model.compile(optimizer = 'adam', loss = sparse_categorical_crossentropy, metrics = ['accuracy'])
    
    return model

In [32]:
model = lstm_model(preproc_german_sentences.shape, preproc_english_sentences.shape[1], 
                   len(german_tokenizer.word_index)+1, len(english_tokenizer.word_index)+1)
model.summary()
model.fit(preproc_german_sentences, preproc_english_sentences, batch_size=50, epochs=25, validation_split=0.2, verbose=1)
model.save('final_lstm_model_de-en')

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 20, 1024)          35035136  
_________________________________________________________________
lstm_26 (LSTM)               (None, 20, 1024)          8392704   
_________________________________________________________________
lstm_27 (LSTM)               (None, 20, 512)           3147776   
_________________________________________________________________
lstm_28 (LSTM)               (None, 20, 512)           2099200   
_________________________________________________________________
lstm_29 (LSTM)               (None, 20, 512)           2099200   
_________________________________________________________________
dropout_23 (Dropout)         (None, 20, 512)           0         
_________________________________________________________________
time_distributed_10 (TimeDis (None, 20, 512)          



INFO:tensorflow:Assets written to: final_lstm_model_de-en/assets


INFO:tensorflow:Assets written to: final_lstm_model_de-en/assets


In [74]:
for i in range(10,15):
    # Print prediction(s)
    print("Prediction:")
    print(logits_to_text(model.predict(preproc_german_sentences[i:i+1])[0], english_tokenizer))

    print("\nCorrect Translation:")
    print(' '.join([word for word in english_sentences[i:i+1][0]]))

    print("\nOriginal text:")
    print(' '.join([word for word in german_sentences[i:i+1][0]]))
    print('\n\n')

Prediction:
this is a the the the the the the the the the the the of .    

Correct Translation:
If the House agrees , I shall do as Mr Evans has suggested .

Original text:
Wenn das Haus damit einverstanden ist , werde ich dem Vorschlag von Herrn Evans folgen .



Prediction:
this is a the the the the .            

Correct Translation:
Madam President , on a point of order .

Original text:
Frau Präsidentin , zur Geschäftsordnung .



Prediction:
i , , , , , the the the the the the the the the the the the . .

Correct Translation:
I would like your advice about Rule 143 concerning inadmissibility .

Original text:
Könnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulässigkeit geben ?



Prediction:
but , , the the the the the the the the . .       

Correct Translation:
My question relates to something that will come up on Thursday and which I will then raise again .

Original text:
Meine Frage betrifft eine Angelegenheit , die am Donnerstag zur Sprache kommen wi

#### BLEU Score
Bilingual Evaluation Understudy Score - a metric for evaluating a generated sentence to a reference sentence. The scoring was developed for evaluating the prediction made by automatic machine translation systems. Score ranges 0 (exact mismatch) to 1 (exact match)

In [71]:
from nltk.translate.bleu_score import sentence_bleu
for i in range(100,105):
    original = ' '.join([word for word in german_sentences[i:i+1][0]])
    references = ' '.join([word for word in english_sentences[i:i+1][0]])
    candidates = logits_to_text(model.predict(preproc_german_sentences[i:i+1])[0], english_tokenizer)
    score = sentence_bleu(references, candidates)
    print('\nGER: ', original)
    print('ENG: ', references)
    print('Pred: ', candidates)
    print('BLEU score:', score)


GER:  Es gibt auch Beschlüsse gegen eine solche Steuer .
ENG:  Decisions have also been adopted against a tax of this kind .
Pred:  in is of the the the the of . .          
BLEU score: 1.2803018097424153e-231

GER:  Deswegen beantragt meine Fraktion , diesen Punkt von der Tagesordnung abzusetzen .
ENG:  That is why my Group moves that this item be taken off the agenda .
Pred:  i course , the , to . . . .          
BLEU score: 1.3135841289152546e-231

GER:  Vielen Dank , Herr Poettering .
ENG:  Thank you , Mr Poettering .
Pred:  the the of the the the the the the the the the the the the the the the . .
BLEU score: 9.72161026064145e-232

GER:  Wir kommen nun zu Herrn Wurtz , der gegen den Antrag spricht .
ENG:  We shall now hear Mr Wurtz speaking against this request .
Pred:  the is is is the the the the the the the the . .      
BLEU score: 1.0931616654031189e-231

GER:  Frau Präsidentin , ich möchte zunächst darauf hinweisen , daß das , was Herr Poettering da sagt , nicht ganz logisc

In [73]:
score = 0
for i in range(0,len(preproc_german_sentences)):
    references = ' '.join([word for word in english_sentences[i:i+1][0]])
    candidates = logits_to_text(model.predict(preproc_german_sentences[i:i+1])[0], english_tokenizer)
    bleu = sentence_bleu(references, candidates)
    if bleu > score:
        score = bleu
print('highest bleu score:', score)

highest bleu score: 1.4773652796063933e-231


# Other Model Attempts (English to French practice)

### Model 1: RNN
A basic RNN model is a good baseline for sequence data.  In this model, you'll build a RNN that translates English to French.

In [None]:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Hyperparameters
    learning_rate = 0.005
    
    # TODO: Build the layers
    model = Sequential()
    model.add(GRU(256, input_shape=input_shape[1:], return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax'))) 

    # Compile model
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model


# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)

print(simple_rnn_model.summary())

simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

In [12]:
# Print prediction(s)
print("Prediction:")
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

print("\nCorrect Translation:")
print(french_sentences[:1])

print("\nOriginal text:")
print(english_sentences[:1])

Prediction:
new jersey est parfois calme en mois de mai et il est neigeux en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Translation:
["new jersey est parfois calme pendant l' automne , et il est neigeux en avril ."]

Original text:
['new jersey is sometimes quiet during autumn , and it is snowy in april .']


### Custom RNN Model
Use everything you learned from the previous models to create a model that incorporates embedding and a bidirectional rnn into one model.

In [20]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement

    # Hyperparameters
    learning_rate = 0.003
    
    # Build the layers    
    model = Sequential()
    # Embedding
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1],
                         input_shape=input_shape[1:]))
    # Encoder
    model.add(Bidirectional(GRU(256)))
    model.add(RepeatVector(output_sequence_length))
    # Decoder
    model.add(Bidirectional(GRU(256, return_sequences=True)))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

print('Final Model Loaded')

Final Model Loaded


## Prediction

In [21]:
def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    # TODO: Train neural network using model_final
    model = model_final(x.shape,y.shape[1],
                        len(x_tk.word_index)+1,
                        len(y_tk.word_index)+1)
    model.summary()
    model.fit(x, y, batch_size=1024, epochs=25, validation_split=0.2)
    model.save('rnn_model_fit_round2')

In [22]:
final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 256)           4333056   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 512)               789504    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 20, 512)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 20, 512)           1182720   
_________________________________________________________________
time_distributed_2 (TimeDist (None, 20, 1024)          525312    
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 1024)          0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, 20, 22719)        



INFO:tensorflow:Assets written to: rnn_model_fit_round2/assets


INFO:tensorflow:Assets written to: rnn_model_fit_round2/assets


In [19]:
# 5 hours 20 min for round 1 --> rnn_model_fit
# 8 hours 30 min for round 2 --> rnn_model_fit_round2 ( increased size of same layers and re-run )

In [25]:
from keras.models import load_model
rnn_model = load_model('rnn_model_fit_round2')

# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))


# Print prediction(s)
print("Prediction:")
print(logits_to_text(rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

print("\nCorrect Translation:")
print(french_sentences[:1])

print("\nOriginal text:")
print(english_sentences[:1])

Prediction:
reprise de session <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Translation:
[['Reprise', 'de', 'la', 'session']]

Original text:
[['Resumption', 'of', 'the', 'session']]


### LSTM Model 1

In [28]:
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import RepeatVector
from tensorflow.keras.callbacks import EarlyStopping

def lstm_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """   
    # Initialising the RNN
    model = Sequential()
    # word embedding layer seen before
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1], input_shape=input_shape[1:]))
    # Adding the first LSTM layer and some Dropout regularisation
    model.add(LSTM(units = 50, return_sequences = True))
    model.add(Dropout(0.2))
    # repeat vector layer -- repeats the input n times, how many? 
    #model.add(RepeatVector(10))
    # Adding a second LSTM layer and some Dropout regularisation
    model.add(LSTM(units = 50, return_sequences = True))
    model.add(Dropout(0.2))
    # output dense layer
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))
    # Compiling the RNN
    model.compile(optimizer = 'adam', loss = sparse_categorical_crossentropy, metrics = ['accuracy'])
    
    return model

In [29]:
model = lstm_model(preproc_english_sentences.shape, preproc_french_sentences.shape[1], 
                   len(english_tokenizer.word_index)+1, len(french_tokenizer.word_index)+1)
model.summary()
model.fit(preproc_english_sentences, preproc_french_sentences, batch_size=1024, epochs=25, validation_split=0.2, callbacks=[EarlyStopping(monitor='val_loss', patience=3)], verbose=1)
model.save('lstm_model_fit')

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 256)           4333056   
_________________________________________________________________
lstm_1 (LSTM)                (None, 20, 50)            61400     
_________________________________________________________________
dropout_3 (Dropout)          (None, 20, 50)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 20, 50)            20200     
_________________________________________________________________
dropout_4 (Dropout)          (None, 20, 50)            0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, 20, 22719)         1158669   
Total params: 5,573,325
Trainable params: 5,573,325
Non-trainable params: 0
____________________________________________



INFO:tensorflow:Assets written to: lstm_model_fit/assets


INFO:tensorflow:Assets written to: lstm_model_fit/assets


In [30]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))


# Print prediction(s)
print("Prediction:")
print(logits_to_text(model.predict(tmp_x[:1])[0], french_tokenizer))

print("\nCorrect Translation:")
print(french_sentences[:1])

print("\nOriginal text:")
print(english_sentences[:1])

Prediction:
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Translation:
[['Reprise', 'de', 'la', 'session']]

Original text:
[['Resumption', 'of', 'the', 'session']]


In [32]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))


# Print prediction(s)
print("Prediction:")
print(logits_to_text(model.predict(tmp_x[4:5])[0], french_tokenizer))

print("\nCorrect Translation:")
print(french_sentences[4:5])

print("\nOriginal text:")
print(english_sentences[4:5])

Prediction:
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Translation:
[['(', 'Le', 'Parlement', ',', 'debout', ',', 'observe', 'une', 'minute', 'de', 'silence', ')']]

Original text:
[['(', 'The', 'House', 'rose', 'and', 'observed', 'a', 'minute', "'", 's', 'silence', ')']]


In [None]:
# 1024 batch, 20 epoch round 1 -- ~3 hours --> 'lstm_model_fit' 36% val_accuracy