# Experiment II : Recurrent Neural Networks

**Author:** Felipe Cortes Jaramillo

**Description:** Experiments implementing different recurrent neural network

**References:** https://github.com/tommytracey/AIND-Capstone/blob/master/machine_translation.ipynb

## Installing Packages and Downloading Data:

In [1]:
# Install Hugging Face Library for Datasets
#!pip install datasets pandas nltk clean-text

In [2]:
# Import needed libraries
import pandas as pd
import numpy as np
from datasets import load_dataset

import nltk
from cleantext import clean
from nltk.tokenize import word_tokenize

import tensorflow as tf
from keras.models import Sequential, Model
from keras.layers import Embedding, SimpleRNN, Dense, GRU, LSTM, Bidirectional, Dropout, Input, Dense
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
#from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm
Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


In [3]:
# Check if TensorFlow is able to detect the GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Set TensorFlow to use only one GPU
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')

        # Enable memory growth
        tf.config.experimental.set_memory_growth(gpus[0], True)

        print("Using GPU:", gpus[0])
    except RuntimeError as e:
        # Memory growth must be set at program startup
        print("RuntimeError:", e)
else:
    raise SystemError("GPU device not found")

Using GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


In [4]:
# Extract dataset
translation_dataset = load_dataset('Nicolas-BZRD/Parallel_Global_Voices_English_French',
                                   split='train').to_pandas()
translation_dataset.head(5)

Unnamed: 0,en,fr
0,Jamaica: “I am HIV”,Jamaïque : J’ai le VIH
1,"It's widely acknowledged, in the Caribbean and...","Il est largement reconnu, dans les Caraïbes et..."
2,"For this woman, however, photographed in the s...","Pour cette femme, cependant, photographiée dan..."
3,As Bacon writes on her blog:,Comme Bacon écrit sur son blog:
4,"“When I asked to take her picture, I suggested...",“Quand je lui ai demandé de la prendre en phot...


In [5]:
# Remove line when finishied

df = translation_dataset.head(8060)

## Data Pre-processing:

In [6]:
# First step - Data Pre-processing

# nltk downloads
nltk.download('punkt')

# Define a cleaning function
def clean_text(text):
    return clean(text,
                 fix_unicode=True,               # fix various unicode errors
                 to_ascii=True,                  # transliterate to closest ASCII representation
                 lower=True,                     # lowercase text
                 no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
                 no_urls=True,                   # replace all URLs with a special token
                 no_emails=True,                 # replace all email addresses with a special token
                 no_phone_numbers=True,          # replace all phone numbers with a special token
                 no_numbers=False,               # replace all numbers with a special token
                 no_digits=False,                # replace all digits with a special token
                 no_currency_symbols=True,       # replace all currency symbols with a special token
                 no_punct=True,                  # remove punctuations
                 replace_with_punct="",          # replace punctuations with this character
                 replace_with_url="<URL>",
                 replace_with_email="<EMAIL>",
                 replace_with_phone_number="<PHONE>",
                 replace_with_number="<NUMBER>",
                 replace_with_digit="<DIGIT>",
                 replace_with_currency_symbol="<CUR>",
                 lang="en")

# Apply cleaning function to both English and French columns
df['en'] = df['en'].apply(clean_text)
df['fr'] = df['fr'].apply(clean_text)

# Tokenization
df['en_tokens'] = df['en'].apply(word_tokenize)
df['fr_tokens'] = df['fr'].apply(word_tokenize)

# Handling missing data
df.dropna(subset=['en', 'fr'], inplace=True)

# Save the preprocessed data
df.to_csv('preprocessed_data.csv', index=False)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pacho\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['en'] = df['en'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['fr'] = df['fr'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-

In [7]:
# Second step - Data Transformation for Training

# Tokenization
tokenizer_en = Tokenizer()
tokenizer_en.fit_on_texts(df['en_tokens'])
tokenizer_fr = Tokenizer()
tokenizer_fr.fit_on_texts(df['fr_tokens'])

# Convert text to sequences
sequences_en = tokenizer_en.texts_to_sequences(df['en_tokens'])
sequences_fr = tokenizer_fr.texts_to_sequences(df['fr_tokens'])

# Padding sequences
max_len = max(max(len(s) for s in sequences_en), max(len(s) for s in sequences_fr))
sequences_en = pad_sequences(sequences_en, maxlen=max_len, padding='post')
sequences_fr = pad_sequences(sequences_fr, maxlen=max_len, padding='post')

# Splitting the data
split = int(len(sequences_en) * 0.8)
trainX, testX = sequences_en[:split], sequences_en[split:]
trainY, testY = sequences_fr[:split], sequences_fr[split:]


In [8]:
# Third step - Reshape data for feeding into model (French words)
trainY = trainY.reshape(trainY.shape[0], trainY.shape[1], 1)
testY = testY.reshape(testY.shape[0], testY.shape[1], 1)

In [9]:
# Fourth Step - Some relevant information after pre-processing and transforming
max_english_sequence_length = sequences_en.shape[1]
max_french_sequence_length = sequences_fr.shape[1]
english_vocab_size = len(tokenizer_en.word_index)
french_vocab_size = len(tokenizer_fr.word_index)

print('<--- Data Preprocessed Summary: --->')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

<--- Data Preprocessed Summary: --->
Max English sentence length: 121
Max French sentence length: 121
English vocabulary size: 16619
French vocabulary size: 20767


----------------------------------------------------------

## Model Approach 1: Simple RNN, GRU and LSTM

In [14]:
# Testing Module - Models

def translate_sequence(seq, tokenizer):
    """ Translates a sequence of integers back into text using the tokenizer. """
    words = [tokenizer.index_word.get(idx, '') for idx in seq]
    return ' '.join(words).strip()

def predict_and_compare(index, testX, model, tokenizer_en, tokenizer_fr):
    """ Predicts translation for a given index in the test set and compares with the ground truth. """
    input_seq = testX[index:index+1]
    prediction = model.predict(input_seq)

    # Converting the prediction to a sequence of integers
    predicted_seq = np.argmax(prediction, axis=-1)[0]

    # Reverse tokenization (converting sequences back to words)
    input_text = translate_sequence(input_seq[0], tokenizer_en)
    predicted_text = translate_sequence(predicted_seq, tokenizer_fr)
    ground_truth_text = translate_sequence(testY[index].flatten(), tokenizer_fr)

    print("Input (English):", input_text)
    print("Predicted (French):", predicted_text)
    print("Ground Truth (French):", ground_truth_text)

In [40]:
# Model Section - First Model RNN

def simple_rnn(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(SimpleRNN(64, return_sequences=True))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  return model

In [41]:
%%time

# Testing simple RNN

simple_rnn_instance = simple_rnn(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
simple_rnn_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7d0229844070>

In [44]:
# Predicting with simple RNN (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=simple_rnn_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): you spend only a portion of your year in morocco
Predicted (French): le que a a de de de de de
Ground Truth (French): vous ne passez quune partie de lannee au maroc
Input (English): what do you do the rest of the time
Predicted (French): le le le la la
Ground Truth (French): que faitesvous le reste du temps
Input (English): what do you do to relax
Predicted (French): le le le la de
Ground Truth (French): comment vous detendezvous


In [45]:
# Model Section - First Model GRU

def simple_gru(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(GRU(64, return_sequences=True))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  return model

In [46]:
%%time
# Testing simple GRU

simple_gru_instance = simple_gru(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
simple_gru_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Input (English): you spend only a portion of your year in morocco
Predicted (French): je  de de de de
Ground Truth (French): vous ne passez quune partie de lannee au maroc
Input (English): what do you do the rest of the time
Predicted (French): il de de
Ground Truth (French): que faitesvous le reste du temps
Input (English): what do you do to relax
Predicted (French): il de de
Ground Truth (French): comment vous detendezvous


In [None]:
# Predicting with simple GRU (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=simple_gru_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

In [47]:
# Model Section - First Model LSTM

def simple_lstm(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(LSTM(64, return_sequences=True))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  return model

In [48]:
%%time
# Testing simple LSTM

simple_lstm_instance = simple_lstm(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
simple_lstm_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Input (English): you spend only a portion of your year in morocco
Predicted (French): il  de de de de
Ground Truth (French): vous ne passez quune partie de lannee au maroc
Input (English): what do you do the rest of the time
Predicted (French): il
Ground Truth (French): que faitesvous le reste du temps
Input (English): what do you do to relax
Predicted (French): il
Ground Truth (French): comment vous detendezvous


In [None]:
# Predicting with simple LSTM (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=simple_lstm_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

## Model Approach 2: Bidirectional RNN, GRU and LSTM

In [50]:
# Model Section - Bidirectional RNN

def bd_rnn(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(Bidirectional(SimpleRNN(64, return_sequences=True)))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  return model

In [51]:
%%time
# Testing Bi-Directional RNN

bd_rnn_instance = bd_rnn(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
bd_rnn_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10




Input (English): you spend only a portion of your year in morocco
Predicted (French): il le le le le
Ground Truth (French): vous ne passez quune partie de lannee au maroc
Input (English): what do you do the rest of the time
Predicted (French): le le le le
Ground Truth (French): que faitesvous le reste du temps
Input (English): what do you do to relax
Predicted (French): le le le
Ground Truth (French): comment vous detendezvous


In [None]:
# Predicting with Bi-Directional RNN (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=bd_rnn_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

In [52]:
# Model Section - Bidirectional GRU

def bd_gru(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(Bidirectional(GRU(64, return_sequences=True)))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  return model

In [None]:
%%time
# Testing Bi-Directional GRU

bd_gru_instance = bd_gru(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
bd_gru_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

In [None]:
# Predicting with Bi-Directional GRU (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=bd_gru_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

In [None]:
# Model Section - Bidirectional LSTM

def bd_lstm(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(Bidirectional(LSTM(64, return_sequences=True)))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
  return model

In [None]:
%%time
# Testing Bi-Directional LSTM

bd_lstm_instance = bd_lstm(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
bd_lstm_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

In [None]:
# Predicting with Bi-Directional LSTM (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=bd_lstm_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

## Model Approach 3: Bidirectional RNN, GRU and LSTM - More Complex Architecture

In [None]:
def bd_comp_rnn(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(Bidirectional(SimpleRNN(64, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Bidirectional(SimpleRNN(32, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Bidirectional(SimpleRNN(32, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [None]:
%%time
# Testing Bi-Directional Complex RNN

bd_comp_rnn_instance = bd_comp_rnn(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
bd_comp_rnn_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

In [None]:
# Predicting Bi-Directional Complex RNN (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=bd_comp_rnn_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

In [None]:
def bd_comp_gru(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(Bidirectional(GRU(64, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Bidirectional(GRU(32, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Bidirectional(GRU(32, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [None]:
%%time
# Testing Bi-Directional Complex GRU

bd_comp_gru_instance = bd_comp_gru(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
bd_comp_gru_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

In [None]:
# Predicting Bi-Directional Complex GRU (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=bd_comp_rnn_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

In [27]:
def bd_comp_lstm(tokenizer_en, tokenizer_fr):

  # Define structure of the model
  model = Sequential()
  model.add(Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64, input_length=max_len))
  model.add(Bidirectional(LSTM(64, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Bidirectional(LSTM(32, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Bidirectional(LSTM(32, return_sequences=True)))
  model.add(Dropout(0.2))
  model.add(Dense(len(tokenizer_fr.word_index) + 1, activation='softmax'))

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [28]:
%%time
# Testing Bi-Directional Complex LSTM

bd_comp_lstm_instance = bd_comp_lstm(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)
bd_comp_lstm_instance.fit(trainX, trainY, epochs=10, validation_data=(testX, testY), batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: total: 31.4 s
Wall time: 2min 38s


<keras.callbacks.History at 0x2bdbd09de50>

In [29]:
# Predicting Bi-Directional Complex LSTM (First 3 Test Samples)
for i in range(3):
    predict_and_compare(index=i, testX=testX, model=bd_comp_lstm_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): le de de de de
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): le de de de de de de de de de de de de de de de de de de de
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): le de de de de de de de de de de de de de de de de de de de de de de de de de de
Ground Truth (French): mais les failles ont deja commence a apparaitre les stations de radio en place dans la region de toro dans louest de louganda sont har

## Model Approach 4: Auto-Encoders with RNN, GRU and LSTM

In [16]:
# We need to define the encoder and decoder models testing function separately

def predict_and_compare_auto_en(index, testX, testY, model, tokenizer_en, tokenizer_fr):
    """ Predicts translation for a given index in the test set and compares with the ground truth. """
    input_seq_X = testX[index:index+1]
    input_seq_Y = testY[index:index+1]
    prediction = model.predict([input_seq_X, input_seq_Y])

    # Converting the prediction to a sequence of integers
    predicted_seq = np.argmax(prediction, axis=-1)[0]

    # Reverse tokenization (converting sequences back to words)
    input_text = translate_sequence(input_seq_X[0], tokenizer_en)
    predicted_text = translate_sequence(predicted_seq, tokenizer_fr)
    ground_truth_text = translate_sequence(testY[index].flatten(), tokenizer_fr)

    print("Input (English):", input_text)
    print("Predicted (French):", predicted_text)
    print("Ground Truth (French):", ground_truth_text)

In [30]:
def encoder_decoder_rnn(tokenizer_en, tokenizer_fr):

  # Encoder
  encoder_inputs = Input(shape=(None,))
  enc_emb = Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64)(encoder_inputs)
  encoder_rnn = SimpleRNN(64, return_state=True)
  encoder_outputs, state = encoder_rnn(enc_emb)

  # Decoder
  decoder_inputs = Input(shape=(None,))
  dec_emb_layer = Embedding(input_dim=len(tokenizer_fr.word_index) + 1, output_dim=64)
  dec_emb = dec_emb_layer(decoder_inputs)
  decoder_rnn = SimpleRNN(64, return_sequences=True, return_state=True)
  decoder_outputs, _= decoder_rnn(dec_emb, initial_state=state)
  decoder_dense = Dense(len(tokenizer_fr.word_index) + 1, activation='softmax')
  decoder_outputs = decoder_dense(decoder_outputs)

  # Build Final Model
  model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

  # Comple the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [31]:
%%time
# Testing Encoder-Decoder RNN

en_dec_rnn_instance = encoder_decoder_rnn(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

# Summary of the model
en_dec_rnn_instance.summary()
en_dec_rnn_instance.fit([trainX, trainY], np.expand_dims(trainY, -1), epochs=10, validation_split=0.2, batch_size=64)

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_7 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_8 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_7 (Embedding)        (None, None, 64)     1063680     ['input_7[0][0]']                
                                                                                                  
 embedding_8 (Embedding)        (None, None, 64)     1329152     ['input_8[0][0]']                
                                                                                            

<keras.callbacks.History at 0x2be8dcafdc0>

In [32]:
# Predicting Encoder-Decoder RNN (First 3 Test Samples)
for i in range(3):
    predict_and_compare_auto_en(index=i, testX=testX, testY=testY, model=en_dec_rnn_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): voici que  a une
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): a les le region les region  sont  au ete the la monde de la region de region de ete du  de
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): il les  ont pays  a  les region de ces de pays dans la region de  dans  de  sont  les ces  de ete
Ground Truth (French): mais les failles ont deja commence a apparaitre les stations de radio en place dans 

In [23]:
def encoder_decoder_gru(tokenizer_en, tokenizer_fr):

  # Encoder
  encoder_inputs = Input(shape=(None,))
  enc_emb = Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64)(encoder_inputs)
  encoder_gru = GRU(64, return_state=True)
  encoder_outputs, state = encoder_gru(enc_emb)

  # Decoder
  decoder_inputs = Input(shape=(None,))
  dec_emb_layer = Embedding(input_dim=len(tokenizer_fr.word_index) + 1, output_dim=64)
  dec_emb = dec_emb_layer(decoder_inputs)
  decoder_gru = GRU(64, return_sequences=True, return_state=True)
  decoder_outputs, _= decoder_gru(dec_emb, initial_state=state)
  decoder_dense = Dense(len(tokenizer_fr.word_index) + 1, activation='softmax')
  decoder_outputs = decoder_dense(decoder_outputs)

  # Build Final Model
  model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

  # Comple the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [24]:
%%time
# Testing Encoder-Decoder GRU

en_dec_gru_instance = encoder_decoder_gru(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

# Summary of the model
en_dec_gru_instance.summary()
en_dec_gru_instance.fit([trainX, trainY], np.expand_dims(trainY, -1), epochs=10, validation_split=0.2, batch_size=64)

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_4 (Embedding)        (None, None, 64)     1063680     ['input_5[0][0]']                
                                                                                                  
 embedding_5 (Embedding)        (None, None, 64)     1329152     ['input_6[0][0]']                
                                                                                            

<keras.callbacks.History at 0x2bd34c18670>

In [26]:
# Predicting Encoder-Decoder GRU (First 3 Test Samples)
for i in range(3):
    predict_and_compare_auto_en(index=i, testX=testX, testY=testY, model=en_dec_gru_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): blogueur que blogueur a une blogueur blogueur
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): a le le blogueur les blogueur blogueur un est un ete ces la ete de la blogueur et blogueur de du du de en ete
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): les les blogueur les blogueur nous a ete les blogueur de blogueur en se pas la blogueur de de pas blogueur de ete un de la blog pas de leurs
Ground Truth (Fr

In [11]:
def encoder_decoder_lstm(tokenizer_en, tokenizer_fr):

  # Encoder
  encoder_inputs = Input(shape=(None,))
  enc_emb = Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64)(encoder_inputs)
  encoder_lstm = LSTM(64, return_state=True)
  encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
  encoder_states = [state_h, state_c]

  # Decoder
  decoder_inputs = Input(shape=(None,))
  dec_emb_layer = Embedding(input_dim=len(tokenizer_fr.word_index) + 1, output_dim=64)
  dec_emb = dec_emb_layer(decoder_inputs)
  decoder_lstm = LSTM(64, return_sequences=True, return_state=True)
  decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
  decoder_dense = Dense(len(tokenizer_fr.word_index) + 1, activation='softmax')
  decoder_outputs = decoder_dense(decoder_outputs)

  # Build Final Model
  model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

  # Comple the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model


In [12]:
%%time
# Testing Encoder-Decoder LSTM

en_dec_lstm = encoder_decoder_lstm(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

# Summary of the model
en_dec_lstm.summary()
en_dec_lstm.fit([trainX, trainY], np.expand_dims(trainY, -1), epochs=10, validation_split=0.2, batch_size=64)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, None, 64)     1063680     ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, None, 64)     1329152     ['input_2[0][0]']                
                                                                                              

<keras.callbacks.History at 0x2bc86071160>

In [17]:
# Predicting Encoder-Decoder LSTM (First 3 Test Samples)
for i in range(3):
    predict_and_compare_auto_en(index=i, testX=testX, testY=testY, model=en_dec_lstm, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): nous de leur de de
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): de de de ete de leur  de  de et  de  de de  de ete de de de  de
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): le le leur de aux ete de  de leur de  de  de de  de  de  de  de  de   de
Ground Truth (French): mais les failles ont deja commence a apparaitre les stations de radio en place dans la region de toro dans louest de louganda sont har

## Model Approach 5: Auto-Encoders with RNN, GRU and LSTM - More Complex Architecture

In [33]:
def encoder_decoder_comp_rnn(tokenizer_en, tokenizer_fr):

  # Encoder
  encoder_inputs = Input(shape=(None,))
  enc_emb = Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64)(encoder_inputs)

  # Stacking multiple RNN layers in the encoder
  encoder_rnn1 = SimpleRNN(64, return_sequences=True, return_state=True)
  encoder_outputs, state = encoder_rnn1(enc_emb)
  encoder_rnn2 = SimpleRNN(64, return_state=True)
  encoder_outputs, state = encoder_rnn2(encoder_outputs)

  # Decoder
  decoder_inputs = Input(shape=(None,))
  dec_emb_layer = Embedding(input_dim=len(tokenizer_fr.word_index) + 1, output_dim=64)
  dec_emb = dec_emb_layer(decoder_inputs)

  # Stacking multiple RNN layers in the decoder
  decoder_rnn1 = SimpleRNN(64, return_sequences=True, return_state=True)
  decoder_outputs, _ = decoder_rnn1(dec_emb, initial_state=state)
  decoder_rnn2 = SimpleRNN(64, return_sequences=True, return_state=True)
  decoder_outputs, _ = decoder_rnn2(decoder_outputs, initial_state=state)

  decoder_dense = Dense(len(tokenizer_fr.word_index) + 1, activation='softmax')
  decoder_outputs = decoder_dense(decoder_outputs)

  # Define the model
  model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [34]:
%%time
# Testing Encoder-Decoder Complex RNN

encoder_decoder_comp_rnn_instance = encoder_decoder_comp_rnn(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

# Summary of the model
encoder_decoder_comp_rnn_instance.summary()
encoder_decoder_comp_rnn_instance.fit([trainX, trainY], np.expand_dims(trainY, -1), epochs=10, validation_split=0.2, batch_size=64)

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_9 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_9 (Embedding)        (None, None, 64)     1063680     ['input_9[0][0]']                
                                                                                                  
 input_10 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 simple_rnn_3 (SimpleRNN)       [(None, None, 64),   8256        ['embedding_9[0][0]']            
                                 (None, 64)]                                                

<keras.callbacks.History at 0x2be9010f460>

In [35]:
# Testing Encoder-Decoder Complex RNN (First 3 Test Samples)
for i in range(3):
    predict_and_compare_auto_en(index=i, testX=testX, testY=testY, model=encoder_decoder_comp_rnn_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): the que sont de la ete sont
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): le la la ne de blogueur sont que que de ete que de blog de la blog de blogueur de ete que que de blogueur
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): le la ne de ne sont de blog de blogueur de blog de ete de la ete de ete de ete de blogueur sont que de un que de blog
Ground Truth (French): mais les failles ont deja commence a a

In [36]:
def encoder_decoder_comp_gru(tokenizer_en, tokenizer_fr):

  # Encoder
  encoder_inputs = Input(shape=(None,))
  enc_emb = Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64)(encoder_inputs)

  # Stacking multiple GRU layers in the encoder
  encoder_gru1 = GRU(64, return_sequences=True, return_state=True)
  encoder_outputs, state = encoder_gru1(enc_emb)
  encoder_gru2 = GRU(64, return_state=True)
  encoder_outputs, state = encoder_gru2(encoder_outputs)

  # Decoder
  decoder_inputs = Input(shape=(None,))
  dec_emb_layer = Embedding(input_dim=len(tokenizer_fr.word_index) + 1, output_dim=64)
  dec_emb = dec_emb_layer(decoder_inputs)

  # Stacking multiple GRU layers in the decoder
  decoder_gru1 = GRU(64, return_sequences=True, return_state=True)
  decoder_outputs, _ = decoder_gru1(dec_emb, initial_state=state)
  decoder_gru2 = GRU(64, return_sequences=True, return_state=True)
  decoder_outputs, _ = decoder_gru2(decoder_outputs, initial_state=state)

  decoder_dense = Dense(len(tokenizer_fr.word_index) + 1, activation='softmax')
  decoder_outputs = decoder_dense(decoder_outputs)

  # Define the model
  model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

  # Compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  return model

In [37]:
%%time
# Testing Encoder-Decoder Complex GRU

encoder_decoder_comp_gru_instance = encoder_decoder_comp_gru(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

# Summary of the model
encoder_decoder_comp_gru_instance.summary()
encoder_decoder_comp_gru_instance.fit([trainX, trainY], np.expand_dims(trainY, -1), epochs=10, validation_split=0.2, batch_size=64)

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_11 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 embedding_11 (Embedding)       (None, None, 64)     1063680     ['input_11[0][0]']               
                                                                                                  
 input_12 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 gru_3 (GRU)                    [(None, None, 64),   24960       ['embedding_11[0][0]']           
                                 (None, 64)]                                                

<keras.callbacks.History at 0x2be95ac09a0>

In [38]:
# Testing Encoder-Decoder Complex GRU (First 3 Test Samples)
for i in range(3):
    predict_and_compare_auto_en(index=i, testX=testX, testY=testY, model=encoder_decoder_comp_gru_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): le de de de de de de
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): le le de de de de de de de de de de de de de de de de de de de de de de de
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): le le de de de de de de de de de de de de de de de de de de de de de de de de de  de
Ground Truth (French): mais les failles ont deja commence a apparaitre les stations de radio en place dans la region de toro dans lo

In [39]:
def encoder_decoder_comp_lstm(tokenizer_en, tokenizer_fr):

    # Encoder
    encoder_inputs = Input(shape=(None,))
    enc_emb = Embedding(input_dim=len(tokenizer_en.word_index) + 1, output_dim=64)(encoder_inputs)

    # Stacking multiple LSTM layers in the encoder
    encoder_lstm1 = LSTM(64, return_sequences=True, return_state=True)
    encoder_outputs, state_h1, state_c1 = encoder_lstm1(enc_emb)
    encoder_lstm2 = LSTM(64, return_state=True)
    encoder_outputs, state_h2, state_c2 = encoder_lstm2(encoder_outputs)

    # Decoder
    decoder_inputs = Input(shape=(None,))
    dec_emb_layer = Embedding(input_dim=len(tokenizer_fr.word_index) + 1, output_dim=64)
    dec_emb = dec_emb_layer(decoder_inputs)

    # Stacking multiple LSTM layers in the decoder
    decoder_lstm1 = LSTM(64, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm1(dec_emb, initial_state=[state_h1, state_c1])
    decoder_lstm2 = LSTM(64, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm2(decoder_outputs, initial_state=[state_h2, state_c2])

    decoder_dense = Dense(len(tokenizer_fr.word_index) + 1, activation='softmax')
    decoder_outputs = decoder_dense(decoder_outputs)

    # Define the model
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    # Compile the model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [40]:
%%time
# Testing Encoder-Decoder Complex LSTM

encoder_decoder_comp_lstm_instance = encoder_decoder_comp_lstm(tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

# Summary of the model
encoder_decoder_comp_lstm_instance.summary()
encoder_decoder_comp_lstm_instance.fit([trainX, trainY], np.expand_dims(trainY, -1), epochs=10, validation_split=0.2, batch_size=64)

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_13 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 input_14 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 embedding_13 (Embedding)       (None, None, 64)     1063680     ['input_13[0][0]']               
                                                                                                  
 embedding_14 (Embedding)       (None, None, 64)     1329152     ['input_14[0][0]']               
                                                                                            

<keras.callbacks.History at 0x2be9fc932b0>

In [41]:
# Testing Encoder-Decoder Complex LSTM (First 3 Test Samples)
for i in range(3):
    predict_and_compare_auto_en(index=i, testX=testX, testY=testY, model=encoder_decoder_comp_lstm_instance, tokenizer_en=tokenizer_en, tokenizer_fr=tokenizer_fr)

Input (English): do you believe uganda has a free press
Predicted (French): il de  de de
Ground Truth (French): croyezvous que louganda a une presse libre
Input (English): a for the moment the traditional media are basically safe at least until the end of the commonwealth heads of government meeting in november
Predicted (French): de de de de de   de  de de de de  de de de de de de de de  de
Ground Truth (French): a pour le moment les media traditionnels sont saufs au moins jusqua la fin de la reunion des chefs de gouvernement du commonwealth en novembre
Input (English): but the cracks have already started to appear with upcountry radio stations in places like the toro region in western uganda being harassed for all kinds of things
Predicted (French): le de de de  de de  de  de  de de de de  de  de  de  de  de   de
Ground Truth (French): mais les failles ont deja commence a apparaitre les stations de radio en place dans la region de toro dans louest de louganda sont harcelees pour tout

____________________________________________________________

In [None]:
# End of notebook