# Baseline Text to Text Translation : English to French

This notebook trains a sequence to sequence (seq2seq) model for English to French translation. This model will be our **baseline** model, which we will then improve upon by adding attention and other features.

---

## Import Required Libraries

We will start by importing the libraries we need for this project. You can install any missing libraries using the requirements.txt file provided or by running ``make install`` in the terminal.

In [1]:
from datasets import load_dataset

import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras import optimizers

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

import seaborn as sns
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm





We provide a in depth analysis of the data in the ``exploratory_analysis.ipynb`` notebook. We will not be doing any exploratory analysis in this notebook. Instead, we will focus on building our baseline model. So, let's start by importing the dataset we will be using.

In [2]:
dataset = load_dataset("Nicolas-BZRD/Parallel_Global_Voices_English_French", split='train').to_pandas()
dataset.head()

Unnamed: 0,en,fr
0,Jamaica: “I am HIV”,Jamaïque : J’ai le VIH
1,"It's widely acknowledged, in the Caribbean and...","Il est largement reconnu, dans les Caraïbes et..."
2,"For this woman, however, photographed in the s...","Pour cette femme, cependant, photographiée dan..."
3,As Bacon writes on her blog:,Comme Bacon écrit sur son blog:
4,"“When I asked to take her picture, I suggested...",“Quand je lui ai demandé de la prendre en phot...


The actual data contains over 350,000 sentence-pairs. However, to speed up training for this notebook, we will only use a small portion of the data. 

In [3]:
# TODO : Use the whole dataset (but it's too big for my computer)
dataset = dataset.sample(n=1000, random_state=42)
print(dataset.shape)

# TODO : move inside the preprocessing step
en_len = dataset['en'].str.count(' ')
fr_len = dataset['fr'].str.count(' ')

MAX_LENGTH = 20

dataset = dataset[(en_len <= MAX_LENGTH) & (fr_len <= MAX_LENGTH)]
print(dataset.shape)

(1000, 2)
(563, 2)


## Text Pre-Processing

The text pre-processing steps will be implemented in a class called ``TextPreprocessor``. This class will be used to clean and tokenize the text data. The class will also be used to convert the text to sequences and pad the sequences to a maximum length. This way we will be able to improve our model's without having to copy and paste the same code over and over again.

### Text to Sequence Conversion

To feed our data to a Seq2Seq model, we will have to convert both the input and the output sentences into integer sequences of fixed length. Check the exploratory data analysis notebook to see the distribution of the lengths of the sentences in the dataset. Based on that, we decided to fix the maximum length of each sentence to 20 since the average length of the sentences in the dataset is around 20.

We will use the ``Tokenizer`` class from the ``tensorflow.keras.preprocessing.text`` module to tokenize the text data. The ``Tokenizer`` class will also be used to convert the text to sequences. We will use the ``pad_sequences`` function from the same module to pad the sequences to the maximum length.

In [4]:
max_sequence_length = 20

In [5]:
def tokenization(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

def encode_sequences(tokenizer, length, lines):
    seq = tokenizer.texts_to_sequences(lines)
    seq = pad_sequences(seq, maxlen=length, padding='post')
    return seq

def decode_sequences(tokenizer, sequence):
    text = tokenizer.sequences_to_texts([sequence])[0].replace('PAD', '').strip()
    return text

In [6]:
# Tokenize the English sentences
eng_tokenizer = tokenization(dataset["en"])
eng_vocab_size = len(eng_tokenizer.word_index) + 1

# Tokenize the French sentences
fr_tokenizer = tokenization(dataset["fr"])
fr_vocab_size = len(fr_tokenizer.word_index) + 1

In [7]:
print('English Vocabulary Size: %d' % eng_vocab_size)
print('French Vocabulary Size: %d' % fr_vocab_size)

English Vocabulary Size: 2562
French Vocabulary Size: 2814


## Model Building

We will now split the data into train and test set for model training and evaluation, respectively. We will use the ``train_test_split`` function from the ``sklearn.model_selection`` module to split the data. We will use 10% of the data for testing and the rest for training. We will also set the ``random_state`` parameter to 42 to ensure reproducibility. 

In [8]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

It's time to encode the sentences. We will encode French sentences as the input sequences and English sentences as the target sequences. It will be done for both tra and test datasets.

In [9]:
# prepare training data
trainX = encode_sequences(fr_tokenizer, max_sequence_length, train_data["fr"])
trainY = encode_sequences(eng_tokenizer, max_sequence_length, train_data["en"])

# prepare validation data
testX = encode_sequences(fr_tokenizer, max_sequence_length, test_data["fr"])
testY = encode_sequences(eng_tokenizer, max_sequence_length, test_data["en"])

Now comes the fun part, building the model. We will build a simple Seq2Seq model for text-to-text translation. 
The model follows a simple architecture:

- Input sequence is embedded using an Embedding layer.
- The embedded sequence is processed by an LSTM layer to capture context.
- Output sequence is generated by repeating and processing with another LSTM layer.
- The Dense layer produces a probability distribution over the output vocabulary for each timestep, enabling text generation.

In [10]:
def build_model(in_vocab, out_vocab, in_timesteps, out_timesteps, units):
    model = Sequential()

    model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
    model.add(LSTM(units))
    model.add(RepeatVector(out_timesteps))
    
    model.add(LSTM(units, return_sequences=True))
    model.add(Dense(out_vocab, activation='softmax'))

    return model

We are using RMSprop optimizer in this model as it is usually a good choice for recurrent neural networks. We will experiment with other optimizers in the next notebook.

We will use the ``sparse_categorical_crossentropy`` loss since we have used integers to encode the target sequences. 

In [None]:
model = build_model(fr_vocab_size, eng_vocab_size, max_sequence_length, max_sequence_length, 64)
rms = optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')

Note that we have used **sparse_categorical_crossentropy** as the loss function because it allows us to use the target sequence as it is instead of one hot encoded format. One hot encoding the target sequences with such a huge vocabulary might consume our system's entire memory.

It seems we are all set to start training our model. We will train it for **30 epochs** and with a **batch size of 512**. We will also experiment with the hyperparameters in the next notebook.
We will also use **ModelCheckpoint()** to save the best model with lowest validation loss.

In [None]:
filename = '../models/test.md'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

history = model.fit(trainX, trainY.reshape(trainY.shape[0], trainY.shape[1], 1), 
          epochs=10, batch_size=512, 
          validation_split = 0.2,
          callbacks=[checkpoint], verbose=1)

In [11]:
model = load_model('../models/test.md')
predictions = model.predict(testX.reshape((testX.shape[0], testX.shape[1])))
predictions_classes = np.argmax(predictions, axis=-1)





## Evaluation of the Model

Let's compare the training loss and the validation loss. If the validation loss is much higher than the training loss, then the model might be overfitting. We will also evaluate the model on the test set to see how well it performs on unseen data.

**TODO:** 
- Add BLEU score
- Add attention (if we implemented the attention mechanism)

In [14]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])
plt.show()

TypeError: 'History' object is not subscriptable

## Make Predictions

Now that we have our model, let's make some predictions. We will create a function called ``translate`` which will take a sentence in English as input and return the translated sentence in French. We will use the trained model to make predictions.

In [23]:
def translate(sentence):
    sentence = encode_sequences(fr_tokenizer, max_sequence_length, [sentence])
    prediction = model.predict(sentence.reshape((sentence.shape[0], sentence.shape[1])))
    prediction = np.argmax(prediction, axis=-1)
    text = decode_sequences(eng_tokenizer, prediction[0])
    print(text)
    return text

But before let's test on the predictions classes to see if it works.

In [19]:
preds_text = []

for seq in predictions_classes:
    predicted_text = decode_sequences(eng_tokenizer, seq)
    preds_text.append(predicted_text)

In [20]:
pred_df = pd.DataFrame({'actual' : test_data["en"], 'predicted' : preds_text})
pd.set_option('display.max_colwidth', 200)
pred_df.head(15)

Unnamed: 0,actual,predicted
334656,We are all struggling to live and we are always busy with our lives.,
134625,Here's how it went and how the Lebanese on line community reacted to this move.,
173850,Those publications are neither “handate messages” nor ” anti-peace”…,
72208,Japan: I want my husband dead,
192076,Where will we stand?,
149004,"Photo by Julio Albarrán, republished under a CC License. *",
222591,One of the many responses to the story is by Mungoma and reads:,
2727,"But, the car came back and ran over him killing him instantly.",
48725,Gameron writes [fa] that homosexuals face problems in Islamic countries where they can be executed.,
230044,I talked to Wong Tack during the break.,


Now let's make some predictions, with the ``translate`` function.

In [28]:
translate("Bonjour")



