# Baseline Text to Text Translation : English to French

This notebook trains a sequence to sequence (seq2seq) model for English to French translation. This model will be our **baseline** model, which we will then improve upon by adding attention and other features.

---

## Import Required Libraries

We will start by importing the libraries we need for this project. You can install any missing libraries using the requirements.txt file provided or by running ``make install`` in the terminal.

In [None]:
%load_ext autoreload
%aimport utils.text_processing
%autoreload 1

In [None]:
from datasets import load_dataset

from utils.text_processing import TextProcessor

import numpy as np
import pandas as pd
import random

from keras.models import Model
from keras.layers import Input, Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed, BatchNormalization

from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras import optimizers

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

import nltk
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_colwidth', 200)

### Verify access to the GPU
The following test applies only if you expect to be using a GPU, e.g., while running in a cloud environment with GPU support. Run the next cell, and verify that the device_type is "GPU".

In [None]:
import tensorflow as tf
print("cuda available: ", tf.config.list_physical_devices('GPU'))

We provide a in depth analysis of the data in the ``exploratory_analysis.ipynb`` notebook. We will not be doing any exploratory analysis in this notebook. Instead, we will focus on building our baseline model. So, let's start by importing the dataset we will be using.

In [None]:
dataset = load_dataset("Nicolas-BZRD/Parallel_Global_Voices_English_French", split='train').to_pandas()
dataset.head()

The actual data contains over 350,000 sentence-pairs. However, to speed up training for this notebook, we will only use a small portion of the data. 

In [None]:
# TODO : Use the whole dataset (but it's too big for my computer)
dataset = dataset.sample(n=50000, random_state=42)
print(dataset.shape)

## Text Pre-Processing

The text pre-processing steps will be implemented in a class called ``TextPreprocessor``. This class will be used to clean and tokenize the text data. The class will also be used to convert the text to sequences and pad the sequences to a maximum length. This way we will be able to improve our model's without having to copy and paste the same code over and over again.

In [None]:
max_sequence_length = 20

In [None]:
# clean the english and french sentences
dataset['en'] = TextProcessor(dataset, 'en').transform()
dataset['fr'] = TextProcessor(dataset, 'fr').transform()

# keep only sentences with less than max_sequence_length words
dataset = dataset[dataset['en'].str.split().str.len() <= max_sequence_length]
dataset = dataset[dataset['fr'].str.split().str.len() <= max_sequence_length]

dataset.head(10)

### Text to Sequence Conversion

To feed our data to a Seq2Seq model, we will have to convert both the input and the output sentences into integer sequences of fixed length. Check the exploratory data analysis notebook to see the distribution of the lengths of the sentences in the dataset. Based on that, we decided to fix the maximum length of each sentence to 20 since the average length of the sentences in the dataset is around 20.

We will use the ``Tokenizer`` class from the ``tensorflow.keras.preprocessing.text`` module to tokenize the text data. The ``Tokenizer`` class will also be used to convert the text to sequences. We will use the ``pad_sequences`` function from the same module to pad the sequences to the maximum length.

In [None]:
def tokenization(lines, max_vocab_size=5000):
    tokenizer = Tokenizer(filters=' ', num_words=max_vocab_size)
    tokenizer.fit_on_texts(lines)
    return tokenizer

def encode_sequences(tokenizer, length, lines):
    seq = tokenizer.texts_to_sequences(lines)
    seq = pad_sequences(seq, maxlen=length, padding='post', truncating='post')
    return seq

def decode_sequences(tokenizer, sequence):
    text = tokenizer.sequences_to_texts([sequence])[0]
    text = text.replace('<start>', '').replace('<end>', '').strip()
    return text

def get_most_common_words(tokenizer, n=10):
    word_counts = sorted(tokenizer.word_counts.items(), key=lambda x: x[1], reverse=True)
    return word_counts[:n]

In [None]:
# Tokenize the English sentences
eng_tokenizer = tokenization(dataset["en"])
eng_vocab_size = len(eng_tokenizer.word_index) + 1

# Tokenize the French sentences
fr_tokenizer = tokenization(dataset["fr"])
fr_vocab_size = len(fr_tokenizer.word_index) + 1

In [None]:
print('English Vocabulary Size: %d' % eng_vocab_size)
print('French Vocabulary Size: %d' % fr_vocab_size)

## Model Building

We will now split the data into train and test set for model training and evaluation, respectively. We will use the ``train_test_split`` function from the ``sklearn.model_selection`` module to split the data. We will use 10% of the data for testing and the rest for training. We will also set the ``random_state`` parameter to 42 to ensure reproducibility. 

In [None]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

It's time to encode the sentences. We will encode French sentences as the input sequences and English sentences as the target sequences. It will be done for both tra and test datasets.

In [None]:
# prepare training data
trainX = encode_sequences(fr_tokenizer, max_sequence_length, train_data["fr"])
trainY = encode_sequences(eng_tokenizer, max_sequence_length, train_data["en"])

# prepare validation data
testX = encode_sequences(fr_tokenizer, max_sequence_length, test_data["fr"])
testY = encode_sequences(eng_tokenizer, max_sequence_length, test_data["en"])

In [None]:
trainX.shape, trainY.shape, testX.shape, testY.shape

In [None]:
# decode sample sequences from the training set
for i in range(1500):
    english = decode_sequences(eng_tokenizer, trainY[i, : ])
    print('English: ', english, len(english.split()))
    french = decode_sequences(fr_tokenizer, trainX[i, :])
    print('French: ', french , len(french.split()))
    print('---')

Now comes the fun part, building the model. We will build a simple Seq2Seq model for text-to-text translation. 
The model follows a simple architecture:

- Input sequence is embedded using an Embedding layer.
- The embedded sequence is processed by an LSTM layer to capture context.
- Output sequence is generated by repeating and processing with another LSTM layer.
- The Dense layer produces a probability distribution over the output vocabulary for each timestep, enabling text generation.

In [None]:
def build_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    embedding_size = 128
    
    french_input = Input(shape=input_shape[1:], name="input_layer")  # Embedding input (batch, seq_length)
    
    embeddings = Embedding(input_dim=english_vocab_size, output_dim=embedding_size, 
                           input_length=output_sequence_length, name="Embedding_layer")(french_input)
    
    # input shape to LSTM (batchsize, seq_length, embedding_dim) output shape: (batchsize, seq_length, units=64x2)
    x = Bidirectional(LSTM(126, return_sequences=True, activation="tanh"), name="Bidir_LSTM_layer")(embeddings)
    
    preds = TimeDistributed(Dense(french_vocab_size, activation="softmax"), name="Dense_layer")(x)
    model = Model(inputs=french_input, outputs=preds, name='Embedding_Bidir_LSTM')
       
    return model

<img src="../images/bidirectional.png"
    alt="rnn"
    style="text-align: center;" />
</br>

We reshape the ``trainX`` and ``trainY`` to be 3-dimensional tensors to be used in the model. The first dimension represents the number of samples (or sentences), the second represents the length of each sequence, and the third represents the number of features in each sequence. We will use the ``trainX`` and ``trainY`` to train the model. We will use the ``testX`` and ``testY`` to evaluate the model.

In [None]:
trainX = trainX.reshape((-1, max_sequence_length))
testX = testX.reshape((-1, max_sequence_length))

trainY = trainY.reshape((trainY.shape[0], trainY.shape[1], 1))
testY = testY.reshape((testY.shape[0], testY.shape[1], 1))

We are using RMSprop optimizer in this model as it is usually a good choice for recurrent neural networks. We will experiment with other optimizers in the next notebook.

We will use the ``sparse_categorical_crossentropy`` loss since we have used integers to encode the target sequences. 

In [None]:
model = build_model(trainX.shape, max_sequence_length, 5000, 5000)

rms = optimizers.RMSprop(learning_rate=0.0001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

Note that we have used **sparse_categorical_crossentropy** as the loss function because it allows us to use the target sequence as it is instead of one hot encoded format. One hot encoding the target sequences with such a huge vocabulary might consume our system's entire memory.

It seems we are all set to start training our model. We will train it for **30 epochs** and with a **batch size of 512**. We will also experiment with the hyperparameters in the next notebook.
We will also use **ModelCheckpoint()** to save the best model with lowest validation loss.

In [None]:
filename = '../models/embedding_bidirectional.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

history = model.fit(trainX, trainY, 
          epochs=20, batch_size=64,
          validation_split=0.2,
          callbacks=[checkpoint], verbose=1)

In [None]:
model = load_model('../models/embedding_bidirectional.h5')

## Evaluation of the Model

Let's compare the training loss and the validation loss. If the validation loss is much higher than the training loss, then the model might be overfitting. We will also evaluate the model on the test set to see how well it performs on unseen data.

In [None]:
# Plot evaluation results
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train', 'validation'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Model Evaluation')
plt.show()

In [None]:
evaluation = model.evaluate(testX, testY)

print("Test Loss:", evaluation[0])
print("Test Accuracy:", evaluation[1])

In [None]:
def translate(sentence):
    sentence = encode_sequences(fr_tokenizer, max_sequence_length, [sentence])
    prediction = model.predict(sentence.reshape((sentence.shape[0], sentence.shape[1])))
    prediction = np.argmax(prediction, axis=-1)
    text = decode_sequences(eng_tokenizer, prediction[0])
    return text

In [None]:
def calculate_bleu_score(reference, candidate):
    reference = [reference.split()]
    candidate = candidate.split()
    return sentence_bleu(reference, candidate)

def evaluate_model_bleu_score(test_data):
    references = []
    candidates = []
    
    for _, row in test_data.iterrows():
        reference = row['en']
        candidate = translate(row['fr'])
        
        references.append(reference)
        candidates.append(candidate)
    
    return corpus_bleu(references, candidates)


In [None]:
# # Calculate BLEU score for a single sentence
# reference_sentence = "Hello, how are you?"
# candidate_sentence = "Bonjour, comment ça va?"
# bleu_score = calculate_bleu_score(reference_sentence, candidate_sentence)
# print("BLEU score:", bleu_score)

# # Evaluate model BLEU score on test data
# test_bleu_score = evaluate_model_bleu_score(test_data)
# print("Model BLEU score on test data:", test_bleu_score)

## Make Predictions

Now that we have our model, let's make some predictions. We will create a function called ``translate`` which will take a sentence in English as input and return the translated sentence in French. We will use the trained model to make predictions.

But before let's test on the predictions classes to see if it works.

In [None]:
size_to_predict = 20

# Make predictions on the subset
subset_to_predict = testX[:size_to_predict]
predictions = model.predict_on_batch(subset_to_predict)
predictions_classes = np.argmax(predictions, axis=-1)

# reshape the subset to predict and the testY to be able to decode them
reshapedX_subset = subset_to_predict.reshape((subset_to_predict.shape[0], subset_to_predict.shape[1]))
reshapedY_subset = testY[:size_to_predict].reshape((testY[:size_to_predict].shape[0], testY[:size_to_predict].shape[1]))

predicted_df = pd.DataFrame(columns=['french_sentence', 'actual_english_sentence', 'predicted_english_sentence'])

i = 0
for seq in predictions_classes:
    predicted_text = decode_sequences(eng_tokenizer, seq)
    original_french_sentence = decode_sequences(fr_tokenizer, reshapedX_subset[i])
    original_english_sentence = decode_sequences(eng_tokenizer, reshapedY_subset[i])
    
    predicted_df.loc[i] = [original_french_sentence, original_english_sentence, predicted_text]
    i += 1

In [None]:
predicted_df

Now let's make some predictions, with the ``translate`` function.

In [None]:
testX.shape

In [None]:
from tqdm import tqdm
data = []

references = []
candidates = []

for i in tqdm(range(3000)):
    textX_decoded = decode_sequences(fr_tokenizer, testX[i,])
    testY_decoded = decode_sequences(eng_tokenizer, testY[i, : ,0])
    candidate = translate(textX_decoded).replace('<end>', '').replace('<start>', '').strip()
    
    data.append({
        'Context': textX_decoded,
        'Reference': testY_decoded,
        'Candidate': candidate,
        'length': len(textX_decoded.split())
    })
    
    references.append([testY_decoded])
    candidates.append(candidate)

In [None]:
# split into small dataset based on the sentences length
length_ranges = [(1, 5), (6, 10), (11, 15), (16, 20), (21, 30), (31, 40), (41, 60), (61, float('inf'))]

small_datasets = {}
for min_len, max_len in length_ranges:
    filtered_examples = [example for example in data if example['length'] >= min_len and example['length'] <= max_len]
    small_datasets[f'dataset_{min_len}_{max_len}'] = filtered_examples

samples_per_range = []
for key, dataset in small_datasets.items():
    samples_per_range.append(len(dataset))
    print(f"{key}: {len(dataset)} samples")

In [None]:
def compute_corpus_bleu(references, candidates):
    if len(references) != len(candidates):
        raise ValueError('The number of references and candidates must be the same :', len(references), len(candidates))
    
    if len(references) == 0: return 0.0
    
    reference_tokens = [[ref] for ref in references]
    return corpus_bleu(reference_tokens, candidates)

In [None]:
bleu_scores = []
for key, dataset in small_datasets.items():
    refs = [example['Reference'] for example in dataset]
    cands = [example['Candidate'] for example in dataset]
    
    corpus_bleu_score = compute_corpus_bleu(refs, cands)
    bleu_scores.append(corpus_bleu_score)
    
    print(f"{key}: {corpus_bleu_score:.4f}")

In [None]:
overall_bleu_score = corpus_bleu(references, candidates)
overall_bleu_score

In [None]:
import matplotlib.patches as mpatches

plt.figure(figsize=(15, 7))

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']  # List of colors for each bar
bar_plot = plt.bar([f'{start}-{end}' for start, end in length_ranges], bleu_scores, color=colors, alpha=0.7, label='BLEU Score')

# Add "All" bar with legend
all_bar = plt.bar("All", overall_bleu_score * 2, color='k', alpha=0.7)

# Create a dummy handle for the "All" bar
all_patch = mpatches.Patch(color='k', label=f'Sample = {len(candidates)}')
legend_labels = [f'Sample = {value}' for value in samples_per_range]

# Include the dummy handle in the legend
plt.legend(handles=[*bar_plot, all_patch], labels=legend_labels + [f'Sample = {len(candidates)}'], loc='upper right', title='Samples per range')

plt.xlabel('Word Count Range')
plt.ylabel('BLEU Score')

plt.title('BLEU Score and Number of Samples Based on Word Count Range')

plt.show()
