# Attention Mechanisme Text to Text Translation : English to French

This notebook trains a model for French to English translation. Using the attention mechanism, the model learns to align and translate French text to English text.

---

## Import Required Libraries

We will start by importing the libraries we need for this project. You can install any missing libraries using the requirements.txt file provided or by running ``make install`` in the terminal.

In [None]:
%load_ext autoreload
%aimport utils.text_processing
%autoreload 1

In [None]:
from datasets import load_dataset
from tqdm import tqdm

import pandas as pd
import numpy as np

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
from sklearn.model_selection import train_test_split
from string import digits
import tensorflow as tf

from keras.preprocessing.sequence import pad_sequences
from utils.text_processing import TextProcessor

from keras.preprocessing.text import Tokenizer

import time
import re
import os

import matplotlib.ticker as ticker
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_colwidth', 200)

### Verify access to the GPU
The following test applies only if you expect to be using a GPU, e.g., while running in a cloud environment with GPU support. Run the next cell, and verify that the device_type is "GPU".

In [None]:
import tensorflow as tf
print("cuda available: ", tf.config.list_physical_devices('GPU'))

We provide a in depth analysis of the data in the ``exploratory_analysis.ipynb`` notebook. We will not be doing any exploratory analysis in this notebook. Instead, we will focus on building our baseline model. So, let's start by importing the dataset we will be using.

In [None]:
dataset = load_dataset("Nicolas-BZRD/Parallel_Global_Voices_English_French", split='train').to_pandas()
dataset.head(10)

The actual data contains over 350,000 sentence-pairs. However, to speed up training for this notebook, we will only use a small portion of the data. 

In [None]:
# TODO : Use the whole dataset (but it's too big for my computer)
dataset = dataset.sample(n=50000, random_state=42)
print(dataset.shape)

## Text Pre-Processing

The text pre-processing steps will be implemented in a class called ``TextPreprocessor``. This class will be used to clean and tokenize the text data. The class will also be used to convert the text to sequences and pad the sequences to a maximum length. This way we will be able to improve our model's without having to copy and paste the same code over and over again.

In [None]:
max_sequence_length = 20

In [None]:
dataset['en'] = TextProcessor(dataset, 'en').transform()
dataset['fr'] = TextProcessor(dataset, 'fr').transform()

dataset.head(10)

In [None]:
# keep only sentences with less than max_sequence_length words
dataset = dataset[dataset['en'].str.split().str.len() <= max_sequence_length]
dataset = dataset[dataset['fr'].str.split().str.len() <= max_sequence_length]

In [None]:
dataset

### Text to Sequence Conversion

To feed our data to a Seq2Seq model, we will have to convert both the input and the output sentences into integer sequences of fixed length. Check the exploratory data analysis notebook to see the distribution of the lengths of the sentences in the dataset. Based on that, we decided to fix the maximum length of each sentence to 20 since the average length of the sentences in the dataset is around 20.

We will use the ``Tokenizer`` class from the ``tensorflow.keras.preprocessing.text`` module to tokenize the text data. The ``Tokenizer`` class will also be used to convert the text to sequences. We will use the ``pad_sequences`` function from the same module to pad the sequences to the maximum length.

In [None]:
def tokenization(lines, max_vocab_size=100000):
    tokenizer = Tokenizer(filters=' ', num_words=max_vocab_size)
    tokenizer.fit_on_texts(lines)
    return tokenizer

def encode_sequences(tokenizer, length, lines):
    seq = tokenizer.texts_to_sequences(lines)
    seq = pad_sequences(seq, maxlen=length, padding='post', truncating='post')
    return seq

def decode_sequences(tokenizer, sequence):
    text = tokenizer.sequences_to_texts([sequence])[0]
    text = text.replace('<start>', '').replace('<end>', '').strip()
    return text

def get_most_common_words(tokenizer, n=10):
    word_counts = sorted(tokenizer.word_counts.items(), key=lambda x: x[1], reverse=True)
    return word_counts[:n]

In [None]:
max_vocab_size = 5000

In [None]:
# Tokenize the English sentences
eng_tokenizer = tokenization(dataset["en"], max_vocab_size=max_vocab_size)
eng_vocab_size = len(eng_tokenizer.word_index) + 1

# Tokenize the French sentences
fr_tokenizer = tokenization(dataset["fr"], max_vocab_size=max_vocab_size)
fr_vocab_size = len(fr_tokenizer.word_index) + 1

In [None]:
print('English Vocabulary Size: %d' % eng_vocab_size)
print('French Vocabulary Size: %d' % fr_vocab_size)

In [None]:
print("Most common words in English: ", get_most_common_words(eng_tokenizer))
print("Most common words in French: ", get_most_common_words(fr_tokenizer))

## Model Building

We will now split the data into train and test set for model training and evaluation, respectively. We will use the ``train_test_split`` function from the ``sklearn.model_selection`` module to split the data. We will use 10% of the data for testing and the rest for training. We will also set the ``random_state`` parameter to 42 to ensure reproducibility. 

In [None]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

It's time to encode the sentences. We will encode French sentences as the input sequences and English sentences as the target sequences. It will be done for both tra and test datasets.

In [None]:
# prepare training data
trainX = encode_sequences(fr_tokenizer, max_sequence_length, dataset["fr"])
trainY = encode_sequences(eng_tokenizer, max_sequence_length, dataset["en"])

# prepare validation data
testX = encode_sequences(fr_tokenizer, max_sequence_length, test_data["fr"])
testY = encode_sequences(eng_tokenizer, max_sequence_length, test_data["en"])

In [None]:
trainX = trainX.reshape((-1, trainX.shape[1], 1))
testX = testX.reshape((-1, testX.shape[1], 1))

trainY = trainY.reshape((-1, trainY.shape[1], 1))
testY = testY.reshape((-1, testY.shape[1], 1))

In [None]:
testX.shape

In [None]:
trainX.shape

In [None]:
test_data

In [None]:
# decode sample sequences from the training set
for i in range(1500):
    english = decode_sequences(eng_tokenizer, trainY[i, : ,0])
    french = decode_sequences(fr_tokenizer, trainX[i, : ,0])
    print('English: ', english, len(english.split()))
    print('French: ', french , len(french.split()))
    print('---')

## Encoder Decoder with Attention mechanism 

![Attention](../images/attention.jpg)

In [None]:
class Encoder(tf.keras.Model):
    
    def __init__(self, vocab_size, embedding_dim, encoder_units, batch_size):
        super(Encoder, self).__init__()
        self.batch_size = batch_size
        self.encoder_units = encoder_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(encoder_units, 
                                      return_sequences=True,
                                      return_state=True,                                      
                                      recurrent_initializer='glorot_uniform')
    
    def call(self, x, hidden):
        # pass the input x to the embedding layer
        x = self.embedding(x)
        # pass the embedding and the hidden state to GRU
        output, state = self.gru(x, initial_state=hidden)
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_size, self.encoder_units))

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    
    def __init__(self, units):
        
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, query, values):
        # query hidden state shape == (batch_size, hidden size)
        # query_with_time_axis shape == (batch_size, 1, hidden size)
        # values shape == (batch_size, max_len, hidden size)
        # we are doing this to broadcast addition along the time axis to calculate the score
        query_with_time_axis = tf.expand_dims(query, 1)

        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        # the shape of the tensor before applying self.V is (batch_size, max_length, units)
        score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))

        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

In [None]:
class Decoder(tf.keras.Model):
    
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)

        # used for attention
        self.attention = BahdanauAttention(self.dec_units)

    def call(self, x, hidden, enc_output):
        
        # hidden state shape == (batch_size, hidden size)
        # enc_output shape == (batch_size, max_length, hidden_size)
        context_vector, attention_weights = self.attention(hidden, enc_output)

        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # passing the concatenated vector to the GRU
        output, state = self.gru(x)

        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size, vocab)
        x = self.fc(output)

        return x, state, attention_weights

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

In [None]:
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0
    with tf.GradientTape() as tape:
        # enc_output (batch_size, max?_lenght, encoder_units) ,enc_hidden (batch_size, encoder_units)
        enc_output, enc_hidden = encoder(inp, enc_hidden) 
        dec_hidden = enc_hidden

        # dec_input(batch_size, 1)
        dec_input = tf.expand_dims([fr_tokenizer.word_index['<start>']] * BATCH_SIZE, 1) 

        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

            loss += loss_function(targ[:, t], predictions)
            dec_input = tf.expand_dims(targ[:, t], 1) # using teacher forcing

    batch_loss = (loss / int(targ.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

In [None]:
BATCH_SIZE = 64
embedding_dim = 256
units = 1024

# Create data in memeory and shuffles the data in the batches
dataset=tf.data.Dataset.from_tensor_slices((trainX.reshape((-1, trainX.shape[-2])), trainY.reshape((-1, trainY.shape[1])))).shuffle(BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [None]:
encoder = Encoder(fr_vocab_size, embedding_dim=embedding_dim, encoder_units=units, batch_size=BATCH_SIZE)
decoder = Decoder(vocab_size=eng_vocab_size, embedding_dim=embedding_dim, dec_units=units, batch_sz=BATCH_SIZE)

In [None]:
checkpoint_dir = '../models/encode_decoder_attention_training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [None]:
EPOCHS = 10
steps_per_epoch = len(trainX) // BATCH_SIZE

for epoch in range(EPOCHS):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0
    for (batch, (inp, targ)) in tqdm(enumerate(dataset.take(steps_per_epoch)), total=steps_per_epoch):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss

    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix = checkpoint_prefix)

    print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

### Make Prediction using the trained Model

In [None]:
def predict(sentences, source_sentence_tokenizer=fr_tokenizer, target_sentence_tokenizer=eng_tokenizer): 
    max_length = trainX.shape[1]
    attention_plot = np.zeros((max_length, max_length))

    sentence = TextProcessor(sentences, 'fr').process(sentences)

    inputs = [source_sentence_tokenizer.word_index[w] for w in sentences.split()]
    inputs = pad_sequences([inputs], maxlen=max_length, padding="post")
    inputs = tf.convert_to_tensor(inputs)

    result = ""
    
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)
    
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([target_sentence_tokenizer.word_index['<start>']], 0)
    
    for t in range(max_length):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)

        # storing the attention weights to plot later on
        attention_weights =  tf.reshape(attention_weights,(-1, ))
        attention_plot [t] = attention_weights.numpy()
        
        predicted_id = tf.argmax(predictions[0]).numpy()
        result += target_sentence_tokenizer.index_word[predicted_id] + ' '

        if target_sentence_tokenizer.index_word[predicted_id] == '<end>':
            return result, sentence, attention_plot

        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

In [None]:
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap="viridis")
    
    sentence = sentence.split()
    predicted_sentence = predicted_sentence.split()
    
    # Add <PAD> token if sentence or predicted_sentence is shorter than attention matrix
    if len(sentence) < attention.shape[1]:
        sentence += ["<PAD>"] * (attention.shape[1] - len(sentence))
        
    if len(predicted_sentence) < attention.shape[0]:
        predicted_sentence += ["<PAD>"] * (attention.shape[0] - len(predicted_sentence))
    
    fontdict = {"fontsize": 14}
    ax.set_xticklabels([' '] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([' '] + predicted_sentence, fontdict=fontdict)
    
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
    
    plt.show()

In [None]:
def translate(sentence):
    predicted_text, sentence, attention_plot = predict(sentence)
        
    attention_plot = attention_plot[:len(predicted_text.split(' ')), :len(sentence.split(' '))]
    return predicted_text, attention_plot

In [None]:
random_id = np.random.randint(0, testX.shape[0])
sentence = ' '.join([fr_tokenizer.index_word[i[0]] for i in testX[random_id] if i[0] != 0][1:-1])

In [None]:
predicted_text, attention_plot = translate(sentence)

print('Input:        ', sentence)
print('Predicted:   ', predicted_text)
print('Ground Truth: ', decode_sequences(eng_tokenizer, testY[random_id, : ,0]))

In [None]:
plot_attention(attention_plot, sentence, predicted_text)

In [None]:
data = []

references = []
candidates = []

for i in tqdm(range(testX.shape[0]//2)):
    textX_decoded = decode_sequences(fr_tokenizer, testX[i, : ,0])
    testY_decoded = decode_sequences(eng_tokenizer, testY[i, : ,0])
    candidate = translate(textX_decoded)[0].replace('<end>', '').replace('<start>', '').strip()
    
    data.append({
        'Context': textX_decoded,
        'Reference': testY_decoded,
        'Candidate': candidate,
        'length': len(textX_decoded.split())
    })
    
    references.append([testY_decoded])
    candidates.append(candidate)

In [None]:
# split into small dataset based on the sentences length
length_ranges = [(1, 5), (6, 10), (11, 15), (16, 20), (21, 30), (31, 40), (41, 60), (61, float('inf'))]

small_datasets = {}
for min_len, max_len in length_ranges:
    filtered_examples = [example for example in data if example['length'] >= min_len and example['length'] <= max_len]
    small_datasets[f'dataset_{min_len}_{max_len}'] = filtered_examples

samples_per_range = []
for key, dataset in small_datasets.items():
    samples_per_range.append(len(dataset))
    print(f"{key}: {len(dataset)} samples")

In [None]:
def compute_corpus_bleu(references, candidates):
    if len(references) != len(candidates):
        raise ValueError('The number of references and candidates must be the same :', len(references), len(candidates))
    
    if len(references) == 0: return 0.0
    
    reference_tokens = [[ref] for ref in references]
    return corpus_bleu(reference_tokens, candidates)

In [None]:
bleu_scores = []
for key, dataset in small_datasets.items():
    refs = [example['Reference'] for example in dataset]
    cands = [example['Candidate'] for example in dataset]
    
    corpus_bleu_score = compute_corpus_bleu(refs, cands)
    bleu_scores.append(corpus_bleu_score)
    
    print(f"{key}: {corpus_bleu_score:.4f}")

In [None]:
overall_bleu_score = corpus_bleu(references, candidates)
overall_bleu_score

In [None]:
import matplotlib.patches as mpatches

plt.figure(figsize=(15, 7))

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']  # List of colors for each bar
bar_plot = plt.bar([f'{start}-{end}' for start, end in length_ranges], bleu_scores, color=colors, alpha=0.7, label='BLEU Score')

# Add "All" bar with legend
all_bar = plt.bar("All", overall_bleu_score, color='k', alpha=0.7)

# Create a dummy handle for the "All" bar
all_patch = mpatches.Patch(color='k', label=f'Sample = {len(candidates)}')
legend_labels = [f'Sample = {value}' for value in samples_per_range]

# Include the dummy handle in the legend
plt.legend(handles=[*bar_plot, all_patch], labels=legend_labels + [f'Sample = {len(candidates)}'], loc='upper right', title='Samples per range')

plt.xlabel('Word Count Range')
plt.ylabel('BLEU Score')

plt.title('BLEU Score and Number of Samples Based on Word Count Range')

plt.show()
