# Attention Mechanism (Graded)

Welcome to your Attention (required) programming assignment! You will build a **Machine Translation** model using Attention. You will be using [Bilingual Sentence Pairs](https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs) dataset which contains 81 distinct languages. The dataset is structured in such a way that there's an english translation against each sentence in a specific language.

Your goal is to select a dataset of your liking and build a machine translation model to translate it into english!

**Instructions:**
* Do not modify any of the codes.
* Only write code when prompted. For example in some sections you will find the following,
```
# YOUR CODE GOES HERE
# YOUR CODE STARTS HERE
# TODO
```
Only modify those sections of the code.

**You will learn:**
* Data preprocessing for Machine Translation problem. Cleaning the data, tokenizing it, and padding sequences to a uniform length.
* Incorporating attention mechanism into the seq2seq model. This mechanism allows the decoder to focus on relevant parts of the input sequence when generating the output sequence, improving translation accuracy.
* Build a robust Machine Translation model.
* Inference using trained model to make predictions.

# Machine Translation using Seq2Seq Attention Mechanism

<img src="assets/attention-1.png" width=500>

1. **User Input:**
  * The user enters an English sentence, which serves as the input to the model.

2. **Encoder:**

  * The encoder processes the English sentence, converting it into a numerical representation (embeddings). This involves breaking down the sentence into individual words or tokens and representing each token as a dense vector.
  * The encoder typically uses a recurrent neural network (RNN) or a transformer to capture the sequential nature of the input. RNNs process the input sequence one word at a time, maintaining a hidden state that stores information about the previously seen words.

3. **Seq2Seq Model:**

  * This is the core of the model that connects the encoder and decoder.
  The Seq2Seq model processes the encoded English sentence and passes it to the decoder.
  * It acts as a bridge between the two components, ensuring that the information from the encoder is effectively transmitted to the decoder.

4. **Decoder:**

  * The decoder generates the Hindi sentence, one word at a time. It uses the encoded English sentence as context and its own previous outputs to predict the next word.
  * The decoder often employs an attention mechanism to focus on relevant parts of the encoded English sentence while generating the Hindi output.
  * This allows the decoder to align the English words with their corresponding Hindi translations, ensuring that the generated Hindi sentence is coherent and accurate.

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import string
import random
import os
from helpers import *
from tests import *

from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm
tqdm.pandas()

**Before we get started:**

Download any .txt data from here: [Bilingual Sentence Pairs](https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs)

In [None]:
# TODO

# Load dataset
dataset_path = 'hin.txt' #TODO: Path to your dataset


In [None]:
# TODO

with open(dataset_path, 'r', encoding='utf-8') as f:

    #TODO: Check the length of the dataset
    lines = f.readlines()
    num_lines = len(lines)

    print(f"Length of the dataset: {num_lines}")

In [None]:
# Printing a few lines from the dataset

print("Sample lines:")
for i in range(5):
    print(lines[i].strip())

# Data Preparation

The data is prepared by carrying out the following steps:


1. **Data cleaning:** Lowercasing, removing punctuations, excluding special characters etc.
2. **Tokenizing and Padding:** Padding input and target tensors to a uniform length.
3. **Train/Test Split:** Create `tf.data.Dataset` objects for training and testing. Also shuffle and create batches for efficient training.



## Data Cleaning

* **Lowercasing:** Converting all text to lowercase ensures that words are treated the same regardless of their capitalization.
* **Removing Punctuations:** Punctuations can introduce noise and are often not relevant to the meaning of the text.
* **Removing Special Characters/aposrophes:** Removing special characters and unnecessary symbols eliminates additional noise.
* **Removing Extra Spaces:** Removing extra spaces and standardizing the spacing ensures that the text is uniformly formatted.
* **Adding Special Tokens:** Adding special tokens such as <start> and <end> helps the model identify the beginning and end of sequences, which is important for sequence-to-sequence tasks like machine translation.

In [None]:
# TODO

def preprocess(sent, sp_tokens=False):
    '''
    Performs all text preprocessing steps such as removing
    unncecessary characters, adding special tokens, etc.

    sent (str): sentence to preprocess
    exclude (str): characters to exclude (like punctuations)
    sp_tokens (bool): If True, special tokens '<start>' and '<end>'
                      will be added. Default False.

    reuturns (str): preprocessed sentence
    '''


    exclude = # TODO: create a set of punctuations to exclude(eg: !@#$%)
    sent = #TODO: Lower case all the words inside 'sent'
    sent = #TODO: Remove aposrophes ''
    sent = #TODO: Get rid of all the punctuations
    sent = #TODO: Remove leading and trailing spaces
    sent = #TODO: Replace multiple spaces with a single space

    if sp_tokens:
        sent = #TODO: Add '<start>' and '<end>' tokens to the sentence

    return sent

def preprocess_lines(lines):
    '''
    Preprocesses the given list of sentence pairs.

    args:
      lines: list of tuples, each containing (source_sentence, target_sentence).

    returns:
      Preprocessed list of sentence pairs.
    '''
    prep_lines = [
        [preprocess(i, sp_tokens=False),
         preprocess(j, sp_tokens=True)]
        for i, j in lines
    ]

    return prep_lines


In [None]:
#TODO

def load_dataset(path, reverse=False):
    '''
    Loads and optionally reverses the language pairs inside the dataset.

    args:
      path: str, path to the translation file.
      reverse: bool, optional, if True, reverse the language pairs. Default False.

    returns:
      lines (list): list of sentence pairs.
    '''
    lines = #TODO: Create a list of lists of language pairs

    if reverse:
        lines = #TODO: Reverse the positions of language pairs

    return lines

**Expected lines to return:**

  ```
  [['Wow!', 'वाह!'], ['Help!', 'बचाओ!'], ['Jump.', 'उछलो.'] ...... ]

  ```

***Note:*** *The second value inside the list of lists can be any chosen language that you selected for this task*

## Tokenizing and Padding

* **Tokenization:** Tokenization is essential to convert text data into a numerical format that can be processed by machine learning models.
* **Padding:** Padding ensures that all sequences in a batch have the same length, which is a requirement for many deep learning models, especially those using recurrent neural networks or transformers.

In [None]:

def create_tensors(prep_lines, inp_lang, tgt_lang):
    '''
    Converts preprocessed sentences into input and target tensors.

    args:
      prep_lines: list of preprocessed sentence pairs.
      inp_lang: LanguageIndex object for the input language.
      tgt_lang: LanguageIndex object for the target language.

    returns:
      input_tensor: list of input tensors.
      target_tensor: list of target tensors.
    '''
    input_tensor = [[inp_lang.word2idx[w] for w in inp.split(' ')]
                    for inp, tgt in prep_lines]

    target_tensor = [[tgt_lang.word2idx[w] for w in tgt.split(' ')]
                     for inp, tgt in prep_lines]

    return input_tensor, target_tensor

#TODO
def pad_tensors(input_tensor, target_tensor, max_len=None):
    '''
    Pads input and target tensors to the specified max length.

    args:
      input_tensor: list of input tensors.
      target_tensor: list of target tensors.
      max_len: int, optional, maximum length for padding. Default None.

    returns:
      input_tensor: padded input tensor.
      target_tensor: padded target tensor.
      max_length_inp: int, maximum length of input tensor.
      max_length_tgt: int, maximum length of target tensor.
    '''
    if max_len:
        input_tensor = #TODO: Create a list of lists of input_tensors with sequences <= max_len
        target_tensor = #TODO: Create a list of lists of target_tensors with sequences <= max_len
        max_length_inp = max_length_tgt = max_len
    else:
        max_length_inp = #TODO: Get the length of the longest sequence in the input_tensor.
        max_length_tgt = #TODO: Get the length of the longest sequence in the target_tensor.

    input_tensor =  #TODO: pad input_tensor sequences using keras 'pad_sequences' method.

    target_tensor = #TODO: pad target_tensor sequences using keras 'pad_sequences' method.

    return input_tensor, target_tensor, max_length_inp, max_length_tgt





## Train/Test Split

In [None]:
#TODO
def split_dataset(input_tensor, target_tensor, batch_size):
    '''
    Splits the dataset into training and validation sets.

    args:
      input_tensor: padded input tensor.
      target_tensor: padded target tensor.
      batch_size: int, batch size for training and validation sets.

    returns:
      train_dataset: tf.data.Dataset for training.
      test_dataset: tf.data.Dataset for validation.
    '''
    input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split( #TODO: Split the input and target tensor into train and test sets
        input_tensor,
        target_tensor,
        test_size=0.1,
        random_state=42)

    train_dataset = #TODO: Create a TF dataset object, shuffle it and divide it into batches

    test_dataset = #TODO: Create a TF dataset object, shuffle it and divide it into batches

    return train_dataset, test_dataset


In [None]:

def data_loader(path, batch_size, max_len=None, reverse=False):
    '''
    Combines all the above methods to load and preprocess the dataset.

    args:

      path: str, path to the translation file.
      batch_size: int, batch size for training and validation sets.
      max_len: int, optional, maximum length for padding. Default None.
      reverse: bool, optional, if True, reverse the language pairs. Default False.

    returns:

      train_dataset: tf.data.Dataset for training.
      test_dataset: tf.data.Dataset for validation.
      inp_lang: LanguageIndex object for input language.
      tgt_lang: LanguageIndex object for target language.
      max_length_inp: int, maximum length of input tensor.
      max_length_tgt: int, maximum length of target tensor.
    '''
    lines = #TODO: Load the dataset
    prep_lines = #TODO: Preprocess the input sequences
    inp_lang, tgt_lang = create_language_indices(prep_lines)
    input_tensor, target_tensor = create_tensors(prep_lines, inp_lang, tgt_lang)
    input_tensor, target_tensor, max_length_inp, max_length_tgt = #TODO: Pad the input and target tensors

    train_dataset, test_dataset = split_dataset(input_tensor, target_tensor, batch_size)

    return train_dataset, test_dataset, inp_lang, tgt_lang, max_length_inp, max_length_tgt


# Model training and Evaluation

### Encoder Network

The encoder is responsible for processing the input sequence (the English sentence) and transforming it into a fixed-length vector representation. This representation, often called the "context vector", captures the essence of the input sequence and is passed to the decoder.



In [None]:
#TODO

# Encoder
class Encoder(tf.keras.layers.Layer):

    def __init__(self, embed_dim, units, vocab_size):
        super().__init__()
        self.embedding = #TODO: Add an Embedding layer with mask set to True
        self.rnn = #TODO: Add an LSTM layer with return_state and sequences set to True

    def call(self, x):
        # x => (batch_size, max_len)
        x = #TODO: Input Embedding dim (batch_size, s, embed_dim)
        enc_outputs = #TODO: RNN encoder inputs
        return enc_outputs[0], enc_outputs[1:]


### Attention

We perform the Additive attention, also known as Bahdanau attention, was introduced by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in their 2015 paper, "Neural Machine Translation by Jointly Learning to Align and Translate."

Additive attention allows the model to automatically learn which parts of the input sequence are most relevant to each output word during decoding. The name "additive" comes from the fact that it combines (adds) the decoder's hidden state and the encoder's hidden states to compute attention scores.

the process of calculating attention weights and using them to create a context vector that captures the relevant information from the source sequence for generating the target sequence is as follows:

$$
e_{ij} = v_a^T \tanh(W_a s_{i-1} + U_a h_j)
$$

where,
* $eij$ is the alignment score between the i-th element of the source sequence and the j-th element of the target sequence.
* $va, Wa$, and Ua are matrices used to calculate the alignment score.
* $s_i-1$ is the hidden state of the decoder at the previous time step.
*$hj$ is the hidden state of the encoder at the j-th time step.


$$
a_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x} exp(e_{ik})}
$$

where,
* $aij$ is the attention weight for the i-th element of the source sequence and the j-th element of the target sequence.
* The equation calculates the attention weight by normalizing the alignment scores using the softmax function.


$$
c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j
$$

where,
* $ci$ is the context vector at the i-th time step of the decoder.
* The equation calculates the context vector by taking a weighted sum of the encoder hidden states, where the weights are the attention weights.

In [None]:

#TODO
class Attention(tf.keras.layers.Layer):

    def __init__(self, units):
        super().__init__()
        self.W_q = #TODO: define query(dense layer)
        self.W_k = #TODO: define key(dense layer)
        self.W_v = #TODO: define value(dense layer)


    def call(self, query, key, value, mask=None):
        query, key = self.W_q(query), self.W_k(key)
        # query => (batch_size, t, units)
        # key => (batch_size, s, units)

        score = self.W_v(
            tf.math.tanh(
                tf.expand_dims(query, 2) + tf.expand_dims(key, 1)
            )
        )
        score = tf.squeeze(score, -1)
        # score => (batch_size, t, s)

        if mask is not None:
            score = tf.where(mask, score, -1e6)

        attention_weights = #TODO: attention weights shape (batch_size, t, s)

        context = #TODO: context vector shape (batch_size, t, units)

        return context, attention_weights

### Decoder Network

The decoder is responsible for generating the target sequence (language you chose) word by word, using the context vector produced by the encoder and its own internal state.

In [None]:
#TODO
class Decoder(tf.keras.layers.Layer):

    def __init__(self, embed_dim, units, vocab_size):
        super().__init__()

        # Embedding layer to convert tokens to vectors
        self.embedding = tf.keras.layers.Embedding(
            vocab_size, embed_dim, mask_zero=True)

        # RNN layer
        self.rnn = tf.keras.layers.LSTM(
            units, return_sequences=True, return_state=True)

        # Attention layer
        self.attention = Attention(units)

        # Final layer to output logits, we can use
        # argmax to know which output token is predicted.
        self.fc = tf.keras.layers.Dense(vocab_size)


    def call(self, x, enc_outputs, state, mask=None):
        x = #TODO: Add an Embedding layer with x as input
        # x => (batch_size, t, embed_dim)

        dec_outputs = #TODO: Add an rnn layer with initial_state state to state
        output = dec_outputs[0]
        state = dec_outputs[1:]
        # output   => (batch_size, t, units)
        # state[i] => (batch_size, s, units)

        context_vector, attention_weights = self.attention(
            query=output,
            key=enc_outputs,
            value=enc_outputs,
            mask=mask
        )
        # context_vector => (batch_size, t, units)
        # attention_weights => (batch_size, t, s)

        context_rnn_output = tf.concat(
            [context_vector, output], axis=-1)
        # context_rnn_output => (batch_size, t, 2*units)

        pred = self.fc(context_rnn_output)
        # pred => (batch_size, t, vocab_size)

        return pred, state, attention_weights

## Custom Loss Function

the following custom loss function calculates the cross-entropy loss between the predicted and actual target tokens, but it ignores the padding tokens to avoid penalizing the model for predicting them. This helps the model focus on learning the actual translation task.

In [None]:
# Custom loss function:
def loss_fn(y_true, y_pred):
    # y_true => (batch_size, max_len)
    # y_pred => (batch_size, max_len, vocab_size)

    mask = tf.cast(y_true != 0, tf.float32)

    loss = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')(y_true, y_pred)

    # masking the padding tokens
    loss = tf.reduce_sum(loss * mask)
    return loss

## Model Training

 We're going to orchestrate the entire training process of the machine translation model, from data preparation to model updates and performance tracking.

In [None]:
#TODO

# TODO: Define Hyperparameters
BATCH_SIZE = #Minimum 64
EMBEDDING_DIM =  #Minimum 50
UNITS = #Minimum 50
NUM_EPOCHS =

validate_data_loader(data_loader, path=dataset_path, batch_size=batch_size)

train_dataset, test_dataset, inp_lang, tgt_lang, max_length_inp, max_length_tgt = #TODO: Load data using Dataloader

vocab_inp_size = len(inp_lang.word2idx) #Vocab size of English
vocab_tgt_size = len(tgt_lang.word2idx) #Vicab size of language you chose

#TODO: Load Adam optimizer

# Creating instances of encoder and decoder
encoder = Encoder(EMBEDDING_DIM, UNITS, vocab_inp_size)
validate_encoder(encoder, vocab_inp_size, embed_dim=64, units=50)

decoder = Decoder(EMBEDDING_DIM, UNITS, vocab_tgt_size)

#Storing loss at each epoch to plot later
history = {
'loss': []
}

for epoch in range(NUM_EPOCHS):
    total_loss = 0.

    with tqdm(total=len(train_dataset)) as pbar:
        for batch, (x, y) in enumerate(train_dataset):
            inp_mask = tf.expand_dims(x != 0, axis=1)
            tgt_mask = tf.cast(y != 0, tf.float32)

            with tf.GradientTape() as tape:
                loss = tf.constant(0.0)
                enc_outputs, enc_states = encoder(x)
                dec_states = enc_states

                dec_input = tf.expand_dims(y[:, 0], axis=1)
                for t in range(1, x.shape[1]):
                    dec_outputs, dec_states, tmp_a = decoder(
                        dec_input, enc_outputs,
                        dec_states, inp_mask)

                    loss += loss_fn(
                        tf.expand_dims(y[:, t], axis=1), dec_outputs)
                    dec_input = tf.expand_dims(y[:, t], axis=1)

                loss = loss/tf.reduce_sum(tgt_mask)

            variables = (encoder.trainable_variables +
            decoder.trainable_variables)
            gradients = tape.gradient(loss, variables)
            trainer.apply_gradients(zip(gradients, variables))
            total_loss += loss.numpy()
            pbar.update(1)

    epoch_loss = total_loss/len(train_dataset)
    history['loss'].append(epoch_loss)
    print(f'Epoch: {epoch} | Loss: {epoch_loss}')







In [None]:
epochs = list(range(len(history['loss'])))
plt.plot(epochs, history['loss'])
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Inference

In [None]:
eng, hin = next(iter(train_dataset))

idx = -5
actual_seq = detokenize(hin[idx], tgt_lang.idx2word)
translation, att_wgts = predict_seq2seq(encoder, decoder, eng[idx], tgt_lang, max_length_tgt)

eng_sent = detokenize(eng[idx], inp_lang.idx2word)
print(f'English Sentence:      {eng_sent}')
print(f'\nPredicted Translation: {translation}')
print(f'Actual Translation:    {actual_seq}')

## Visualizing attention weights

In [None]:
import plotly
from plotly.offline import init_notebook_mode, iplot
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objects as go


idx = -5
actual_seq = detokenize(hin[idx], tgt_lang.idx2word)
translation, att_wgts = predict_seq2seq(encoder, decoder, eng[idx], tgt_lang, max_length_tgt)

eng_sent = detokenize(eng[idx], inp_lang.idx2word)
print(f'English Sentence:      {eng_sent}')
print(f'Predicted Translation: {translation}')
print(f'Actual Translation:    {actual_seq}')


eng_tokens = eng_sent.strip().split(' ')
eng_tokens += ['<PAD>'] * (max_length_inp - len(eng_tokens))

trace = go.Heatmap(
    z=att_wgts, 
    x= eng_tokens, 
    y=translation.split(' '), 
    colorscale='Reds'
)
iplot([trace])