<a href="https://colab.research.google.com/github/sankarvinayak/DL-assignment-3/blob/main/DL_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DA6401 Assignment 3
Use recurrent neural networks to build a transliteration system.

# Instructions
- The goal of this assignment is fourfold: (i) learn how to model sequence to sequence learning problems using Recurrent Neural Networks (ii) compare different cells such as vanilla RNN, LSTM and GRU (iii) understand how attention networks overcome the limitations of vanilla seq2seq models (iv) visualise the interactions between different components in a RNN based model.
- We strongly recommend that you work on this assignment in a team of size 2. Both the members
of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
- Collaborations and discussions with other groups are strictly prohibited.
- You must use Python (numpy and pandas) for your implementation.
- You can use any and all packages from keras, pytorch, tensorflow
- You can run the code in a jupyter notebook on colab by enabling GPUs.
- You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai. You will upload a link to this report on gradescope.
- You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
- You have to check moodle regularly for updates regarding the assignment.



# Problem Statement

In this assignment you will experiment with the [Dakshina dataset](https://github.com/google-research-datasets/dakshina) released by Google. This dataset contains pairs of the following form:

$x$.      $y$

ajanabee अजनबी.

i.e., a word in the native script and its corresponding transliteration in the Latin script (the way we type while chatting with our friends on WhatsApp etc). Given many such $(x_i, y_i)_{i=1}^n$ pairs your goal is to train a model $y = \hat{f}(x)$ which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर).

As you would realise this is the problem of mapping a sequence of characters in one language to a sequence of characters in another language. Notice that this is a scaled down version of the problem of translation where the goal is to translate a sequence of **words** in one language to a sequence of words in another language (as opposed to sequence of **characters** here).

Read these blogs to understand how to build neural sequence to sequence models: [blog1](https://keras.io/examples/nlp/lstm_seq2seq/), [blog2](https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/)




In [1]:
!wget https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar
!tar -xf dakshina_dataset_v1.0.tar
!cp -r dakshina_dataset_v1.0/ml .
!rm -rf dakshina_dataset_v1.0
!rm -r dakshina_dataset_v1.0.tar

--2025-05-15 04:53:20--  https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.24.207, 142.251.10.207, 142.251.12.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.24.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2008340480 (1.9G) [application/x-tar]
Saving to: ‘dakshina_dataset_v1.0.tar’


2025-05-15 04:54:48 (21.8 MB/s) - ‘dakshina_dataset_v1.0.tar’ saved [2008340480/2008340480]



In [1]:
import pandas as pd

# Paths to your data files (adjust if needed)
path = "/content/ml/lexicons/ml.translit.sampled.train.tsv"
path_val = "/content/ml/lexicons/ml.translit.sampled.dev.tsv"
path_test = "/content/ml/lexicons/ml.translit.sampled.test.tsv"

# Read the files
df = pd.read_csv(path, sep='\t', header=None)
df_val = pd.read_csv(path_val, sep='\t', header=None)
df_test = pd.read_csv(path_test, sep='\t', header=None)

# Split into native (Malayalam) and romanized (English) columns
malayalam_words = df[0]
english_words = df[1]

malayalam_words_val = df_val[0]
english_words_val = df_val[1]

malayalam_words_test = df_test[0]
english_words_test = df_test[1]



In [2]:
malayalam_words,english_words

(0             അം
 1            അംഗ
 2            അംഗ
 3           അംഗം
 4           അംഗം
           ...   
 58377       ഹൗസ്
 58378       ഹർജി
 58379       ഹർജി
 58380    ഹർജിയിൽ
 58381    ഹർജിയിൽ
 Name: 0, Length: 58382, dtype: object,
 0              am
 1            amga
 2            anga
 3           amgam
 4           angam
            ...   
 58377       house
 58378       harje
 58379       harji
 58380    harjeyil
 58381    harjiyil
 Name: 1, Length: 58382, dtype: object)

In [3]:

english_words = english_words.dropna()
malayalam_words = malayalam_words.dropna()

english_words = english_words.astype(str)
malayalam_words = malayalam_words.astype(str)

english_chars = sorted(set("".join(english_words)))
malayalam_chars = sorted(set("".join(malayalam_words)))

max_len_eng = max(len(w) for w in pd.concat([english_words, english_words_val, english_words_test]).dropna().astype(str))
max_len_mal = max(len(w) for w in pd.concat([malayalam_words, malayalam_words_val, malayalam_words_test]).dropna().astype(str))

print("English characters:", english_chars)
print("Malayalam characters:", malayalam_chars)
print("Max English word length:", max_len_eng)
print("Max Malayalam word length:", max_len_mal)


English characters: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Malayalam characters: ['ം', 'ഃ', 'അ', 'ആ', 'ഇ', 'ഈ', 'ഉ', 'ഊ', 'ഋ', 'എ', 'ഏ', 'ഐ', 'ഒ', 'ഓ', 'ഔ', 'ക', 'ഖ', 'ഗ', 'ഘ', 'ങ', 'ച', 'ഛ', 'ജ', 'ഝ', 'ഞ', 'ട', 'ഠ', 'ഡ', 'ഢ', 'ണ', 'ത', 'ഥ', 'ദ', 'ധ', 'ന', 'പ', 'ഫ', 'ബ', 'ഭ', 'മ', 'യ', 'ര', 'റ', 'ല', 'ള', 'ഴ', 'വ', 'ശ', 'ഷ', 'സ', 'ഹ', 'ാ', 'ി', 'ീ', 'ു', 'ൂ', 'ൃ', 'െ', 'േ', 'ൈ', 'ൊ', 'ോ', '്', 'ൗ', 'ൺ', 'ൻ', 'ർ', 'ൽ', 'ൾ', '\u200c']
Max English word length: 32
Max Malayalam word length: 31


In [4]:

longest_malayalam_word = max(malayalam_words, key=len)
print("Longest Malayalam word:", longest_malayalam_word)
print("Length:", len(longest_malayalam_word))


Longest Malayalam word: ചൂണ്ടിക്കാണിക്കപ്പെട്ടിട്ടുണ്ട്
Length: 31


In [5]:
def word2vec(word, lang):
    vec = []

    if lang == "english":
        start_token = len(english_chars) + 1
        vec.append(start_token)

        for char in word:
            if char in english_chars:
                vec.append(english_chars.index(char) + 1)

        while len(vec) < max_len_eng + 1:  # +1 for start token
            vec.append(0)

        vec.append(0)  # end token

    elif lang == "malayalam":
        start_token = len(malayalam_chars) + 1
        vec.append(start_token)

        for char in word:
            if char in malayalam_chars:
                vec.append(malayalam_chars.index(char) + 1)

        while len(vec) < max_len_mal + 1:
            vec.append(0)

        vec.append(0)

    return vec


In [6]:
vec = word2vec(malayalam_words[10], "malayalam")
print("Malayalam word:", malayalam_words[50000])
print("Tokenized vector:", vec)


Malayalam word: വീണ്
Tokenized vector: [71, 3, 1, 18, 20, 63, 20, 45, 52, 41, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [7]:
def ip_matrix_construct(words, lang):
    ans = []
    for word in words:
        ans.append(word2vec(word, lang))
    return ans


In [8]:
import torch
english_matrix = ip_matrix_construct(english_words.dropna().astype(str), "english")
malayalam_matrix = ip_matrix_construct(malayalam_words.dropna().astype(str), "malayalam")

# Convert to tensors
english_matrix = torch.tensor(english_matrix)
malayalam_matrix = torch.tensor(malayalam_matrix)

# For validation data
english_matrix_val = ip_matrix_construct(english_words_val.dropna().astype(str), "english")
malayalam_matrix_val = ip_matrix_construct(malayalam_words_val.dropna().astype(str), "malayalam")
english_matrix_val = torch.tensor(english_matrix_val)
malayalam_matrix_val = torch.tensor(malayalam_matrix_val)

# For test data
english_matrix_test = ip_matrix_construct(english_words_test.dropna().astype(str), "english")
malayalam_matrix_test = ip_matrix_construct(malayalam_words_test.dropna().astype(str), "malayalam")
english_matrix_test = torch.tensor(english_matrix_test)
malayalam_matrix_test = torch.tensor(malayalam_matrix_test)



# Question 1 (15 Marks)
Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari).

The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU) and the number of layers in the encoder and decoder can be changed.

(a) What is the total number of computations done by your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder, the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)

(b) What is the total number of parameters in your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder and the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)



# Question 2 (10 Marks)

You will now train your model using any one language from the [Dakshina dataset](https://github.com/google-research-datasets/dakshina) (I would suggest pick a language that you can read so that it is easy to analyse the errors). Use the standard train, dev, test set from the folder dakshina_dataset_v1.0/hi/lexicons/ (replace hi by the language of your choice)

Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions but you are free to decide which hyperparameters you want to explore

- input embedding size: 16, 32, 64, 256, ...
- number of encoder layers: 1, 2, 3
- number of decoder layers: 1, 2, 3
- hidden layer size: 16, 32, 64, 256, ...
- cell type: RNN, GRU, LSTM
- dropout: 20%, 30% (btw, where will you add dropout? you should read up a bit on this)
- beam search in decoder with different beam sizes:

Based on your sweep please paste the following plots which are automatically generated by wandb:
- accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration).
- parallel co-ordinates plot
- correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)

Also write down the hyperparameters and their values that you sweeped over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried for efficiently searching the hyperparameters.

In [73]:
import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout, cell_type='LSTM', bidirectional=False):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.cell_type = cell_type
        self.bidirectional = bidirectional

        self.embedding = nn.Embedding(input_size, embedding_size)
        self.dropout = nn.Dropout(dropout)

        if cell_type == 'GRU':
            self.rnn = nn.GRU(embedding_size, hidden_size, num_layers, dropout=dropout if num_layers > 1 else 0, bidirectional=bidirectional)
        elif cell_type == 'RNN':
            self.rnn = nn.RNN(embedding_size, hidden_size, num_layers, dropout=dropout if num_layers > 1 else 0, bidirectional=bidirectional)
        else:  # LSTM by default
            self.rnn = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=dropout if num_layers > 1 else 0, bidirectional=bidirectional)

    def forward(self, x):
        # x shape: (seq_len, batch)
        embedded = self.dropout(self.embedding(x))
        # embedded shape: (seq_len, batch, embedding_size)

        if self.cell_type == 'LSTM':
            outputs, (hidden, cell) = self.rnn(embedded)
            return outputs, (hidden, cell)
        else:
            outputs, hidden = self.rnn(embedded)
            return outputs, hidden
class Decoder(nn.Module):
    def __init__(self, output_size, embedding_size, hidden_size, num_layers, dropout, cell_type='LSTM'):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.cell_type = cell_type

        self.embedding = nn.Embedding(output_size, embedding_size)
        self.dropout = nn.Dropout(dropout)

        if cell_type == 'GRU':
            self.rnn = nn.GRU(embedding_size, hidden_size, num_layers, dropout=dropout if num_layers > 1 else 0)
        elif cell_type == 'RNN':
            self.rnn = nn.RNN(embedding_size, hidden_size, num_layers, dropout=dropout if num_layers > 1 else 0)
        else:  # LSTM
            self.rnn = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=dropout if num_layers > 1 else 0)

        self.fc_out = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        # x shape: (batch), because one timestep input
        x = x.unsqueeze(0)  # now (1, batch)
        embedded = self.dropout(self.embedding(x))
        # embedded shape: (1, batch, embedding_size)

        if self.cell_type == 'LSTM':
            output, (hidden, cell) = self.rnn(embedded, hidden)
            prediction = self.fc_out(output.squeeze(0))
            return prediction, (hidden, cell)
        else:
            output, hidden = self.rnn(embedded, hidden)
            prediction = self.fc_out(output.squeeze(0))
            return prediction, hidden
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src shape: (src_len, batch)
        # trg shape: (trg_len, batch)

        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.fc_out.out_features

        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        # encoder forward pass
        if self.encoder.cell_type == 'LSTM':
            encoder_outputs, (hidden, cell) = self.encoder(src)
        else:
            encoder_outputs, hidden = self.encoder(src)
            cell = None  # no cell in RNN/GRU

        # first input to the decoder is the <sos> token
        input = trg[0, :]

        for t in range(1, trg_len):
            if self.encoder.cell_type == 'LSTM':
                output, (hidden, cell) = self.decoder(input, (hidden, cell))
            else:
                output, hidden = self.decoder(input, hidden)

            outputs[t] = output

            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1)

            input = trg[t] if teacher_force else top1

        return outputs
    def beam_search(self, src, sos_idx, eos_idx, max_len=30, beam_width=3):
        # src: (src_len, batch=1) - typically batch=1 for beam search inference
        self.eval()

        with torch.no_grad():
            # Encode source sentence
            if self.encoder.cell_type == 'LSTM':
                encoder_outputs, (hidden, cell) = self.encoder(src)
            else:
                encoder_outputs, hidden = self.encoder(src)
                cell = None

            # Initialize beam with sequences, scores, and hidden states
            sequences = [[ [sos_idx], 0.0, hidden, cell ]]  # list of [sequence, score, hidden, cell]

            for _ in range(max_len):
                all_candidates = []
                # Expand each sequence in the beam
                for seq, score, hidden_state, cell_state in sequences:
                    # If last token is EOS, add sequence as is
                    if seq[-1] == eos_idx:
                        all_candidates.append((seq, score, hidden_state, cell_state))
                        continue

                    input_token = torch.LongTensor([seq[-1]]).to(self.device)
                    if self.encoder.cell_type == 'LSTM':
                        output, (hidden_new, cell_new) = self.decoder(input_token, (hidden_state, cell_state))
                    else:
                        output, hidden_new = self.decoder(input_token, hidden_state)
                        cell_new = None

                    # Get log probabilities
                    log_probs = F.log_softmax(output, dim=1).squeeze(0)  # (vocab_size,)

                    # Get top beam_width tokens
                    top_log_probs, top_indices = torch.topk(log_probs, beam_width)

                    for i in range(beam_width):
                        candidate_seq = seq + [top_indices[i].item()]
                        candidate_score = score + top_log_probs[i].item()
                        all_candidates.append((candidate_seq, candidate_score, hidden_new, cell_new))

                # Order all candidates by score and select top beam_width
                ordered = sorted(all_candidates, key=lambda tup: tup[1], reverse=True)
                sequences = ordered[:beam_width]

                # Optional: break early if all sequences end with EOS
                if all(seq[-1] == eos_idx for seq, _, _, _ in sequences):
                    break

            # Return the highest scoring sequence
            best_seq = sequences[0][0]
            return best_seq


In [112]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

INPUT_DIM = len(english_chars) + 2  # +2 for padding and possibly <sos>/<eos>
OUTPUT_DIM = len(malayalam_chars) + 2

ENC_EMB_DIM = 64
DEC_EMB_DIM = 64
HID_DIM = 128
ENC_LAYERS = 2
DEC_LAYERS = 2
ENC_DROPOUT = 0.3
DEC_DROPOUT = 0.3
CELL_TYPE ='GRU' # 'LSTM'  or 'GRU' or 'RNN'

encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_LAYERS, ENC_DROPOUT, CELL_TYPE).to(device)
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_LAYERS, DEC_DROPOUT, CELL_TYPE).to(device)

model = Seq2Seq(encoder, decoder, device).to(device)
import torch.optim as optim

PAD_IDX = 0  # assuming 0 is padding index in your tokenizer

criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = optim.Adam(model.parameters(), lr=0.001)
BATCH_SIZE = 64

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=0)  # padding token is 0
CLIP = 1


In [113]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    epoch_acc = 0
    total_tokens = 0

    for src, trg in iterator:
        src = src.transpose(0, 1).to(device)  # (seq_len, batch)
        trg = trg.transpose(0, 1).to(device)  # (seq_len, batch)

        optimizer.zero_grad()

        output = model(src, trg[:-1, :])  # (seq_len-1, batch, output_dim)

        output_dim = output.shape[-1]
        output = output.reshape(-1, output_dim)  # ( (seq_len-1)*batch, output_dim )
        trg_y = trg[1:, :].reshape(-1)           # ( (seq_len-1)*batch )

        loss = criterion(output, trg_y)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        epoch_loss += loss.item()

        # Calculate accuracy
        preds = output.argmax(1)                # predicted tokens
        non_pad_mask = trg_y != 0               # ignore padding
        correct = (preds == trg_y) & non_pad_mask

        epoch_acc += correct.sum().item()
        total_tokens += non_pad_mask.sum().item()

    accuracy = epoch_acc / total_tokens if total_tokens > 0 else 0
    return epoch_loss / len(iterator), accuracy
def evaluate(model, iterator, criterion):
    model.eval()

    epoch_loss = 0
    epoch_acc = 0
    total_tokens = 0

    with torch.no_grad():
        for src, trg in iterator:
            src = src.transpose(0, 1).to(model.device)  # (seq_len, batch)
            trg = trg.transpose(0, 1).to(model.device)

            output = model(src, trg, 0)  # turn off teacher forcing

            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)  # remove <sos>, flatten
            trg_y = trg[1:].reshape(-1)               # remove <sos>, flatten

            loss = criterion(output, trg_y)
            epoch_loss += loss.item()

            preds = output.argmax(1)
            non_pad_mask = trg_y != 0
            correct = (preds == trg_y) & non_pad_mask

            epoch_acc += correct.sum().item()
            total_tokens += non_pad_mask.sum().item()

    accuracy = epoch_acc / total_tokens if total_tokens > 0 else 0
    return epoch_loss / len(iterator), accuracy



In [None]:
train_loss, train_acc = train(model, train_loader, optimizer, criterion, CLIP)

valid_loss, valid_acc = evaluate(model, val_loader, criterion)

print(f"Train Loss: {train_loss:.3f} | Train Acc: {train_acc:.2%}")
print(f"Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc:.2%}")

In [103]:
sweep_config = {
  "method": "bayes",
  "metric": {"name": "val_accuracy", "goal": "maximize"},
  "parameters": {
    "embedding_size": {"values": [32, 64, 128]},
    "hidden_size": {"values": [64, 128]},
    "num_encoder_layers": {"values": [1, 2]},
    "num_decoder_layers": {"values": [1, 2]},
    "cell_type": {"values": ["RNN", "GRU", "LSTM"]},
    "dropout": {"values": [0.2, 0.3]},
    "beam_size": {"values": [1, 3, 5]},
    "learning_rate": {"values": [1e-3, 5e-4]}
  }
}


In [None]:
import wandb

wandb.init(project="DL-Assignment3")
sweep_id = wandb.sweep(sweep_config, project="DL-Assignment3")
wandb.agent(sweep_id, function=train_function)




# Question 3 (15 Marks)
Based on the above plots write down some insightful observations. For example,
- RNN based model takes longer time to converge than GRU or LSTM
- using smaller sizes for the hidden layer does not give good results
- dropout leads to better performance

(Note: I don't know if any of the above statements is true. I just wrote some random comments that came to my mind)

Of course, each inference should be backed by appropriate evidence.



# Question 4 (10 Marks)

You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only).

(a) Use the best model from your sweep and report the accuracy on the test set (the output is correct only if it exactly matches the reference output).

(b) Provide sample inputs from the test data and predictions made by your best model (more marks for presenting this grid creatively). Also upload all the predictions on the test set in a folder **predictions_vanilla** on your github project.

(c) Comment on the errors made by your model (simple insightful bullet points)

- The model makes more errors on consonants than vowels
- The model makes more errors on longer sequences
- I am thinking confusion matrix but may be it's just me!
- ...



# Question 5 (20 Marks)

Now add an attention network to your basis sequence to sequence model and train the model again. For the sake of simplicity you can use a single layered encoder and a single layered decoder (if you want you can use multiple layers also). Please answer the following questions:

(a) Did you tune the hyperparameters again? If yes please paste appropriate plots below.

(b) Evaluate your best model on the test set and report the accuracy. Also upload all the predictions on the test set in a folder **predictions_attention** on your github project.

(c) Does the attention based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences (i.e., outputs which were predicted incorrectly by your best seq2seq model are predicted correctly by this model)

(d) In a 3 x 3 grid paste the attention heatmaps for 10 inputs from your test data (read up on what are attention heatmaps).



# Question 6 (20 Marks)

This a challenge question and most of you will find it hard.

I like the visualisation in the figure captioned "Connectivity" in this [article](https://distill.pub/2019/memorization-in-rnns/#appendix-autocomplete). Make a similar visualisation for your model. Please look at this [blog](https://towardsdatascience.com/visualising-lstm-activations-in-keras-b50206da96ff) for some starter code. The goal is to figure out the following: When the model is decoding the $i$-th character in the output which is the input character that it is looking at?

Have fun!




# Question 7 (10 Marks)
Paste a link to your github code for Part A

Example: https://github.com/&lt;user-id&gt;/da6401_assignment3/partA;

- We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this).

- We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarised).

- We will check the number of commits made by the two team members and then give marks accordingly. For example, if we see 70% of the commits were made by one team member then that member will get more marks in the assignment (**note that this contribution will decide the marks split for the entire assignment and not just this question**).

- We will also check if the training and test splits have been used properly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy.




# Question 8 (0 Marks)

Note that this question does not carry any marks and will not be graded. This is only for students who are looking for a challenge and want to get something more out of the course.

Your task is to finetune the GPT2 model to generate lyrics for English songs. You can refer to [this blog](https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a) and follow the steps there. This blog shows how to finetune the GPT2 model to generate headlines for financial articles. Instead of headlines you will use lyrics so you may find the following datasets useful for training: [dataset1](https://data.world/datasets/lyrics), [dataset2](https://www.kaggle.com/paultimothymooney/poetry)

At test time you will give it a prompt: "I love Deep Learning" and it should complete the song based on this prompt :-) Paste the generated song in a block below!

### Self Declaration



I, Name_XXX (Roll no: XXYY), swear on my honour that I have written the code and the report by myself and have not copied it from the internet or other students.



