<a href="https://colab.research.google.com/github/sankarvinayak/DL-assignment-3/blob/main/DL_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DA6401 Assignment 3
Use recurrent neural networks to build a transliteration system.

# Instructions
- The goal of this assignment is fourfold: (i) learn how to model sequence to sequence learning problems using Recurrent Neural Networks (ii) compare different cells such as vanilla RNN, LSTM and GRU (iii) understand how attention networks overcome the limitations of vanilla seq2seq models (iv) visualise the interactions between different components in a RNN based model.
- We strongly recommend that you work on this assignment in a team of size 2. Both the members
of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
- Collaborations and discussions with other groups are strictly prohibited.
- You must use Python (numpy and pandas) for your implementation.
- You can use any and all packages from keras, pytorch, tensorflow
- You can run the code in a jupyter notebook on colab by enabling GPUs.
- You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai. You will upload a link to this report on gradescope.
- You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
- You have to check moodle regularly for updates regarding the assignment.



# Problem Statement

In this assignment you will experiment with the [Dakshina dataset](https://github.com/google-research-datasets/dakshina) released by Google. This dataset contains pairs of the following form:

$x$.      $y$

ajanabee अजनबी.

i.e., a word in the native script and its corresponding transliteration in the Latin script (the way we type while chatting with our friends on WhatsApp etc). Given many such $(x_i, y_i)_{i=1}^n$ pairs your goal is to train a model $y = \hat{f}(x)$ which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर).

As you would realise this is the problem of mapping a sequence of characters in one language to a sequence of characters in another language. Notice that this is a scaled down version of the problem of translation where the goal is to translate a sequence of **words** in one language to a sequence of words in another language (as opposed to sequence of **characters** here).

Read these blogs to understand how to build neural sequence to sequence models: [blog1](https://keras.io/examples/nlp/lstm_seq2seq/), [blog2](https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/)




In [None]:
!wget https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar
!tar -xf dakshina_dataset_v1.0.tar
!cp -r dakshina_dataset_v1.0/ml .
!rm -rf dakshina_dataset_v1.0
!rm -r dakshina_dataset_v1.0.tar
!cp ml/lexicons/* .

--2025-05-16 06:52:16--  https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar
Resolving storage.googleapis.com (storage.googleapis.com)... 192.178.218.207, 172.253.62.207, 172.253.115.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|192.178.218.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2008340480 (1.9G) [application/x-tar]
Saving to: ‘dakshina_dataset_v1.0.tar’


2025-05-16 06:52:39 (85.3 MB/s) - ‘dakshina_dataset_v1.0.tar’ saved [2008340480/2008340480]



In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
# Paths to your data files (adjust if needed)
path = "/content/ml/lexicons/ml.translit.sampled.train.tsv"
path_val = "/content/ml/lexicons/ml.translit.sampled.dev.tsv"
path_test = "/content/ml/lexicons/ml.translit.sampled.test.tsv"

# Read the files
df = pd.read_csv(path, sep='\t', header=None)
df_val = pd.read_csv(path_val, sep='\t', header=None)
df_test = pd.read_csv(path_test, sep='\t', header=None)

# Split into native (Malayalam) and romanized (English) columns
malayalam_words = df[0]
english_words = df[1]

malayalam_words_val = df_val[0]
english_words_val = df_val[1]

malayalam_words_test = df_test[0]
english_words_test = df_test[1]



In [None]:
malayalam_words,english_words

(0             അം
 1            അംഗ
 2            അംഗ
 3           അംഗം
 4           അംഗം
           ...   
 58377       ഹൗസ്
 58378       ഹർജി
 58379       ഹർജി
 58380    ഹർജിയിൽ
 58381    ഹർജിയിൽ
 Name: 0, Length: 58382, dtype: object,
 0              am
 1            amga
 2            anga
 3           amgam
 4           angam
            ...   
 58377       house
 58378       harje
 58379       harji
 58380    harjeyil
 58381    harjiyil
 Name: 1, Length: 58382, dtype: object)

In [None]:

english_words = english_words.dropna()
malayalam_words = malayalam_words.dropna()

english_words = english_words.astype(str)
malayalam_words = malayalam_words.astype(str)

english_chars = sorted(set("".join(english_words)))
malayalam_chars = sorted(set("".join(malayalam_words)))

max_len_eng = max(len(w) for w in pd.concat([english_words, english_words_val, english_words_test]).dropna().astype(str))
max_len_mal = max(len(w) for w in pd.concat([malayalam_words, malayalam_words_val, malayalam_words_test]).dropna().astype(str))

print("English characters:", english_chars)
print("Malayalam characters:", malayalam_chars)
print("Max English word length:", max_len_eng)
print("Max Malayalam word length:", max_len_mal)


English characters: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Malayalam characters: ['ം', 'ഃ', 'അ', 'ആ', 'ഇ', 'ഈ', 'ഉ', 'ഊ', 'ഋ', 'എ', 'ഏ', 'ഐ', 'ഒ', 'ഓ', 'ഔ', 'ക', 'ഖ', 'ഗ', 'ഘ', 'ങ', 'ച', 'ഛ', 'ജ', 'ഝ', 'ഞ', 'ട', 'ഠ', 'ഡ', 'ഢ', 'ണ', 'ത', 'ഥ', 'ദ', 'ധ', 'ന', 'പ', 'ഫ', 'ബ', 'ഭ', 'മ', 'യ', 'ര', 'റ', 'ല', 'ള', 'ഴ', 'വ', 'ശ', 'ഷ', 'സ', 'ഹ', 'ാ', 'ി', 'ീ', 'ു', 'ൂ', 'ൃ', 'െ', 'േ', 'ൈ', 'ൊ', 'ോ', '്', 'ൗ', 'ൺ', 'ൻ', 'ർ', 'ൽ', 'ൾ', '\u200c']
Max English word length: 32
Max Malayalam word length: 31



# Question 1 (15 Marks)
Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari).

The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU) and the number of layers in the encoder and decoder can be changed.

(a) What is the total number of computations done by your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder, the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)

(b) What is the total number of parameters in your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder and the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)



# Question 2 (10 Marks)

You will now train your model using any one language from the [Dakshina dataset](https://github.com/google-research-datasets/dakshina) (I would suggest pick a language that you can read so that it is easy to analyse the errors). Use the standard train, dev, test set from the folder dakshina_dataset_v1.0/hi/lexicons/ (replace hi by the language of your choice)

Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions but you are free to decide which hyperparameters you want to explore

- input embedding size: 16, 32, 64, 256, ...
- number of encoder layers: 1, 2, 3
- number of decoder layers: 1, 2, 3
- hidden layer size: 16, 32, 64, 256, ...
- cell type: RNN, GRU, LSTM
- dropout: 20%, 30% (btw, where will you add dropout? you should read up a bit on this)
- beam search in decoder with different beam sizes:

Based on your sweep please paste the following plots which are automatically generated by wandb:
- accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration).
- parallel co-ordinates plot
- correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)

Also write down the hyperparameters and their values that you sweeped over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried for efficiently searching the hyperparameters.

In [None]:
class Attention(nn.Module):
    def __init__(self, enc_hidden_dim, dec_hidden_dim):
        super().__init__()
        self.attn = nn.Linear(enc_hidden_dim + dec_hidden_dim, dec_hidden_dim)
        self.v = nn.Linear(dec_hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        src_len = encoder_outputs.size(1)
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        return torch.softmax(attention, dim=1)
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, n_layers, cell_type='lstm', dropout=0.0):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        rnn_cls = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}[cell_type.lower()]
        self.rnn = rnn_cls(emb_dim, hidden_dim, num_layers=n_layers,
                           dropout=dropout if n_layers > 1 else 0, batch_first=True)
        self.dropout = nn.Dropout(dropout)
        self.cell_type = cell_type.lower()
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.rnn(embedded)
        return outputs, hidden
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, cell_type='lstm', dropout=0.0, use_attention=True):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim)
        rnn_cls = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}[cell_type.lower()]
        rnn_input_dim = emb_dim + enc_hidden_dim if use_attention else emb_dim
        self.rnn = rnn_cls(rnn_input_dim, dec_hidden_dim, num_layers=n_layers,
                           dropout=dropout if n_layers > 1 else 0, batch_first=True)
        self.use_attention = use_attention
        if use_attention:
            self.attention = Attention(enc_hidden_dim, dec_hidden_dim)
        self.fc_out = nn.Linear(dec_hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.cell_type = cell_type.lower()
        self.n_layers = n_layers
        self.dec_hidden_dim = dec_hidden_dim

    def forward(self, trg, hidden, encoder_outputs, teacher_forcing=True):
        batch_size = trg.size(0)
        trg_len = trg.size(1)
        outputs = []

        input_t = trg[:, 0]  # Start with <sos> token

        if self.cell_type == 'lstm':
            h, c = hidden
        else:
            h = hidden
            c = None

        for t in range(1, trg_len):
            emb_t = self.dropout(self.embedding(input_t)).unsqueeze(1)
            hidden_t = h[-1]

            if self.use_attention:
                attn_weights = self.attention(hidden_t, encoder_outputs)
                context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
                rnn_input = torch.cat((emb_t, context), dim=2)
            else:
                rnn_input = emb_t

            if self.cell_type == 'lstm':
                output, (h, c) = self.rnn(rnn_input, (h, c))
            else:
                output, h = self.rnn(rnn_input, h)

            pred = self.fc_out(output.squeeze(1))
            outputs.append(pred.unsqueeze(1))

            input_t = trg[:, t] if teacher_forcing else pred.argmax(1)

        outputs = torch.cat(outputs, dim=1)
        return outputs, (h, c) if self.cell_type == 'lstm' else h
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

        self.enc_n_layers = encoder.n_layers
        self.dec_n_layers = decoder.n_layers
        self.enc_hidden_dim = encoder.hidden_dim
        self.dec_hidden_dim = decoder.dec_hidden_dim
        self.cell_type = encoder.cell_type
        self.different_dims = self.enc_hidden_dim != self.dec_hidden_dim

        if self.different_dims:
            self.hidden_projection = nn.Linear(self.enc_hidden_dim, self.dec_hidden_dim)
            if self.cell_type == 'lstm':
                self.cell_projection = nn.Linear(self.enc_hidden_dim, self.dec_hidden_dim)

    def _adapt_hidden_state(self, encoder_hidden):
        if self.cell_type == 'lstm':
            h, c = encoder_hidden
            if self.enc_n_layers != self.dec_n_layers:
                if self.enc_n_layers > self.dec_n_layers:
                    h = h[-self.dec_n_layers:]
                    c = c[-self.dec_n_layers:]
                else:
                    h = torch.cat([h] + [h[-1:].clone()] * (self.dec_n_layers - self.enc_n_layers), dim=0)
                    c = torch.cat([c] + [c[-1:].clone()] * (self.dec_n_layers - self.enc_n_layers), dim=0)
            if self.different_dims:
                h = self.hidden_projection(h)
                c = self.cell_projection(c)
            return (h, c)
        else:
            h = encoder_hidden
            if self.enc_n_layers != self.dec_n_layers:
                if self.enc_n_layers > self.dec_n_layers:
                    h = h[-self.dec_n_layers:]
                else:
                    h = torch.cat([h] + [h[-1:].clone()] * (self.dec_n_layers - self.enc_n_layers), dim=0)
            if self.different_dims:
                h = self.hidden_projection(h)
            return h

    def forward(self, src, trg, teacher_forcing=True):
        encoder_outputs, encoder_hidden = self.encoder(src)
        decoder_hidden = self._adapt_hidden_state(encoder_hidden)
        outputs, _ = self.decoder(trg, decoder_hidden, encoder_outputs, teacher_forcing=teacher_forcing)
        return outputs
def calculate_sequence_accuracy(preds, trg):
    pred_tokens = preds.argmax(-1)
    match = ((pred_tokens == trg[:, 1:]) | (trg[:, 1:] == 0)).all(dim=1)
    return match.float().mean()
from tqdm import tqdm

def train(model, iterator, optimizer, criterion, device):
    model.train()
    epoch_loss = 0
    epoch_acc = 0
    pbar = tqdm(iterator, desc="Training", leave=False)
    for src, trg in pbar:
        src, trg = src.to(device), trg.to(device)
        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing=True)
        loss = criterion(output.view(-1, output.shape[-1]), trg[:, 1:].reshape(-1))
        acc = calculate_sequence_accuracy(output, trg)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        pbar.set_postfix(loss=loss.item(), acc=acc.item())
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion, device):
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    with torch.no_grad():
        for src, trg in iterator:
            src, trg = src.to(device), trg.to(device)
            output = model(src, trg, teacher_forcing=False)
            loss = criterion(output.view(-1, output.shape[-1]), trg[:, 1:].reshape(-1))
            acc = calculate_sequence_accuracy(output, trg)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [None]:

def load_english_data():
    train_path = "ml.translit.sampled.train.tsv"
    dev_path = "ml.translit.sampled.dev.tsv"
    test_path = "ml.translit.sampled.test.tsv"

    train_data = pd.read_csv(train_path, sep='\t', header=None, names=['malayalam', 'english', 'count'])
    dev_data = pd.read_csv(dev_path, sep='\t', header=None, names=['malayalam', 'english', 'count'])
    test_data = pd.read_csv(test_path, sep='\t', header=None, names=['malayalam', 'english', 'count'])

    return train_data, dev_data, test_data

def preprocess_data(train_data, dev_data, test_data, max_sequence_length=20):
    train_data = train_data.dropna()
    dev_data = dev_data.dropna()
    test_data = test_data.dropna()

    all_src = pd.concat([train_data['english'], dev_data['english'], test_data['english']])
    all_trg = pd.concat([train_data['malayalam'], dev_data['malayalam'], test_data['malayalam']])

    src_tokenizer = Tokenizer(char_level=True, lower=False)
    src_tokenizer.fit_on_texts(all_src)

    trg_tokenizer = Tokenizer(char_level=True, lower=False)
    trg_tokenizer.fit_on_texts(all_trg)

    # Add SOS and EOS tokens
    sos_token = '<s>'
    eos_token = '</s>'
    trg_tokenizer.word_index[sos_token] = len(trg_tokenizer.word_index) + 1
    trg_tokenizer.word_index[eos_token] = len(trg_tokenizer.word_index) + 1

    def process_sequences(texts, tokenizer, max_len):
        seq = tokenizer.texts_to_sequences(texts)
        return pad_sequences(seq, maxlen=max_len, padding='post')

    def process_target_sequences(texts, tokenizer, max_len):
        sos = tokenizer.word_index[sos_token]
        eos = tokenizer.word_index[eos_token]
        seq = tokenizer.texts_to_sequences(texts)
        seq = [[sos] + s + [eos] for s in seq]
        return pad_sequences(seq, maxlen=max_len+2, padding='post')

    X_train = process_sequences(train_data['english'], src_tokenizer, max_sequence_length)
    y_train = process_target_sequences(train_data['malayalam'], trg_tokenizer, max_sequence_length)

    X_dev = process_sequences(dev_data['english'], src_tokenizer, max_sequence_length)
    y_dev = process_target_sequences(dev_data['malayalam'], trg_tokenizer, max_sequence_length)

    X_test = process_sequences(test_data['english'], src_tokenizer, max_sequence_length)
    y_test = process_target_sequences(test_data['malayalam'], trg_tokenizer, max_sequence_length)

    return X_train, y_train, X_dev, y_dev, X_test, y_test, src_tokenizer, trg_tokenizer

# -------------------------------
# 3. Dataset
# -------------------------------
class Seq2SeqDataset(Dataset):
    def __init__(self, src, trg):
        self.src = torch.LongTensor(src)
        self.trg = torch.LongTensor(trg)

    def __len__(self):
        return len(self.src)

    def __getitem__(self, idx):
        return self.src[idx], self.trg[idx]
class Attention(nn.Module):
    def __init__(self, enc_hidden_dim, dec_hidden_dim):
        super().__init__()
        self.attn = nn.Linear(enc_hidden_dim + dec_hidden_dim, dec_hidden_dim)
        self.v = nn.Linear(dec_hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        src_len = encoder_outputs.size(1)
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        return torch.softmax(attention, dim=1)
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, n_layers, cell_type='lstm', dropout=0.0):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        rnn_cls = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}[cell_type.lower()]
        self.rnn = rnn_cls(emb_dim, hidden_dim, num_layers=n_layers,
                           dropout=dropout if n_layers > 1 else 0, batch_first=True)
        self.dropout = nn.Dropout(dropout)
        self.cell_type = cell_type.lower()
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.rnn(embedded)
        return outputs, hidden
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, cell_type='lstm', dropout=0.0, use_attention=True):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim)
        rnn_cls = {'lstm': nn.LSTM, 'gru': nn.GRU, 'rnn': nn.RNN}[cell_type.lower()]
        rnn_input_dim = emb_dim + enc_hidden_dim if use_attention else emb_dim
        self.rnn = rnn_cls(rnn_input_dim, dec_hidden_dim, num_layers=n_layers,
                           dropout=dropout if n_layers > 1 else 0, batch_first=True)
        self.use_attention = use_attention
        if use_attention:
            self.attention = Attention(enc_hidden_dim, dec_hidden_dim)
        self.fc_out = nn.Linear(dec_hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.cell_type = cell_type.lower()
        self.n_layers = n_layers
        self.dec_hidden_dim = dec_hidden_dim

    def forward(self, trg, hidden, encoder_outputs, teacher_forcing=True):
        batch_size = trg.size(0)
        trg_len = trg.size(1)
        outputs = []

        input_t = trg[:, 0]  # Start with <sos> token

        if self.cell_type == 'lstm':
            h, c = hidden
        else:
            h = hidden
            c = None

        for t in range(1, trg_len):
            emb_t = self.dropout(self.embedding(input_t)).unsqueeze(1)
            hidden_t = h[-1]

            if self.use_attention:
                attn_weights = self.attention(hidden_t, encoder_outputs)
                context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
                rnn_input = torch.cat((emb_t, context), dim=2)
            else:
                rnn_input = emb_t

            if self.cell_type == 'lstm':
                output, (h, c) = self.rnn(rnn_input, (h, c))
            else:
                output, h = self.rnn(rnn_input, h)

            pred = self.fc_out(output.squeeze(1))
            outputs.append(pred.unsqueeze(1))

            input_t = trg[:, t] if teacher_forcing else pred.argmax(1)

        outputs = torch.cat(outputs, dim=1)
        return outputs, (h, c) if self.cell_type == 'lstm' else h
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

        self.enc_n_layers = encoder.n_layers
        self.dec_n_layers = decoder.n_layers
        self.enc_hidden_dim = encoder.hidden_dim
        self.dec_hidden_dim = decoder.dec_hidden_dim
        self.cell_type = encoder.cell_type
        self.different_dims = self.enc_hidden_dim != self.dec_hidden_dim

        if self.different_dims:
            self.hidden_projection = nn.Linear(self.enc_hidden_dim, self.dec_hidden_dim)
            if self.cell_type == 'lstm':
                self.cell_projection = nn.Linear(self.enc_hidden_dim, self.dec_hidden_dim)

    def _adapt_hidden_state(self, encoder_hidden):
        if self.cell_type == 'lstm':
            h, c = encoder_hidden
            if self.enc_n_layers != self.dec_n_layers:
                if self.enc_n_layers > self.dec_n_layers:
                    h = h[-self.dec_n_layers:]
                    c = c[-self.dec_n_layers:]
                else:
                    h = torch.cat([h] + [h[-1:].clone()] * (self.dec_n_layers - self.enc_n_layers), dim=0)
                    c = torch.cat([c] + [c[-1:].clone()] * (self.dec_n_layers - self.enc_n_layers), dim=0)
            if self.different_dims:
                h = self.hidden_projection(h)
                c = self.cell_projection(c)
            return (h, c)
        else:
            h = encoder_hidden
            if self.enc_n_layers != self.dec_n_layers:
                if self.enc_n_layers > self.dec_n_layers:
                    h = h[-self.dec_n_layers:]
                else:
                    h = torch.cat([h] + [h[-1:].clone()] * (self.dec_n_layers - self.enc_n_layers), dim=0)
            if self.different_dims:
                h = self.hidden_projection(h)
            return h

    def forward(self, src, trg, teacher_forcing=True):
        encoder_outputs, encoder_hidden = self.encoder(src)
        decoder_hidden = self._adapt_hidden_state(encoder_hidden)
        outputs, _ = self.decoder(trg, decoder_hidden, encoder_outputs, teacher_forcing=teacher_forcing)
        return outputs
def calculate_sequence_accuracy(preds, trg):
    pred_tokens = preds.argmax(-1)
    match = ((pred_tokens == trg[:, 1:]) | (trg[:, 1:] == 0)).all(dim=1)
    return match.float().mean()
from tqdm import tqdm

def train(model, iterator, optimizer, criterion, device):
    model.train()
    epoch_loss = 0
    epoch_acc = 0
    pbar = tqdm(iterator, desc="Training", leave=False)
    for src, trg in pbar:
        src, trg = src.to(device), trg.to(device)
        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing=True)
        loss = criterion(output.view(-1, output.shape[-1]), trg[:, 1:].reshape(-1))
        acc = calculate_sequence_accuracy(output, trg)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        pbar.set_postfix(loss=loss.item(), acc=acc.item())
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion, device):
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    with torch.no_grad():
        for src, trg in iterator:
            src, trg = src.to(device), trg.to(device)
            output = model(src, trg, teacher_forcing=False)
            loss = criterion(output.view(-1, output.shape[-1]), trg[:, 1:].reshape(-1))
            acc = calculate_sequence_accuracy(output, trg)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [None]:
sweep_config = {
    'method': 'bayes',  # or 'random' if you want to sample randomly
    'metric': {
        'name': 'val_loss',
        'goal': 'minimize'
    },
    'parameters': {
        'batch_size': {
            'values': [64,128,256,512]
        },
        'max_sequence_length': {
            'values': [20,30,40]
        },
        'emb_dim': {
            'values': [16, 32, 64, 256,512]
        },
        'hidden_dim': {
            'values': [16, 32, 64, 256,512]
        },
        'enc_layers': {
            'values': [1, 2, 3,4,5]
        },
        'dec_layers': {
            'values': [1, 2, 3,4,5]
        },
        'cell_type': {
            'values': ['rnn', 'gru', 'lstm']
        },
        'dropout': {
            'values': [0.2, 0.3,0.5,0.7]
        },
        'n_epochs': {
            'values': [10,20,30]
        },
        'lr': {
            'values': [0.001,0.0001]
        }
    }
}
import wandb

sweep_id = wandb.sweep(sweep_config, project="en-ml-transliteration-wo_att")


In [None]:
import wandb
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

def train_seq2seq_model_wandb(config=None):
    # Initialize wandb
    with wandb.init(config=config):
        config = wandb.config

        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        # Load and preprocess data
        train_data, dev_data, test_data = load_english_data()
        X_train, y_train, X_dev, y_dev, X_test, y_test, src_tokenizer, trg_tokenizer = preprocess_data(
            train_data, dev_data, test_data, config.max_sequence_length)

        train_loader = DataLoader(Seq2SeqDataset(X_train, y_train), batch_size=config.batch_size, shuffle=True)
        dev_loader = DataLoader(Seq2SeqDataset(X_dev, y_dev), batch_size=config.batch_size)
        test_loader = DataLoader(Seq2SeqDataset(X_test, y_test), batch_size=config.batch_size)

        input_dim = len(src_tokenizer.word_index) + 1
        output_dim = len(trg_tokenizer.word_index) + 1

        encoder = Encoder(
            input_dim=input_dim,
            emb_dim=config.emb_dim,
            hidden_dim=config.hidden_dim,
            n_layers=config.enc_layers,
            cell_type=config.cell_type,
            dropout=config.dropout
        )
        decoder = Decoder(
            output_dim=output_dim,
            emb_dim=config.emb_dim,
            enc_hidden_dim=config.hidden_dim,
            dec_hidden_dim=config.hidden_dim,
            n_layers=config.dec_layers,
            cell_type=config.cell_type,
            dropout=config.dropout,
            use_attention=False
        )

        model = Seq2Seq(encoder, decoder).to(device)
        optimizer = optim.Adam(model.parameters(), lr=config.lr)
        criterion = nn.CrossEntropyLoss(ignore_index=0)


        for epoch in range(config.n_epochs):
            print(f"Epoch {epoch+1}/{config.n_epochs}")
            train_loss, train_acc = train(model, train_loader, optimizer, criterion, device)
            valid_loss, valid_acc = evaluate(model, dev_loader, criterion, device)

            print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")
            print(f"Val Loss: {valid_loss:.4f} | Val Acc: {valid_acc:.4f}")

            wandb.log({
                'epoch': epoch + 1,
                'train_loss': train_loss,
                'train_acc': train_acc,
                'val_loss': valid_loss,
                'val_acc': valid_acc
            })
        wandb.finish()


In [None]:

wandb.agent(sweep_id, function=train_seq2seq_model_wandb)



# Question 4 (10 Marks)

You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only).

(a) Use the best model from your sweep and report the accuracy on the test set (the output is correct only if it exactly matches the reference output).

(b) Provide sample inputs from the test data and predictions made by your best model (more marks for presenting this grid creatively). Also upload all the predictions on the test set in a folder **predictions_vanilla** on your github project.

(c) Comment on the errors made by your model (simple insightful bullet points)

- The model makes more errors on consonants than vowels
- The model makes more errors on longer sequences
- I am thinking confusion matrix but may be it's just me!
- ...

In [None]:
def generate_and_save_predictions(model, dataset, src_tokenizer, trg_tokenizer, out_file_path, device, batch_size=32):
    """
    Generate predictions for a dataset and save them to a file
    """
    model.eval()
    dataloader = DataLoader(dataset, batch_size=batch_size)

    # Open output file
    with open(out_file_path, 'w', encoding='utf-8') as f:
        f.write("Source\tTarget\tPrediction\n")

        # Process batches
        for src, trg in tqdm(dataloader, desc="Generating predictions"):
            src, trg = src.to(device), trg.to(device)

            # Generate predictions
            predictions = model.predict(src, trg_tokenizer, device)

            # Convert batch items to text
            for i in range(len(src)):
                # Process source
                src_tokens = []
                for idx in src[i].cpu().numpy():
                    if idx != 0:  # Not padding
                        src_tokens.append(list(src_tokenizer.word_index.keys())[list(src_tokenizer.word_index.values()).index(idx)])
                src_text = ''.join(src_tokens)

                # Process target
                trg_tokens = []
                for idx in trg[i][1:].cpu().numpy():  # Skip <sos> token
                    if idx != 0:  # Not padding
                        token = list(trg_tokenizer.word_index.keys())[list(trg_tokenizer.word_index.values()).index(idx)]
                        if token == '</s>':  # End of sequence
                            break
                        trg_tokens.append(token)
                trg_text = ''.join(trg_tokens)

                # Process prediction
                pred_tokens = []
                for idx in predictions[i].cpu().numpy():
                    if idx == 0 or idx == trg_tokenizer.word_index['</s>']:  # Padding or EOS
                        break
                    pred_tokens.append(list(trg_tokenizer.word_index.keys())[list(trg_tokenizer.word_index.values()).index(idx)])
                pred_text = ''.join(pred_tokens)

                # Write to file
                f.write(f"{src_text}\t{trg_text}\t{pred_text}\n")

    print(f"Predictions saved to {out_file_path}")


In [None]:
def train_and_write(
    batch_size=64,
    max_sequence_length=30,
    emb_dim=32,
    enc_hidden_dim=512,
    dec_hidden_dim=512,
    enc_layers=2,
    dec_layers=2,
    dropout=0.2,
    cell_type='gru',
    n_epochs=30,
    lr=0.0001,
    use_attention=False,
    device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
):
    print(f"Training with attention: {use_attention}")
    train_data, dev_data, test_data = load_english_data()
    X_train, y_train, X_dev, y_dev, X_test, y_test, src_tokenizer, trg_tokenizer = preprocess_data(
        train_data, dev_data, test_data, max_sequence_length)

    train_loader = DataLoader(Seq2SeqDataset(X_train, y_train), batch_size=batch_size, shuffle=True)
    dev_loader = DataLoader(Seq2SeqDataset(X_dev, y_dev), batch_size=batch_size)
    test_loader = DataLoader(Seq2SeqDataset(X_test, y_test), batch_size=batch_size)

    input_dim = len(src_tokenizer.word_index) + 1
    output_dim = len(trg_tokenizer.word_index) + 1

    encoder = Encoder(input_dim, emb_dim, enc_hidden_dim, enc_layers, cell_type, dropout)
    decoder = Decoder(output_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, dec_layers, cell_type, dropout, use_attention)

    model = Seq2Seq(encoder, decoder).to(device)

    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    best_valid_loss = float('inf')

    for epoch in range(n_epochs):
        print(f"Epoch {epoch+1}/{n_epochs}")
        train_loss, train_acc = train(model, train_loader, optimizer, criterion, device)
        valid_loss, valid_acc = evaluate(model, dev_loader, criterion, device)
        print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")
        print(f"Val Loss: {valid_loss:.4f} | Val Acc: {valid_acc:.4f}")

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'best_seq2seq_model.pt')
            print("Model saved!")

    model.load_state_dict(torch.load('best_seq2seq_model.pt'))



    # Generate and save predictions for test set
    print("Generating predictions for test set...")
    test_dataset = Seq2SeqDataset(X_test, y_test)
    generate_and_save_predictions(model, test_dataset, src_tokenizer, trg_tokenizer, 'test_predictions.tsv', device)

    # Generate and save predictions for dev set
    print("Generating predictions for dev set...")
    dev_dataset = Seq2SeqDataset(X_dev, y_dev)
    generate_and_save_predictions(model, dev_dataset, src_tokenizer, trg_tokenizer, 'dev_predictions.tsv', device)

    return model, src_tokenizer, trg_tokenizer


In [None]:
model, src_tokenizer, trg_tokenizer = train_seq2seq_model_with_heatmaps(
        batch_size=64,
    max_sequence_length=30,
    emb_dim=32,
    enc_hidden_dim=512,
    dec_hidden_dim=512,
    enc_layers=2,  # Now we can have different values for encoder and decoder layers
    dec_layers=2,
    dropout=0.2,
    cell_type='gru',
    n_epochs=30,
    lr=0.0001,
        use_attention=False, # Changed to True for demonstration (otherwise we can't generate heatmaps)
        num_samples_for_heatmap=10
    )



# Question 5 (20 Marks)

Now add an attention network to your basis sequence to sequence model and train the model again. For the sake of simplicity you can use a single layered encoder and a single layered decoder (if you want you can use multiple layers also). Please answer the following questions:

(a) Did you tune the hyperparameters again? If yes please paste appropriate plots below.

(b) Evaluate your best model on the test set and report the accuracy. Also upload all the predictions on the test set in a folder **predictions_attention** on your github project.

(c) Does the attention based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences (i.e., outputs which were predicted incorrectly by your best seq2seq model are predicted correctly by this model)

(d) In a 3 x 3 grid paste the attention heatmaps for 10 inputs from your test data (read up on what are attention heatmaps).

In [None]:
import wandb
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

def train_seq2seq_model_wandb(config=None):
    with wandb.init(config=config):
        config = wandb.config

        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        train_data, dev_data, test_data = load_english_data()
        X_train, y_train, X_dev, y_dev, X_test, y_test, src_tokenizer, trg_tokenizer = preprocess_data(
            train_data, dev_data, test_data, config.max_sequence_length)

        train_loader = DataLoader(Seq2SeqDataset(X_train, y_train), batch_size=config.batch_size, shuffle=True)
        dev_loader = DataLoader(Seq2SeqDataset(X_dev, y_dev), batch_size=config.batch_size)
        test_loader = DataLoader(Seq2SeqDataset(X_test, y_test), batch_size=config.batch_size)

        input_dim = len(src_tokenizer.word_index) + 1
        output_dim = len(trg_tokenizer.word_index) + 1

        encoder = Encoder(
            input_dim=input_dim,
            emb_dim=config.emb_dim,
            hidden_dim=config.hidden_dim,
            n_layers=config.enc_layers,
            cell_type=config.cell_type,
            dropout=config.dropout
        )
        decoder = Decoder(
            output_dim=output_dim,
            emb_dim=config.emb_dim,
            enc_hidden_dim=config.hidden_dim,
            dec_hidden_dim=config.hidden_dim,
            n_layers=config.dec_layers,
            cell_type=config.cell_type,
            dropout=config.dropout,
            use_attention=True
        )

        model = Seq2Seq(encoder, decoder).to(device)
        optimizer = optim.Adam(model.parameters(), lr=config.lr)
        criterion = nn.CrossEntropyLoss(ignore_index=0)


        for epoch in range(config.n_epochs):
            print(f"Epoch {epoch+1}/{config.n_epochs}")
            train_loss, train_acc = train(model, train_loader, optimizer, criterion, device)
            valid_loss, valid_acc = evaluate(model, dev_loader, criterion, device)

            print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")
            print(f"Val Loss: {valid_loss:.4f} | Val Acc: {valid_acc:.4f}")

            wandb.log({
                'epoch': epoch + 1,
                'train_loss': train_loss,
                'train_acc': train_acc,
                'val_loss': valid_loss,
                'val_acc': valid_acc
            })
        wandb.finish()




# Question 6 (20 Marks)

This a challenge question and most of you will find it hard.

I like the visualisation in the figure captioned "Connectivity" in this [article](https://distill.pub/2019/memorization-in-rnns/#appendix-autocomplete). Make a similar visualisation for your model. Please look at this [blog](https://towardsdatascience.com/visualising-lstm-activations-in-keras-b50206da96ff) for some starter code. The goal is to figure out the following: When the model is decoding the $i$-th character in the output which is the input character that it is looking at?

Have fun!




# Question 7 (10 Marks)
Paste a link to your github code for Part A

Example: https://github.com/&lt;user-id&gt;/da6401_assignment3/partA;

- We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this).

- We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarised).

- We will check the number of commits made by the two team members and then give marks accordingly. For example, if we see 70% of the commits were made by one team member then that member will get more marks in the assignment (**note that this contribution will decide the marks split for the entire assignment and not just this question**).

- We will also check if the training and test splits have been used properly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy.




# Question 8 (0 Marks)

Note that this question does not carry any marks and will not be graded. This is only for students who are looking for a challenge and want to get something more out of the course.

Your task is to finetune the GPT2 model to generate lyrics for English songs. You can refer to [this blog](https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a) and follow the steps there. This blog shows how to finetune the GPT2 model to generate headlines for financial articles. Instead of headlines you will use lyrics so you may find the following datasets useful for training: [dataset1](https://data.world/datasets/lyrics), [dataset2](https://www.kaggle.com/paultimothymooney/poetry)

At test time you will give it a prompt: "I love Deep Learning" and it should complete the song based on this prompt :-) Paste the generated song in a block below!

### Self Declaration



I, Name_XXX (Roll no: XXYY), swear on my honour that I have written the code and the report by myself and have not copied it from the internet or other students.



