# Introduction to Natural Language Processing: Assignment 3

In this exercise we'll practice training RNN & LSTM models as well as fine-tuning LLMs to predict one or more labels for a given text using Hugging Face and PyTorch.

- You can use any Python package you need.
- Please comment your code
- Submissions are due Tuesdays at 23:59 **only** on eCampus: **Assignmnets >> Student Submissions >> Assignment 3 (Deadline: 10.12.2024, at 23:59)**

- Name the file aproppriately: "Assignment_3_\<Your_Name\>.ipynb" and submit only the Jupyter Notebook file.

### Task 1 (15 points)

In this task you will implement text generation in torch with a multi-layer RNN and multi-layer LSTM.

a) Implement the missing methods of the dataset class to
1. load the dataset from the file `reddit-cleanjokes.csv` and split it into words [**(dataset link)**](https://raw.githubusercontent.com/amoudgl/short-jokes-dataset/master/data/reddit-cleanjokes.csv)
2. get a list of the unique words
3. implement the `__getitem__` method to iterate through the dataset. Hint: use `torch.tensor` to turn a list into a tensor.

Then instantiate the dataset with `sequence_length=4`

In [2]:
import torch

from torch import Tensor
from typing import List
# hint: use these methods
from pandas import read_csv
from collections import Counter

class Dataset(torch.utils.data.Dataset):
    def __init__(
        self,
        sequence_length,
    ):
        self.sequence_length = sequence_length
        self.words = self.load_words()
        self.uniq_words = self.get_uniq_words()

        self.index_to_word = {index: word for index, word in enumerate(self.uniq_words)}
        self.word_to_index = {word: index for index, word in enumerate(self.uniq_words)}

        self.words_indexes = [self.word_to_index[w] for w in self.words]

    def load_words(self) -> List[str]:
        """Returns a list of all words in the dataset.
        Make sure to strip punctuation and lowercase the words"""
       # YOUR CODE HERE

    def get_uniq_words(self) -> List[str]:
        """Returns a list, containing each unique word in the dataset once"""
       # YOUR CODE HERE

    def __len__(self) -> int:
        """Returns the number of `self.sequence_length` length word spans in the dataset"""
        return len(self.words_indexes) - self.sequence_length

    def __getitem__(self, index) -> (Tensor, Tensor):
        """Returns a tuple of two torch.Tensors:
        an input sequence for the RNN/LSTM model and a target sequence.
        The tensors should be 1D and have length equal to self.sequence_length.
        Remember that the target should be shifted with respect to the input."""
       # YOUR CODE HERE

In [None]:
dataset = Dataset(4)
dataset.load_words()

b) Now, complete the implementation of the RNN model.

You'll need to use all the model components defined in `__init__`: [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html), [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html), and the [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) output layer. These are all subclasses of [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html).

Torch Modules are objects that hold the layer's weights and biases (called parameters, accessed by `model.parameters()`) and keep track of a bunch of metadata, like what device the weights are on or what precision they're stored at. Each Module can have parts that are themselves Modules. The easiest way to combine Modules is with [`torch.nn.Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html). 

Every Module must have a `forward` method. This defines what the Module does with its input and returns as output. You can access `forward` simply by calling the Module, for example `output = self.rnn(input)`. This is the preferred way to write it.

Hint: unlike the one we saw in the tutorial, this RNN has multiple layers. Think about what this means for the shape of the hidden state. You might not want to use Sequential as RNN has multiple inputs and outputs.

In [None]:
import torch
from torch import nn

class RNNModel(nn.Module):
    def __init__(self, dataset):
        super(RNNModel, self).__init__()
        self.hidden_size = 128
        self.embedding_dim = 128
        self.num_layers = 3

        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.rnn = nn.RNN(
            input_size=self.hidden_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        self.fc = nn.Linear(self.hidden_size, n_vocab)

    def init_state(self, sequence_length: int) -> Tensor:
        """Returns the initial state hidden state (all zeros), with the correct shape."""
        # YOUR CODE HERE
    
    def forward(self, inputs: Tensor, prev_hidden_state: Tensor) -> (Tensor, Tensor):
        """Compute the logits and next_hidden_state."""
        # YOUR CODE HERE
        return logits, next_hidden_state

In [None]:
model = RNNModel(dataset)

c) Write a function that counts the total number of parameters and total number of trainable parameters of a model.

Refer to the [torch documentation](https://pytorch.org/docs/stable/index.html).

In [None]:
def count_params(model: nn.Module) -> (int, int):
    # YOUR CODE HERE
    return n_params, n_trainable_params

d) Complete the training loop and train the model for 10 epochs. Store the training loss in a list. You will probably want to have an inner loop that loops over batches.

Hint: refer to the documentation of the `DataLoader`, `CrossEntropyLoss` and `Optimizer` classes. You might also need to use the `detach()` and `item()` methods to work with the loss tensor.

In [None]:
import torch
import numpy as np
from torch import nn, optim
from torch.utils.data import DataLoader

def train_rnn(dataset, model, sequence_length, batch_size, max_epochs) -> List[float]:
    model.train()

    dataloader = DataLoader(dataset, batch_size=batch_size)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_list = []

    for epoch in range(max_epochs):
        # YOUR CODE HERE

    return loss_list

In [None]:
train_loss_rnn = train_rnn(dataset, model, 4, 256, 10)

e) Complete the function to generate output from the model using 1. argmax (greedy) decoding 2. softmax decoding. Generate some sample outputs with each and discuss briefly.

Hint: torch already has a builtin function for getting the softmax of a tensor, which you may use.

In [None]:
def predict_rnn_argmax(dataset: Dataset, model: nn.Module, text: str, next_words=100) -> List[str]:
    model.eval()

    # YOUR CODE HERE

    return words

def predict_rnn_softmax(dataset: Dataset, model: nn.Module, text: str, next_words=100) -> List[str]:
    model.eval()

    # YOUR CODE HERE

    return words

f) Implement `LSTMModel`, `train_lstm`, `predict_lstm_argmax` and `predict_lstm_softmax`. Train the model using the same settings and plot both training loss curves together. Briefly discuss the differences in the model architectures and performance. Which model performs better and what are possible causes? What are the limitations of the model?

Hint: use the `torch.nn.LSTM` class. You can do almost everything as with RNN, but take into account that an LSTM has **two** hidden states.

Hint: You might not necessarily see that LSTM performs better even if your implementation is correct.

In [None]:
# Here comes your code

### Task 2 (2 points)

The goal of this task is to download a multi-label text classification dataset from the [Hugging Face Hub](https://huggingface.co/datasets) and load it.

a) Select the `Text Classification` tag on the left, multi-label-classificationas as well as the the "1K<n<10K" tag to find a relatively small dataset. (e.g., sem_eval_2018_task_1 >> subtask5.english)

b) Load your dataset using `load_dataset` and check (print) the last data point in the validation set.

**Hint:** If you don't have access to GPU, you can downsample the dataset.

In [None]:
# Here comes your code

### Task 3 (3 points)

a) Write a function `tokenize_data(dataset)` that takes the loaded dataset as input and returns the encoded dataset for both text and labels.


**Hints:**

1. You should tokenize the text using the BERT tokenizer `bert-base-uncased`
2. You also need to provide labels to the model as numbers. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). This should be a tensor of floats rather than integers.
3. You can apply the function `tokenize_data(dataset)` to the the dataset using `map()`. (You can check out the exercise!)
4. You should set the format of the data to PyTorch tensors using `encoded_dataset.set_format("torch")`. This will turn the training, validation and test sets into standard PyTorch.

b) Print the `keys()` of the the last data point in the validation set in the encoded dataset.

**Hint:** The output should be as follows:

`dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])`

In [None]:
def tokenize_data(dataset):
    # here comes your code
    return(encoded_dataset)

### Task 4 (15 points)

Implement and compare two different approaches for text classification on a multilabel dataset. One approach will utilize BERT for tokenization and classification, while the other will use an alternative method such as (TF-IDF + SVM) or (BOW + LR).

a) **BERT Approach:**


``1.`` Define a text classification model that includes a pre-trained base ``(bert-base-uncased)`` using ``AutoModelForSequenceClassification``.

**Hints:**

       
- Create two dictionaries that map labels to integers and vice versa for the ``id2label`` and ``label2id`` parameters in  `.from_pretrained function` .
        
- Set the `problem_type` to "multi_label_classification" to ensure the appropriate loss function is used.
        
``2.`` Train the BERT-based model using HuggingFace's Trainer API.

**Hints:**
- Utilize `TrainingArguments` and `Trainer` classes.

- While training, compute metrics using a ``compute_metrics`` function that returns a dictionary with the desired metric values.

b) **Alternative Approach:**


``1.`` Choose an alternative approach for tokenization and classification. For example, use TF-IDF  or Bag of Words (BoW) for tokenization and a traditional classifier like SVM or logistic regression for classification.

**Hints:**

  - Use scikit-learn library for the  implementations.

``2.`` Train the alternative approach (model) on the same dataset you used for the BERT approach.

__Hints:__

  - Use appropriate training and evaluation procedures for the chosen alternative approach.
  
``3.`` Evaluate the performance of both models on the validation set using the metrics Accuracy, F1-score, precision, recall.

c) **Discussion:**

 Discuss the strengths and weaknesses of each approach.

__Note:__ Feel free to explore variations and improvements for both approaches. Experiment with hyperparameters and preprocessing steps to enhance the models' performance.


In [None]:
# Here comes your code for BERT Approach

In [None]:
# Here comes your code for the alternative approach

#### Here comes your discussion