# Introduction

This exercise assignment aims to provide hands-on experience with three different approaches in natural language processing: RNN model training, prompting a pretrained language model, and fine-tuning a language model. The task is to classify emotions in text using the Emotion dataset available at Hugging Face's Emotion Dataset.

https://huggingface.co/datasets/dair-ai/emotion

## Data set

* Utilize the Emotion dataset from Hugging Face.
* You will apply three approaches to classify emotions such as sadness, joy, love, anger, fear, and surprise from textual data.
* More details about the dataset can be found at the provided link.


## Three approaches

In this task, you will apply three distinct NLP approaches to classify emotions from textual data. Each approach should be executable within the Google Colab environment, allowing you to leverage its resources.

# 1. Train a RNN model

* Introduction: Recurrent Neural Networks (RNNs) are powerful for sequence modeling and have been extensively used in NLP for tasks like text classification.
* Task: Train a RNN to classify emotions.
* Details: Implement and train an RNN using PyTorch. The architecture should include an embedding layer, one or more RNN layers, and a dense output layer for classification.
* Model Flexibility: You are free to choose or modify any RNN architecture (e.g., LSTM, GRU) as long as it is compatible with Colab.


## Install dependency

In [None]:
%%capture
!pip install datasets

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm

## Load and Prepare the Dataset

In [None]:
# Load and tokenize the dataset
def load_and_preprocess_data():
    dataset = load_dataset('dair-ai/emotion')
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

    def tokenize_function(examples):
        return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    return tokenized_datasets, tokenizer

tokenized_datasets, tokenizer = load_and_preprocess_data()
print(tokenized_datasets)

## Define the RNN Model


In [None]:
# Define the RNN Classifier
class RNNClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, rnn_type="GRU", num_layers=2, bidirectional=True, dropout=0.5):
        super(RNNClassifier, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.dropout = nn.Dropout(dropout)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=num_layers,
                          batch_first=True, bidirectional=bidirectional, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        output, _ = self.rnn(embedded)
        hidden = output[:, -1, :]  # Get the last hidden state
        return self.fc(hidden)

## Model and training parameters

In [None]:
# Model and training parameters
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
INPUT_DIM = tokenizer.vocab_size
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = len(tokenized_datasets['train'].features['label'].names)
model = RNNClassifier(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

# Data loaders
train_loader = DataLoader(tokenized_datasets['train'], batch_size=32, shuffle=True, pin_memory=True)
validation_loader = DataLoader(tokenized_datasets['validation'], batch_size=32, shuffle=False, pin_memory=True)
test_loader = DataLoader(tokenized_datasets['test'], batch_size=32, shuffle=False, pin_memory=True)

## Training and evaluation functions

In [None]:
def train_model(model, data_loader, optimizer, criterion):
    model.train()
    total_loss = 0
    for batch in tqdm(data_loader):
        optimizer.zero_grad()
        predictions = model(batch['input_ids'])
        loss = criterion(predictions, batch['label'])
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(data_loader)

def evaluate_model(model, data_loader, criterion):
    model.eval()
    total_loss = 0
    total_correct = 0
    with torch.no_grad():
        for batch in data_loader:
            predictions = model(batch['input_ids'])
            loss = criterion(predictions, batch['label'])
            total_loss += loss.item()
            preds = predictions.argmax(dim=1)
            total_correct += (preds == batch['label']).sum().item()
    avg_loss = total_loss / len(data_loader)
    accuracy = total_correct / len(data_loader.dataset)
    return avg_loss, accuracy

## Main training loop

In [None]:
num_epochs = 5
for epoch in range(num_epochs):
    train_loss = train_model(model, train_loader, optimizer, criterion)
    val_loss, val_accuracy = evaluate_model(model, validation_loader, criterion)
    print(f'Epoch {epoch+1}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}')

# Evaluation
test_loss, test_accuracy = evaluate_model(model, test_loader, criterion)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')

# 2. (**M8 Group Excercise**) Prompting a pretrained (transformer-based) language model

* Introduction: Prompting involves adapting a pre-trained model to a specific task without extensive retraining, leveraging the model's existing knowledge.
* Task: Use zero-shot learning by prompting a pretrained language model.
* Details: Utilize a pre trained language model to generate predictions based on prompts. Craft three different prompts to evaluate how well the model can infer the correct emotion.
* Model Flexibility: Any pretrained model available via libraries like Hugging Face’s Transformers that runs on Google Colab can be used.


## Install dependency

In [None]:
# installation takes ~1 min
!pip install -U sentence-transformers
!pip install datasets
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tqdm import tqdm
from sklearn.metrics import f1_score, accuracy_score

## Load dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset('dair-ai/emotion')

## Load the pretrained model

In [None]:
# distilbert model: https://arxiv.org/abs/1910.01108
# many other models are avaliable on huggingface: https://huggingface.co/models
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

## Prompt design

In [None]:
# prompt design
unmasked = []
##### Your implementation starts here #####
prefix =  # string
suffix =  # string
##### Your implementation ends here #####

## Test run

In [None]:
example = dataset['train']['text'][11]
print("The raw data is:\n", example)
prompt = example + prefix + '[MASK]' + suffix # [MASK] is the to-be-predicted token; defined by the model
print("The prompt is:\n", prompt)
pred = unmasker(prompt)
print("\nThe prediction is:")
(pred)

## Prediction

In [None]:
# mask filling
for x in tqdm(dataset['test']['text'][0:100]): # Let's test only on first 100 data points for this coding exercise
  prompt = x + prefix + '[MASK]' + suffix # distilbert

  pred = unmasker(prompt) # this may take ~5 minutes to run on the entire dataset
  unmasked.append(pred[0]['token_str'])

In [None]:
# sentence bert
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
prediction = []
for i in tqdm(range(len(unmasked))):
  z = unmasked[i]
  x = dataset['test']['text'][i]
  # predefine vocab
  emotion = ['sadness', 'joy', 'love', 'anger', 'fear','shock']
  emotion.append(z)

  # put emotion into sentences
  sentence = [x + prefix + z + suffix for z in emotion]

  word = emotion

  # get cosine similarity between sentences
  sentence_embeddings = model.encode(word)
  # back mapping
  back_mapping = cosine_similarity(
      [sentence_embeddings[6]],
      sentence_embeddings[0:6]
  )

  prediction.append(np.argmax(back_mapping))

## Evaluation

In [None]:
label = dataset['test']['label'][0:len(unmasked)]
print(len(prediction))
print(len(label))

# classes order: sadness, joy , love, anger, fear, surprise
print('F1_macro: ', f1_score(prediction, label, average='macro'))
print('F1: ', f1_score(prediction, label, average=None))
print('Accuracy: ', accuracy_score(prediction, label))

# 3. Fine-tune a pretrained (transformer-based) language model

* Introduction: Fine-tuning adjusts the weights of a pretrained model specifically to the task at hand, improving performance by adapting the model's deep knowledge to your specific dataset.
* Task: Fine-tune a pretrained model on the Emotion dataset.
* Details: Choose a transformer model and fine-tune it using the training split of the Emotion dataset. Adjust the learning rate, batch size, and other hyperparameters as necessary.
* Model Flexibility: Any transformer-based model that is supported by the Google Colab environment can be used. Ensure the chosen model is manageable within the resource constraints of Colab.


## Load dataset

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load the dataset
dataset = load_dataset('dair-ai/emotion')

## Preprocess the data

In [None]:
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the function
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

# Apply the tokenizer to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

## Load the model

In [None]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=dataset['train'].features['label'].num_classes)

## Training config

In [None]:
!pip install accelerate -U # Note this package requires to restart runtime/session

In [None]:
from transformers import Trainer, TrainingArguments

# Set training arguments
#In the following code, several key components have been removed. Fill in the blanks with the appropriate code to complete the script.
# Sets the number of samples that will be processed at a time during training to be 16
# Sets the number of samples that will be processed at a time during evaluation to be 64
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=???,
    per_device_eval_batch_size=???,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
)

# Initialize the Trainer
trainer = Trainer(
    model=???,
    args=???,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    compute_metrics=lambda p: {'accuracy': (p.predictions.argmax(-1) == p.label_ids).astype(float).mean()}  # compute accuracy
)

## Training

In [None]:
trainer.train()

## Evaluation

In [None]:
results = trainer.evaluate(tokenized_datasets['test'])
print(results)