<a href="https://colab.research.google.com/github/rajkstats/uplimit_nlp/blob/main/wk1_text_classification_rk_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Text Classification Project

### What are we building
We will build and compare two text classification models:
1. Single Layer NN: This model uses pre-trained Word2Vec embeddings averaged over each text, followed by a single linear layer neural network for classification.
2. Fine-tuned BERT: Leveraging the power of BERT, a transformer-based model, we will fine-tune it for our specific classification task. BERT's attention mechanism allows it to understand context and semantics better than traditional embeddings.

**Bonus**:
1. Explore different parameters, features and architectures.
2. Fine-tune GPT4o-mini via API (made available for fine-tuning just recently on July 23, 2024!)

###  Evaluation
We’ll be evaluating our models on the following metrics:

1. Accuracy: The ratio of the number of correctly classified instances to the total number of instances
2. **Bonus**: Given that this tutorial outlines a multi-class classification problem, visualize a [confusion matrix](https://miro.medium.com/v2/resize:fit:1400/1*yH2SM0DIUQlEiveK42NnBg.png) of N*N of actual class vs predicted class (N = number of classes). This analysis can also be used to calculate other useful evaluation metrics such as precision, recall and F1-scores.


### Instructions

1. **Attach GPU/TPU**: Click Runtime -> Change Runtime Type -> Select T4 GPU (or TPU v2)
1. **Setup**: Simply execute the boilerplate code in each cell to download and parse the dataset. No coding required!
2. **Model 1**: Train and evaluate a model using pre-trained word2vec embeddings. This will serve as the baseline model.
3. **Model 2**: Train and evaluate a transformer-based model and compare it against the baseline performance metrics in Step 2.
4. **Bonus**: Outperform model 2 by tweaking the hyperparameters

### Code Overview
- Dependencies: Python dependencies and loading the spacy model
- Project
  - Dataset: Download the conversation dataset and parse it into a pytorch Dataset
  - Trainer: Trainer function to help with multi-epoch training
  - Model 1: Simple Word2Vec + MLP model (baseline)
  - Model 2: Fine-tuning BERT
- Bonus



## Dependencies

In [1]:
!pip install torchmetrics lightning datasets -q
!python -m spacy download en_core_web_lg

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/866.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m860.2/866.2 kB[0m [31m29.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m866.2/866.2 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m811.0/811.0 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import os
import subprocess

from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn.functional as F
import torch.utils.data as tfdata
from torchmetrics.classification import MulticlassAccuracy
import spacy
from tqdm import tqdm
import numpy as np
import lightning as L
import pandas as pd
import warnings
from torch.nn.utils.rnn import pad_sequence
warnings.filterwarnings("ignore", ".*does not have many workers.*")

In [5]:
# Load the spaCy model
nlp = spacy.load('en_core_web_lg')

# Fix the random seed so that we get consistent results
torch.manual_seed(0)
np.random.seed(0)



# Classifier Project
✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

We’ll be using the Empathetic Dialogs dataset open-sourced by Facebook ([link](https://research.fb.com/publications/towards-empathetic-open-domain-conversation-models-a-new-benchmark-and-dataset/)). It can be downloaded as a tar ball from the following [link](https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz)

A sample row from the dataset:
```
conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
hit:12388_conv:24777,1,joyful,I felt overcome with emotions when Christmas came around as a kid,437,Christmas was the best time of year back in the day!,5|5|5_5|5|5, ''
```

Let's download and explore the dataset and these should automatically get clear.

[Dataset and Data loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html): Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

[LightingDataModule](https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html#datamodules): A datamodule is a shareable, reusable class that encapsulates all the steps needed to process data. A datamodule encapsulates the five steps involved in data processing in PyTorch:

1. Download / tokenize / process.
2. Clean and (maybe) save to disk.
3. Load inside Dataset.
4. Apply transforms (rotate, tokenize, etc…).
5. Wrap inside a DataLoader.


In [6]:
###########
# DATASET #
###########

class EmpatheticDataset(tfdata.Dataset):
    def __init__(self, data_dir: str = 'classification_data', split: str = 'train', transform=None):
        self.data_dir = data_dir
        self.dataset_url = 'https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz'
        self.directory_name = 'empatheticdialogues'
        self.transform = transform
        # Check if the dataset directory already exists to avoid re-downloading
        if not os.path.exists(self.data_dir):
            os.makedirs(self.data_dir, exist_ok=True)
            # Download the dataset using wget
            subprocess.run(['wget', '-q', self.dataset_url, '-O', 'empatheticdialogues.tar.gz'])
            # Extract the dataset
            subprocess.run(['tar', '-xvf', 'empatheticdialogues.tar.gz', '-C', self.data_dir])
            # Remove the tar file to clean up
            os.remove('empatheticdialogues.tar.gz')
        else:
            print("Dataset already downloaded and extracted.")
        train_data_url = f"{self.data_dir}/{self.directory_name}/train.csv"
        val_data_url = f"{self.data_dir}/{self.directory_name}/valid.csv"
        test_data_url = f"{self.data_dir}/{self.directory_name}/test.csv"

        ######################################
        ### START HERE
        ######################################
        ## Instantiate the label encoder and fit it with training context
          # Hint: Data is never perfect. We might want to skip some rows. Pandas allows us to do that easily.
          # See the on_bad_lines parameter here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
        self.label_encoder = LabelEncoder()
        self.label_encoder.fit(pd.read_csv(train_data_url, on_bad_lines='skip')['context'])

        if split == 'train':
            self.pd_data = pd.read_csv(train_data_url, on_bad_lines='skip')
        elif split == 'val':
            self.pd_data = pd.read_csv(val_data_url, on_bad_lines='skip')
        elif split == 'test':
            self.pd_data = pd.read_csv(test_data_url, on_bad_lines='skip')
        else:
            raise ValueError("Invalid split. Must be one of 'train', 'val', or 'test'.")

        # Add a new column for the index of the context
        self.pd_data['context_idx'] = self.label_encoder.transform(self.pd_data['context'])

        ######################################
        ### END HERE
        ######################################


        # Apply the transform function to the data
        self.data = [self.transform(item) if self.transform else item for _, item in tqdm(list(self.pd_data.iterrows()), "Applying dataset transform")]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

Let's poke around our dataset a little!

In [7]:
# Show the internal Pandas DataFrame within the EmpatheticDataset
sample_dataset = EmpatheticDataset()
sample_dataset.pd_data

Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 2571563.71it/s]


Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags,context_idx
0,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,,28
1,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,,28
2,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,,28
3,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,,28
4,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,,28
...,...,...,...,...,...,...,...,...,...
76663,hit:12424_conv:24848,5,sentimental,I found some pictures of my grandma in the att...,389,Yeah reminds me of the good old days. I miss ...,5|5|5_5|5|5,,28
76664,hit:12424_conv:24849,1,surprised,I woke up this morning to my wife telling me s...,294,I woke up this morning to my wife telling me s...,5|5|5_5|5|5,,29
76665,hit:12424_conv:24849,2,surprised,I woke up this morning to my wife telling me s...,389,Oh hey that's awesome! That is awesome right?,5|5|5_5|5|5,,29
76666,hit:12424_conv:24849,3,surprised,I woke up this morning to my wife telling me s...,294,It is soooo awesome. We have been wanting a b...,5|5|5_5|5|5,,29


Let's explore the label encoder in the data module.  It should be able to convert the string labels to integers and vice versa.

In [8]:
print(f'Label encoder classes: {sample_dataset.label_encoder.classes_}')
print(f'Label for "sad": {sample_dataset.label_encoder.transform(["sad"])}')
print(f'Label for "hopeful": {sample_dataset.label_encoder.transform(["hopeful"])}')
print(f'Label for "angry": {sample_dataset.label_encoder.transform(["angry"])}')

Label encoder classes: ['afraid' 'angry' 'annoyed' 'anticipating' 'anxious' 'apprehensive'
 'ashamed' 'caring' 'confident' 'content' 'devastated' 'disappointed'
 'disgusted' 'embarrassed' 'excited' 'faithful' 'furious' 'grateful'
 'guilty' 'hopeful' 'impressed' 'jealous' 'joyful' 'lonely' 'nostalgic'
 'prepared' 'proud' 'sad' 'sentimental' 'surprised' 'terrified' 'trusting']
Label for "sad": [27]
Label for "hopeful": [19]
Label for "angry": [1]


### Transform
Now, let's create a transform to extract the context and the utterance from the dataset.

The columns we care about are:
1. "context": This is the emotion we're trying to predict (this has already been converted to a number usign the dataset label encoder)
1. "prompt" and "utterance": We'll combine these sentences and use them as input

In [11]:
def transform_mean_vectors(item):
    # Combine 'prompt' and 'utterance' into a single string
    input_string = item['prompt'] + ' ' + item['utterance']

    ######################################
    ### START HERE
    ######################################

    # Create a Document from the input string using spaCy
    doc = nlp(input_string)

    # Compute the average vector representation of the input string
    x = np.mean([token.vector for token in doc if token.has_vector], axis=0)

    ######################################
    ### END HERE
    ######################################

    # Retrieve the context label
    y = item['context_idx']  # This column was added in the EmpatheticDataset class __init__ method

    return {
        'input_string': input_string,  # This is useful for visualization/debugging
        'sentiment': item['context'],  # This is useful for visualization/debugging
        'input_vector': x,
        'sentiment_idx': y
    }

### Collate
Next, we'll create a collate function that tokenizes the batch and returns the tokenized input and the emotion label.

In [12]:
def collate_fn(batch):
    # A torch data loader will combine a list of samples into a batch
    # This function will be used to process the batch
    # batch is a list of the outputs from the __getitem__ method of the EmpatheticDataset class
    # Separate the batch into input and target, then convert to tensors

    input_tensors = torch.stack([torch.tensor(item['input_vector']) for item in batch])

    ############################
    ### START HERE
    ############################

    # Create the target tensors
    target_tensors = torch.stack([torch.tensor(item['sentiment_idx']) for item in batch])

    ############################
    ### END HERE
    ############################

    return {
        'input': input_tensors,
        'target': target_tensors,
        'input_string': [item['input_string'] for item in batch], # This is useful for visualization/debugging
        'sentiment': [item['sentiment'] for item in batch] # This is useful for visualization/debugging
    }

In [13]:
class EmpatheticDialoguesDataModule(L.LightningDataModule):
    def __init__(self, batch_size=32, collate_fn=None, transform=None):
        super().__init__()
        self.batch_size = batch_size
        self.collate_fn = collate_fn
        self.transform = transform

    def prepare_data(self):
        # This downloads the dataset and prepares it
        # It doesn't save anything to the data module, just prepares the dataset
        EmpatheticDataset()
        self.label_encoder = EmpatheticDataset().label_encoder

    def setup(self, stage):
        # Retrieve the dataset from disk
        self.train_dataset = EmpatheticDataset(split='train', transform=self.transform)
        self.val_dataset = EmpatheticDataset(split='val', transform=self.transform)
        self.test_dataset = EmpatheticDataset(split='test', transform=self.transform)

    def train_dataloader(self):
        return tfdata.DataLoader(self.train_dataset,
                               batch_size=self.batch_size,
                               collate_fn=self.collate_fn,
                               num_workers=0,
                               shuffle=True)

    def val_dataloader(self):
        return tfdata.DataLoader(self.val_dataset,
                               batch_size=self.batch_size,
                               collate_fn=self.collate_fn,
                               num_workers=0)

    def test_dataloader(self):
        return tfdata.DataLoader(self.test_dataset,
                               batch_size=self.batch_size,
                               collate_fn=self.collate_fn,
                               num_workers=0)

### Classfier Module

We've now created the DataLoader and Datasets we'll use in the entire project, it is time to write the training and testing code via a `LightningModule`.

[LightingModule](https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html): organizes your PyTorch code into 5 sections

1. Computations (init).
2. Train loop (training_step)
3. Validation loop (validation_step)
4. Test loop (test_step)
5. Optimizers (configure_optimizers)

In [14]:
class EmotionClassifier(L.LightningModule):
  def __init__(self, model, batch_size, learning_rate, num_classes):
      super().__init__()
      self.model = model
      self.batch_size = batch_size
      self.learning_rate = learning_rate
      self.accuracy = MulticlassAccuracy(num_classes=num_classes)

  def training_step(self, batch, batch_idx):
    x = batch["input"]
    y = batch["target"]
    y_hat = self.model(x)
    loss = F.cross_entropy(y_hat, y)
    self.log_dict(
      {'train_loss': loss},
      batch_size=self.batch_size,
      prog_bar=True
    )
    return loss

  def validation_step(self, batch, batch_nb):
    x = batch["input"]
    y = batch["target"]
    y_hat = self.model(x)
    val_loss = F.cross_entropy(y_hat, y)
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'val_loss': val_loss,
          'val_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.batch_size,
        prog_bar=True
      )
    return val_loss

  def test_step(self, batch, batch_nb):
    ######################################
    ### START HERE
    ######################################
    # Write the test step of the EmotionClassifier

    x = batch["input"]
    y = batch["target"]
    y_hat = self.model(x)
    test_loss = F.cross_entropy(y_hat, y)
    predictions = torch.argmax(y_hat, dim=1)

    ######################################
    ### END HERE
    ######################################
    self.log_dict(
        {
          'test_loss': test_loss,
          'test_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.batch_size,
        prog_bar=True
      )
    return test_loss

  def predict_step(self, batch, batch_idx):
    y_hat = self.model(batch["input"])
    predictions = torch.argmax(y_hat, dim=1)
    return {'logits':y_hat, 'predictions': predictions, 'sentiment_labels': batch["sentiment"], 'input_string': batch['input_string']}

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
    return optimizer

# Models

Are we building models yet? Finally the time has come to build our baseline model and then we'll work towards improving it.

### Model 1: Average word vector of the sentence -- Baseline
##### <font color='red'>Expected accuracy: ~29 - 32%</font>

Let's build our first simple word2vec based model we'll use as our baseline.

Here we've three key pieces:

1. *WordVectorClassificationModel*: Simple linear model that just has one single neuron layer that maps the input word2vec dimensions (300) to the output classes (32) building a really simple classifier.

In [15]:
class WordVectorClassificationModel(torch.nn.Module):
  def __init__(self, word_vec_dimension, num_classes):
    super().__init__()
    self.linear_layer = torch.nn.Linear(word_vec_dimension, num_classes)

  def forward(self, batch):
    """Projection from word_vec_dim to n_classes

    Batch is of shape (batch_size, max_seq_len, word_vector_dim)
    """
    return self.linear_layer(batch)

### Trainer
Now, let's use the Lighning [`Trainer`](https://lightning.ai/docs/pytorch/latest/common/trainer.html) to train our model.

In [16]:
def train(
    model,
    transform,
    collate_fn,
    batch_size=32,
    max_epochs=4,
    learning_rate=0.001,
):

    # Create a pytorch trainer
    trainer = L.Trainer(max_epochs=max_epochs, check_val_every_n_epoch=1)

    #############################
    ## START HERE
    #############################


    # Initialize our data loader with the passed vectorizer
    data_module = EmpatheticDialoguesDataModule(collate_fn=collate_fn,
                                                transform=transform,
                                                batch_size=batch_size)

    data_module.prepare_data()
    data_module.setup('fit')

    # Instantiate a new lightning module
    module = EmotionClassifier(model,
                               batch_size=batch_size,
                               learning_rate=learning_rate,
                               num_classes=data_module.label_encoder.classes_.size)

    # Train and validate the model
    trainer.fit(module, data_module.train_dataloader(), data_module.val_dataloader())

    # Test the model
    trainer.test(module,  data_module.test_dataloader())

    # Predict on the same test set to show some output
    output = trainer.predict(module, data_module.test_dataloader())

    #############################
    ## END HERE
    #############################

    # Un-collate the output
    uncollated_output = {}
    for batch_output in output:
        for k, v in batch_output.items():
            if k not in uncollated_output:
                uncollated_output[k] = []
            uncollated_output[k].extend(v)

    return uncollated_output, data_module.label_encoder

In [17]:
uncollated_output, label_encoder = train(
    model=WordVectorClassificationModel(word_vec_dimension=300, num_classes=32),
    transform=transform_mean_vectors,
    collate_fn=collate_fn,
    batch_size=256,
    max_epochs=4,
    learning_rate=0.001,
)

# Feel free to explore the uncollated_output dictionary to see the predictions
for _ in range(5):
    # Randomly select a sample from the test set
    i = np.random.randint(len(uncollated_output['input_string']))
    print(f'Input: {uncollated_output["input_string"][i]}')
    print(f'Predicted label: {label_encoder.inverse_transform([uncollated_output["predictions"][i].item()])}')
    print(f'Actual label: {uncollated_output["sentiment_labels"][i]}')
    print()

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 2516779.36it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 2416612.05it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [14:34<00:00, 87.65it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 6318/6318 [01:17<00:00, 81.65it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 5701/5701 [01:12<00:00, 78.73it/s]
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name     | Type                          | Params | Mode 
-------------------------------------------------------------------
0 | model    | WordVectorClassificationModel | 9.6 K  | train
1 | accuracy | MulticlassAccuracy            | 0      | train
-------------------------------------------------------------------
9.6 K     Trainable params
0         Non-trainable params
9.6 K     Total params
0.039     Total estimated model params size (MB)
3         Modules in train mode
0         Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
  | Name     | Type                          | Params | Mode 
-------------------------------------------------------------------
0 | model    | WordVectorClassificationModel | 9.6 K  | train
1 | accuracy | MulticlassAc

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=4` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=4` reached.
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

Input: hopeful: I know things will work out in God's perfect timing Yes I am. For everything we want for the baby to be gifted to us lol
Predicted label: ['hopeful']
Actual label: hopeful

Input: I lost my temper with my daughter the other day and was a bit ill with her. I felt so bad that I had raised my voice and hurt her feelings. My daughter caught me in a bad mood the other day and was being a bit argumentative. I lost my temper_comma_ yelled and really hurt her feelings. I felt bad that it happened.
Predicted label: ['guilty']
Actual label: ashamed

Input: While I didn't know him personally_comma_ I was quite distressed to hear of the passing of Anthony Bourdain. That a man with such talent who has given so much joy to so many people_comma_ should feel he has to take his life is just soul-destroying. It is. I read his book Kitchen Confidential. He's been battling addiction and demons all his life. I thought he'd found peace_comma_ but I guess not.
Predicted label: ['impressed']
A

🎉🎉🎉 WE HAVE OUR FIRST TEXT CLASSIFIER 🎉🎉🎉

Report the accuracy below. Document an example input, predicted label and actual label of an incorrect prediction from the model. Considering that the word2vec model is context-free, why do you think the model got this wrong?

<font color='red'>Accuracy:</font> 32.09%

<font color='red'>Input of incorrect prediction:</font> While I didn't know him personally_comma_ I was quite distressed to hear of the passing of Anthony Bourdain. That a man with such talent who has given so much joy to so many people_comma_ should feel he has to take his life is just soul-destroying. It is. I read his book Kitchen Confidential. He's been battling addiction and demons all his life. I thought he'd found peace_comma_ but I guess not.


<font color='red'>Predicted label:</font> impressed

<font color='red'>Actual label:</font> furious

<font color='red'>Why do you think the model got it wrong?</font> Predictions are based on word2vec embeddings which are context free which means a same vector respresentation to a word regradless of context, In this input example, it contains both Anthony **accomplishments** and of his **passing**. Model might have picked up words like **talent** and **joy** which are generally associated with positive emotions which may have lead to categorising it as **impressed**


### Assignment Model 2: Fine-tuning a Transformer
---- TO BE COMPLETED
##### <font color='red'>Expected accuracy: ~30-35%</font>

In this assignment, we will transition from traditional context-free word embeddings to a more current contextualized embeddings approach via transformer-based language models such as BERT - [developed by Google in 2018](https://research.google/pubs/bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding/). This is the architecture used by all modern day LLMs (e.g. GPT, Bard, Claude, Llama, Mistral).

We will fine-tune BERT to directly handle the entire input text. Here's how we will implement this:
1. Tokenization: Use BERT's tokenizer to split the text into tokens.
2. Input Formatting: Convert tokens to the format expected by BERT, including adding special tokens and padding/truncating to a fixed length.
3. Model Fine-tuning: Use BERT's pre-trained model and fine-tune it on our classification task.
4. Evaluation: Measure the model's performance and compare it to our baseline.

Does this model perform better than our baseline? Why do you think that is?

In [18]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

In [19]:
# Assuming the EmpatheticDataset class and sample_dataset are already defined
train_dataset_raw = EmpatheticDataset(split='train').pd_data
test_dataset_raw = EmpatheticDataset(split='test').pd_data


train_dataset_raw['input'] = train_dataset_raw['prompt'] + ' ' + train_dataset_raw['utterance']
test_dataset_raw['input'] = test_dataset_raw['prompt'] + ' ' + test_dataset_raw['utterance']

# Convert to pandas DataFrame
train_df = train_dataset_raw[['input', 'context_idx']].rename(columns={'input': 'text', 'context_idx': 'label'})
test_df = test_dataset_raw[['input', 'context_idx']].rename(columns={'input': 'text', 'context_idx': 'label'})

Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 76668/76668 [00:00<00:00, 2463506.54it/s]


Dataset already downloaded and extracted.


Applying dataset transform: 100%|██████████| 5701/5701 [00:00<00:00, 1192307.51it/s]


Note: We downsample the dataset because transformers like BERT are generally larger and take longer to process. We recommend an 80/20 training to test ratio where the training data size is between 5000-8000.

In [20]:
#############################
## START HERE
#############################

# Downsample the training set to 6000 samples
train_df_sampled = train_df.sample(n=6000, random_state=42)

# Downsample the test set to 1500 samples (20% of the training size)
test_df_sampled = test_df.sample(n=1500, random_state=42)

#############################
## END HERE
#############################

In [21]:
# Load tokenizer and model
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(sample_dataset.label_encoder.classes_))

# Tokenize function
def tokenize_function(text):
    return tokenizer(text, padding='max_length', truncation=True, return_tensors='pt')

# Tokenize the datasets
train_encodings = tokenize_function(train_df_sampled['text'].tolist())
test_encodings = tokenize_function(test_df_sampled['text'].tolist())

# Convert to PyTorch datasets
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_labels = train_df_sampled['label'].tolist()
test_labels = test_df_sampled['label'].tolist()

train_dataset = TextDataset(train_encodings, train_labels)
test_dataset = TextDataset(test_encodings, test_labels)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.01,
    report_to="none"
)



In [25]:
# !pip install evaluate
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
  )


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

This cell may take ~5-10 minutes to run with a GPU. This would be a great opportunity to learn about the trainer API that is running in this cell in the meantime! HuggingFace prepared a great video for that [here](https://www.youtube.com/watch?v=nvBXf7s7vTI)

Note: If you're able to train and evaluate the model, answer Question 1 below. If you're unable to train the model because Google is not providing you with a GPU or training is taking 10+ minutes, proceed to Question 2 instead.

In [26]:
# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
    device = torch.device("cpu")
    print("Using CPU")

trainer.train()
eval_results = trainer.evaluate()
print(f"Overall Accuracy: {eval_results['eval_accuracy']}")


Using GPU: Tesla T4


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,3.184814,0.115333


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Overall Accuracy: 0.11533333333333333


In [27]:
import random

# Make predictions for random 5 samples from the sampled test set
predictions = trainer.predict(test_dataset)
# Convert predictions to tensor
preds = torch.argmax(torch.tensor(predictions.predictions), dim=-1)

# Select random 5 samples from the sampled test set
random_indices = random.sample(range(len(test_df_sampled)), 5)

for idx in random_indices:
    input_text = test_df_sampled.iloc[idx]['text']
    actual_label = test_df_sampled.iloc[idx]['label']
    predicted_label = preds[idx].item()
    print(f"Input Text: {input_text}")
    print(f"Actual Label: {sample_dataset.label_encoder.inverse_transform([actual_label])}")
    print(f"Predicted Label: {sample_dataset.label_encoder.inverse_transform([predicted_label])}")
    print("-" * 50)


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Input Text: I feel sad when I see people squandering their potential or underestimating themselves. That's perfectly natural. You sound like the kind of person_comma_ though_comma_ that quickly regained their bearings. Disappointments like that usually come with some *good* lessons.
Actual Label: ['sad']
Predicted Label: ['nostalgic']
--------------------------------------------------
Input Text: One time a hurricane hit my city and it was so windy I couldn't sleep that night. I thought the wind would tear the roof off of the house It damaged the roof and the fence. We lost power for six days during the first hurricane I experienced
Actual Label: ['afraid']
Predicted Label: ['afraid']
--------------------------------------------------
Input Text: I took my car to my mechanic_comma_ who I thought I could trust. He quoted me $1000 in parts and labor. I took it to another mechanic and he fixed it on the spot for $100. You just can't trust people any more. My mechanic told me it was going 


##Question 1 (if able to train):


Report the accuracy below if you were able to run the model. Document an example input, predicted label and actual label of a correct prediction from the model. Considering that the transformer model is contextual, why do you think the model got this correct? Do you think the context-free model before would get this correct?

<font color='red'>Accuracy:</font> 11.5%

<font color='red'>Input text of correct prediction:</font> One time a hurricane hit my city and it was so windy I couldn't sleep that night. I thought the wind would tear the roof off of the house It damaged the roof and the fence. We lost power for six days during the first hurricane I experienced.

<font color='red'>Predicted label:</font> afraid

<font color='red'>Actual label:</font> afraid

<font color='red'>Why do you think the model got this right? </font> BERT is designed to capture the context in which words are used. In this example, the model correctly identifies the emotion "afraid" because it understands that words like "hurricane," "windy," "tear the roof off," and "lost power" are associated with fear.  


<font color='red'> Do you think model 1 would have gotten this correct? Why or why not? </font> Less likely to get it right, since word2vec model is context free and words like wind can be used in  pleasant situations not necessarily associating with fear.

##Question 2 (if unable to train):
Go to this pre-trained emotion text classifier model on HuggingFace:
https://huggingface.co/michellejieli/emotion_text_classifier

Take example inputs from model 1 that failed, insert them into the text classification field under the Inference API widget on the right side and click compute.
![picture](https://drive.google.com/uc?id=19H-djr9Vwyv3VQjOXweHwggWX3pmE9sE)

Is there an example where the text classifier detects the emotion better than the word2vec model? Share below.
Note: that the emotion labels are slightly different for this model. They are: joy, surprise, neutral, anger, fear, sadness and disgust


<font color='red'>Input text:</font>

<font color='red'>Predicted label:</font>

<font color='red'>Actual label:</font>

<font color='red'>Why do you think this model got this right/more right than the word2vec model?</font>


🎉 CONGRATS!!! on finishing the assignment. Now is a good time to pause and reflect how much progress we've made in understanding word vectors, reading some pytorch code and build our first model. But hey, don't stop here, there is a lot to do or play with in the next sections.

# Bonus

Here are a few bonus assignment 🤓
- Visualize a [confusion matrix](https://torchmetrics.readthedocs.io/en/latest/references/functional.html#confusion-matrix-func) of N*N of actual class vs predicted class (N = number of classes) for both models. Does 1 model perform better at predicting a certain class than the other model? Why do you think that is?
- Try decreasing and increasing the dataset size and hyperparameters in the Trainer (learning rate, epochs, batch size, etc). How does that impact training time and accuracy of each model?
- Try fine-tuning GPT4o-mini using the following [documentation](https://platform.openai.com/docs/guides/fine-tuning). How does performance compare to BERT?
