# Binary sentiment classification with BERT uncased model
For this project, I'm going to fine-tune [a pretrained BERT uncased model](https://huggingface.co/bert-base-uncased) shared on the Hugging Face Hub with [IMDB movie reviews data](https://huggingface.co/datasets/imdb) and use the fine-tuned mdoel to classify the sentiment of movie reviews.

Most of the functions used come from **Hugging Face's Transformers library**, and the deep learning framework used is **PyTorch**.

## 1. Download the IMDB movie reviews dataset and process the data

### 1.1 Download the dataset from the Hugging Face Hub
Downloading a dataset shared on the Hub is made very simple with the `Datasets` library, which downloads and caches the dataset.

In [None]:
# Download the IMDB dataset with load_dataset()
from datasets import load_dataset, Dataset, DatasetDict

raw_datasets = load_dataset('imdb')

IMDB dataset has **three splits**: train, test, and unsupervised. For this project, I will only need train and test splits.

Each split has **two fields**, text and label:
- Text contains movie reviews
- Label is a binary value with 0 indicating negative review and 1 indicating positive reivew.

In [None]:
# IMDB dataset has 3 splits: train, test, unsupervised
# Each split only has two fields
# For this project, I only need train and test
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
# Keep 1000 records from the unsupervised dataset
# Use the fine-tuned model to classify the sentiment of these reviews later
# Randomly select 1000 records
import random
random.seed(42)
random_1000 = [random.randint(0, 49999) for _ in range(1000)]
unsupervised_reviews = Dataset.from_dict(raw_datasets['unsupervised'][random_1000])

In [None]:
# Keep only 500 records in training and testing datasets to make training and evaluating faster
# Randomly select 500 records
random.seed(42)
random_500 = [random.randint(0, 24999) for _ in range(500)]
train_dataset = Dataset.from_dict(raw_datasets['train'][random_500])
test_dataset = Dataset.from_dict(raw_datasets['test'][random_500])
raw_datasets = DatasetDict({"train": train_dataset, "test": test_dataset})

In [None]:
# Examine raw_datasets to ensure data has been properly sliced
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})

In [None]:
# Check what the data looks like
raw_datasets['train'][0]

{'text': 'Arguably this is a very good "sequel", better than the first live action film 101 Dalmatians. It has good dogs, good actors, good jokes and all right slapstick! <br /><br />Cruella DeVil, who has had some rather major therapy, is now a lover of dogs and very kind to them. Many, including Chloe Simon, owner of one of the dogs that Cruella once tried to kill, do not believe this. Others, like Kevin Shepherd (owner of 2nd Chance Dog Shelter) believe that she has changed. <br /><br />Meanwhile, Dipstick, with his mate, have given birth to three cute dalmatian puppies! Little Dipper, Domino and Oddball...<br /><br />Starring Eric Idle as Waddlesworth (the hilarious macaw), Glenn Close as Cruella herself and Gerard Depardieu as Le Pelt (another baddie, the name should give a clue), this is a good family film with excitement and lots more!! One downfall of this film is that is has a lot of painful slapstick, but not quite as excessive as the last film. This is also funnier than the 

### 1.2 Process data

#### 1.2.1 Tokenize text
I'm going to fine-tune a pretrained BERT uncased model. Because transformer models can't directly process text, I need to convert movie reviews from text into numerical representations that can be fed into the model.

To accomplish this, I will be using a **tokenizer**, which handles the following tasks:
- Splitting text into words, subwords, or symbols, referred to as tokens
- Mapping tokens to integers
- Incorporating additional inputs that might be beneficial for the model

Setting up the tokenizer is straightforward using the `AutoTokenizer` class and its `from_pretrained()` method. All I need to do is specify the **checkpoint name** of the pretrained BERT uncased model, and it will load the corresponding pretrained tokenizer. Using this tokenizer will ensure that text is processed in the same manner as during the model's pretraining.

In [None]:
from transformers import AutoTokenizer

# Checkpoint name of the pretrained BERT uncased model
checkpoint = 'bert-base-uncased'

# Load the corresponding pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Define a function to tokenize movie reviews with the following setting:
- `truncation=True` to ensure that input sequences do not exceed the length limit of the BERT uncased model.

Then, use `Dataset.map()` method to apply this function to each element of the dataset with the option:
- `batched=True`, which can expedite tokenization by applying the function to multiple elements of the dataset simultaneously.

In [None]:
def tokenize_function(example):
    return tokenizer(example['text'], truncation=True)

The tokenized dataset includes three new fields:
- **input_ids**: Unique numbers assigned to each word, subword, or symbol in a sentence. They assit the model in understanding the words.

- **token_type_ids**: Numbers indicating different sentences. This becomes more valuable when classifying pairs of sentences. For binary sentiment classification, these IDs are always set to 0.

- **attention_mask**: Tensors with the exact same shape as the input IDs tensor, consisting of 0s and 1s:
    - 1s indicate the corresponding tokens should be attended to
    - 0s indicate the corresponding tokens should not be attended to

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 500
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 500
    })
})

#### 1.2.2 Dynamic padding
**Dynamic padding** is a technique used to standardize the length of all sequences in a batch or dataset. When fine-tuning the model, I need to pass batches of sequences through the model in tensor format. Without padding, sequences in one batch may have varying lengths, preventing the batch from being converted into a tensor.

To address this issue, I need to use a **collate function**, which applies the appropriate amount of padding to sequences in a dataset that we want to batch together. The Transformers library offers a collate function through the `DataCollatorWithPadding` class. This class requires a tokenizer when instantiated, and the input should be in dictionary format.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 2. Prepare for modeling

### 2.1 Process tokenized dataset
There are 3 processings to apply to `tokenized_datasets`:
1. Remove columns that can't be fed to the model. Only need to remove "text" column.

2. Rename the "label" column as "labels" because the model's `forward()` method expects the argument to be named "labels".

3. Convert the format of datasets from lists to tensors since the model can't process lists of data.

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(['text'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets.set_format('torch')

### 2.2 Prepare DataLoaders
DataLoaders will be used to iterate over batches when fine-tuning the model.

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets['train'], shuffle=True, batch_size=16, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets['test'], batch_size=16, collate_fn=data_collator
)

In [None]:
# Inspect a batch in train_dataloader to verify its preparation
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'labels': torch.Size([16]),
 'input_ids': torch.Size([16, 512]),
 'token_type_ids': torch.Size([16, 512]),
 'attention_mask': torch.Size([16, 512])}

### 2.3 Set up optimizer
For **optimizer**, I will use the default optimizer `AdamW` in the `Trainer` class. AdamW is similar to Adam but incorporates weight decay regularization.

Before defining the optimizer, I need to load the pretrained BERT uncased model.

In [None]:
from transformers import AutoModelForSequenceClassification

# num_labels=2 since the label is binary, either negative or positive
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import get_scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Fine-tune the model for 3 epochs
num_epochs = 3

In [None]:
# Set up device-agnostic code to utilize the GPU if it's available
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## 3. Modeling

### 3.1 Baseline model
The pretrained BERT uncased model will serve as the baseline model.

I need to calculate the classification accuracy of the baseline model and compare it to the fine-tuned model to assess the improvement. The Hugging Face `Evaluate` library offers access to numerous evaluation metrics and is user-friendly. I'll use it for evaluation.

The classification accuracy of the baseline model is 0.476, meaning without fine-tuning, the BERT uncased model doesn't perform well.

In [None]:
import evaluate
# Load accuracy metric
metric = evaluate.load('accuracy')
# Send model to GPU
model.to(device)
# Set model in evaluation mode for faster evaluation
model.eval()
with torch.inference_mode():
    for batch in eval_dataloader:
        # Send data to GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        # Forward pass
        outputs = model(**batch)
        # Make predictions
        predictions = torch.argmax(outputs['logits'], dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])

    # Compute and print acccuracy
    print(f'Accuracy: {metric.compute()}')

Accuracy: {'accuracy': 0.476}


### 3.2 Fine-tune the pretrained BERT uncased model
Need to set up training and evaluation loops as follows:
- Training loops for fine-tuning the pretrained model

- Evaluation loops for assessing model performance

Use `tqdm()` to include progress bars in the loops, allowing me to monitor the training and evaluation progress.

The fine-tuned BERT uncased model achieves an accuracy of **0.772** on the testing dataset, a commendable performance given the limited training data of 500 movie reviews and 3 epochs.

In [None]:
torch.manual_seed(42)
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_epochs))

# Send model to GPU
model.to(device)
# Set model in training mode
model.train()

for epoch in range(num_epochs):
    ### Training
    for batch in train_dataloader:
        # Send data to GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        # Forward pass
        outputs = model(**batch)
        # Calculate loss
        loss = outputs.loss
        # Backward pass (backpropagation)
        loss.backward()
        # Step optimizer to update parameters
        optimizer.step()
        # Optimizer zero grad
        optimizer.zero_grad()
        progress_bar.update(1)

    ### Evaluation
    metric = evaluate.load('accuracy')
    # Set model in evaluation mode and calculate classification accuracy for every epoch
    model.eval()
    with torch.inference_mode():
        for batch in eval_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            predictions = torch.argmax(outputs['logits'], dim=-1)
            metric.add_batch(predictions=predictions, references=batch["labels"])

        print(f'Epoch: {epoch+1}, Accuracy: {metric.compute()}')

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch: 1, Accuracy: {'accuracy': 0.476}
Epoch: 2, Accuracy: {'accuracy': 0.476}
Epoch: 3, Accuracy: {'accuracy': 0.772}


In [None]:
# Save the fine-tuned model to Google Drive for use in the future
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Specify a path to save the model
save_directory = '/content/drive/My Drive/Colab Notebooks/Models'
model.save_pretrained(save_directory)

# Load saved model. Need to mount Google Drive again before loading
# model = AutoModelForSequenceClassification.from_pretrained(save_directory)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 4. Use the fine-tuned BERT uncased model for sentiment classification

### 4.1 Process data and prepare DataLoaders
Before using the model to classify sentiment, I need to **process movie reviews the same way** I did for the training and evaluation data:

1. Tokenize movie reviews
2. Process tokenized data
3. Prepare DataLoaders

In [None]:
# Tokenize movie reviews in unsupervised_reviews
tokenized_reviews = unsupervised_reviews.map(tokenize_function, batched=True)
tokenized_reviews

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [None]:
# Process tokenized data
tokenized_reviews = tokenized_reviews.remove_columns(['text', 'label'])
tokenized_reviews.set_format('torch')

In [None]:
# Prepare DataLoaders
pred_dataloader = DataLoader(
    tokenized_reviews, batch_size=16, collate_fn=data_collator
)

In [None]:
# Inspect a batch in pred_dataloader to verify its preparation
for batch in pred_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([16, 512]),
 'token_type_ids': torch.Size([16, 512]),
 'attention_mask': torch.Size([16, 512])}

### 4.2 Classify sentiment of movie reviews

In [None]:
# Send model to GPU
model.to(device)
# Set model in evaluation mode
model.eval()
# Create an empty tensor to store results
pred_tensor = torch.tensor([]).to(device)
with torch.inference_mode():
    for batch in pred_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs['logits'], dim=-1)
        pred_tensor = torch.cat((pred_tensor, predictions), dim=0)

### 4.3 Export results to a CSV file

In [None]:
# Create a DataFrame with movie reviews and predicted labels for easy reference
import pandas as pd
unsupervised_reviews_df = pd.DataFrame(unsupervised_reviews)
unsupervised_reviews_df['label'] = pred_tensor.tolist()
unsupervised_reviews_df.to_csv('/content/drive/My Drive/Colab Notebooks/Movie_Review_Sentiment_Classify.csv', index=False)