# IMDB Movie Review Sentiment Classificaiton w/BERT

In this notebook, you will perform sentiment analysis on the IMDB Movie Review Dataset using the BERT (Bidirectional Encoder Representations from Transformers) model.

Sentiment analysis is a natural language processing (NLP) task that involves determining the emotion or emotional tone behind a series of words. The IMDB movie review dataset is a well-known dataset containing 50,000 movie reviews tagged as positive or negative. Your task is to train a model that can accurately classify the emotion of a particular movie review.

The input to your model will be text data from IMDB movie reviews. Each review is a string of words that will be processed and tokenized in line with the BERT model. The expected output will be binary classification: each review will be labeled as positive (1) or negative (0). Leveraging the BERT model, we aim to capture the contextual meanings of words in reviews and thus improve the accuracy of our sentiment classification. This notebook will guide you through the process of data preprocessing, model training, evaluation, and making predictions about new reviews using the BERT model.



![**BERT](https://cdn-images-1.medium.com/max/1500/1*g1KBCVCITjrd9IJ7AyFqdw.png)

[IMDB Movie Review Sentiment Classificaiton w/BERT](#scrollTo=Nnpq7d-poO-h)

[What is BERT?](#scrollTo=_kAGrtJjoXaZ)

>[Key Features of BERT](#scrollTo=WXNeaIsJ7ZzC)

>>[1)Masked Language Modelling (MLM)](#scrollTo=PED6sVVtobm9)

>>[2) Next Sentence Predicition (NSP)](#scrollTo=LjkfD8-iohjS)

>>[Detailed Architecture of BERT Model](#scrollTo=jn8li8xE4PRI)

[1)Import Libraries and Set up GPU](#scrollTo=g_J6BlEXotA0)

[2) Exploratory Data Analysis and Preprocessing](#scrollTo=lxG3xDYUjlp9)

>[IMDB Movie Review Dataset](#scrollTo=gvqE-YQupqjT)

[3) Initialize The Pre-trained BERT Model and Load The Tokenizer](#scrollTo=UYa5zHelfyJA)

>[BERT's Word Piece Tokenization](#scrollTo=BfcfpiKmOqni)

>[Special Tokens](#scrollTo=_tW94n1WNxoR)

[4) Tokenize the Dataset](#scrollTo=zsb_ThOpfmGI)

[5) Load The Pre-Trained BERT Model](#scrollTo=u__s-wGB2SfF)

[6) Train and Evaluate Model](#scrollTo=UxTaXPXGmZoL)

[7) Inference From Fine-Tuned Model](#scrollTo=wesC5b6JCmjr)



# What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers, developed by Google and launched in 2018, is a transformer-based large language model.

Here is the link to the Original article for BERT: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)


### Key Features of BERT

> ### 1)Masked Language Modelling (MLM)

Masked Language Modeling is one of the key pre-training tasks used in BERT to enable the model to learn context-rich word representations.

In BERT's MLM, before the word sequences are given to the model, **15% of the words in each sequence are masked**, that is, replaced with a **[MASK]** token. The modified sequence is passed through the **bidirectional encoder**. The model then attempts to predict the original value of the masked words based on the context provided by other unmasked words in the sequence.

For detailed information about masked language modeling, [see the link](https://huggingface.co/docs/transformers/tasks/masked_language_modeling)



![**mlm](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/MLM.png)

> ### 2) Next Sentence Predicition (NSP)

Next Sentence Prediction is another pre-training task used in BERT to enable the model to understand the relationship between pairs of sentences.

In the NSP task, the model is fed positive and negative sentence pairs. In positive sentence pairs, the second sentence is the actual next sentence following the first sentence in the original text. In negative sentence pairs, the second sentence is a randomly selected sentence from the corpus that does not follow the first sentence.




![**nsp](https://www.researchgate.net/publication/359791880/figure/fig4/AS:11431281093138241@1667103724679/Next-sentence-prediction-in-BERT-training.png)

> ### Detailed Architecture of BERT Model

BERT model is divided into BERT_Base and BERT_Large depending on the number of encoder layers, number of individual attention heads and latent vector size.



| Model      | Encoder Layers (L) | Attention Heads (A) | Hidden Size (H) |
|------------|---------------------|---------------------|------------------|
| BERT_BASE  | 12                  | 12                  | 768              |
| BERT_LARGE | 24                  | 16                  | 1024             |


The three main inputs to the BERT model are:

* **Positional Embedding:** In this embedding type, the index number of each input token is taken. In this way, it determines the position of tokens in the sentence, allowing the model to understand and learn the order of words and capture sequential dependencies.

* **Segment Embedding:** This type of embedding represents the sentence number in a sentence sequence. For example, in a two-sentence input, one set of values ​​is used for the first sentence and a different set of values ​​is used for the second sentence. This helps the model distinguish which token belongs to which sentence.

* **Token Embedding:** This embedding type holds the set of tokens for words given by the token generator. The tokenizer splits the text into tokens (usually words or word pieces, in the BERT model, this is word piece tokenization, you can find detailed information in the following cells), and each token is converted into an embedding vector to be processed by the model.


![BERT Inputs](https://miro.medium.com/v2/resize:fit:619/1*iJqlhZz-g6ZQJ53-rE9VvA.png)




---



## 1)Import Libraries and Set up GPU

In [None]:
!pip install datasets transformers session-info -q

In [None]:
import os
import torch
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from datasets import load_dataset
from transformers import BertTokenizer, BertTokenizerFast, AutoTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset, TensorDataset, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import f1_score, accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Check if cuda is avaliable and set gpu device

!nvidia-smi

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

## 2) Exploratory Data Analysis and Preprocessing

### IMDB Movie Review Dataset

IMDB Movie Review dataset consists of 50,000 movie reviews from the IMDB and includes sentiment labels indicating whether a review is positive or negative.

*Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150). https://aclanthology.org/P11-1015*

Dataset: https://huggingface.co/datasets/stanfordnlp/imdb


In [None]:
# Import the IMDB Review dataset with Hugginface's load_dataset function (https://huggingface.co/docs/datasets/loading). Then examine the data set.

# YOUR CODE STARTS HERE
dataset = None

# YOUR CODE ENDS HERE
dataset

In [None]:
# Create a smaller training dataset with random selections from the dataset for faster training times.

# YOUR CODE STARTS HERE

small_train_dataset = None
small_test_dataset = None

# YOUR CODE ENDS HERE

In [None]:
small_train_dataset

In [None]:
small_test_dataset

## 3) Initialize The Pre-trained BERT Model and Load The Tokenizer

For detailed information on different BERT models: https://github.com/google-research/bert/blob/master/README.md

In [None]:
# Initialize the BERT model and create the tokenizer object with Transformer's AutoTokenizer class.

# YOUR CODE STARTS HERE

model_id = None
tokenizer = None

# YOUR CODE ENDS HERE

In [None]:
# Examine the tokenizer components

tokenizer

### BERT's Word Piece Tokenization

Word Piece Tokenization is a subword tokenization algorithm that is used in natural language processing tasks, particularly in the context of machine learning models like BERT.

[For more detailed information see](https://huggingface.co/learn/nlp-course/en/chapter6/6)

In [None]:
# Try to tokenize the random example sentences.

# YOUR CODE STARTS HERE

test_sentence= None
tokens = tokenizer.tokenize(test_sentence)

# YOUR CODE ENDS HERE

print(tokens)

In [None]:
# Pass the example sentence through the tokenizer.

# YOUR CODE STARTS HERE

output=tokenizer(test_sentence,
                 padding= None,  # Ensure that all sequences are the same length by adding padding.
                 truncation= None, # Truncate the sequences to the maximum length if necessary.
                 max_length= None) # Set the maximum length of the sequences.

# YOUR CODE ENDS HERE

token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f'Sentence: {test_sentence}')
print(f'Tokens: {tokens}')
print(f'Token IDs: {token_ids}')
print(f'Test sentence with the special tokens: {tokenizer.decode(output["input_ids"])}')
print(f'Input IDs, Token Type IDs, Attention Mask: {output}')

### Special Tokens




> ##### [SEP] (Separator Token) --  It marks the end of one sequence and the beginning of another within the same input string.

In [None]:
tokenizer.sep_token, tokenizer.sep_token_id

('[SEP]', 102)

> ##### [CLS] (Classification Token) -- It marks beginning of the sequence.

In [None]:
tokenizer.cls_token, tokenizer.cls_token_id

('[CLS]', 101)

> ##### [PAD] (Padding Token) -- Padding is the process of filling the shorter sequences to match the longest sequence in the batch.

In [None]:
tokenizer.pad_token, tokenizer.pad_token_id

('[PAD]', 0)

In [None]:
# Padding example.

texts = ["Hello world", "Hello world hello world hello world"]

encoded_inputs = tokenizer(texts,
                           padding=True,
                           return_tensors='pt')

print(encoded_inputs['input_ids'])
print(encoded_inputs['attention_mask'])
decoded_text_1 = tokenizer.decode(encoded_inputs['input_ids'][0])
decoded_text_2 = tokenizer.decode(encoded_inputs['input_ids'][1])
print(decoded_text_1)
print(decoded_text_2)

> ##### [UNK] (Unknown Token) -- UNK token represent words that are not found in the vocabulary of a tokenizer or language model.

In [None]:
tokenizer.unk_token, tokenizer.unk_token_id

('[UNK]', 100)

## 4) Tokenize the Dataset

You can utilize the `batch_encode_plus` method from the Transformers library for [batch encoding](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding). This method is particularly useful for encoding multiple texts simultaneously, optimizing both the processing time and resource utilization. It handles padding and creates attention masks automatically, ensuring that the input sequences are properly formatted for the model.


In [None]:
# YOUR CODE STARTS HERE

encoded_data_train = tokenizer.batch_encode_plus(
    dataset['train']['text'],
    add_special_tokens = None,
    return_attention_mask = None,
    pad_to_max_length = None,
    max_length = None,
    return_tensors = None
)

encoded_data_val = tokenizer.batch_encode_plus(
    dataset['test']['text'],
    add_special_tokens = None,
    return_attention_mask = None,
    pad_to_max_length = None,
    max_length = None,
    return_tensors = None
)

# YOUR CODE ENDS HERE

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
encoded_data_train

The encoded data for both training and validation sets need to be prepared for input into the model.  Extract `input_ids` and `attention_masks` from the encoded training and validation data. These are necessary for the model to understand which parts of the text to focus on and which parts are filler.  Convert the labels in the dataset to tensors. In this way, the data becomes suitable for processing with PyTorch during model training and evaluation.

In [None]:
# YOUR CODE STARTS HERE

input_ids_train = None
attention_masks_train = None
labels_train = torch.tensor(None)

input_ids_val = None
attention_masks_val = None
labels_val = torch.tensor(None)

# YOUR CODE ENDS HERE

Use PyTorch's [`TensorDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) class to combine `input_ids`, `attention_masks` and `labels` into a single dataset object for each cluster.

In [None]:
# YOUR CODE STARTS HERE

dataset_train = TensorDataset(None,
                              None,
                              None)

dataset_val = TensorDataset(None,
                            None,
                            None)
# YOUR CODE ENDS HERE

## 5) Load The Pre-Trained BERT Model

Load Pre-trained model with '[BertForSequenceClassification.from_pretrained](https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#bertforsequenceclassification)' and set the configuration.

In [None]:
# YOUR CODE STARTS HERE

model = BertForSequenceClassification.from_pretrained(
    None, # Specify the model
    num_labels = None, # For the classifier layer, consider the number of classes in the dataset.
    output_attentions = None, # optimize the model to only return the final prediction outputs rather than intermediate states or attention weights.
    output_hidden_states = None # optimize the model to only return the final prediction outputs rather than intermediate states or attention weights.
)

# YOUR CODE ENDS HERE

model.to(device) # Move the model to the GPU. Check the model architecture.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
# Set up data loaders for training and validation.

# YOUR CODE STARTS HERE

batch_size = None

dataloader_train = DataLoader(
    None,
    sampler = RandomSampler(None),
    batch_size = batch_size
)

dataloader_val = DataLoader(
    None,
    sampler = SequentialSampler(None), # Unlike the training loader, the validation data loader uses a sequential sampler that processes the validation data in the original order. This is typical for validation processes where the order of data does not need to be shuffled.
    batch_size = batch_size
)

# YOUR CODE ENDS HERE

In [None]:
# Configure the optimizer parameters for model training.

# YOUR CODE STARTS HERE

optimizer = AdamW(
    model.parameters(),
    lr = None,
    eps = None
)

# YOUR CODE ENDS HERE
optimizer # Check the optimizer components.

In [None]:
# Set up the learning rate scheduler.

# YOUR CODE STARTS HERE

epochs = None
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = None,
    num_training_steps = None
)

# YOUR CODE ENDS HERE

## 6) Train and Evaluate Model

Set the directory where the fine tuned model will be saved.

In [None]:
# YOUR CODE STARTS HERE

models_out_dir = None
os.makedirs(models_out_dir, exist_ok=True)

# YOUR CODE ENDS HERE

The evaluate function is designed to measure the performance of the model on the validation dataset. This function collects the results the model predicted during validation, prepares it to calculate accuracy metrics, and returns the average loss value.

The model is put into eval mode, so dropout and similar operations used during training are disabled. The validation dataset is processed in batches (small pieces of data) using dataloader_val.

In [None]:
# Get input_ids, attention_mask and labels values ​​from the batch.

def evaluate(dataloader_val):
    model.eval()
    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in tqdm(dataloader_val, desc="Evaluating", leave=False):
        batch = tuple(b.to(device) for b in batch)

        # YOUR CODE STARTS HERE

        inputs = {
            'input_ids': None,
            'attention_mask': None,
            'labels': None
        }

        # YOUR CODE ENDS HERE

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total / len(dataloader_val)
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
    preds_flat = np.argmax(predictions, axis=1).flatten()

    return loss_val_avg, predictions, true_vals, preds_flat

In [None]:
# Start the training loop.

for epoch in tqdm(range(1, epochs + 1)):
    model.train()

    loss_train_total = 0
    progress_bar = tqdm(dataloader_train, desc="Epoch {:1d}".format(epoch), leave=False, disable=False)

    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }

        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        loss.backward()
        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:3f}'.format(loss.item() / len(batch))})


    loss_train_avg = loss_train_total / len(dataloader_train)
    tqdm.write(f'\nEpoch {epoch}')
    tqdm.write(f'Training loss: {loss_train_avg}')

    loss_val_avg, predictions, true_vals, preds_flat = evaluate(dataloader_val)
    f1 = f1_score(true_vals, preds_flat, average='weighted')
    tqdm.write(f'Validation Loss: {loss_val_avg}')
    tqdm.write(f'Validation F1 Score: {f1}')

    torch.save(model.state_dict(), '{}/BERT_ft_epoch{}.model'.format(models_out_dir, epoch))

## 7) Inference From Fine-Tuned Model

In [None]:
# Load the weights of the trained model.

# YOUR CODE STARTS HERE

model.load_state_dict(torch.load(None))
model.to(device)

# YOUR CODE ENDS HERE

Take samples from the test dataset then tokenize and compare the ground truths with the model's predictions.

In [None]:
# YOUR CODE STARTS HERE

text = None
true_label = None

# YOUR CODE STARTS HERE
text

In [None]:
encoded_input = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=256,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)

# YOUR CODE STARTS HERE

input_id = encoded_input['input_ids'].to(device) # Move input ids and attention masks to the same device as the model.
attention_mask = encoded_input['attention_mask'].to(device)

# YOUR CODE STARTS HERE

model.eval()
with torch.no_grad():
    output = model(input_ids=input_id, attention_mask=attention_mask)
    logits = output.logits
    prediction = torch.argmax(logits, dim=-1).item()

print(f"Text: {text}")
print(f"True Label: {true_label}")
print(f"Prediction: {prediction}")

In [None]:
import session_info
#session_info.show()
session_info.show(excludes=['pybind11_abseil'])