# Sentiment Classification with BERT

## Table of Contents
1. **Objective**
2. **Metadata**
3. **Dataset Overview**
4. **Concept Overview: What is BERT?**
5. **Why Use BERT?**
6. **How to Use BERT?**
7. **Advantages and Disadvantages of BERT**
8. **Other Use Cases of BERT**
9. **Exploratory Data Analysis (EDA)**
10. **Data Preprocessing**
11. **Implementation**
12. **Key Learnings**
13. **Conclusion**

---

## 1. Objective

The objective of this notebook is to perform sentiment classification on a given dataset using BERT (Bidirectional Encoder Representations from Transformers). We will explore the dataset, preprocess the data, implement a BERT-based model, and evaluate its performance. By the end of this notebook, you will have a clear understanding of how to use BERT for sentiment analysis and other NLP tasks.

---

## 2. Metadata

- **Notebook Author**: [Your Name]
- **Date**: [Current Date]
- **Language**: Python
- **Libraries Used**: 
  - `transformers` (Hugging Face)
  - `torch` (PyTorch)
  - `pandas`
  - `numpy`
  - `matplotlib`
  - `seaborn`
  - `scikit-learn`
- **Dataset**: [Dataset Name] (e.g., IMDb, Twitter Sentiment Analysis Dataset)

---

## 3. Dataset Overview

The dataset used in this notebook is the [Dataset Name], which contains [number] of text samples labeled with positive, negative, or neutral sentiment. The dataset is split into training and testing sets, with [number] of samples in the training set and [number] of samples in the testing set.

- **Columns**:
  - `text`: The text data (e.g., movie reviews, tweets).
  - `label`: The sentiment label (e.g., 0 for negative, 1 for positive).

- **Dataset Source**: [Link to Dataset]

---

## 4. Concept Overview: What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. Unlike previous models that process text in a unidirectional manner (either left-to-right or right-to-left), BERT is designed to process text in both directions simultaneously. This bidirectional approach allows BERT to capture context from both past and future words in a sentence, leading to a deeper understanding of the text.

### Key Features of BERT:
- **Bidirectional Context**: BERT considers the context from both sides of a word, which helps in understanding the meaning of words in a sentence more accurately.
- **Transformer Architecture**: BERT uses the transformer architecture, which relies on self-attention mechanisms to weigh the importance of different words in a sentence.
- **Pre-trained Models**: BERT is pre-trained on large corpora (e.g., Wikipedia, BookCorpus) and can be fine-tuned for specific NLP tasks like sentiment analysis, question answering, and more.

---

## 5. Why Use BERT?

- **State-of-the-Art Performance**: BERT has achieved state-of-the-art results on various NLP tasks, including sentiment analysis, question answering, and named entity recognition.
- **Contextual Understanding**: BERT's bidirectional nature allows it to understand the context of words better than unidirectional models.
- **Transfer Learning**: BERT can be fine-tuned on specific tasks with relatively small datasets, making it highly versatile.

---

## 6. How to Use BERT?

1. **Pre-training**: BERT is pre-trained on large text corpora using two tasks:
   - **Masked Language Model (MLM)**: Randomly masks some words in a sentence and predicts them based on the context.
   - **Next Sentence Prediction (NSP)**: Predicts whether one sentence follows another in a document.

2. **Fine-tuning**: After pre-training, BERT can be fine-tuned on specific tasks (e.g., sentiment analysis) by adding a task-specific layer (e.g., a classification layer) and training on labeled data.

---

## 7. Advantages and Disadvantages of BERT

### Advantages:
- **High Accuracy**: BERT achieves high accuracy on various NLP tasks.
- **Contextual Understanding**: BERT's bidirectional nature allows it to understand context better.
- **Versatility**: BERT can be fine-tuned for a wide range of NLP tasks.

### Disadvantages:
- **Computationally Expensive**: BERT requires significant computational resources for training and inference.
- **Large Model Size**: BERT models are large, which can be a limitation for deployment on resource-constrained devices.
- **Long Training Time**: Fine-tuning BERT on large datasets can be time-consuming.

---

## 8. Other Use Cases of BERT

- **Question Answering**: BERT can be used to build systems that answer questions based on a given context.
- **Named Entity Recognition (NER)**: BERT can identify and classify entities in text (e.g., names, dates, locations).
- **Text Summarization**: BERT can be used to generate summaries of long documents.
- **Machine Translation**: BERT can be fine-tuned for translating text from one language to another.

---


In [3]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


# Install Required Libraries
- Install the `transformers` library to access the pre-trained BERT model and utilities.
- Install the `datasets` library to load the IMDb dataset conveniently.
- Install `wandb` (Weights and Biases) for tracking training and evaluation metrics.


In [4]:
# Load the IMDb dataset which is pre-split into 'train' and 'test' sets.
dataset = load_dataset("imdb")


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

# Import Libraries and Load IMDb Dataset
- Import `load_dataset` from the `datasets` library to load the IMDb dataset.
- Import `BertTokenizer` and `BertForSequenceClassification` from `transformers` to use a pre-trained BERT model for sentiment classification.
- Load the IMDb dataset, which contains movie reviews labeled as positive or negative.


In [5]:
# Define the model name (we're using the base uncased version of BERT)
model_name = "bert-base-uncased"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained BERT model for sequence classification with 2 output labels (positive/negative)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Define a function to tokenize the text data.
def tokenize_function(examples):
    # Tokenize the "text" field, applying truncation and padding.
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Apply the tokenizer to the entire dataset in a batched manner.
tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

# Preprocess the Dataset
- Load the pre-trained tokenizer (`bert-base-uncased`) for BERT.
- Define a tokenization function to tokenize the text, applying truncation and padding to ensure uniform input length.
- Use the `map()` method to apply tokenization across the entire dataset.


In [7]:
# Remove the original text column since it is no longer needed
tokenized_datasets = tokenized_datasets.remove_columns(["text"])

# Set the format of the dataset to PyTorch tensors.
tokenized_datasets.set_format("torch")


In [8]:
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]


# Prepare Datasets for Training
- Split the tokenized IMDb dataset into training and testing datasets.
- Use these subsets for model training and evaluation.


In [9]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert logits to predicted class labels
    predictions = np.argmax(logits, axis=-1)
    # Compute accuracy
    acc = accuracy_score(labels, predictions)
    # Compute precision, recall, and F1 score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}


# Define Metrics
- Load the `accuracy` metric from the `datasets` library.
- Define a `compute_metrics` function to calculate accuracy by comparing model predictions to true labels.


In [12]:
import wandb
from transformers import TrainingArguments, Trainer

# Ask for the WandB token
wandb_token = input("Enter your WandB API token: ")

# Login to WandB
wandb.login(key=wandb_token)

# Define training arguments, including WandB integration
training_args = TrainingArguments(
    output_dir="./results",               # Directory for model checkpoints and outputs
    evaluation_strategy="epoch",          # Evaluate the model at the end of each epoch
    learning_rate=2e-5,                   # Learning rate for optimization
    per_device_train_batch_size=8,        # Batch size per device during training
    per_device_eval_batch_size=8,         # Batch size per device during evaluation
    num_train_epochs=2,                   # Number of training epochs
    weight_decay=0.01,                    # Weight decay for regularization
    logging_dir="./logs",                 # Directory for storing logs
    logging_steps=50,                     # Frequency of logging steps
    report_to="wandb",                    # Report metrics to WandB
    save_strategy="epoch",                # Save model checkpoint at the end of each epoch
)

# Initialize the Trainer with the model, training arguments, datasets, and evaluation metrics.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# Start the training process, WandB will track the metrics automatically
trainer.train()


Enter your WandB API token:  e6307e2669d40326758d0898197fdd7da8ee52ea


[34m[1mwandb[0m: Currently logged in as: [33msayan-ft252082[0m ([33msayan-ft252082-capgemini[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2105,0.182105,0.93188,0.950288,0.91144,0.930459
2,0.1394,0.219952,0.94216,0.938929,0.94584,0.942372




TrainOutput(global_step=3126, training_loss=0.18700834946684247, metrics={'train_runtime': 3933.4131, 'train_samples_per_second': 12.712, 'train_steps_per_second': 0.795, 'total_flos': 1.3155552768e+16, 'train_loss': 0.18700834946684247, 'epoch': 2.0})

In [14]:
# Evaluate the model
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)




Evaluation Results: {'eval_loss': 0.21995195746421814, 'eval_accuracy': 0.94216, 'eval_precision': 0.9389294790343075, 'eval_recall': 0.94584, 'eval_f1': 0.9423720707795312, 'eval_runtime': 477.203, 'eval_samples_per_second': 52.389, 'eval_steps_per_second': 3.275, 'epoch': 2.0}


# Evaluate the Model
- Evaluate the model's performance on the test dataset using the `evaluate()` method.
- The evaluation metrics, such as accuracy, are logged and displayed.


# Set Up WandB and Training Arguments
- Prompt the user to enter their WandB API token to log into their account.
- Define training arguments:
  - `output_dir`: Directory for saving checkpoints.
  - `evaluation_strategy`: Evaluate the model at the end of each epoch.
  - `report_to`: Send metrics to WandB.
  - `save_strategy`: Save model checkpoints at the end of each epoch.

# Initialize Trainer and Train Model
- Initialize the `Trainer` object with the model, training arguments, datasets, and metric computation function.
- Call the `train()` method to start training the model.
- During training, WandB will log metrics such as training loss and evaluation accuracy.


In [19]:
# Define a function to classify the sentiment of a given text.
def classify_text(text):
    # Tokenize the input text with the same settings as during training.
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=512
    )
    
    # Move the inputs to the same device as the model (CPU or GPU).
    device = next(model.parameters()).device
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    # Disable gradient calculation for inference.
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract logits and convert them to probabilities using softmax.
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    
    # Determine the predicted class: 0 for negative, 1 for positive.
    predicted_class = torch.argmax(probabilities, dim=1).item()
    return predicted_class, probabilities.cpu().numpy()

# Example usage: classify a sample text.
sample_text = "This movie was absolutely phenomenal! The storytelling was captivating, the characters were deeply relatable, and the visuals were stunning. The director's vision was brought to life with such finesse, leaving me inspired and in awe. Highly recommend it—an absolute must-watch!"
predicted_class, probabilities = classify_text(sample_text)
label = "Positive" if predicted_class == 1 else "Negative"

print(f"Input text: {sample_text}")
print(f"Predicted sentiment: {label}")
print(f"Probabilities: {probabilities}")


Input text: This movie was absolutely phenomenal! The storytelling was captivating, the characters were deeply relatable, and the visuals were stunning. The director's vision was brought to life with such finesse, leaving me inspired and in awe. Highly recommend it—an absolute must-watch!
Predicted sentiment: Positive
Probabilities: [[0.00154408 0.9984559 ]]


In [20]:
# Example usage: classify a sample text.
sample_text = "Unfortunately, this movie was a disappointment. The plot felt disjointed, the characters lacked depth, and the pacing dragged on far too long. Even the visuals couldn't make up for the weak storytelling. Not something I’d recommend"
predicted_class, probabilities = classify_text(sample_text)
label = "Positive" if predicted_class == 1 else "Negative"

print(f"Input text: {sample_text}")
print(f"Predicted sentiment: {label}")
print(f"Probabilities: {probabilities}")

Input text: Unfortunately, this movie was a disappointment. The plot felt disjointed, the characters lacked depth, and the pacing dragged on far too long. Even the visuals couldn't make up for the weak storytelling. Not something I’d recommend
Predicted sentiment: Negative
Probabilities: [[0.9989349  0.00106505]]


# Use the Model to Classify New Text
- Define a function `classify_text` to process a given text input:
  - Tokenize the input using the BERT tokenizer.
  - Pass the tokenized input to the model for inference.
  - Use softmax to convert logits into probabilities and determine the predicted class.
- Test the function with a sample text to verify sentiment classification.
