# Sentiment Analysis with IMDB

**Module 02 | Notebook 2 of 3**

In this notebook, we'll build a complete sentiment analysis system using the IMDB movie review dataset.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Handle large text datasets efficiently
2. Implement proper train/validation splits
3. Use evaluation metrics (accuracy, F1, precision, recall)
4. Analyze model performance and errors

---

In [None]:
%%capture
!pip install transformers datasets accelerate evaluate scikit-learn matplotlib seaborn

In [None]:
import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import evaluate
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---

## Load and Explore the Dataset

In [None]:
# Load IMDB dataset
dataset = load_dataset("imdb")

print("Dataset structure:")
print(dataset)
print(f"\nTrain examples: {len(dataset['train']):,}")
print(f"Test examples: {len(dataset['test']):,}")

In [None]:
# Explore the data
print("Label distribution:")
train_labels = dataset['train']['label']
pos_count = sum(train_labels)
neg_count = len(train_labels) - pos_count
print(f"  Positive: {pos_count:,} ({pos_count/len(train_labels):.1%})")
print(f"  Negative: {neg_count:,} ({neg_count/len(train_labels):.1%})")

In [None]:
# Analyze text lengths
text_lengths = [len(text.split()) for text in dataset['train']['text']]

print("\nText length statistics (words):")
print(f"  Min: {min(text_lengths)}")
print(f"  Max: {max(text_lengths)}")
print(f"  Mean: {np.mean(text_lengths):.0f}")
print(f"  Median: {np.median(text_lengths):.0f}")

# Plot distribution
plt.figure(figsize=(10, 4))
plt.hist(text_lengths, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(x=256, color='r', linestyle='--', label='256 tokens (typical limit)')
plt.xlabel('Number of Words')
plt.ylabel('Count')
plt.title('Distribution of Review Lengths')
plt.legend()
plt.xlim(0, 1000)
plt.show()

In [None]:
# Sample reviews
print("Sample Positive Review:")
print("-" * 50)
pos_example = dataset['train'].filter(lambda x: x['label'] == 1)[0]
print(pos_example['text'][:500] + "...")

print("\nSample Negative Review:")
print("-" * 50)
neg_example = dataset['train'].filter(lambda x: x['label'] == 0)[0]
print(neg_example['text'][:500] + "...")

---

## Data Preparation

For faster training in this demo, we'll use a subset. In production, use the full dataset.

In [None]:
# Create subsets for faster training (increase for better results)
train_size = 2000
test_size = 500

# Balanced sampling
train_data = dataset['train'].shuffle(seed=42).select(range(train_size))
test_data = dataset['test'].shuffle(seed=42).select(range(test_size))

# Create validation split from training data
train_val_split = train_data.train_test_split(test_size=0.1, seed=42)
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_data)}")

In [None]:
# Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding=True,
        truncation=True,
        max_length=256  # IMDB reviews can be long
    )

# Tokenize all datasets
train_tokenized = train_dataset.map(tokenize_function, batched=True)
val_tokenized = val_dataset.map(tokenize_function, batched=True)
test_tokenized = test_data.map(tokenize_function, batched=True)

print("Tokenization complete!")

---

## Model Training

In [None]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")

In [None]:
# Define comprehensive metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
precision = evaluate.load("precision")
recall = evaluate.load("recall")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    
    return {
        'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'],
        'f1': f1.compute(predictions=preds, references=labels)['f1'],
        'precision': precision.compute(predictions=preds, references=labels)['precision'],
        'recall': recall.compute(predictions=preds, references=labels)['recall']
    }

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./imdb_sentiment_model",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    logging_steps=25,
    fp16=torch.cuda.is_available(),
    report_to="none"
)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
# Train
print("Starting training...")
trainer.train()
print("\nTraining complete!")

---

## Evaluation and Analysis

In [None]:
# Evaluate on test set
test_results = trainer.evaluate(test_tokenized)

print("Test Set Results:")
print("=" * 40)
print(f"Accuracy:  {test_results['eval_accuracy']:.2%}")
print(f"F1 Score:  {test_results['eval_f1']:.2%}")
print(f"Precision: {test_results['eval_precision']:.2%}")
print(f"Recall:    {test_results['eval_recall']:.2%}")

In [None]:
# Get predictions for confusion matrix
predictions = trainer.predict(test_tokenized)
preds = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

# Confusion matrix
cm = confusion_matrix(true_labels, preds)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.show()

# Classification report
print("\nClassification Report:")
print(classification_report(true_labels, preds, target_names=['Negative', 'Positive']))

---

## Error Analysis

Understanding where the model fails helps improve it.

In [None]:
# Find misclassified examples
test_texts = test_data['text']
misclassified_indices = np.where(preds != true_labels)[0]

print(f"Misclassified: {len(misclassified_indices)} out of {len(true_labels)} ({len(misclassified_indices)/len(true_labels):.1%})")
print("\n" + "=" * 60)
print("Sample Misclassified Reviews:")
print("=" * 60)

for idx in misclassified_indices[:3]:
    true_label = "Positive" if true_labels[idx] == 1 else "Negative"
    pred_label = "Positive" if preds[idx] == 1 else "Negative"
    
    print(f"\nTrue: {true_label} | Predicted: {pred_label}")
    print(f"Text: {test_texts[idx][:300]}...")
    print("-" * 60)

---

## Interactive Testing

In [None]:
from transformers import pipeline

# Create inference pipeline
sentiment_classifier = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test examples
test_reviews = [
    "This movie is a masterpiece. The acting, direction, and story are all perfect.",
    "Absolute garbage. Waste of 2 hours of my life. Avoid at all costs.",
    "It was okay. Some good moments but overall forgettable.",
    "Not as good as the original, but still entertaining enough."
]

print("Live Predictions:")
print("=" * 60)
for review in test_reviews:
    result = sentiment_classifier(review)[0]
    emoji = "ðŸ˜Š" if result['label'] == "POSITIVE" else "ðŸ˜ "
    print(f"{emoji} {result['label']} ({result['score']:.2%})")
    print(f"   {review[:60]}...\n")

---

## ðŸŽ¯ Student Challenge

### Challenge: Improve Model Performance

Try these strategies to improve the model:

In [None]:
# TODO: Experiment with these improvements:

# 1. Use more training data (increase train_size)
# 2. Try a different base model (e.g., "bert-base-uncased", "roberta-base")
# 3. Adjust hyperparameters (learning rate, batch size, epochs)
# 4. Increase max_length for longer reviews

# Track your improvements:
# | Change | Accuracy | F1 |
# |--------|----------|----|
# | Baseline | ... | ... |

# Your solution:


---

## Key Takeaways

1. **Data exploration** is crucial before training
2. **Balanced datasets** lead to better model performance
3. **Multiple metrics** (accuracy, F1, precision, recall) give complete picture
4. **Error analysis** helps identify model weaknesses
5. **Confusion matrices** visualize classification errors

---

## Next Steps

Continue to `03_summarization.ipynb` for sequence-to-sequence tasks!