In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/movie-review-analysis
%pip install -r requirements.txt

In [None]:

!pip freeze > requirements.txt

# IMDB Sentiment Analysis - Complete Interface

This notebook provides an interface for all project routines:
1. **Data Preprocessing**
2. **Statistics & Visualization**
3. **N-gram Models** (training & generation)
4. **RNN Models** (LSTM/GRU training & generation)
5. **Transformer Models** (training & evaluation)

---

## Overall Project Pipeline

```mermaid
graph TD
    A[IMDB Dataset CSV] --> B[TextPreprocessor]
    B --> C{Tokenization Type}
    C -->|Word| D[Word Tokenization]
    C -->|Subword| E[Subword Tokenization]
    
    D --> F[Statistics & Visualization]
    D --> G[N-gram Models]
    D --> H[RNN Models LSTM/GRU]
    E --> I[Transformer Models]
    
    F --> J[Word Clouds & Scattertext]
    G --> K[Text Generation & Perplexity]
    H --> L[Text Generation & Perplexity]
    I --> M[Sentiment Classification]
    
    M --> N[Evaluation Pipeline]
    N --> O[Metric-based]
    N --> P[Human Evaluation]
    N --> Q[LLM-as-a-Judge]
    
    O --> R[Results & Comparison]
    P --> R
    Q --> R
    
    style A fill:#e1f5ff
    style B fill:#fff4e1
    style J fill:#e8f5e9
    style K fill:#e8f5e9
    style L fill:#e8f5e9
    style M fill:#e8f5e9
    style R fill:#f3e5f5
```

---

## Setup & Configuration

### Project Architecture

```mermaid
graph LR
    subgraph Data Layer
        A[IMDB CSV] --> B[TextPreprocessor]
        B --> C[Train/Val/Test Splits]
    end
    
    subgraph Model Layer
        C --> D[NGramModel]
        C --> E[RNN LSTM/GRU]
        C --> F[Transformer ALBERT]
    end
    
    subgraph Evaluation Layer
        D --> G[Perplexity]
        E --> G
        F --> H[Multi-Perspective Eval]
        H --> I[Metrics]
        H --> J[Human]
        H --> K[LLM Judge]
    end
    
    subgraph Output Layer
        G --> L[out/ directory]
        I --> L
        J --> L
        K --> L
    end
    
    style A fill:#e1f5ff
    style D fill:#fff4e1
    style E fill:#fff4e1
    style F fill:#fff4e1
    style L fill:#e8f5e9
```

In [1]:
# Imports
import os
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

from src.data.preprocessing import TextPreprocessor, SENTIMENT_TO_ID
from src.data.stats import IMDBDataStats
from src.models.ngram import NGramModel
from src.models.nn import RNN, RNNDataModule, get_device
from src.models.transformer import finetune_minilm, TransformerDataset
from src.eval.eval_transformer import evaluate_transformer

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

    You are using a Python version 3.9 past its end of life. Google will update
    google-auth with critical bug fixes on a best-effort basis, but not
    with any other fixes or features. Please upgrade your Python version,
    and then update google-auth.
    
    You are using a Python version 3.9 past its end of life. Google will update
    google-auth with critical bug fixes on a best-effort basis, but not
    with any other fixes or features. Please upgrade your Python version,
    and then update google-auth.
    

All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  import google.generativeai as genai


In [2]:
import numpy as np
from typing import Tuple


def split_dataset_indices(
    dataset_length: int,
    train_ratio: float = 0.8,
    val_ratio: float = 0.1,
    test_ratio: float = 0.1,
    n_selected_test: int = 100,
    seed: int = 42,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Split dataset indices into train, eval, test, and a selected test subset.

    Returns:
        train_idx, val_idx, test_idx, selected_test_idx
    """
    assert (
        abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-6
    ), "Ratios must sum to 1"

    rng = np.random.default_rng(seed)
    indices = rng.permutation(dataset_length)

    train_end = int(dataset_length * train_ratio)
    eval_end = train_end + int(dataset_length * val_ratio)

    train_idx = indices[:train_end]
    val_idx = indices[train_end:eval_end]
    test_idx = indices[eval_end:]

    # Select subset from test indices
    n_selected = min(n_selected_test, len(test_idx))
    selected_test_idx = rng.choice(test_idx, size=n_selected, replace=False)

    return train_idx, val_idx, test_idx, selected_test_idx

In [3]:
# Configuration
DATA_PATH = "dataset/imdb-dataset.csv"
SAMPLE_SIZE = 1000  # Reduce for faster experimentation
DEVICE = get_device()
import sklearn


train_idx, val_idx, test_idx, selected_test_idx = split_dataset_indices(SAMPLE_SIZE)
# save selected_test_idx to a csv file
selected_test_df = pd.read_csv(DATA_PATH).iloc[selected_test_idx]
selected_test_df.to_csv(f"dataset/imdb-test-subsample-100_{SAMPLE_SIZE}.csv", index=False)
selected_test_path = f"dataset/imdb-test-subsample-100_{SAMPLE_SIZE}.csv"

# selected_test_idx is a subset of test_idx
assert set(selected_test_idx).issubset(set(test_idx))

print(f"Using device: {DEVICE}")
print(f"Sample size: {SAMPLE_SIZE}")

Using device: mps
Sample size: 1000


---
## 1. Data Preprocessing

Load and preprocess the IMDB dataset with different tokenization strategies.

### Preprocessing Pipeline

```mermaid
graph LR
    A[Raw CSV] --> B[Load & Sample]
    B --> C[Clean HTML Tags]
    C --> D[Lowercase]
    D --> E[Expand Contractions]
    E --> F[Remove Punctuation]
    F --> G{Tokenization}
    G -->|Word| H[NLTK word_tokenize]
    G -->|Subword| I[HuggingFace Tokenizer]
    H --> J[Word Lists]
    I --> K[Token IDs]
    J --> L[Train/Val/Test Split]
    K --> L
    L --> M[Ready for Models]
    
    style A fill:#e1f5ff
    style G fill:#fff4e1
    style M fill:#e8f5e9
```

In [4]:
# Word-based tokenization (for stats, ngrams, RNNs)
preprocessor_word = TextPreprocessor(
    DATA_PATH,
    sample_size=SAMPLE_SIZE,
    tokenizer_type="word"
)

df_word = preprocessor_word.load_data(remove_stopwords=True) # true for better stats
train_df, val_df, test_df = preprocessor_word.get_splits(train_index=train_idx, val_index=val_idx, test_index=test_idx)

print("\nDataset loaded successfully!")
display(df_word.head())

Loading data from dataset/imdb-dataset.csv...
Loaded 1000 reviews (positive: 476, negative: 524)

Dataset loaded successfully!


Unnamed: 0,review,sentiment,_words,_word_count,_char_len
0,really liked summerslam due look arena curtain...,positive,"[really, liked, summerslam, due, look, arena, ...",117,752
1,many television shows appeal quite many differ...,positive,"[many, television, shows, appeal, quite, many,...",191,1354
2,film quickly gets major chase scene ever incre...,negative,"[film, quickly, gets, major, chase, scene, eve...",64,424
3,jane austen would definitely approve one gwyne...,positive,"[jane, austen, would, definitely, approve, one...",61,431
4,expectations somewhat high went see movie thou...,negative,"[expectations, somewhat, high, went, see, movi...",164,1173


---
## 2. Statistics & Visualization

Compute statistics and visualize the dataset.

In [None]:
# Compute statistics
stats_computer = IMDBDataStats(df_word)
stats = stats_computer.get_full_stats()

print("=" * 50)
print("DATASET STATISTICS")
print("=" * 50)

print(f"\nClass Distribution: {stats['class_distribution']}")
print(f"\nVocabulary Size: {stats['vocabulary_size']}")
print(f"\nAverage Word Count: {stats['average_word_count']}")
print(f"\nMost Frequent Words (Overall):")
for word, count in stats['most_frequent_words']['overall']:
    print(f"  {word}: {count}")

In [None]:
# Visualize word clouds
stats_computer.visualize_word_clouds(save_path="out/wordclouds.png")

In [None]:
# Generate scattertext visualization (interactive HTML)
stats_computer.visualize_scattertext(output_html="out/scattertext.html")
print("Open out/scattertext.html in your browser to view the interactive visualization.")

---
## 3. N-gram Language Models

Train bigram and trigram models, generate text, and compute perplexity.

### N-gram Pipeline

```mermaid
graph TD
    A[Tokenized Text] --> B[Build N-gram Counts]
    B --> C{N-gram Type}
    C -->|n=2| D[Bigram Model]
    C -->|n=3| E[Trigram Model]
    
    D --> F[Laplace Smoothing]
    E --> F
    
    F --> G[Probability Estimation]
    G --> H[Text Generation]
    G --> I[Perplexity Computation]
    
    H --> J[Sample Sentences]
    I --> K[Model Quality Metric]
    
    style A fill:#e1f5ff
    style F fill:#fff4e1
    style J fill:#e8f5e9
    style K fill:#e8f5e9
```

In [None]:
# new dataframe loaded with stopwords

train_df = preprocessor_word.load_data(remove_stopwords=False) 

In [None]:
# Train Bigram Model
print("Training Bigram Model...")
bigram_model = NGramModel(n=2, laplace_smoothing=True)
bigram_model.train(train_df["_words"])

print("\n--- Bigram Generation Examples ---")
for i in range(3):
    print(f"{i+1}. {bigram_model.generate_sentence(max_length=20)}")

print("\nComputing perplexity...")
bigram_perplexity = bigram_model.compute_perplexity(test_df["_words"])
print(f"Bigram Test Perplexity: {bigram_perplexity:.2f}")

In [None]:
# Train Trigram Model
print("Training Trigram Model...")
trigram_model = NGramModel(n=3, laplace_smoothing=True)
trigram_model.train(train_df["_words"])

print("\n--- Trigram Generation Examples ---")
for i in range(3):
    print(f"{i+1}. {trigram_model.generate_sentence(max_length=20)}")

print("\nComputing perplexity...")
trigram_perplexity = trigram_model.compute_perplexity(test_df["_words"])
print(f"Trigram Test Perplexity: {trigram_perplexity:.2f}")

---
## 4. RNN Models (LSTM/GRU)

Train recurrent neural networks for language modeling and text generation.

### RNN Training Pipeline

```mermaid
graph TD
    A[Word Tokens] --> B[Build Vocabulary]
    B --> C[Token to ID Mapping]
    C --> D[Create DataLoader]
    D --> E[RNN Model]
    
    E --> F{RNN Type}
    F -->|LSTM| G[LSTM Layers]
    F -->|GRU| H[GRU Layers]
    
    G --> I[Embedding Layer]
    H --> I
    I --> J[Hidden States]
    J --> K[Output Layer]
    
    K --> L[Cross-Entropy Loss]
    L --> M[Backpropagation]
    M --> N{Converged?}
    N -->|No| E
    N -->|Yes| O[Trained Model]
    
    O --> P[Text Generation]
    O --> Q[Perplexity Evaluation]
    
    style A fill:#e1f5ff
    style E fill:#fff4e1
    style O fill:#e8f5e9
```

In [None]:
# Build vocabulary
data_module = RNNDataModule()
data_module.build_vocab(train_df["_words"].tolist(), min_freq=5) # train_df with stopwords

# Create dataloaders
train_loader = data_module.get_dataloader(
    df=train_df,
    batch_size=8,
    shuffle=True,
    max_seq_length=128,
    device=DEVICE
)

val_loader = data_module.get_dataloader(
    df=val_df,
    batch_size=8,
    shuffle=False,
    max_seq_length=128,
    device=DEVICE
)

test_loader = data_module.get_dataloader(
    df=test_df,
    batch_size=8,
    shuffle=False,
    max_seq_length=128,
    device=DEVICE
)

print(f"Vocabulary size: {len(data_module.vocab)}")
print(f"Train batches: {len(train_loader)}")

In [None]:
# Train LSTM Model
print("Training LSTM Model...")

lstm_model = RNN(
    vocab_size=len(data_module.vocab),
    embedding_dim=128,
    hidden_dim=256,
    rnn_type='LSTM',
    device=DEVICE
)

criterion = torch.nn.CrossEntropyLoss(ignore_index=0)  # padding_idx=0
optimizer = torch.optim.AdamW(lstm_model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2)

lstm_model.train_loop(
    train_loader,
    criterion,
    optimizer,
    scheduler,
    num_epochs=3,  # Reduce for demo
    accumulation_steps=4
)

# Save model
lstm_model.save_model("out/lstm_imdb_model.pth")
print("Model saved to out/lstm_imdb_model.pth")

In [None]:
# Generate text with LSTM
print("\n--- LSTM Generation Examples ---")
start_token_id = data_module.vocab.get("<s>", 2)

for i in range(3):
    generated_sequence = lstm_model.generate(
        start_token=start_token_id,
        max_length=30,
        temperature=0.8
    )
    generated_text = data_module.decode_sequence(generated_sequence)
    print(f"{i+1}. {generated_text}")

# Compute perplexity
lstm_perplexity = lstm_model.compute_perplexity(test_loader)
print(f"\nLSTM Test Perplexity: {lstm_perplexity:.2f}")

In [None]:
# Train GRU Model (optional)
print("Training GRU Model...")

gru_model = RNN(
    vocab_size=len(data_module.vocab),
    embedding_dim=128,
    hidden_dim=256,
    rnn_type='GRU',
    device=DEVICE
)

optimizer = torch.optim.AdamW(gru_model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2)

gru_model.train_loop(
    train_loader,
    criterion,
    optimizer,
    scheduler,
    num_epochs=3,
    accumulation_steps=4
)

gru_model.save_model("out/gru_imdb_model.pth")
print("Model saved to out/gru_imdb_model.pth")

In [None]:
# Generate text with GRU
print("\n--- GRU Generation Examples ---")

for i in range(3):
    generated_sequence = gru_model.generate(
        start_token=start_token_id,
        max_length=30,
        temperature=0.8
    )
    generated_text = data_module.decode_sequence(generated_sequence)
    print(f"{i+1}. {generated_text}")

# Compute perplexity
gru_perplexity = gru_model.compute_perplexity(test_loader)
print(f"\nGRU Test Perplexity: {gru_perplexity:.2f}")

---
## 5. Transformer Models (Fine-tuning)

Fine-tune pre-trained transformers (ALBERT) on sentiment classification.

### Transformer Fine-tuning Pipeline

```mermaid
graph TD
    A[IMDB Dataset] --> B[Subword Tokenization]
    B --> C{Tokenizer Type}
    C -->|WordPiece| D[BERT/MiniLM Tokenizer]
    C -->|BPE| E[RoBERTa Tokenizer]
    C -->|Unigram| F[ALBERT Tokenizer]
    
    D --> G[Token IDs + Attention Masks]
    E --> G
    F --> G
    
    G --> H[Pre-trained Model]
    H --> I[Add Classification Head]
    I --> J[Fine-tuning]
    
    J --> K[Training Loop]
    K --> L[Gradient Accumulation]
    L --> M[AdamW Optimizer]
    M --> N{Epoch Complete?}
    N -->|No| K
    N -->|Yes| O[Save Model]
    
    O --> P[Evaluation]
    
    style A fill:#e1f5ff
    style C fill:#fff4e1
    style O fill:#e8f5e9
```

In [6]:

print("Fine-tuning Transformer...")

transformer_outputs = finetune_minilm(
    data_path=DATA_PATH,
    sample_size=SAMPLE_SIZE,
    output_dir="out/minilm_imdb_model",
    tokenization="unigram",  # Options: "wordpiece", "bpe", "unigram", "all"
    epochs=1,
    train_index=train_idx,
    val_index=val_idx,
    test_index=test_idx
)

print("\nTransformer training complete!")
print(f"Model saved to out/minilm_imdb_model")
print(transformer_outputs)

Fine-tuning Transformer...
Using device: mps

Vocabulary size for unigram tokenizer: 30000


Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Gradient checkpointing not supported for this model
Loading data from dataset/imdb-dataset.csv...
Loaded 1000 reviews (positive: 476, negative: 524)
Dataset: 800 samples, avg_len=223.6, max_len=320
Dataset: 100 samples, avg_len=232.3, max_len=320


Step,Training Loss


Dataset: 100 samples, avg_len=222.7, max_len=320


UNIGRAM Test results: {'eval_loss': 0.5329983234405518, 'eval_runtime': 10.4234, 'eval_samples_per_second': 9.594, 'eval_steps_per_second': 0.192, 'epoch': 0.96}

Transformer training complete!
Model saved to out/minilm_imdb_model
{'unigram': {'trainer': <transformers.trainer.Trainer object at 0x31a758880>, 'model': AlbertForSequenceClassification(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): 

---
## 6. Transformer Evaluation

Evaluate the fine-tuned transformer with:
- Standard metrics (Accuracy, F1, Confusion Matrix, Perplexity)
- LLM-as-a-judge (optional, requires API key)

### Three-Part Evaluation Pipeline

```mermaid
graph TD
    A[Trained Model] --> B[Random 100 Test Samples]
    
    B --> C[Part 1: Metric-based]
    B --> D[Part 2: Human Evaluation]
    B --> E[Part 3: LLM-as-a-Judge]
    
    C --> F[Accuracy]
    C --> G[F1 Score]
    C --> H[Perplexity]
    C --> I[Confusion Matrix]
    
    D --> J[CSV Export]
    J --> K[Manual Annotation]
    K --> L[Human Ratings]
    
    E --> M[Llama 3.1 8B]
    M --> N[LLM Judgments]
    
    F --> O[Agreement Analysis]
    G --> O
    H --> O
    I --> O
    L --> O
    N --> O
    
    O --> P[Compare All Evaluations]
    P --> Q[Disagreement Examples]
    P --> R[Final Report]
    
    style A fill:#e1f5ff
    style C fill:#fff4e1
    style D fill:#fff4e1
    style E fill:#fff4e1
    style R fill:#e8f5e9
```

In [None]:
# Set Google Gemini API key (optional, for LLM judge)
import os
# Run evaluation
# create subsample from test set
print(selected_test_idx)

for tokenization in ["wordpiece"]:
    model_dir = f"out/minilm_imdb_model_{tokenization}"
    print(f"\nEvaluating Transformer model ({tokenization})...")
    evaluate_transformer(
        model_dir=model_dir,
        tokenizer_type=tokenization,
        train_index=train_idx,
        eval_index=val_idx,
        test_index=selected_test_idx,
        test_path=selected_test_path,
        data_path=DATA_PATH,
        sample_size=SAMPLE_SIZE,
        skip_human_eval=False,  # Set to False to generate CSV for human annotation
        skip_llm_judge=False,  # Set to False to use LLM judge
    )

[503 182 622 117 333 791 314 586 637 927 451 982 287 548 137 141  15 634
 609 675 526 300 752 286  10 212 275  57 195 648 102  69 611 821 967 317
 534 948 437 553 973 432 932 809 689  41 720 970  42  49 518 167 535 528
  63 910 366 706 651 272 313 621 943 617 908 628 669 641 702 490 828 290
 502 296 832 583 749   1 779 177 303 961 547 837 736 529 974 372 939 500
 543 672  76 491 735 555  34 113 410 687]

Evaluating Transformer model (wordpiece)...
Using device: mps

Loading model...
Transformer vocabulary size: 30522

Loading data...
Loading data from dataset/imdb-dataset.csv...
Loaded 1000 reviews (positive: 476, negative: 524)
Dataset: 800 samples, avg_len=221.2, max_len=320
Dataset: 100 samples, avg_len=230.0, max_len=320
Dataset: 100 samples, avg_len=220.8, max_len=320

PART 1: METRIC-BASED EVALUATION (Full Splits)

Train Metrics:
------------------------------
Accuracy: 0.6488
F1 Score: 0.7347
Perplexity: 1.9943
Confusion Matrix:
[[130 278]
 [  3 389]]

Validation Metrics:
-------

In [12]:
for tokenization in ["unigram"]:
    model_dir = f"out/minilm_imdb_model"
    print(f"\nEvaluating Transformer model ({tokenization})...")
    evaluate_transformer(
        model_dir=model_dir,
        tokenizer_type=tokenization,
        train_index=train_idx,
        eval_index=val_idx,
        test_index=selected_test_idx,
        test_path=selected_test_path,
        data_path=DATA_PATH,
        sample_size=SAMPLE_SIZE,
        skip_human_eval=False,  # Set to False to generate CSV for human annotation
        skip_llm_judge=False,  # Set to False to use LLM judge
    )


Evaluating Transformer model (unigram)...
Using device: mps

Loading model...
Transformer vocabulary size: 30000

Loading data...
Loading data from dataset/imdb-dataset.csv...
Loaded 1000 reviews (positive: 476, negative: 524)
Dataset: 800 samples, avg_len=223.6, max_len=320
Dataset: 100 samples, avg_len=232.3, max_len=320
Dataset: 100 samples, avg_len=222.7, max_len=320

PART 1: METRIC-BASED EVALUATION (Full Splits)

Train Metrics:
------------------------------
Accuracy: 0.8125
F1 Score: 0.7869
Perplexity: 1.7115
Confusion Matrix:
[[373  35]
 [115 277]]

Validation Metrics:
------------------------------
Accuracy: 0.8200
F1 Score: 0.7568
Perplexity: 1.7154
Confusion Matrix:
[[54  4]
 [14 28]]

Test Metrics:
------------------------------
Accuracy: 0.8400
F1 Score: 0.7838
Perplexity: 1.7040
Confusion Matrix:
[[55  3]
 [13 29]]

PART 2: SUBSAMPLE (100 Test Instances)

Generating predictions for 100 samples...

Subsample Metrics (using gold labels):
  Accuracy: 0.9000
  F1 Score: 0.893

---
## Notes

- **Adjust `SAMPLE_SIZE`** in the configuration cell to control dataset size
- **Training times** vary by model complexity and hardware
- **Transformer evaluation** supports LLM-as-a-judge with free OpenRouter API
- All models are saved to disk and can be reloaded

### File Outputs (saved in `out/` directory)
- `out/wordclouds.png` - Word cloud visualizations
- `out/scattertext.html` - Interactive term frequency visualization
- `out/lstm_imdb_model.pth` - Trained LSTM model
- `out/gru_imdb_model.pth` - Trained GRU model
- `out/minilm_imdb_model/` - Fine-tuned transformer model
- `out/evaluation_results.csv` - Evaluation results for manual annotation