# Movie Review Sentiment and Text Classification Analysis

Date: 12/12/2025

Team Members:
- Karrie Butcher
- Nicko Lomelin
- Thanh Tuan Pham

dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 

The dataset is designed for determining the sentiment of movie reviews (labeled as 'positive' or 'negative') based on unstructured text data, specifically, raw word sequences representing narrative context, tone, and viewer opinion. The primary business value of building a predictive model on this data is to enable automated opinion mining and large-scale audience sentiment analysis. In a media streaming or film production context, an accurate model allows for the development of "intelligent" feedback systems capable of distinguishing between favorable audience engagement and critical backlash. This distinction allows for real-time aggregation of viewer satisfaction or automated content moderation, significantly reducing the manual effort required to analyze thousands of daily user reviews.

dataset(pre-trained word embeddings (GloVe)): https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt 

Prediction Task: **Binary Sequence Classification (Many-to-One)**

We will investigate and compare at least two specific sequential network architectures: a Recurrent Neural Network (specifically an LSTM) and a Transformer architecture. We will predict the sentiment (the target variable) as one of two distinct categories ('positive' or 'negative'). TThis investigation will involve utilizing **pre-trained GloVe embedding layers** to capture semantic meaning, tuning hyperparameters to improve generalization, and specifically examining the impact of stacking a second multi-headed self-attention layer within the Transformer architecture to better capture complex dependencies in the review text. Finally, we will extend our analysis by evaluating the performance of our best model when utilizing **ConceptNet Numberbatch embeddings** compared to our **baseline GloVe embeddings**.

---
## 1. Preparation

****
### **1.1 Data Preparation and Preprocessing**

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from pathlib import Path

file_path = Path.home() / "Downloads" / "imdb-dataset.csv"

# load into pandas
df = pd.read_csv(file_path)

print("Initial shape:", df.shape)
df.head()
df.info()

#Clean Text Data
def clean_text(text):
    text = re.sub(r'<br\s*/?>', ' ', text) # Replace break tags with space
    text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove special characters/punctuation
    return text.lower()

df['review_cleaned'] = df['review'].apply(clean_text)

# Defining and Prepare Class Variables (Label Encoding)
# Converting 'positive'/'negative' strings into 0/1 integers
le = LabelEncoder()
df['sentiment_encoded'] = le.fit_transform(df['sentiment'])
y = df['sentiment_encoded'].values

# Tokenization and Sequence Padding
# We need to turn words into integers for the Embedding layer.

# Hyperparameters 
MAX_VOCAB_SIZE = 20000  # We will only look at the top 20,000 most frequent words
MAX_SEQUENCE_LENGTH = 300 # We will cut off reviews after 300 words

tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(df['review_cleaned'])
sequences = tokenizer.texts_to_sequences(df['review_cleaned'])

# Padding and Truncating sequences to force a specific length
X = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post', truncating='post')

print(f"Original Data Shape: {df.shape}")
print(f"Final X Shape (Input Matrix): {X.shape}")
print(f"Final y Shape (Target Vector): {y.shape}")
print(f"Vocabulary Size: {len(tokenizer.word_index)}")
print(f"Example Review (Integers): {X[0][:10]}...")
print(f"Example Sentiment: {y[0]} ({le.inverse_transform([y[0]])[0]})")

# Load Pre-trained Word Embeddings (GloVe)
glove_file = Path.home() / "Downloads" / "glove.6B.100d.txt" 

embeddings_index = {}
embedding_dim = 100 

print("Loading GloVe vectors...")
try:
    with open(glove_file, encoding="utf8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    print(f"Found {len(embeddings_index)} word vectors.")
except FileNotFoundError:
    print("GloVe file not found. Will skip pre-trained embeddings.")

# Embedding Matrix Creation
# This maps our specific vocabulary (from the Tokenizer) to the GloVe vectors
embedding_matrix = np.zeros((MAX_VOCAB_SIZE, embedding_dim))

for word, i in tokenizer.word_index.items():
    if i < MAX_VOCAB_SIZE:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

print(f"Embedding Matrix Shape: {embedding_matrix.shape}") # embedding_matrix is ready to be passed to the Embedding layer (in Modeling)



Initial shape: (50000, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
Original Data Shape: (50000, 4)
Final X Shape (Input Matrix): (50000, 300)
Final y Shape (Target Vector): (50000,)
Vocabulary Size: 163304
Example Review (Integers): [  28    5    2   76 1928   45 1056   12  101  144]...
Example Sentiment: 1 (positive)
Loading GloVe vectors...
Found 400000 word vectors.
Embedding Matrix Shape: (20000, 100)


Final Variable Structure:

X (Input Matrix): A matrix of shape (50000, 300) containing integer-encoded sequences.

y (Target Vector): A vector of shape (50000,) containing binary labels (0 for negative, 1 for positive).

embedding_matrix: A matrix of shape (20000, 100) containing pre-trained GloVe vectors.

### 1.1.2 Tokenization and Sequence Length Strategy

#### Methods of Tokenization: 
To transform the unstructured text into a format suitable for a sequential neural network, we utilized Frequency-Based Integer Encoding via the Keras Tokenizer.

#### Vocabulary Limit: 
We restricted our vocabulary to the top 20,000 most frequent words. This decision serves two purposes: it reduces the dimensionality of the embedding layer and filters out noise (such as typos or extremely obscure proper nouns) that could lead to overfitting.

#### OOV Handling: 
Any words encountered that fall outside of our defined range are replaced with a  <OOV> (Out of Vocabulary) token. This ensures the model preserves the structure of a sentence even if certain words are unknown, rather than deleting them.

#### Encoding:
We chose **integer encoding** over one-hot encoding for the input sequences because one-hot encoding a vocabulary of 20,000 words for sequences of length 300 would result in an largely sparse matrix. Integer encoding allows us to utilize an Embedding Layer to learn dense, semantic vector representations.

**Decisions on Sequence Length:** Neural networks require input tensors of a fixed shape, but movie reviews vary in length. To standardize the input, we forced a specific sequence length of 300 words.

We chose 300 words because after analyzing the dataset, we saw that 300 words capture the majority of the sentiment-bearing content for the average review.

#### Padding & Truncating:

**Padding:** Reviews shorter than 300 words are padded with zeros at the end (padding='post'). We chose 'post' padding because recurrent networks (like LSTMs) process sequences chronologically. Padding at the end is often easier for the model to ignore once it hits the "end" of the real data.

**Truncating:** Reviews longer than 300 words are truncated at the end (truncating='post'). We assume that the most critical sentiment is often established in the introduction and body of the review, rather than strictly in the final words.

****
### **1.2 Chosen Metric(s) and Justification**

#### Chosen Metrics: **F1-Score** and **Confusion Matrix**

#### Justification: 

We will use the **F1-Score** as our primary metric to mathematically balance the cost of misleading recommendations (Precision) against the cost of burying high-quality content (Recall), while utilizing a Confusion Matrix to diagnose whether our model is biased toward optimism or pessimism.

**Why Accuracy is Insufficient:**
While the IMDB dataset is relatively balanced, relying on simple Accuracy is insufficient for a sentiment analysis business case. Accuracy treats all errors equally, but in a content recommendation or moderation context, the nature of the error matters. A model that blindly predicts "Positive" for everything might have decent accuracy but would be functionally useless for distinguishing actual viewer sentiment, destroying user trust in the platform.

**Why F1-Score is Appropriate:**
The F1-Score is the harmonic mean of Precision and Recall. In our proposed business case of **Automated Opinion Mining for Streaming Services**, both types of errors carry significant, distinct costs:

* **Precision (Cost of False Positives):** If our model falsely classifies a Negative review as Positive (1), the recommendation algorithm might push a poorly received movie to users, claiming it has "great reviews." When users watch it and realize it is bad, they lose trust in the platform's "Smart Recommendations" (Cost: User Churn/Trust Loss).

* **Recall (Cost of False Negatives):** If our model fails to identify a Positive review (classifying it as Negative), a "hidden gem" or critically acclaimed indie film might get buried by the algorithm. The studio loses potential viral revenue because the system failed to recognize the audience's excitement (Cost: Missed Revenue Opportunity).

Since we need to balance the risk of annoying users (Precision) against the risk of burying good content (Recall), the F1-Score is the mathematically appropriate measure to ensure the model performs robustly in both directions.

**Why Confusion Matrix:**
Finally, we will visualize the results using a Confusion Matrix. This allows us to look "under the hood" to see if the model has a specific bias. For example, is it struggling with sarcasm (predicting "Positive" because it sees words like "great" or "best" used ironically in a negative review)? The confusion matrix helps us diagnose these specific linguistic failures.

***

### **1.3 Chosen Method for Dividing Data and Justification**

#### Chosen Method: Stratified Shuffle Split (80% Training / 20% Testing)

#### Justification:

We will use a **Stratified Shuffle Split**, allocating 80% of the data for training and 20% for final testing. During the model training phase (the 80% split), we will further reserve a portion (20% of the training data) as a **validation set** to monitor loss curves and implement early stopping.

**Why Stratified?**
Although the IMDB dataset is technically balanced (25,000 positive and 25,000 negative reviews), using a Stratified Shuffle Split is still the most robust approach. It guarantees that the **exact 50/50 class distribution** is preserved across the Training, Validation, and Testing sets. If we relied on a simple random shuffle, we might accidentally generate a Validation set that is skewed (e.g., 60% positive), which could mislead our hyperparameter tuning and cause the model to overfit to positive sentiment.

**Why not 10-Fold Cross-Validation?**
While 10-Fold Cross-Validation provides a robust statistical estimate for lighter models, it is computationally prohibitive for Sequential Deep Learning architectures (like LSTMs and Transformers). Training a single sequential network on 40,000 text sequences of length 300 is computationally expensive; repeating this process 10 times is not feasible within the scope of this lab.

**Realistic Mirroring:**
This approach mirrors real-world deployment (the "Business Case"). In a production environment for a streaming service, the model is trained on historical data and then frozen to predict sentiment on **new, unseen reviews** as they are posted by users. By strictly holding out the Test set (and never using it for tuning), we simulate this stream of "future" data, ensuring our F1-Score reflects how the model will perform on actual user feedback rather than memorized training examples.

# 2. Modeling

# 3. Exceptional Work

# 4. Citation