<a href="https://colab.research.google.com/github/surf-guy/DSE200/blob/main/HW/homework4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Review Sentiment Analysis and Rating Prediction

In this homework, you will:
1. Load IMDB movie reviews dataset using Hugging Face datasets
2. Perform sentiment analysis
3. Build a ML model to predict movie ratings


In [1]:
# TODO: Install required packages
%pip install pandas numpy scikit-learn transformers torch datasets



In [2]:
import pandas as pd
import numpy as np
import re
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


## Part 1: Load Dataset

Load the IMDB dataset using Hugging Face datasets library

In [3]:
dataset = load_dataset("imdb")
train_df = dataset["train"].to_pandas()
test_df = dataset["test"].to_pandas()
unsupervised_df = dataset["unsupervised"].to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Part 2: Data Preprocessing

Clean and prepare the text data

In [4]:
#Verify dfs contain html

train_html = train_df['text'].str.contains(r'<.*?>').any()
test_html = test_df['text'].str.contains(r'<.*?>').any()
unsupe_html = unsupervised_df['text'].str.contains(r'<.*?>').any()

print("train_df contains HTML tags:", train_html)
print("test_df contains HTML tags:", test_html)
print("unsupervised_df contains HTML tags:", unsupe_html)

train_df contains HTML tags: True
test_df contains HTML tags: True
unsupervised_df contains HTML tags: True


In [5]:
#verify dfs contain special char
train_char = train_df['text'].str.contains(r'<.*?>').any()
test_char = test_df['text'].str.contains(r'<.*?>').any()
unsupe_char = unsupervised_df['text'].str.contains(r'<.*?>').any()

print("train_df contains special char:", train_char)
print("test_df contains special char:", test_char)
print("unsupervised_df contains special char:", unsupe_char)

train_df contains special char: True
test_df contains special char: True
unsupervised_df contains special char: True


In [6]:
def clean_text(text):
    """
    Cleans input text by:
    1. Removing HTML tags
    2. Removing special characters & extra spaces
    3. Converting to lowercase
    """
    # 1. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # 2. Remove special characters (keep letters, numbers, punctuation)
    text = re.sub(r'[^a-zA-Z0-9\s.,!?\'"]', '', text)

    # 3. Convert to lowercase
    text = text.lower()

    # remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text


In [7]:
#Clean all 3 dfs

train_df['text']=train_df['text'].apply(clean_text)
test_df['text']=test_df['text'].apply(clean_text)
unsupervised_df['text']=unsupervised_df['text'].apply(clean_text)


In [8]:
#Verify changes worked

train_html = train_df['text'].str.contains(r'<.*?>').any()
test_html = test_df['text'].str.contains(r'<.*?>').any()
unsupe_html = unsupervised_df['text'].str.contains(r'<.*?>').any()
train_char = train_df['text'].str.contains(r'<.*?>').any()
test_char = test_df['text'].str.contains(r'<.*?>').any()
unsupe_char = unsupervised_df['text'].str.contains(r'<.*?>').any()


print("train_df contains HTML tags:", train_html)
print("test_df contains HTML tags:", test_html)
print("unsupervised_df contains HTML tags:", unsupe_html)
print("train_df contains special char:", train_char)
print("test_df contains special char:", test_char)
print("unsupervised_df contains special char:", unsupe_char)

train_df contains HTML tags: False
test_df contains HTML tags: False
unsupervised_df contains HTML tags: False
train_df contains special char: False
test_df contains special char: False
unsupervised_df contains special char: False


## Part 3: Advanced Sentiment Analysis

Go beyond binary classification - use a pre-trained model to get continuous sentiment scores

In [9]:
# TODO: Implement advanced sentiment analysis
# 1. Load a pre-trained model (hint: try 'distilbert-base-uncased-finetuned-sst-2-english')
# 2. Create a function to get continuous sentiment scores
# 3. Apply it to your cleaned text data
# Note: Original dataset has binary labels, but we want continuous scores!

In [10]:
# Load a pre-trained model (hint: try 'distilbert-base-uncased-finetuned-sst-2-english')
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tok = AutoTokenizer.from_pretrained(model_name)
mdl = AutoModelForSequenceClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
mdl = mdl.to(device).eval()

# figure out which index is POSITIVE/NEGATIVE
id2label = mdl.config.id2label
labels = [id2label[i] for i in range(mdl.config.num_labels)]
pos_idx = labels.index("POSITIVE") if "POSITIVE" in labels else 1
neg_idx = labels.index("NEGATIVE") if "NEGATIVE" in labels else 0

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [11]:
# Create a function to get continuous sentiment scores

@torch.no_grad()
def get_sentiment_scores(texts, batch_size=32, max_len=256):
    """
    returns two arrays:
    - probs : probability that the text is positive (0..1)
    - score    : same thing scaled to [-1,1]
    """
    if not isinstance(texts, (list, tuple)):
        texts = list(texts)

    probs = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        enc = tok(batch,
                  padding=True,
                  truncation=True,
                  max_length=max_len,
                  return_tensors="pt").to(device)

        out = mdl(**enc).logits
        p = torch.softmax(out, dim=-1)[:, pos_idx].cpu().numpy()
        probs.extend(p.tolist())

    probs = np.array(probs)
    score = 2 * probs - 1     # convert 0–1 → -1–1
    return probs, score



In [12]:
 #Apply it to your cleaned text data

train_df["prob_pos"], train_df["sent_score"] = get_sentiment_scores(train_df["text"])
test_df["prob_pos"],  test_df["sent_score"]  = get_sentiment_scores(test_df["text"])


In [13]:
train_df.head()

Unnamed: 0,text,label,prob_pos,sent_score
0,i rented i am curiousyellow from my video stor...,0,0.00995,-0.980101
1,"""i am curious yellow"" is a risible and pretent...",0,0.000697,-0.998607
2,if only to avoid making this type of film in t...,0,0.001154,-0.997693
3,this film was probably inspired by godard's ma...,0,0.943966,0.887933
4,"oh, brother...after hearing about this ridicul...",0,0.001831,-0.996337


In [14]:
test_df.head()

Unnamed: 0,text,label,prob_pos,sent_score
0,i love scifi and am willing to put up with a l...,0,0.000852,-0.998296
1,"worth the entertainment value of a rental, esp...",0,0.942555,0.885111
2,its a totally average film with a few semialri...,0,0.000297,-0.999407
3,star rating saturday night friday night friday...,0,0.003544,-0.992913
4,"first off let me say, if you haven't enjoyed a...",0,0.994985,0.98997


## Part 4: Feature Engineering

Create rich features for your model

In [15]:
#  Calculate text statistics:
#    - Length
#    - Word count
#    - Average word length
#    - Sentence count

def _safe_word_stats(text: str):
    if not isinstance(text, str):
        return 0, 0.0
    words = text.split()
    wc = len(words)
    awl = (sum(len(w) for w in words) / wc) if wc > 0 else 0.0
    return wc, awl

def _sent_count(text: str):
    if not isinstance(text, str) or not text.strip():
        return 0
    # count sentence enders . ! ?
    c = len(re.findall(r'[.!?]+', text))
    # if there’s text but no enders, treat as one sentence
    return max(1 if text.strip() else 0, c)

def add_text_features(df: pd.DataFrame):
    """
    adds:
      - char_len
      - word_count
      - avg_word_len
      - sent_count
      - words_per_sent  (extra, handy)
      - uses existing 'sent_score' if present (no change)
        (if only 'prob_pos' exists, also add sent_score = 2*prob_pos-1)
    """
    # ensure continuous score present
    if 'sent_score' not in df.columns and 'prob_pos' in df.columns:
        df['sent_score'] = 2*df['prob_pos'] - 1

    # character length
    df['char_len'] = df['text'].fillna('').str.len()

    # word_count + avg_word_len
    wc_awl = df['text'].apply(_safe_word_stats)
    df['word_count']   = wc_awl.apply(lambda x: x[0])
    df['avg_word_len'] = wc_awl.apply(lambda x: x[1])

    # sentence count
    df['sent_count'] = df['text'].apply(_sent_count)

    # extra: words per sentence (avoid div by zero)
    df['words_per_sent'] = np.where(df['sent_count'] > 0,
                                    df['word_count'] / df['sent_count'],
                                    0.0)
    return df

# apply to splits
train_df = add_text_features(train_df)
test_df  = add_text_features(test_df)

# quick peek
train_df.head()


Unnamed: 0,text,label,prob_pos,sent_score,char_len,word_count,avg_word_len,sent_count,words_per_sent
0,i rented i am curiousyellow from my video stor...,0,0.00995,-0.980101,1599,282,4.673759,14,20.142857
1,"""i am curious yellow"" is a risible and pretent...",0,0.000697,-0.998607,1284,214,5.004673,11,19.454545
2,if only to avoid making this type of film in t...,0,0.001154,-0.997693,490,87,4.643678,5,17.4
3,this film was probably inspired by godard's ma...,0,0.943966,0.887933,676,114,4.938596,10,11.4
4,"oh, brother...after hearing about this ridicul...",0,0.001831,-0.996337,1716,293,4.860068,22,13.318182


In [16]:
#Additional test_df feautures

test_df['exclamation_count'] = test_df['text'].str.count('!')
test_df['question_count'] = test_df['text'].str.count('\\?')

In [17]:
#Additional train_df feautures

train_df['exclamation_count'] = train_df['text'].str.count('!')
train_df['question_count'] = train_df['text'].str.count('\\?')

## Part 5: Multi-Class Rating Prediction

Instead of binary classification, predict a 5-star rating!

In [18]:
# Create target variable
# Convert binary labels to 5-star ratings using your features
# Hint: Use sentiment scores and other features to estimate star rating

train = train_df.copy()
test  = test_df.copy()

#Normalize so huge values don't dominate
for df in [train, test]:
    df['word_count_norm'] = np.log1p(df['word_count'])
    df['sent_count_norm'] = np.log1p(df['sent_count'])
    df['exclamation_count_norm'] = np.log1p(df['exclamation_count'])


def giveRating(row):
    base = (row['sent_score'] + 1) * 2.5   # converts -1→1 scale to 1→5

    # add/subtract small adjustments
    base += 0.10 * row['word_count_norm']          # longer reviews → slight boost
    base += 0.05 * row['exclamation_count_norm']   # excitement → boost
    base -= 0.10 * row['question_count']           # many questions → confusion → lower

    # clip to valid 1–5 range
    return np.clip(base, 1, 5)

train['rating'] = train.apply(giveRating, axis=1)
test['rating']  = test.apply(giveRating, axis=1)

In [19]:
train.head()

Unnamed: 0,text,label,prob_pos,sent_score,char_len,word_count,avg_word_len,sent_count,words_per_sent,exclamation_count,question_count,word_count_norm,sent_count_norm,exclamation_count_norm,rating
0,i rented i am curiousyellow from my video stor...,0,0.00995,-0.980101,1599,282,4.673759,14,20.142857,0,0,5.645447,2.70805,0.0,1.0
1,"""i am curious yellow"" is a risible and pretent...",0,0.000697,-0.998607,1284,214,5.004673,11,19.454545,0,1,5.370638,2.484907,0.0,1.0
2,if only to avoid making this type of film in t...,0,0.001154,-0.997693,490,87,4.643678,5,17.4,0,0,4.477337,1.791759,0.0,1.0
3,this film was probably inspired by godard's ma...,0,0.943966,0.887933,676,114,4.938596,10,11.4,0,0,4.744932,2.397895,0.0,5.0
4,"oh, brother...after hearing about this ridicul...",0,0.001831,-0.996337,1716,293,4.860068,22,13.318182,3,2,5.68358,3.135494,1.386294,1.0


In [20]:
test.head()

Unnamed: 0,text,label,prob_pos,sent_score,char_len,word_count,avg_word_len,sent_count,words_per_sent,exclamation_count,question_count,word_count_norm,sent_count_norm,exclamation_count_norm,rating
0,i love scifi and am willing to put up with a l...,0,0.000852,-0.998296,1368,230,4.952174,20,11.5,1,0,5.442418,3.044522,0.693147,1.0
1,"worth the entertainment value of a rental, esp...",0,0.942555,0.885111,1242,210,4.919048,14,15.0,0,1,5.351858,2.70805,0.0,5.0
2,its a totally average film with a few semialri...,0,0.000297,-0.999407,703,133,4.293233,4,33.25,0,0,4.89784,1.609438,0.0,1.0
3,star rating saturday night friday night friday...,0,0.003544,-0.992913,1996,345,4.788406,14,24.642857,0,0,5.846439,2.70805,0.0,1.0
4,"first off let me say, if you haven't enjoyed a...",0,0.994985,0.98997,668,134,3.992537,7,19.142857,1,0,4.905275,2.079442,0.693147,5.0


In [21]:
# TODO: Build and train your model
# 1. Split data into train and test sets
# 2. Choose a model suitable for multi-class classification
# 3. Train the model
# 4. Make predictions
# 5. Evaluate performance

In [37]:
# Split data into train and test sets

#concatentate train and test to make one combined dataset
combined_df = pd.concat([train, test], axis=0, ignore_index=True)

# Split into train and test sets (e.g., 80% train, 20% test)
train_final_df, test_final_df = train_test_split(combined_df, test_size=0.2, random_state=42)
print("Train data:", train_final_df.shape)
print("Test data:", test_final_df.shape)


Train data: (40000, 15)
Test data: (10000, 15)


In [None]:
# Choose a model suitable for multi-class classification


In [32]:
combined_df.head()
len(combined_df)

50000

## Part 6: Analysis

Analyze your results and suggest improvements

In [22]:
# TODO: Create visualizations and analyze:
# 1. Confusion matrix for multi-class predictions
# 2. Feature importance
# 3. Error analysis
# 4. Suggest improvements