# Movie Review Sentiment Analysis and Rating Prediction

In this homework, you will:
1. Load IMDB movie reviews dataset using Hugging Face datasets
2. Perform sentiment analysis
3. Build a ML model to predict movie ratings


In [1]:
# TODO: Install required packages
%pip install pandas numpy scikit-learn transformers torch datasets



In [18]:
# TODO: Import required libraries
import pandas as pd
import numpy as np
import re
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


## Part 1: Load Dataset

Load the IMDB dataset using Hugging Face datasets library

In [11]:
dataset = load_dataset("imdb")
train_df = dataset["train"].to_pandas()
test_df = dataset["test"].to_pandas()
unsupervised_df = dataset["unsupervised"].to_pandas()

## Part 2: Data Preprocessing

Clean and prepare the text data

In [25]:
#Verify dfs contain html

train_html = train_df['text'].str.contains(r'<.*?>').any()
test_html = test_df['text'].str.contains(r'<.*?>').any()
unsupe_html = unsupervised_df['text'].str.contains(r'<.*?>').any()

print("train_df contains HTML tags:", train_html)
print("test_df contains HTML tags:", test_html)
print("unsupervised_df contains HTML tags:", unsupe_html)

train_df contains HTML tags: True
test_df contains HTML tags: True
unsupervised_df contains HTML tags: True


In [26]:
#verify dfs contain special char
train_char = train_df['text'].str.contains(r'<.*?>').any()
test_char = test_df['text'].str.contains(r'<.*?>').any()
unsupe_char = unsupervised_df['text'].str.contains(r'<.*?>').any()

print("train_df contains special char:", train_char)
print("test_df contains special char:", test_char)
print("unsupervised_df contains special char:", unsupe_char)

train_df contains special char: True
test_df contains special char: True
unsupervised_df contains special char: True


In [27]:
def clean_text(text):
    """
    Cleans input text by:
    1. Removing HTML tags
    2. Removing special characters & extra spaces
    3. Converting to lowercase
    """
    # 1. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # 2. Remove special characters (keep letters, numbers, punctuation)
    text = re.sub(r'[^a-zA-Z0-9\s.,!?\'"]', '', text)

    # 3. Convert to lowercase
    text = text.lower()

    # remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text


In [30]:
#Clean all 3 dfs

train_df['text']=train_df['text'].apply(clean_text)
test_df['text']=test_df['text'].apply(clean_text)
unsupervised_df['text']=unsupervised_df['text'].apply(clean_text)


In [32]:
#Verify changes worked

train_html = train_df['text'].str.contains(r'<.*?>').any()
test_html = test_df['text'].str.contains(r'<.*?>').any()
unsupe_html = unsupervised_df['text'].str.contains(r'<.*?>').any()
train_char = train_df['text'].str.contains(r'<.*?>').any()
test_char = test_df['text'].str.contains(r'<.*?>').any()
unsupe_char = unsupervised_df['text'].str.contains(r'<.*?>').any()


print("train_df contains HTML tags:", train_html)
print("test_df contains HTML tags:", test_html)
print("unsupervised_df contains HTML tags:", unsupe_html)
print("train_df contains special char:", train_char)
print("test_df contains special char:", test_char)
print("unsupervised_df contains special char:", unsupe_char)

train_df contains HTML tags: False
test_df contains HTML tags: False
unsupervised_df contains HTML tags: False
train_df contains special char: False
test_df contains special char: False
unsupervised_df contains special char: False


In [36]:
unsupervised_df.head()

Unnamed: 0,text,label
0,this is just a precious little diamond. the pl...,-1
1,when i say this is my favourite film of all ti...,-1
2,i saw this movie because i am a huge fan of th...,-1
3,being that the only foreign films i usually li...,-1
4,after seeing point of no return a great movie ...,-1


## Part 3: Advanced Sentiment Analysis

Go beyond binary classification - use a pre-trained model to get continuous sentiment scores

In [None]:
# TODO: Implement advanced sentiment analysis
# 1. Load a pre-trained model (hint: try 'distilbert-base-uncased-finetuned-sst-2-english')
# 2. Create a function to get continuous sentiment scores
# 3. Apply it to your cleaned text data
# Note: Original dataset has binary labels, but we want continuous scores!

## Part 4: Feature Engineering

Create rich features for your model

In [None]:
# TODO: Create features
# 1. Use your continuous sentiment scores
# 2. Calculate text statistics:
#    - Length
#    - Word count
#    - Average word length
#    - Sentence count
# 3. Any other features you think might help!

## Part 5: Multi-Class Rating Prediction

Instead of binary classification, predict a 5-star rating!

In [None]:
# TODO: Create target variable
# Convert binary labels to 5-star ratings using your features
# Hint: Use sentiment scores and other features to estimate star rating

In [None]:
# TODO: Build and train your model
# 1. Split data into train and test sets
# 2. Choose a model suitable for multi-class classification
# 3. Train the model
# 4. Make predictions
# 5. Evaluate performance

## Part 6: Analysis

Analyze your results and suggest improvements

In [None]:
# TODO: Create visualizations and analyze:
# 1. Confusion matrix for multi-class predictions
# 2. Feature importance
# 3. Error analysis
# 4. Suggest improvements