# Training sentiment model on Letterboxd reviews

This notebook trains a simple sentiment classifier using:

- `cleaned_reviews.csv` as input
- TF IDF features from `clean_text`
- Logistic Regression as the classifier
- Balanced training data across positive, neutral, and negative

The code from `sentiment_model.py` is reused.

In [8]:
import sys
import os

In [9]:
# Get the absolute path of the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))


In [10]:
# Add root to Python path
if project_root not in sys.path:
    sys.path.append(project_root)

In [11]:
print("Project root added:", project_root)

Project root added: /Users/sanjaydilip/Desktop/Code/Projects/letterboxd movie sentiment api


# Loading cleaned dataset

In [1]:
import pandas as pd

In [2]:
clean_path = "/Users/sanjaydilip/Desktop/Code/Projects/letterboxd movie sentiment api/data/processed/cleaned_reviews.csv"
df_clean = pd.read_csv(clean_path)

In [3]:
df_clean.head()

Unnamed: 0,movie_title,release_year,review_date,review_date_parsed,reviewer_name,rating_raw,rating_numeric,sentiment_label,review_text,clean_text,comment_count,like_count
0,Aftersun (2022),2022,12-Jan-20,2020-01-12 00:00:00,Tuomas,â??â??â??â??Â½,4.5,positive,This review may contain spoilers.,this review may contain spoilers,130,"22,44 6 likes"
1,Joker (2019),2019,20-Dec-22,2022-12-20 00:00:00,Joao,â??â??â??â??â??,5.0,positive,if youâ??ve never swam in the ocean then of co...,if you ve never swam in the ocean then of cour...,1.8K,"22,032 likes"
2,Puss in Boots: The Last Wish (2022),2022,15-Sep-22,2022-09-15 00:00:00,NicoPico,â??Â½,1.5,negative,Puss in Boots: Into the Pussy-Verse,puss in boots into the pussy verse,6 2,"21, 6 6 6 likes"
3,The Banshees of Inisherin (2022),2022,8-Apr-22,2022-04-08 00:00:00,Ella Kemp,â??â??â??â??â??,5.0,positive,I will NOT leave my donkey outside when Iâ??m sad,i will not leave my donkey outside when i m sad,,"21, 6 09 likes"
4,Everything Everywhere All at Once (2022),2022,14-Aug-19,2019-08-14 00:00:00,CosmonautMarkie,â??â??Â½,2.5,negative,Watch it and have fun before film Twitter tell...,watch it and have fun before film twitter tell...,355,"20, 6 88 likes"


# Inspecting label balance in the full cleaned data

### Sentiment label distribution in cleaned data

In [4]:
df_clean["sentiment_label"].value_counts()

sentiment_label
positive    2667
neutral      486
negative     407
Name: count, dtype: int64

In [5]:
df_clean["sentiment_label"].value_counts(normalize=True).round(3)

sentiment_label
positive    0.749
neutral     0.137
negative    0.114
Name: proportion, dtype: float64

In [6]:
df_clean["clean_length"] = df_clean["clean_text"].astype(str).str.len()
df_clean[df_clean["clean_length"] >= 5]["sentiment_label"].value_counts(normalize=True).round(3)

sentiment_label
positive    0.748
neutral     0.137
negative    0.115
Name: proportion, dtype: float64

## Importing training helpers from `sentiment_model`

The script `sentiment_model.py` contains:

- `load_training_data` – loads data and applies class balancing
- `train_sentiment_model` – trains Logistic Regression, prints a classification report, and saves the model and vectorizer

In [12]:
from src.sentiment_model import load_training_data, train_sentiment_model

# Inspecting the balanced training data

`load_training_data` applies:

- minimum length filter on `clean_text`
- downsampling of positive, neutral, and negative to the size of the smallest class

In [13]:
X, y = load_training_data()

In [14]:
print("Training samples after loading:", len(X))
print("Label distribution after loading:", len(y))

Training samples after loading: 1215
Label distribution after loading: 1215


In [15]:
import numpy as np

In [16]:
unique, counts = np.unique(y, return_counts=True)
dict(zip(unique, counts))

{'negative': np.int64(405),
 'neutral': np.int64(405),
 'positive': np.int64(405)}

# Training the model and viewing the metrics

### Training the sentiment model

This will:

- split into train and validation sets
- vectorize text with TF IDF
- train Logistic Regression
- print a validation classification report
- save the model and vectorizer into `models/`

In [17]:
vectorizer, model = train_sentiment_model()

Validation results:
              precision    recall  f1-score   support

    negative       0.37      0.43      0.40        81
     neutral       0.27      0.22      0.24        81
    positive       0.29      0.30      0.29        81

    accuracy                           0.32       243
   macro avg       0.31      0.32      0.31       243
weighted avg       0.31      0.32      0.31       243



In [21]:
import os
model_dir = os.path.abspath("../models")
print("Model dir:", model_dir)
os.listdir(model_dir)

Model dir: /Users/sanjaydilip/Desktop/Code/Projects/letterboxd movie sentiment api/models


['sentiment_model.joblib', 'vectorizer.joblib']

# Testing manual predictions

The model is loaded and run a few test sentences to check if it reacts in a sensible way.

In [22]:
from src.sentiment_model import predict_sentiment_for_text

In [23]:
examples = [
    "This movie was absolutely beautiful, I loved every second.",
    "It was fine, nothing special but not terrible either.",
    "This was boring and badly written, I regret watching it."
]

In [24]:
for text in examples:
    result = predict_sentiment_for_text(text, vectorizer, model)
    print(text)
    print(result)
    print("-" * 60)

This movie was absolutely beautiful, I loved every second.
{'label': 'positive', 'probability': 0.5997199860511911, 'all_probs': {'negative': 0.2252270732449395, 'neutral': 0.17505294070386934, 'positive': 0.5997199860511911}}
------------------------------------------------------------
It was fine, nothing special but not terrible either.
{'label': 'positive', 'probability': 0.3504215970734989, 'all_probs': {'negative': 0.31348899183524287, 'neutral': 0.33608941109125823, 'positive': 0.3504215970734989}}
------------------------------------------------------------
This was boring and badly written, I regret watching it.
{'label': 'negative', 'probability': 0.5309795193366308, 'all_probs': {'negative': 0.5309795193366308, 'neutral': 0.22311707870571415, 'positive': 0.24590340195765503}}
------------------------------------------------------------


# Summary

- A simple TF IDF + Logistic Regression model is trained on cleaned Letterboxd reviews.
- Labels are weak and based on star ratings, so performance is limited, but it is enough for a working classifier.
- We balanced the classes so the model does not always predict positive.
- The model and vectorizer are saved in `models/` and will be loaded by the FastAPI service.

Next step: plug this model into the API (`api/main.py`) and expose endpoints for:
- single text analysis
- movie level summaries
- movie comparisons