<a href="https://colab.research.google.com/github/rana-ta/Machine_Learning_Project/blob/main/ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Predicting Movie Genre Popularity from Popular Reviews:
In this project, we are going to use machine learning models to predict Letterboxd ratings of a movie genre particular.


# 1. Data Understanding

###Dataset
The Dataset used in this project is the *Letterboxd All Movie Data dataset*, which is available on Hugging Face.

With over 800,000 movies, this dataset provides huge and diverse data, improving the reliability of training machine learning models across different genres, time periods, and audience demographics.

The inclusion of both structured metadata (e.g., genres, release year) and unstructured text (reviews) allows us to capture both statistical patterns and audience sentiment, which is crucial for understanding long-term genre trends.



### Importing Needed packages

In [1]:
from datasets import load_dataset
import pandas as pd
import ast
import re

### Load the dataset

In [2]:
train_dataset = load_dataset("pkchwy/letterboxd-all-movie-data", split="train")

#Convert to a pandas Dataframe
df = pd.DataFrame(train_dataset)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Preview the Dataset

In [3]:
#Displaying the First 5 Rows of the Dataset
df.head()

Unnamed: 0,url,title,year,directors,genres,cast,synopsis,rating,reviews,poster_url
0,https://letterboxd.com/film/come-and-see/,Come and See,1985,[Elem Klimov],"[War, Drama]","[Aleksei Kravchenko, Olga Mironova, Liubomiras...",The invasion of a village in Byelorussia by Ge...,4.62 out of 5,"[{'username': 'cameron fetter', 'review_text':...",https://a.ltrbxd.com/resized/film-poster/3/6/1...
1,https://letterboxd.com/film/seven-samurai/,Seven Samurai,1954,[Akira Kurosawa],"[Action, Drama]","[Toshirō Mifune, Takashi Shimura, Yoshio Inaba...",A samurai answers a village's request for prot...,4.61 out of 5,"[{'username': 'maria', 'review_text': 'too man...",https://a.ltrbxd.com/resized/film-poster/5/1/7...
2,https://letterboxd.com/film/high-and-low/,High and Low,1963,[Akira Kurosawa],"[Mystery, Thriller, Crime, Drama]","[Toshirō Mifune, Tatsuya Nakadai, Kyōko Kagawa...",In the midst of an attempt to take over his co...,4.60 out of 5,"[{'username': 'Karsten', 'review_text': 'every...",https://a.ltrbxd.com/resized/film-poster/4/4/5...
3,https://letterboxd.com/film/harakiri/,Harakiri,1962,[Masaki Kobayashi],"[History, Drama, Action]","[Tatsuya Nakadai, Akira Ishihama, Shima Iwashi...",Down-on-his-luck veteran Tsugumo Hanshirō ente...,4.69 out of 5,"[{'username': 'Ciara', 'review_text': 'Why is ...",https://a.ltrbxd.com/resized/film-poster/4/3/0...
4,https://letterboxd.com/film/12-angry-men/,12 Angry Men,1957,[Sidney Lumet],[Drama],"[Martin Balsam, John Fiedler, Lee J. Cobb, E.G...",The defense and the prosecution have rested an...,4.63 out of 5,"[{'username': 'amaya', 'review_text': '1. henr...",https://a.ltrbxd.com/resized/film-poster/5/1/7...


In [4]:
# Get Information about the Dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847209 entries, 0 to 847208
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   url         847209 non-null  object
 1   title       847209 non-null  object
 2   year        770792 non-null  object
 3   directors   732443 non-null  object
 4   genres      638857 non-null  object
 5   cast        595214 non-null  object
 6   synopsis    727899 non-null  object
 7   rating      108834 non-null  object
 8   reviews     366600 non-null  object
 9   poster_url  847209 non-null  object
dtypes: object(10)
memory usage: 64.6+ MB


In [5]:
# Check the Shape of the Dataset
df.shape

(847209, 10)

In [6]:
# Check Data Types
df.dtypes

Unnamed: 0,0
url,object
title,object
year,object
directors,object
genres,object
cast,object
synopsis,object
rating,object
reviews,object
poster_url,object


# 2. Data Preparation

###Missing Values

In [7]:
# Chech missing values
df.isnull().sum()

Unnamed: 0,0
url,0
title,0
year,76417
directors,114766
genres,208352
cast,251995
synopsis,119310
rating,738375
reviews,480609
poster_url,0


### Dropping columns

We will keep the columns needed for this task, which are reviews, genres, year, and rating.

In [8]:
#Keeping only needed columns for the task
df = df[['reviews', 'genres', 'year','rating']]
df.head()

Unnamed: 0,reviews,genres,year,rating
0,"[{'username': 'cameron fetter', 'review_text':...","[War, Drama]",1985,4.62 out of 5
1,"[{'username': 'maria', 'review_text': 'too man...","[Action, Drama]",1954,4.61 out of 5
2,"[{'username': 'Karsten', 'review_text': 'every...","[Mystery, Thriller, Crime, Drama]",1963,4.60 out of 5
3,"[{'username': 'Ciara', 'review_text': 'Why is ...","[History, Drama, Action]",1962,4.69 out of 5
4,"[{'username': 'amaya', 'review_text': '1. henr...",[Drama],1957,4.63 out of 5


In [9]:
# Dropping rows with missing rows
df.dropna(subset=['reviews', 'genres', 'year','rating' ], inplace=True)

In [10]:
df.isnull().sum()

Unnamed: 0,0
reviews,0
genres,0
year,0
rating,0


### Extracting the first Genre

For this task we will extract the first genre for each movie for a better and easier predection.

In [11]:
df["genre"] = df["genres"].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None)

df = df.drop(columns=["genres"])
df["genre"].head()

Unnamed: 0,genre
0,War
1,Action
2,Mystery
3,History
4,Drama


### Converting the Year column

We will convert the Year column to a numerical data type.

In [12]:
df['Year'] = df['year'].str.extract(r'(\d+)').astype(int)

### Converting the Rating Column

We will conver the Rating column text from ("4.61 out of 5") to float.

In [13]:
df['rating'] = df['rating'].apply(lambda x: float(re.findall(r"[\d\.]+", x)[0]) if isinstance(x, str)else x)
df['rating'].head()

Unnamed: 0,rating
0,4.62
1,4.61
2,4.6
3,4.69
4,4.63


### Extract Review text

Each movie contain multiple reviews. We will make each review on its own row, while keeping the same genre, year, and rating for that movie.

In [14]:
expanded_rows = []
for _, row in df.iterrows():
  if isinstance (row["reviews"], list):
    for review in row["reviews"]:
      expanded_rows.append({
          "review_text": review.get("review_text", ""),
          "genre": row ["genre"],
          "year": row ["year"],
          "rating": row ["rating"],
      })
#Conver expanded list into a new Dataframe
reviews_df = pd.DataFrame(expanded_rows)

### Preview & Save the Cleaned Dataset

In [16]:
reviews_df.head(10)

Unnamed: 0,review_text,genre,year,rating
0,as soon as this film ended i went online and e...,War,1985,4.62
1,Come and See is a film I find almost impossibl...,War,1985,4.62
2,What a horrible nightmare!,War,1985,4.62
3,(guy who's still buzzing from Spider-Man: Acro...,War,1985,4.62
4,apparently elem klimov wanted to name this fil...,War,1985,4.62
5,Francois Truffaut once said that it is impossi...,War,1985,4.62
6,this makes other WWII movieslook like a ride a...,War,1985,4.62
7,Playing this at grandma’s bingo party next week.,War,1985,4.62
8,"100%, no doubt in my mind, this is the best fi...",War,1985,4.62
9,An apocalyptic nightmare of pure brutalizing e...,War,1985,4.62


In [18]:
reviews_df.to_csv("lettedboxed_clean_dataset.csv", index=False, encoding="utf-8")