# Task 2: Movie Rating Prediction with Python

# Movie Rating Prediction with Python

**Task:** Predict movie ratings based on director, genre, etc.

**Dataset Source:**  
https://www.kaggle.com/datasets/adrianmcmahon/imdb-india-movies

(Dataset provided publicly by Kaggle)

---

In [4]:
import pandas as pd

# Load dataset with correct encoding
df = pd.read_csv("IMDb Movies India.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


## Data Preprocessing

In [7]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Keep only rows with ratings
df = df.dropna(subset=['Rating']).copy()

# Clean Votes column
df['Votes'] = df['Votes'].str.replace(',', '', regex=False)
df['Votes'] = pd.to_numeric(df['Votes'], errors='coerce')

# Extract main genre
df['MainGenre'] = df['Genre'].str.split(',').str[0]

# Drop unnecessary columns
df = df.drop(columns=['Name', 'Duration', 'Genre', 'Actor 1', 'Actor 2', 'Actor 3'])

# Drop remaining nulls
df = df.dropna()

# Encode categorical columns
for col in ['Year', 'Director', 'MainGenre']:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

df.head()

Unnamed: 0,Year,Rating,Votes,Director,MainGenre
1,89,7.0,8,805,7
3,89,4.4,35,1730,4
5,67,4.7,827,1981,4
6,75,7.4,1086,2618,7
8,82,5.6,326,173,11


## Model Training and Evaluation

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import math

# Features and target
X = df.drop(columns=['Rating'])
y = df['Rating']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
rmse = math.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("MAE:", mae)
print("RMSE:", rmse)
print("R2 Score:", r2)

MAE: 0.8697177436553634
RMSE: 1.142934267347938
R2 Score: 0.3203894070732036
