# 🎬 Movie Rating Prediction with Python

This notebook demonstrates a regression-based approach to predict IMDb ratings of Indian movies using features such as genre, director, runtime, votes, and release year.
Dataset Source: [Kaggle - IMDb Indian Movies](https://www.kaggle.com/datasets/adrianmcmahon/imdb-india-movies)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the dataset
df = pd.read_csv('data/IMDb_Indian_Movies.csv')
df.head()

## 🔍 Data Exploration

In [None]:
df.info()
df.describe()
df.isnull().sum()

## 🧹 Data Preprocessing

In [None]:
# Drop unnecessary columns and handle missing values
df.drop(columns=['Poster_Link', 'Overview'], inplace=True, errors='ignore')
df.dropna(inplace=True)

# Convert categorical features
df = pd.get_dummies(df, columns=['Genre', 'Certificate'], drop_first=True)

# Convert votes to numeric if needed
df['Votes'] = df['Votes'].str.replace(',', '').astype(float)

# Extract year if in date format
df['Year'] = pd.to_datetime(df['Release Date'], errors='coerce').dt.year
df.drop(columns=['Release Date'], inplace=True, errors='ignore')
df.dropna(inplace=True)

## 🧪 Train-Test Split

In [None]:
X = df.drop('Rating', axis=1)
y = df['Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 🤖 Model Training & Evaluation

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Linear Regression R2 Score:", r2_score(y_test, y_pred_lr))

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest R2 Score:", r2_score(y_test, y_pred_rf))

In [None]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost R2 Score:", r2_score(y_test, y_pred_xgb))

## 📊 Evaluation Metrics

In [None]:
def evaluate(y_true, y_pred, model_name):
    print(f"{model_name} MAE: {mean_absolute_error(y_true, y_pred):.2f}")
    print(f"{model_name} RMSE: {mean_squared_error(y_true, y_pred, squared=False):.2f}")
    print(f"{model_name} R2 Score: {r2_score(y_true, y_pred):.2f}")

evaluate(y_test, y_pred_lr, "Linear Regression")
evaluate(y_test, y_pred_rf, "Random Forest")
evaluate(y_test, y_pred_xgb, "XGBoost")

## ✅ Conclusion

We evaluated three regression models. XGBoost typically performs best in predicting IMDb ratings due to its handling of non-linearity and feature importance. Further improvements can be made with hyperparameter tuning and cross-validation.