# **🎬 IMDb Indian Movies Rating Prediction**

##  **Step 1: Import Libraries**
- `numpy` and `pandas` for numerical operations and data handling.
- `train_test_split` from `sklearn.model_selection` to split our dataset into training and testing sets.
- `mean_squared_error` and `r2_score` for model evaluation.
- `XGBRegressor` from XGBoost for training our regression model.
- `SentenceTransformer` for encoding textual data (can be used in advanced versions).
- `pickle` to save our models and encodings for future use.


In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score
from xgboost import XGBRegressor
from sentence_transformers import SentenceTransformer
import pickle

  from tqdm.autonotebook import tqdm, trange


## **📊 Step 2: Load and Preprocess the Dataset**
- Load the dataset from the `data/IMDb Movies India.csv` file.
- Replace missing values with empty strings.
- Convert `Rating` column to numeric, dropping rows with missing ratings.

In [6]:
import pandas as pd
data= pd.read_csv("data/IMDb Movies India.csv",encoding='latin1')
data.fillna(' ',inplace=True)
data.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


- Convert `Rating` column to numeric, dropping rows with missing ratings.

In [11]:
data['Rating']=pd.to_numeric(data['Rating'],errors='coerce')
data = data.dropna(subset=['Rating'])

## **🧑‍🎤 Step 3: Actor-based Feature Engineering**
- Combine `Actor 1`, `Actor 2`, and `Actor 3` into a list.

In [12]:
data['actors']=data[['Actor 1','Actor 2','Actor 3']].values.tolist()

- Explode actor lists to calculate each actor's average movie rating.
- Use the dictionary to assign an average actor rating to each movie.

In [13]:
exploded =data.explode('actors')
avg_actor_rating = exploded.groupby('actors')['Rating'].mean().to_dict()
data['avg_actor_rating']=data['actors'].apply(
    lambda actor_list: sum([avg_actor_rating.get(a,0) for a in actor_list])/ len(actor_list)
)

## **🎬 Step 4: Feature Engineering - Director**
- Group by `Director` and compute their average movie rating.
- Map these average values to each movie as a new feature.

In [None]:
avg_director_rating = data.groupby('Director')['Rating'].mean().to_dict()
data['avg_director_rating'] = data['Director'].map(avg_director_rating)

In [21]:
data.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3,actors,avg_actor_rating,avg_director_rating
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid,"[Rasika Dugal, Vivek Ghamande, Arvind Jangid]",6.855556,7.0
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor,"[Prateik, Ishita Raj, Siddhant Kapoor]",4.838889,4.4
5,...Aur Pyaar Ho Gaya,(1997),147 min,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor,"[Bobby Deol, Aishwarya Rai Bachchan, Shammi Ka...",5.752446,5.358824
6,...Yahaan,(2005),142 min,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma,"[Jimmy Sheirgill, Minissha Lamba, Yashpal Sharma]",5.883036,7.5
8,?: A Question Mark,(2012),82 min,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia,"[Yash Dave, Muntazir Ahmad, Kiran Bhatia]",5.662121,5.6


## **🔤 Step 5: Genre One-Hot Encoding**
- Perform one-hot encoding on the `Genre` column.
- This transforms categorical genres into binary format.

In [None]:
sentence_model= SentenceTransformer("all-MiniLM-L6-v2")
genre_ohe =pd.get_dummies(data['Genre'])

## ** Step 6: Feature Matrix and Target Setup**
- Combine one-hot encoded genres, average actor rating, and average director rating into a single feature matrix `X`.
- Set the target variable `y` as the `Rating`

In [None]:
X= np.hstack((
    genre_ohe.values,
    data[['avg_actor_rating','avg_director_rating']].values
))

y=data['Rating']

## **🔀 Step 7: Train-Test Split**
Split the dataset into training and testing sets (80% train, 20% test) using `train_test_split`.

In [None]:
X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42)

## **🚀 Step 8: Model Training with XGBoost**
- Initialize the `XGBRegressor` with hyperparameters.

In [26]:
xgb_model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.03,
    max_depth=7,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

- Train the model using the training data (`X_train`, `y_train`).

In [27]:
xgb_model.fit(X_train,y_train)
y_pred= xgb_model.predict(X_test)

## **📈 Step 9: Model Evaluation**
- Predict ratings using the test set (`X_test`).
- Evaluate the model using R² Score and Mean Squared Error (MSE).
- Print evaluation metrics.

In [28]:
#evaluation
r2=r2_score(y_test,y_pred)
mse=mean_squared_error(y_test,y_pred)

print(f'r2 Score{r2:.4f}')
print(f"mean squarred error: {mse:.4f}")

r2 Score0.7253
mean squarred error: 0.5107


## ** Step 10: Save Trained Models and Features**

In [29]:
pickle.dump(xgb_model,open('movie_rating_model.pkl','wb'))
pickle.dump(sentence_model,open('sentence_model.pkl','wb'))
pickle.dump(genre_ohe.columns.tolist(),open('genre_columns_model.pkl','wb'))
pickle.dump(avg_actor_rating,open('avg_actor_rating.pkl','wb'))
pickle.dump(avg_director_rating,open('avg_director_rating.pkl','wb'))