### Movie Rating Prediction 

> ### Importing libraries we need for data handling, visualization, and modeling

Before jumping into the analysis and model-building part, we need to have our essential Python libraries loaded up. Libraries are like our toolbox in a data science project—they provide everything from basic tools to advanced techniques.

In [8]:
import pandas as pd         
import numpy as np           
import seaborn as sns        
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LinearRegression      
from sklearn.metrics import mean_squared_error, r2_score  
import warnings
warnings.filterwarnings('ignore')

> ### Loading the IMDb dataset with specified encoding

In [13]:
imdb_data = pd.read_csv(r"C:\Users\Lenovo\Documents\GitHub\TheUltimate pandas bootcamp\MovieMind-Rating-Prediction\Data\IMDb Movies India.csv", encoding='ISO-8859-1')

> ### Displaying the first few rows of the dataset

Let’s look at a few rows from the top and bottom of the dataset to get an initial sense of the data format, structure, and any obvious irregularities.

In [14]:
imdb_data.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [16]:
imdb_data.tail()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
15504,Zulm Ko Jala Doonga,(1988),,Action,4.6,11.0,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Suparna Anand
15505,Zulmi,(1999),129 min,"Action, Drama",4.5,655.0,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Aruna Irani
15506,Zulmi Raj,(2005),,Action,,,Kiran Thej,Sangeeta Tiwari,,
15507,Zulmi Shikari,(1988),,Action,,,,,,
15508,Zulm-O-Sitam,(1998),130 min,"Action, Drama",6.2,20.0,K.C. Bokadia,Dharmendra,Jaya Prada,Arjun Sarja


Sometimes column names can contain extra spaces, unusual characters, or inconsistent formatting. Let’s check the column names to make sure they’re clean and easy to work with.

In [19]:
imdb_data.columns

Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

In [20]:
# Stripping any leading/trailing spaces from column names
imdb_data.columns = imdb_data.columns.str.strip()

In [21]:
imdb_data.columns

Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

Duplicate entries can skew our analysis and model accuracy. Let’s check if there are any duplicate rows.

In [22]:
duplicates = imdb_data.duplicated().sum()
print("Number of duplicate rows:",duplicates)

Number of duplicate rows: 6


> ### Dropping duplicate rows if any are found

In [23]:
imdb_data = imdb_data.drop_duplicates()

Let’s look at each column in detail, which will help us understand the range, unique values, and data types of both numeric and categorical columns

In [25]:
for column in imdb_data.columns:
    unique_values = imdb_data[column].nunique()
    data_type = imdb_data[column].dtype
    print(f"Column '{column}' has {unique_values} unique values and is of type {data_type}.")

Column 'Name' has 13838 unique values and is of type object.
Column 'Year' has 102 unique values and is of type object.
Column 'Duration' has 182 unique values and is of type object.
Column 'Genre' has 485 unique values and is of type object.
Column 'Rating' has 84 unique values and is of type float64.
Column 'Votes' has 2034 unique values and is of type object.
Column 'Director' has 5938 unique values and is of type object.
Column 'Actor 1' has 4718 unique values and is of type object.
Column 'Actor 2' has 4891 unique values and is of type object.
Column 'Actor 3' has 4820 unique values and is of type object.
