### 🎬 **Project Title**: Movie Rating Prediction with Python

### **Objective**
The aim of this project is simple: **predict the rating of a movie** using its main features like **genre, director, and actors**. We’ll analyze historical movie data to understand what factors play the biggest role in determining movie ratings, and then build a **machine learning model** to predict these ratings accurately.

### **Why This Project?**
In today’s movie world, ratings are everything – they guide audiences on what to watch and help studios understand what’s working. By predicting movie ratings, we can see what aspects (like a movie’s genre or its cast) tend to lead to higher or lower ratings. This project will not only improve our data science skills but also give us insight into the movie industry’s rating trends.

### **Key Parts of the Project**

1. **Data Analysis**: First, we’ll **explore the movie data**, checking features like genre, director, and actors. This will give us a good understanding of what the data looks like and which features might be important for ratings.

2. **Data Preprocessing**: This is the step where we prepare the data to make it clean and useful. We’ll handle missing values, format the data correctly, and encode any text data (like the genre or director) so the machine learning model can understand it.

3. **Feature Engineering**: Here, we’ll refine the data further by creating new features if needed or by transforming the current features to make them more effective for our prediction model.

4. **Model Building with Regression**: Using machine learning, we’ll apply **regression techniques** – these are methods that help us make continuous predictions, like predicting ratings. We’ll test different models and choose the best one for accurate results.

5. **Insights and Analysis**: Finally, we’ll review our results to see what factors most affect movie ratings. This is where we’ll understand trends and patterns, like if certain genres score higher or if specific directors or actors influence ratings.

### **Expected Outcome**
By the end of this project, we’ll have a working model that can estimate movie ratings based on features like genre and cast. Plus, we’ll gain valuable insights into what truly influences movie ratings – knowledge that’s valuable for anyone interested in data science, movies, or both!

#### Before jumping into the actual model building, let’s first understand our dataset, bhaiyon aur behno! This step is super important because, just like in cricket, if you don’t understand the pitch (dataset), you can’t play the game (build the model) well. So, let’s take this step-by-step:

In Python, libraries are like specialized toolkits. Each library has its own unique tools that make data analysis, visualization, and modeling easier. So, let’s import the key ones we’ll need for this project:

In [6]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### 🛠 Why Importing Libraries First is Important
Importing libraries at the start helps us access all the functions we need without interruptions. It’s like setting up your workspace with all the tools in place – ready to tackle each step smoothly.

In [7]:
dataset_path = r'C:\Users\abhis\Documents\GitHub\walmart sales forecasting\MovieMind-Rating-Prediction\Data\IMDb Movies India.csv'
movie_data = pd.read_csv(dataset_path, encoding='ISO-8859-1')

 Display the first 5 rows of the dataset to verify the data loaded correctly

In [8]:
print("First 5 Rows of the Movie Dataset:")
print(movie_data.head())

First 5 Rows of the Movie Dataset:
                                 Name    Year Duration            Genre  \
0                                         NaN      NaN            Drama   
1  #Gadhvi (He thought he was Gandhi)  (2019)  109 min            Drama   
2                         #Homecoming  (2021)   90 min   Drama, Musical   
3                             #Yaaram  (2019)  110 min  Comedy, Romance   
4                   ...And Once Again  (2010)  105 min            Drama   

   Rating Votes            Director       Actor 1             Actor 2  \
0     NaN   NaN       J.S. Randhawa      Manmauji              Birbal   
1     7.0     8       Gaurav Bakshi  Rasika Dugal      Vivek Ghamande   
2     NaN   NaN  Soumyajit Majumdar  Sayani Gupta   Plabita Borthakur   
3     4.4    35          Ovais Khan       Prateik          Ishita Raj   
4     NaN   NaN        Amol Palekar  Rajat Kapoor  Rituparna Sengupta   

           Actor 3  
0  Rajendra Bhatia  
1    Arvind Jangid  
2       Roy 

Get an overview of the dataset's structure and data types

In [9]:
print("\nDataset Overview:")
print(movie_data.info())


Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB
None


Show a statistical summary for numerical columns

In [10]:
print("\nStatistical Summary of Numerical Data:")
print(movie_data.describe())


Statistical Summary of Numerical Data:
            Rating
count  7919.000000
mean      5.841621
std       1.381777
min       1.100000
25%       4.900000
50%       6.000000
75%       6.800000
max      10.000000


encoding issue while reading the CSV file. Sometimes, files use encodings other than UTF-8, which can cause this type of error in Pandas.

### Explanation of the Fix

- **Encoding Change (`encoding='ISO-8859-1'`)**:
  - We specify `ISO-8859-1` encoding in `pd.read_csv()`. This encoding is often compatible with CSV files that contain special characters, particularly those from non-UTF-8 encoded sources.

This should resolve the UnicodeDecodeError, allowing the data to load without issues.