# ðŸŽ¬ Movie Success Prediction Project: Analysis Notebook

## 1. Project Setup and Data Ingestion

The goal of this project is to analyze the factors that contribute to a movie's financial and critical success using a large, synthetic dataset of approximately one million movies. We will perform extensive Exploratory Data Analysis (EDA) and develop a Regression Machine Learning model to predict a movie's global box office revenue.

### Library Imports and Environment Setup
We are importing standard Python data science libraries required for data manipulation, analysis, and visualization.

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Configure plots to display inline
%matplotlib inline

### Data load and Inception

In [84]:
file_path = 'movies_dataset.csv'
df = pd.read_csv(file_path)
    
print(f"Dataset loaded successfully.")
print(f"Shape: {df.shape[0]} rows (records) and {df.shape[1]} columns (features).\n")
df.head()



Dataset loaded successfully.
Shape: 999999 rows (records) and 17 columns (features).



Unnamed: 0,MovieID,Title,Genre,ReleaseYear,ReleaseDate,Country,BudgetUSD,US_BoxOfficeUSD,Global_BoxOfficeUSD,Opening_Day_SalesUSD,One_Week_SalesUSD,IMDbRating,RottenTomatoesScore,NumVotesIMDb,NumVotesRT,Director,LeadActor
0,1,Might toward capital,Comedy,2003,28-09-2003,China,6577427.79,6613685.82,15472035.66,1778530.85,3034053.32,6.2,58,7865,10596,Kristina Moore,Brian Mccormick
1,2,He however experience,Comedy,1988,14-02-1988,USA,1883810.1,1930949.15,3637731.12,247115.74,831828.84,5.2,44,1708,220,Benjamin Hudson,Ashley Pena
2,3,Star responsibility politics,Comedy,1971,02-11-1971,USA,2468079.29,4186694.69,7165111.24,878453.95,2171405.93,5.5,55,4678,7805,Kayla Young,Alexander Haley
3,4,Exactly live,Comedy,1998,06-08-1998,USA,1447311.46,2023683.92,4373820.26,570657.72,898886.01,7.3,87,2467,1751,Michael Ross,Patrick Barnett
4,5,Focus improve especially,Documentary,2021,17-12-2021,India,900915.86,2129629.1,3113017.38,361189.37,861775.91,6.1,67,5555,697,Faith Franklin,Duane Fletcher DDS


In [85]:
print("\n--- Data Information (Data Types and Non-Null Counts) ---")
df.info()


--- Data Information (Data Types and Non-Null Counts) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 17 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   MovieID               999999 non-null  int64  
 1   Title                 999999 non-null  object 
 2   Genre                 999999 non-null  object 
 3   ReleaseYear           999999 non-null  int64  
 4   ReleaseDate           999999 non-null  object 
 5   Country               999999 non-null  object 
 6   BudgetUSD             999999 non-null  float64
 7   US_BoxOfficeUSD       999999 non-null  float64
 8   Global_BoxOfficeUSD   999999 non-null  float64
 9   Opening_Day_SalesUSD  999999 non-null  float64
 10  One_Week_SalesUSD     999999 non-null  float64
 11  IMDbRating            999999 non-null  float64
 12  RottenTomatoesScore   999999 non-null  int64  
 13  NumVotesIMDb          999999 non-null  int64 

In [86]:
df.isna().sum()

MovieID                 0
Title                   0
Genre                   0
ReleaseYear             0
ReleaseDate             0
Country                 0
BudgetUSD               0
US_BoxOfficeUSD         0
Global_BoxOfficeUSD     0
Opening_Day_SalesUSD    0
One_Week_SalesUSD       0
IMDbRating              0
RottenTomatoesScore     0
NumVotesIMDb            0
NumVotesRT              0
Director                0
LeadActor               0
dtype: int64

In [87]:
df.duplicated().sum()

np.int64(0)

###  Data Type Correction and Feature Engineering

Since the initial check revealed a perfectly clean dataset with zero missing values, we can skip the imputation step. We now proceed with two essential preparation steps:

1.  **Date Conversion:** Correcting the `ReleaseDate` data type for accurate time-series analysis (even if we only use the existing `ReleaseYear` column).
2.  **Feature Engineering:** Creating the key financial feature, **Profit**, by subtracting `BudgetUSD` from `Global_BoxOfficeUSD`.

In [88]:
df['ReleaseDate'] = pd.to_datetime(df['ReleaseDate'])
print("ReleaseDate corrected and validated as a datetime object.")


df['ProfitUSD'] = df['Global_BoxOfficeUSD'] - df['BudgetUSD']
print("New feature 'ProfitUSD' calculated successfully.")


df.drop(columns=['MovieID', 'ReleaseDate'], inplace=True)
print("Dropped redundant columns: 'MovieID' and 'ReleaseDate'.")


ReleaseDate corrected and validated as a datetime object.
New feature 'ProfitUSD' calculated successfully.
Dropped redundant columns: 'MovieID' and 'ReleaseDate'.


  df['ReleaseDate'] = pd.to_datetime(df['ReleaseDate'])


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Title                 999999 non-null  object 
 1   Genre                 999999 non-null  object 
 2   ReleaseYear           999999 non-null  int64  
 3   Country               999999 non-null  object 
 4   BudgetUSD             999999 non-null  float64
 5   US_BoxOfficeUSD       999999 non-null  float64
 6   Global_BoxOfficeUSD   999999 non-null  float64
 7   Opening_Day_SalesUSD  999999 non-null  float64
 8   One_Week_SalesUSD     999999 non-null  float64
 9   IMDbRating            999999 non-null  float64
 10  RottenTomatoesScore   999999 non-null  int64  
 11  NumVotesIMDb          999999 non-null  int64  
 12  NumVotesRT            999999 non-null  int64  
 13  Director              999999 non-null  object 
 14  LeadActor             999999 non-null  object 
 15  