# **MOVIE ANALYSIS PROJECT**
- This project is about `tn` and `tnmb` dataset on finding what makes a movie to be a success, by researching on genres, revenues, popularity and votes.
All this can help the company heads with deciding on which movie to produce.


In [159]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## **First DataFrame**

In [160]:
#Load the dataset
df1=pd.read_csv('tn.movie_budgets.csv')

# **Data Understanding**
- Shape, column names
- Missing values
- Sample rows
- Basic statistics

### **Explore the Dataset**


In [161]:
#Views the bottom 5 rows
df1.tail()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


In [162]:
#Structure of the dataset
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [163]:
#Stasticical summary of numeric columns
df1.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


In [164]:
#Names of columns
df1.columns

Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')

In [165]:
#Shape of dataset
df1.shape

(5782, 6)

# **Data Preparation**
- Handling missing values

- Changing data types

- Renaming columns


### **Data Cleaning**

In [167]:
#Check missing values
df1.isnull()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
5777,False,False,False,False,False,False
5778,False,False,False,False,False,False
5779,False,False,False,False,False,False
5780,False,False,False,False,False,False


In [168]:
#fill missing values
df1.fillna(0,inplace=True)

In [169]:
#Check data type
df1.dtypes

id                    int64
release_date         object
movie                object
production_budget    object
domestic_gross       object
worldwide_gross      object
dtype: object

In [170]:
#Converting to a real date
df1['release_date']=pd.to_datetime(df1['release_date'])

In [171]:
df1['release_date'].dtype

dtype('<M8[ns]')

In [172]:
#Checking for duplicates
df1.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
5777    False
5778    False
5779    False
5780    False
5781    False
Length: 5782, dtype: bool

In [173]:
#Remove duplicates
df1.drop_duplicates(inplace=True)

In [174]:
#Clean names
df1.columns=df1.columns.str.strip().str.lower().str.replace('', '_') 
df1.columns = df1.columns.str.replace('_', '', regex=False)



In [175]:
#Save the cleaned dataset
df1.to_csv('tn.csv', index=False)

## **Second DataFrame**

In [176]:
#Load the dataset
df2=pd.read_csv('tmdb.movies.csv')

# **Data Understanding**
- Shape, column names
- Missing values
- Sample rows
- Basic statistics

### **Explore the Dataset**


In [177]:
#View top 5 rows
df2.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [178]:
#View botttom 5 rows
df2.tail()


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


In [179]:
#Structure of the dataset
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [None]:
#Stasticical summary of numeric columns
df2.describe()

In [181]:
#Names of columns
df2.columns

Index(['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title',
       'popularity', 'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

In [None]:
#Shape of dataset
df2.shape

# **Data Preparation**
- Handling missing values

- Changing data types

- Renaming columns
- concatenate `tn` and `tmdb`

### **Data Cleaning**

In [None]:
#Check missing values
df2.isnull().sum()

In [None]:
#Check data type
df2.dtypes

In [None]:
#Converting to a real date
df2['release_date']=pd.to_datetime(df2['release_date'])

In [None]:
#Checking for duplicates
df2.duplicated()

In [None]:
#Remove duplicates
df2.drop_duplicates(inplace=True)

In [None]:
#Clean names
df2.columns=df2.columns.str.strip().str.lower().str.replace('', '_')
df2.columns = df2.columns.str.replace('_', '', regex=False)


In [None]:
#Save the cleaned dataset
df2.to_csv('tmdb.csv', index=False)


## **Combined DataFrame**

In [None]:
#Load both datasets
tn = pd.read_csv('tn.csv')
tmdb = pd.read_csv('tmdb.csv')

In [None]:
#Combining datasets
to_concat = [tn, tmdb]
df3 = pd.concat(to_concat, ignore_index=True)
df3.to_csv("combined_movies.csv", index=False)


### **Data Cleaning**

In [None]:
df3.info()

In [None]:
#clean names
df3.columns=df3.columns.str.strip().str.lower().str.replace('', '_')
df3.columns = df3.columns.str.replace('_', '', regex=False)


In [None]:
#check missing values
df3.isnull()

In [None]:
# Percentage of missing values
df3.isnull().mean() * 100


In [None]:
#check missing values
df3.isnull()

In [None]:
#Fill missing values
df3.fillna(0, inplace=True)  # for text columns
df3.isnull()

In [None]:
df3.isnull()

In [None]:
#check duplicates
df3.duplicated

In [None]:
#remove duplicates
df3.drop_duplicates(inplace=True)
df3.duplicated

In [None]:
# Drop unnamed index columns or redundant metadata
df3.drop(columns=[col for col in df3.columns if 'unnamed' in col], inplace=True)


In [None]:
df3.head()

In [None]:
df3.tail()

In [None]:
df3.columns


# **Data Analysis**
Show filtering, grouping, sorting, and aggregations.eg:
- What are the top 5 most voted movies?
- What are the top 10 movies by worldwide profit

## **Data Manipulation and Analysis**

In [None]:
#Create profit column

df3['domesticgross'] = pd.to_numeric(df3['domesticgross'], errors='coerce')
df3['worldwidegross'] = pd.to_numeric(df3['worldwidegross'], errors='coerce')
df3['productionbudget'] = pd.to_numeric(df3['productionbudget'], errors='coerce')

df3['domestic_profit'] = df3['domesticgross'] - df3['productionbudget']
df3['worldwide_profit'] = df3['worldwidegross'] - df3['productionbudget']

In [None]:
#Top 10 movies by worldwide profit
df3.sort_values('worldwide_profit', ascending=False)[['title', 'worldwide_profit']]


In [None]:
#top languages by movie count
df3['originallanguage'].value_counts().head(10)


In [None]:
#Top 5 rated movies
df3[pd.to_numeric(df3['votecount'], errors='coerce') >= 100].sort_values('voteaverage', ascending=False)[['title', 'voteaverage']].head()

In [None]:
#Group per year average budget and gross
df3.groupby('releasedate')[['productionbudget', 'worldwidegross']].mean()


In [None]:
#Popular and high rated movies
df3['popularity'] = pd.to_numeric(df3['popularity'], errors='coerce')
df3['voteaverage'] = pd.to_numeric(df3['voteaverage'], errors='coerce')
popular_high_rated = df3[(df3['popularity'] > 30) & (df3['voteaverage'] >= 8)]
popular_high_rated

In [None]:
#sorting data by popularity
df3.sort_values('popularity', ascending=False).head(10)


In [None]:
#Average vote per year
df3.groupby('releasedate')['voteaverage'].mean().sort_index().tail(20)


## **Data Visualization**

### 1. Horizontal Bar

It shows the vote count from lowest to highest in each movie.

In [None]:
#Horizontal bar for popular movies
df3['votecount'] = pd.to_numeric(df3['votecount'], errors='coerce')

top5 = df3.sort_values(by='votecount', ascending=False).head(5)

plt.figure(figsize=(8, 5))
plt.barh(top5['originaltitle'], top5['votecount'], color='orange')
plt.title('Top 5 Movies by Vote Count')
plt.xlabel('Vote Count')
plt.ylabel('Movie Title')
plt.tight_layout()
plt.show()


### 2.Scatter Plot

It show the rating of movies in the number of votes 

In [None]:
#Scatter plot for popularity in vote count
df_clean = df3[['votecount', 'voteaverage']].dropna()
df_sample = df_clean.tail(100)

plt.figure(figsize=(8, 5))
plt.scatter(df_sample['votecount'], df_sample['voteaverage'], alpha=0.6, color='green')
plt.title('Vote Count vs Vote Average')
plt.xlabel('Vote Count')
plt.ylabel('Vote Average')
plt.grid(True)
plt.tight_layout()
plt.show()

### 3. Box Plot

It shows distribution for movie popularity in my dataframe

In [None]:
#Box plot for movie popularity
popularity_clean = df3['popularity'].dropna()
popularity_sample = popularity_clean.head(10)

plt.figure(figsize=(6, 5))
plt.boxplot(popularity_clean)
plt.title('Box Plot of Movie Popularity')
plt.ylabel('Popularity')
plt.grid(True)
plt.tight_layout()
plt.show()


### 4. Line Graph

it shows popularity of movies per year

In [None]:
#Line graph for movie popularity in each year
df3['releasedate'] = pd.to_datetime(df3['releasedate'], errors='coerce')
df3['release_year'] = df3['releasedate'].dt.year

pop_by_year = df3.groupby('release_year')['popularity'].mean().dropna().head(10)

plt.figure(figsize=(10, 5))
plt.plot(pop_by_year.index, pop_by_year.values, marker='o')
plt.title('Average Popularity Over Years')
plt.xlabel('Year')
plt.ylabel('Popularity'
plt.grid(True)
plt.tight_layout()
plt.show()



### 5.Histogram
It shows how many movies are popular

In [None]:
#Histogram for movie popularity
plt.figure(figsize=(8, 5))
plt.hist(df3['popularity'].dropna(), bins=10, color='purple', edgecolor='black')
plt.title('Histogram of Movie Popularity')
plt.xlabel('Popularity')
plt.ylabel('Number of Movies')
plt.grid(True)
plt.tight_layout()
plt.show()


# **Business Understanding**
- The studio should focus on the movie genre and see which movies get more vote count
- It`s imporntant to not focus on movie popularity but also attention catching and trendy movies
- Plan when to release a movie depending with the right season
- Marketing and advertising the movie for its popularity

# **Conclusion**
- Many movies are not popular
- Many popular movie dont have high ratings
- Popularity goes together with vote count
  