# BOX OFFICE MOVIE PERFORMANCE ANALYSIS

## 1.BUSINESS UNDERSTANDING

### a.) Introduction 

Movies are common staple of entertainment we can find that encourages the wave of creativity and art to be showcased on our screens.In this case we are using a dataset for which gives us various movies along with information based from genres, reviews, ratings, crew, gross, movie budget etc. This can be used to evaluate the information we can analyse to know the success rate of this movies we watch to the most popular genre of movies. Data analysis allows businesses to perform comprehensive market research to identify target audiences, understand their preferences, and predict trends. Analyzing demographic data, consumer surveys, and historical box office data can help identify patterns and inform decision-making in terms of genre selection, casting choices, and marketing strategies.Data analysis allows businesses to evaluate the success of individual movies or franchises. By analyzing box office revenue, critical reviews, audience ratings, and other performance indicators, stakeholders can assess the profitability and popularity of specific movies. This evaluation aids in understanding which movies resonate with audiences, identifying potential areas of improvement, and making informed decisions regarding future projects

### b.) Problem Statement 

Objective: The objective is to analyze the dataset and gain insights into the factors that influence box office performance. The analysis aims to provide actionable information that can be used to enhance decision-making processes within the film industry.

Data: The analysis will be based on a dataset consisting of information on movies, including variables such as gross revenue, production budget, genre, release date, audience ratings, and reviews. The dataset should be comprehensive, reliable, and up-to-date to ensure accurate analysis.

Stakeholders: The stakeholders involved in this analysis include movie producers, studios, distributors, exhibitors, and other industry professionals. The insights derived from the analysis will assist these stakeholders in understanding the key drivers of box office success and enable them to make strategic decisions related to movie production, marketing, and distribution.

Scope: The analysis will focus on understanding the relationship between various factors and box office performance. This includes exploring the impact of production budget, genre, release date, audience ratings, and reviews on the commercial success of movies. The analysis will consider a specific time period and may focus on specific geographical regions or markets.

Deliverables: The deliverables of this analysis will include insights and recommendations that can be used by stakeholders to enhance box office performance. These insights may involve identifying genres that tend to perform well, determining optimal release periods, understanding the influence of marketing efforts on ticket sales, and other actionable information that can guide decision-making processes.

By addressing the problem statement, this analysis aims to provide valuable insights into the factors that drive box office success in the film industry. The outcomes of the analysis will assist stakeholders in making data-informed decisions and maximizing the commercial performance of their movies.

### c.) Main Objective

It should be able to help us find out what kind of movies we can watch based on the recommendations from the previous success rates and gain insights into the factors that influence box office performance and and provide actionable insights to stakeholders in the film industry. By analyzing a comprehensive dataset encompassing movie information, gross revenue, ratings, reviews, and related variables, the project aims to achieve the following:

Determine Influential Factors: Identify the factors that have the greatest impact on a movie's box office performance. This includes exploring variables such as production budget, genre, release date, audience ratings, critical reception, and marketing efforts.


### d.) Specific Objectives 

Understand Relationships and Trends: Analyze the dataset to uncover patterns, correlations, and trends between different variables and box office success. Examine how factors such as genre popularity, budget allocation, release timing, or audience sentiment affect the commercial performance of movies.

Provide Actionable Insights: Derive actionable insights that can guide decision-making processes in the film industry. Offer recommendations to stakeholders, including filmmakers, studios, distributors, and exhibitors, on how to enhance the box office performance of their movies based on the identified influential factors and observed trends.

Support Strategic Decision Making: Equip stakeholders with the necessary information to make informed decisions related to movie production, marketing strategies, distribution plans, and resource allocation. Enable them to optimize their decision-making processes based on data-driven insights and increase the chances of commercial success.

### e.) Experimental Design 

i. Data Collection
ii. Read and check the data
iii. Cleaning the data
iv. Exploratory Data Analysis
v. Data modelling and model performance evaluation
vi. Use the model to make predictions
vii. Conclusions and Recommendations
viii. Deploy the model

## 2. DATA UNDERSTANDING

IMDB: The IMDB data is located in a SQLite database file called "im.db.zip". The relevant tables for analysis are "movie_basics" and "movie_ratings". The IMDB data contains information about movies such as titles, genres, release dates, ratings, and more.[IMDB](https://www.imdb.com/)

Box Office Mojo: The data from Box Office Mojo is stored in a compressed CSV file called "bom.movie_gross.csv.gz". This file contains information about the domestic and worldwide box office gross for movies.[Box Office Mojo](https://www.imdb.com/)

### a.) importing Data

In [20]:
# importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlite3

### b.) Reading Data 

In [21]:
#load the box office mojo gross csv file
bom_data = pd.read_csv("bom.movie_gross.csv")
bom_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [22]:
#load IMDB Movie title ratings csv file
movie_ratings = pd.read_csv("title.ratings.csv")
movie_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [23]:
#load movie title basics file
movie_title = pd.read_csv("title.basics.csv")
movie_title.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [24]:
#load movies  file
movies = pd.read_csv("tmdb.movies.csv")
movies.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [25]:
movie_budget = pd.read_csv("tn.movie_budgets.csv")
movie_budget.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


### 2.1 DATA WRANGLING

At this point before joining the datasets there has to be modification of datasets and eliminate any unnecessary information or columns and rows and remain with only what is relevant for this analysis. 

### 2.1.1 Indexing

In [26]:
# Merge the movie basics and ratings tables based on tconst
imdb_data = pd.merge(movie_title, movie_ratings, on='tconst', how='inner')
imdb_data.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


In [27]:
# Merge the IMDb data with Box Office Mojo data based on movie titles
merged_data = pd.merge(imdb_data, bom_data, left_on='primary_title', right_on='title', how='inner')
merged_data.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,title,studio,domestic_gross,foreign_gross,year
0,tt0315642,Wazir,Wazir,2016,103.0,"Action,Crime,Drama",7.1,15378,Wazir,Relbig.,1100000.0,,2016
1,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,On the Road,IFC,744000.0,8000000.0,2012
2,tt4339118,On the Road,On the Road,2014,89.0,Drama,6.0,6,On the Road,IFC,744000.0,8000000.0,2012
3,tt5647250,On the Road,On the Road,2016,121.0,Drama,5.7,127,On the Road,IFC,744000.0,8000000.0,2012
4,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,The Secret Life of Walter Mitty,Fox,58200000.0,129900000.0,2013


In [28]:
# Remove irrelevant columns
merged_data = merged_data.drop(['tconst', 'original_title'], axis=1)
merged_data.head()

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes,title,studio,domestic_gross,foreign_gross,year
0,Wazir,2016,103.0,"Action,Crime,Drama",7.1,15378,Wazir,Relbig.,1100000.0,,2016
1,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,On the Road,IFC,744000.0,8000000.0,2012
2,On the Road,2014,89.0,Drama,6.0,6,On the Road,IFC,744000.0,8000000.0,2012
3,On the Road,2016,121.0,Drama,5.7,127,On the Road,IFC,744000.0,8000000.0,2012
4,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,The Secret Life of Walter Mitty,Fox,58200000.0,129900000.0,2013


Merge the 'bom_data' (Box Office Mojo) dataset with the 'movie_title' dataset.
Merge the resulting dataset with the 'movie_ratings' dataset.
A merged dataset that contains information about the studios, gross revenue, movie titles, and ratings.

In [50]:
merged_data = pd.merge(bom_data, movie_title, left_on='title', right_on='primary_title')
merged_data = pd.merge(merged_data, movie_ratings, on='tconst')
merged_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,tconst,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,Toy Story 3,BV,415000000.0,652000000,2010,tt0435761,Toy Story 3,Toy Story 3,2010,103.0,"Adventure,Animation,Comedy",8.3,682218
1,Inception,WB,292600000.0,535700000,2010,tt1375666,Inception,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066
2,Shrek Forever After,P/DW,238700000.0,513900000,2010,tt0892791,Shrek Forever After,Shrek Forever After,2010,93.0,"Adventure,Animation,Comedy",6.3,167532
3,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010,tt1325004,The Twilight Saga: Eclipse,The Twilight Saga: Eclipse,2010,124.0,"Adventure,Drama,Fantasy",5.0,211733
4,Iron Man 2,Par.,312400000.0,311500000,2010,tt1228705,Iron Man 2,Iron Man 2,2010,124.0,"Action,Adventure,Sci-Fi",7.0,657690


This are the highest grossing movies by each Studio

In [43]:
highest_grossing_movies = merged_data.groupby('studio')['domestic_gross'].max()
merged_data_clean = merged_data.dropna(subset=['domestic_gross'])
idx = merged_data_clean.groupby('studio')['domestic_gross'].idxmax()
highest_grossing_movies = merged_data_clean.loc[idx, ['studio', 'title', 'domestic_gross']]
highest_grossing_movies.head()

Unnamed: 0,studio,title,domestic_gross
754,3D,Sea Rex 3D: Journey to a Prehistoric World,6100000.0
580,A23,Revenge of the Electric Car,151000.0
2713,A24,Lady Bird,49000000.0
1327,ADC,A Royal Night Out,228000.0
1684,AF,Barbara,1000000.0


The total gross revenue for each studio for all their movies first the data has to be grouped

In [60]:
# Group data by studio and calculate total gross revenue
studio_gross = merged_data.groupby('studio').agg({'domestic_gross': 'sum', 'foreign_gross': 'sum'})
studio_gross['domestic_gross'] = pd.to_numeric(studio_gross['domestic_gross'], errors='coerce')
studio_gross['foreign_gross'] = pd.to_numeric(studio_gross['foreign_gross'], errors='coerce')

In [62]:
studio_gross['total_gross'] = studio_gross['domestic_gross'] + studio_gross['foreign_gross']

# Sort studios by total gross revenue
studio_gross = studio_gross.sort_values('total_gross', ascending=False)

# Select the top 10 studios
top_10_studios = studio_gross.head(10)
top_10_studios

Unnamed: 0_level_0,domestic_gross,foreign_gross,total_gross
studio,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RAtt.,293855200,7.300001e+272,7.300001e+272
LG/S,1543499999,1.613e+262,1.613e+262
SGem,1455100000,2.401e+243,2.401e+243
ORF,832500000,3.1300000000000004e+233,3.1300000000000004e+233
Rela.,980194000,5.4e+208,5.4e+208
STX,702600000,7.07e+169,7.07e+169
WGUSA,20128600,8.28e+148,8.28e+148
TriS,976600000,2.43e+144,2.43e+144
Strand,3126100,3.00001e+136,3.00001e+136
Sum.,878771000,3.98e+128,3.98e+128


### 2.1.2 Formatting Datatypes

we need to convert domestic gross into a datetime and convert domestic domestic_gross column into an integer

In [63]:
# Remove rows with non-finite values in 'domestic_gross' column
merged_data = merged_data.dropna(subset=['domestic_gross'])

# Convert 'domestic_gross' column to int
merged_data['domestic_gross'] = merged_data['domestic_gross'].astype(int)
merged_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,tconst,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,Toy Story 3,BV,415000000,652000000,2010,tt0435761,Toy Story 3,Toy Story 3,2010,103.0,"Adventure,Animation,Comedy",8.3,682218
1,Inception,WB,292600000,535700000,2010,tt1375666,Inception,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066
2,Shrek Forever After,P/DW,238700000,513900000,2010,tt0892791,Shrek Forever After,Shrek Forever After,2010,93.0,"Adventure,Animation,Comedy",6.3,167532
3,The Twilight Saga: Eclipse,Sum.,300500000,398000000,2010,tt1325004,The Twilight Saga: Eclipse,The Twilight Saga: Eclipse,2010,124.0,"Adventure,Drama,Fantasy",5.0,211733
4,Iron Man 2,Par.,312400000,311500000,2010,tt1228705,Iron Man 2,Iron Man 2,2010,124.0,"Action,Adventure,Sci-Fi",7.0,657690


In [64]:
# Fill missing values in 'runtime_minutes' column with the mean value
mean_runtime = merged_data['runtime_minutes'].mean()
merged_data['runtime_minutes'] = merged_data['runtime_minutes'].fillna(mean_runtime)
mean_runtime

107.25693035835025