<a href="https://colab.research.google.com/github/yayra/tmdb_5000_movies/blob/main/Movie_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Analysis of movie box office success factors**

## **Introduction**
The dataset used for this analysis was sourced from [Kaggle's TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata). The dataset contains movie data from 1916 to 2016.

#### **Assumptions made about the data during DA:**
* The values in the 'budget' and 'revenue' columns are expressed in USD, representing global box office revenue.
* The analysis of actors/actresses considers their appearance in a movie across all roles, not just the main role.

#### **Key Technologies Utilized:**
* **Pandas & Numpy:** For data analysis and manipulation
* **Abstract Syntax Tree (AST):** For parsing and processing data structures
* **Plotly:** For interactive data visualization


##**Key questions addressed in this project**

1. Trends in Film Industry: Invested Budget and Box Office Revenue Over the Years
2. Analysis of the Top 10 Highest-Grossing Films
3. Top 10 Films by Budget and Number of Votes
4. The Most Successful Directors Based on Box Office Revenue
5. The Most Successful Actors Based on Box Office Performance and Number of Films Appeared
6. Insights into High-Grossing Genres
7. Analysis of the Evolution of Popular Genres Over Time
8. Seasonal Trends: Are Certain Genres More Popular in Specific Months?
9. Exploring the Relationship Between a Film's Budget, Revenue, Vote Count, and Ratings
10. Key Characteristics of Films with High Returns on Investment (ROI)

## **1. Import necessary libraries**

In [None]:
# For data manipulation and analysis
import pandas as pd
import numpy as np

#Safely parsing strings that represent lists of dictionaries
import ast

#for visualization
import plotly.express as px

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **2. Data Exploration**

In [None]:
movies= pd.read_csv("/content/drive/MyDrive/tmdb_5000_movies.csv")
movies.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.44,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.08,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.38,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [None]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [None]:
credits = pd.read_csv("/content/drive/MyDrive/tmdb_5000_credits.csv")
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [None]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


## **3. Data preprocessing**

In [None]:
# Keep only columns needed for analysis
movies_df = movies[['id', 'budget', 'genres', 'title', 'release_date', 'revenue', 'vote_average', 'vote_count']]
credits_df = credits[['movie_id', 'cast', 'crew']]

In [None]:
# Merging datasets
data = pd.merge(movies_df, credits_df, left_on = 'id', right_on = 'movie_id').drop('movie_id', axis = 1)
data.head()

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew
0,19995,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Dark Knight Rises,2012-07-16,1084939099,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",John Carter,2012-03-07,284139100,6.1,2124,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [None]:
data.describe()

Unnamed: 0,id,budget,revenue,vote_average,vote_count
count,4803.0,4803.0,4803.0,4803.0,4803.0
mean,57165.48,29045039.88,82260638.65,6.09,690.22
std,88694.61,40722391.26,162857100.94,1.19,1234.59
min,5.0,0.0,0.0,0.0,0.0
25%,9014.5,790000.0,0.0,5.6,54.0
50%,14629.0,15000000.0,19170001.0,6.2,235.0
75%,58610.5,40000000.0,92917187.0,6.8,737.0
max,459488.0,380000000.0,2787965087.0,10.0,13752.0


* **Adding 'director' column**


In [None]:
# Extracting the 'director' information from the 'crew' column and adding it as a new column to the dataset.
# The 'crew' column contains string representations of lists with multiple dictionaries.
# To process this data, the strings need to be converted into actual Python lists using ast.literal_eval() from the ast module.
data['crew'][0]

In [None]:
print(ast.literal_eval(data['crew'][0]))

In [None]:
# Safely convert strings in a 'crew' column to Python list utilizing ast.literal_eval().
import ast

data['crew'] = data['crew'].apply(ast.literal_eval)
print(data['crew'][0])

In [None]:
# Function to extract the name of the director from a dictionary embedded in a list
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
           return i['name']

#Adding a new 'movie director' column
data['director'] = data['crew'].apply(get_director)
data.head()

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director
0,19995,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{'credit_id': '52fe48009251416c750aca23', 'de...",James Cameron
1,285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",Gore Verbinski
2,206647,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...",Sam Mendes
3,49026,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Dark Knight Rises,2012-07-16,1084939099,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...",Christopher Nolan
4,49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",John Carter,2012-03-07,284139100,6.1,2124,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...",Andrew Stanton


* **Adding 'cast_name' column**


In [None]:
data['cast'][0]

In [None]:
# Extracting cast names as a list from the 'cast' column, which contains string representations of lists with dictionaries.
# The 'name' key from each dictionary in the list is used to create a new list of cast names.
data['cast_name'] = data['cast'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])
data.head()

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director,cast_name
0,19995,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{'credit_id': '52fe48009251416c750aca23', 'de...",James Cameron,"[Sam Worthington, Zoe Saldana, Sigourney Weave..."
1,285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",Gore Verbinski,"[Johnny Depp, Orlando Bloom, Keira Knightley, ..."
2,206647,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...",Sam Mendes,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R..."
3,49026,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Dark Knight Rises,2012-07-16,1084939099,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...",Christopher Nolan,"[Christian Bale, Michael Caine, Gary Oldman, A..."
4,49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",John Carter,2012-03-07,284139100,6.1,2124,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...",Andrew Stanton,"[Taylor Kitsch, Lynn Collins, Samantha Morton,..."


* **Extract the main genre**

In [None]:
data['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [None]:
data['genres'] = data['genres'].apply(ast.literal_eval)

In [None]:
#Function to extract the name of the genre from the first element of the list
def get_genres(x):
    if len(x) > 0:
        return x[0]['name']

data['main_genre'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director,cast_name,main_genre
0,19995,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{'credit_id': '52fe48009251416c750aca23', 'de...",James Cameron,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",Action
1,285,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",Gore Verbinski,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Adventure
2,206647,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...",Sam Mendes,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Action
3,49026,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",The Dark Knight Rises,2012-07-16,1084939099,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...",Christopher Nolan,"[Christian Bale, Michael Caine, Gary Oldman, A...",Action
4,49529,260000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",John Carter,2012-03-07,284139100,6.1,2124,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...",Andrew Stanton,"[Taylor Kitsch, Lynn Collins, Samantha Morton,...",Action


* **Convert datatypes**

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            4803 non-null   int64  
 1   budget        4803 non-null   int64  
 2   genres        4803 non-null   object 
 3   title         4803 non-null   object 
 4   release_date  4802 non-null   object 
 5   revenue       4803 non-null   int64  
 6   vote_average  4803 non-null   float64
 7   vote_count    4803 non-null   int64  
 8   cast          4803 non-null   object 
 9   crew          4803 non-null   object 
 10  director      4773 non-null   object 
 11  cast_name     4803 non-null   object 
 12  main_genre    4775 non-null   object 
dtypes: float64(1), int64(4), object(8)
memory usage: 487.9+ KB


In [None]:
# Convert 'release_date' into date format
data['release_date'] = pd.to_datetime(data['release_date'], format='%Y-%m-%d')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            4803 non-null   int64         
 1   budget        4803 non-null   int64         
 2   genres        4803 non-null   object        
 3   title         4803 non-null   object        
 4   release_date  4802 non-null   datetime64[ns]
 5   revenue       4803 non-null   int64         
 6   vote_average  4803 non-null   float64       
 7   vote_count    4803 non-null   int64         
 8   cast          4803 non-null   object        
 9   crew          4803 non-null   object        
 10  director      4773 non-null   object        
 11  cast_name     4803 non-null   object        
 12  main_genre    4775 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(4), object(7)
memory usage: 487.9+ KB


* **Adding 'year' and 'month' columns**

In [None]:
data['year'] = data['release_date'].dt.year
data['month'] = data['release_date'].dt.month
data.head()

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director,cast_name,main_genre,year,month
0,19995,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{'credit_id': '52fe48009251416c750aca23', 'de...",James Cameron,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",Action,2009.0,12.0
1,285,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",Gore Verbinski,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Adventure,2007.0,5.0
2,206647,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...",Sam Mendes,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Action,2015.0,10.0
3,49026,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",The Dark Knight Rises,2012-07-16,1084939099,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de...",Christopher Nolan,"[Christian Bale, Michael Caine, Gary Oldman, A...",Action,2012.0,7.0
4,49529,260000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",John Carter,2012-03-07,284139100,6.1,2124,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{'credit_id': '52fe479ac3a36847f813eaa3', 'de...",Andrew Stanton,"[Taylor Kitsch, Lynn Collins, Samantha Morton,...",Action,2012.0,3.0


* **Handling missing values**

In [None]:
#Check the NaN values in each column
data.isnull().sum()

Unnamed: 0,0
id,0
budget,0
genres,0
title,0
release_date,1
revenue,0
vote_average,0
vote_count,0
cast,0
crew,0


In [None]:
# Drop NaN values
data.dropna(inplace = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4759 entries, 0 to 4802
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            4759 non-null   int64         
 1   budget        4759 non-null   int64         
 2   genres        4759 non-null   object        
 3   title         4759 non-null   object        
 4   release_date  4759 non-null   datetime64[ns]
 5   revenue       4759 non-null   int64         
 6   vote_average  4759 non-null   float64       
 7   vote_count    4759 non-null   int64         
 8   cast          4759 non-null   object        
 9   crew          4759 non-null   object        
 10  director      4759 non-null   object        
 11  cast_name     4759 non-null   object        
 12  main_genre    4759 non-null   object        
 13  year          4759 non-null   float64       
 14  month         4759 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(4),

## **4. Exploratory Data Analysis (EDA)**


#### **4.1 Trends in Film Industry: Invested Budget and Box Office Revenue Over the Years**

In [None]:
revenue_by_year = data.groupby('year')[['revenue', 'budget']].sum().reset_index()
fig = px.line(revenue_by_year, x='year', y = ['revenue', 'budget'],
              title = 'Box Office Revenue Over the Years',
              width = 800, height = 600,
              color_discrete_sequence=px.colors.qualitative.Vivid,
              template = 'plotly_white',
              )
fig.show()

The budget invested in film production and the box office revenue both began to increase in the 1990s. However, the revenue experienced a significantly sharper upward trend.

#### **4.2 Analysis of the Top 10 Highest-Grossing Films**

In [None]:
revenue_by_year = data[['title', 'revenue', 'main_genre', 'year', 'director']].sort_values('revenue', ascending=False).head(10)

revenue_by_year

Unnamed: 0,title,revenue,main_genre,year,director
0,Avatar,2787965087,Action,2009.0,James Cameron
25,Titanic,1845034188,Drama,1997.0,James Cameron
16,The Avengers,1519557910,Science Fiction,2012.0,Joss Whedon
28,Jurassic World,1513528810,Action,2015.0,Colin Trevorrow
44,Furious 7,1506249360,Action,2015.0,James Wan
7,Avengers: Age of Ultron,1405403694,Action,2015.0,Joss Whedon
124,Frozen,1274219009,Animation,2013.0,Chris Buck
31,Iron Man 3,1215439994,Action,2013.0,Shane Black
546,Minions,1156730962,Family,2015.0,Kyle Balda
26,Captain America: Civil War,1153304495,Adventure,2016.0,Anthony Russo


In [None]:
# Create the plot with formatted labels
fig = px.bar(revenue_by_year,
              x='title',
              y='revenue',
              title='TOP 10 highest-grossing films',
              width=800,
              height=600,
              color_discrete_sequence=px.colors.qualitative.G10,
              template='plotly_white',
              color='main_genre',
              hover_name = 'year',
              labels = dict(revenue = 'Revenue (billion)', title = '')
              )

# Show the figure
fig.show()

#### **4.3 Top 10 Films by Budget and Number of Votes**

In [None]:
title_dic = {'budget':'budget', 'vote_count':'number of votes'}
for y in ['budget','vote_count']:
    top = data.groupby('title')[[y]].sum().reset_index().sort_values(y, ascending=False).head(10)
    fig = px.bar(data_frame=top, x='title', y=y,
                 width=800, height=600,
                 color_discrete_sequence=px.colors.qualitative.G10,
                 template='plotly_white',
                 title=f"TOP 10 films in terms of {title_dic[y]}",
                 labels = dict(title = ' ', y = title_dic[y])
                 )
    fig.show()

* It is generally assumed that higher budgets and a greater number of votes often correlate with increased box office revenues. However, the data in this dataset presents a different perspective, indicating that this assumption may not always hold true. To evaluate the validity of this hypothesis, a correlation analysis is recommended to examine the relationships between variables such as budget, votes, and revenue (performed in part 3.9).

#### **4.4 The Most Successful Directors Based on Box Office Revenue**

In [None]:
top_director = data.groupby(['director'])['revenue'].sum().reset_index().sort_values(by='revenue', ascending=False).head(10)
top_director
fig = px.bar(top_director, x='director', y='revenue',
             title = 'Highest-grossing film directors',
             color_discrete_sequence=px.colors.qualitative.G10,
             template='plotly_white',
             width=800, height=500,
             labels = dict(director = '', revenue = 'Revenue(billion)'))
fig.show()

In [None]:
dir_rev = data[data['director'].isin(['Steven Spielberg', 'Peter Jackson','James Cameron'])][['title', 'year', 'director', 'revenue']].sort_values('director').reset_index(drop=True)
dir_rev

Unnamed: 0,title,year,director,revenue
0,Avatar,2009.0,James Cameron,2787965087
1,Aliens,1986.0,James Cameron,183316455
2,The Abyss,1989.0,James Cameron,90000098
3,The Terminator,1984.0,James Cameron,78371200
4,Titanic,1997.0,James Cameron,1845034188
5,True Lies,1994.0,James Cameron,378882411
6,Terminator 2: Judgment Day,1991.0,James Cameron,520000000
7,The Lord of the Rings: The Two Towers,2002.0,Peter Jackson,926287400
8,The Lord of the Rings: The Return of the King,2003.0,Peter Jackson,1118888979
9,The Lovely Bones,2009.0,Peter Jackson,93525586


In [None]:
dir_rev.groupby('director').size()

Unnamed: 0_level_0,0
director,Unnamed: 1_level_1
James Cameron,7
Peter Jackson,9
Steven Spielberg,27


Steven Spielberg, Peter Jackson, and James Cameron ranked as the top three most successful film directors in terms of box office revenue between 1975 and 2016. Films directed by Steven Spielberg generated over 9 billion USD, while those by Peter Jackson earned over 6 billion USD, and James Camerons films grossed approximately 5.9 billion USD during this period. Spielberg directed 27 films, Jackson directed 9, and Cameron directed 7. Notably, the highest-grossing films, Avatar (2009) and Titanic (1997), were both directed by James Cameron.

#### **4.5 The Most Successful Actors Based on Box Office Performance and Number of Films Appeared**

In [None]:
#Extract cast names from the list and sum up the revenue of the films they appeared in.
revenue_cast = data[['revenue', 'cast_name']].explode('cast_name')
top_cast = revenue_cast.groupby('cast_name')[['revenue']].sum().reset_index().sort_values('revenue', ascending=False).head(10)
top_cast

Unnamed: 0,cast_name,revenue
47852,Stan Lee,17364063582
45548,Samuel L. Jackson,14806065788
16991,Frank Welker,11614837160
25610,John Ratzenberger,11038044745
20259,Hugo Weaving,10822190781
7838,Cate Blanchett,9726416776
20387,Ian McKellen,9710670395
23798,Jess Harnell,9633458775
37483,Morgan Freeman,9275477679
50724,Tom Cruise,8993387534


In [None]:
fig = px.bar(top_cast, x='cast_name', y='revenue',
             title = 'Highest-grossing actors (all roles)',
             width=800, height=500,
             color_discrete_sequence=px.colors.qualitative.Vivid,
             template='plotly_white',
             labels = dict(cast_name = '', revenue = 'Revenue(billion)'))
fig.show()

In [None]:
cast_film_count = revenue_cast['cast_name'].value_counts().reset_index().head(10)
fig = px.bar(cast_film_count , x='cast_name', y='count',
             title = 'TOP 10 Actors with the most film apperance',
             width=800, height=500,
             color_discrete_sequence=px.colors.qualitative.Safe,
             template='plotly_white',
             labels = dict(cast_name = '', revenue = 'Film count'))
fig.show()

#### **4.6 Insights into High-Grossing Genres**

In [None]:
data['main_genre'].value_counts(normalize=True).reset_index().head(10)

Unnamed: 0,main_genre,proportion
0,Drama,0.25
1,Comedy,0.22
2,Action,0.16
3,Adventure,0.07
4,Horror,0.06
5,Crime,0.04
6,Thriller,0.04
7,Animation,0.03
8,Fantasy,0.02
9,Romance,0.02


The dataset comprises 20 genres, with Drama (25%), Comedy (22%), and Action (16%) being the most represented.

In [None]:
fig = px.box(data, x='revenue',title = 'Box plot  to detect outliers in revenue column',
             width=700, height=500, hover_name = 'title',
             color_discrete_sequence=px.colors.qualitative.G10,
             template='plotly_white',
            )
fig.show()


In [None]:
rev_hist = data['revenue'].value_counts().reset_index()
fig = px.histogram(rev_hist, x='revenue', width=700, height=500,
                   title = 'Distrubition of data in revenue column',
                   color_discrete_sequence=px.colors.qualitative.G10,
                   template='plotly_white',
                   labels = dict(revenue = 'Revenue', count = ''))
fig.show()

The revenue column contains 531 records with a value of 0, indicating a significant number of missing or invalid entries. Additionally, the data is not normally distributed. To address these issues, the interquartile range (IQR) method will be used to identify outliers and facilitate a more effective analysis of revenue across genres.

In [None]:
fig = px.box(data_frame = data, y = 'main_genre', x = 'revenue',
             color_discrete_sequence=px.colors.qualitative.Vivid,
             labels = dict(revenue = 'revenue', main_genre = ''),
             hover_name = 'title')
fig.show()

In [None]:
median_revenue = data.groupby('main_genre')[['revenue']].median().reset_index().sort_values(by='revenue', ascending = False)
fig = px.bar(median_revenue, x='main_genre', y='revenue',
             title = 'Median revenue by genre',
             width=800, height=500,
             color_discrete_sequence=px.colors.qualitative.Vivid,
             labels = dict(main_genre = '', count = 'Median revenue'))
fig.show()

In [None]:
#Drop rows with revenue equal to 0
genre_data = data[data['revenue']>0]

# Compute the 25th percentile value in 'revenue'
percentile25 = genre_data['revenue'].quantile(0.25)

#Compute the 75th percintile value in 'revenue
percentile75 = genre_data['revenue'].quantile(0.75)

#Compute the interquantile range in 'revenue
iqr = percentile75 - percentile25
print("IQR:", iqr)
#Define the upper and lower limit for non-outlier values in 'revenue'
upper_limit = percentile75 + 1.5*iqr
lower_limit = percentile25 -1.5*iqr
print("Lower limit:", lower_limit)
print("Upper limit:", upper_limit)

#Identify subset of data containing outliers in 'revenue'
outliers = genre_data[(genre_data['revenue']> upper_limit) | (genre_data['revenue'] < lower_limit)]

# Count how many rows in the data contain outliers in `revenue`
print("Number of rows in the data containing outliers in `revenue`:", len(outliers))
print("% of rows containing outliers:", round((len(outliers)/genre_data.shape[0])*100),'%')

IQR: 124795751.0
Lower limit: -171817675.0
Upper limit: 327365329.0
Number of rows in the data containing outliers in `revenue`: 299
% of rows containing outliers: 9 %


In [None]:
#The number of movies with high box office revenue outlier values within the genres.
outliers_genre_count = outliers['main_genre'].value_counts().reset_index().head(10)
fig = px.bar(outliers_genre_count, x='main_genre', y='count',
             title = 'Count of genres with significantly high box office revenues',
             color_discrete_sequence=px.colors.qualitative.Vivid,
             width=800, height=500,
             labels = dict(main_genre = '', count = 'Count of movies'))
fig.show()


The Drama and Action genres show a higher concentration of outliers at the upper end of the box office revenue spectrum, indicating some of the highest performances. At the higher end, the genres with the highest earnings are Adventure, Action, and Animation in that order. However, the median box office revenue is still notably higher in the Animation, Adventure, followed by Fantasy genre. With Action ranking sixth in the line, Drama does not make it into the top 10.

#### **4.7 Analysis of the Evolution of Popular Genres Over Time**

In [None]:
#Extract data starting from year 1990
after_1990 = data[data['year']>=1990]
#Sum up revenue by genre for every year strating from 1990
revenue_by_genre = after_1990.groupby(['main_genre', 'year'])[['revenue']].sum().reset_index()
#Create a barplot to detect the trend in the sum of revenue per genre over the years
fig = px.bar(revenue_by_genre, x='year', y = 'revenue',
             color = 'main_genre', width = 800, height = 500,
             title = 'Trends in box office revenue by genre (1990–2016)',
             color_discrete_sequence=px.colors.qualitative.Vivid,
             labels = dict(revenue='Revenue(billion)', year='')
)
fig.show()

The graph indicates that the **Action** genre experienced a steady increase in total box office revenue over the period from 2000 to 2015. In contrast, the **Adventure** genre began to rise after 2000 and exhibited a volatile revenue trend. The **Animation** genre showed a significant rise starting in 2009, followed by a period of sharp decline after 2013. Overall, box office revenue experienced a gradual decline across all genres by 2016. A deeper analysis would require additional data starting from 2016 to better understand this trend.

#### **4.8 Seasonal Trends: Are Certain Genres More Popular in Specific Months?**

In [None]:
revenue_genre_month = after_1990.groupby(['month', 'main_genre'])[['revenue']].sum().reset_index()
#create a barplot that visualizes the popularity of genres in a particular month
fig = px.bar(revenue_genre_month, x='month', y = 'revenue',
             color = 'main_genre', width = 800, height = 500,
             title = 'Popularity of genres during particular month',
             color_discrete_sequence=px.colors.qualitative.Vivid,
             labels = dict(revenue='Revenue(billion)', year='')
)
fig.show()

The analysis reveals that the **Action** genre achieved higher box office revenue when released during the spring and summer months, while the **Adventure** genre excelled in May and the late winter months. The Animation genre saw significant box office revenue with releases in June, whereas Comedy and **Drama** performed best with December releases.

#### **4.9 Exploring the Relationship Between a Film's Budget, Revenue, Vote Count, and Ratings**

In [None]:
data.head(3)

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director,cast_name,main_genre,year,month
0,19995,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{'credit_id': '52fe48009251416c750aca23', 'de...",James Cameron,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",Action,2009.0,12.0
1,285,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",Gore Verbinski,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Adventure,2007.0,5.0
2,206647,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...",Sam Mendes,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Action,2015.0,10.0


In [None]:
#Calculate correlation coefficients between budget, revenue, vote count and ratings
cor = data[['budget', 'revenue', 'vote_count', 'vote_average' ]].corr()
fig = px.imshow(cor, text_auto = '.2f', color_continuous_scale = 'Sunsetdark',
                width = 700, height = 700)
fig.show()

In [None]:
# See in detail realationship between revenue and other variables
for x in ['budget', 'vote_count', 'vote_average']:
    fig = px.scatter(data_frame = data, x = x, y = 'revenue', hover_name = 'title', size = 'revenue', color = 'revenue'
    , color_continuous_scale = px.colors.sequential.Sunsetdark, width = 700, height = 600, trendline = 'ols')
    fig.show()

In the analysis of the **Top 10 Box Office Revenue** (Section 3.2) and the **Top 10 Movies by Budget and Number of Votes** (Section 3.3), an intriguing trend was observed. Contrary to the general assumption that movies with the highest revenues would also receive the most votes, only *Avatar* and *The Avengers* from the Top 10 Box Office Revenue list were among the **Top 10 Most Voted Movies**. Additionally, none of the movies with the **Top 10 Highest Budgets** were included in the Top 10 Most Voted Movies.

A similar pattern emerged when comparing movies with the highest revenue and budgets. Only *Avengers: Age of Ultron* and *Capitan America: Civil War ranked in the **Top 10 by Budget**—was also featured in the **Top 10 Box Office Revenue** list. None of the other highest-budget movies appeared among the Top 10 Revenue-Generating Films.
Based on these findings, it was recommended to explore the relationships between revenue, budget, number of votes, and ratings. The correlation matrix revealed the following key insights:
 - A strong positive correlation between revenue and number of votes (r = 0.78).
 - A high correlation between revenue and budget (r = 0.73).
 - A relatively moderate correlation between budget and number of votes (r = 0.59).
 These results confirm the general assumption that higher film budgets are associated with increased revenue and a larger number of votes, and that films with high box office revenue tend to collect more votes.

However, the observed exceptions, where **Top 10 Box Office Revenue** films do not align with these trends, require further investigation to identify the underlying factors contributing to these anomalies. To address this, a more comprehensive correlation analysis of the Top 100 Box Office Revenue films will be conducted.


In [None]:
#Calculate correlation coefficients between budget, revenue, vote count and ratings of TOP 100 Box office revenue films
fig = px.imshow(data[['budget', 'revenue', 'vote_count', 'vote_average' ]].sort_values(by= 'revenue', ascending = False).head(100).corr(),
                text_auto = '.2f', color_continuous_scale = 'YlOrBr',
                width = 700, height = 700)
fig.show()

The situation observed in the graphs of the **Top 10 Highest-Grossing Movies** (Section 3.2) and the **Top 10 Movies by Budget and Number of Votes** (Section 3.3) contrasts with the results of the correlation analysis involving all films in the dataset. For the **Top 100 Movies by Revenue**, the correlations reveal a different pattern: revenue shows a weak correlation with budget (r = 0.31) and number of votes (r = 0.38), while the relationship between budget and number of votes is negligible (r = 0.13). These findings suggest that the dynamics influencing top-performing movies may differ significantly from those observed across the broader dataset.

#### **4.10 The key characteristics of movies that achieve high returns on investment (ROI)**

In [None]:
# Calculate the number of rows with 0 value in a 'budget' column
data[data['budget']==0].shape[0]

1001

In [None]:
#Filter rows for budget not equal to 0
data_roi = data[data['budget']>0].copy()
#Calculate roi and add 'roi' column
data_roi.loc[:, 'roi'] = (data_roi['revenue']/data_roi['budget']).round(2)
data_roi.head(3)

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director,cast_name,main_genre,year,month,roi
0,19995,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Avatar,2009-12-10,2787965087,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{'credit_id': '52fe48009251416c750aca23', 'de...",James Cameron,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",Action,2009.0,12.0,11.76
1,285,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Pirates of the Caribbean: At World's End,2007-05-19,961000000,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",Gore Verbinski,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Adventure,2007.0,5.0,3.2
2,206647,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",Spectre,2015-10-26,880674609,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...",Sam Mendes,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Action,2015.0,10.0,3.59


In [None]:
# Set pandas to display floats in fixed-point notation
pd.options.display.float_format = "{:.2f}".format

# Describe the 'roi' column
data_roi['roi'].describe()

Unnamed: 0,roi
count,3758.0
mean,2538.88
std,139608.55
min,0.0
25%,0.49
50%,1.87
75%,3.95
max,8500000.0


In [None]:
#Find outliers
data_roi[data_roi['roi']>1000]

Unnamed: 0,id,budget,genres,title,release_date,revenue,vote_average,vote_count,cast,crew,director,cast_name,main_genre,year,month,roi
3137,78383,10,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",Nurse 3-D,2013-09-28,10000000,4.9,119,"[{""cast_id"": 5, ""character"": ""Abby Russell"", ""...","[{'credit_id': '52fe499cc3a368484e1346b1', 'de...",Douglas Aarniokoski,"[Paz de la Huerta, Katrina Bowden, Kathleen Tu...",Horror,2013.0,9.0,1000000.0
4238,3082,1,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",Modern Times,1936-02-05,8500000,8.1,856,"[{""cast_id"": 8, ""character"": ""A factory worker...","[{'credit_id': '5621aeadc3a3680e1d00a09a', 'de...",Charlie Chaplin,"[Charlie Chaplin, Paulette Goddard, Henry Berg...",Drama,1936.0,2.0,8500000.0
4496,2667,60000,"[{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n...",The Blair Witch Project,1999-07-14,248000000,6.3,1055,"[{""cast_id"": 41, ""character"": ""Mike"", ""credit_...","[{'credit_id': '52fe4364c3a36847f8050c01', 'de...",Daniel Myrick,"[Michael C. Williams, Heather Donahue, Joshua ...",Horror,1999.0,7.0,4133.33
4577,23827,15000,"[{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n...",Paranormal Activity,2007-09-14,193355800,5.9,1316,"[{""cast_id"": 3, ""character"": ""Katie"", ""credit_...","[{'credit_id': '52fe4477c3a368484e024b01', 'de...",Oren Peli,"[Katie Featherston, Micah Sloat, Mark Fredrich...",Horror,2007.0,9.0,12890.39
4582,1435,218,"[{'id': 99, 'name': 'Documentary'}, {'id': 18,...",Tarnation,2003-10-19,1162014,7.5,22,"[{""cast_id"": 2, ""character"": ""Herself"", ""credi...","[{'credit_id': '52fe42f7c3a36847f8030443', 'de...",Jonathan Caouette,"[Renee Leblanc, Adolph Davis, Jonathan Caouett...",Documentary,2003.0,10.0,5330.34


The budget data for Nurse 3-D and Modern Times appears to be corrupted, leading to potentially unreliable analysis for these films. On the other hand, *The Blair Witch Project(1999), Paranormal Activity(2007), and Tarnation(2003)* achieved exceptional box office revenue relative to their production budgets, showcasing an extraordinary return on investment of more than 1000x. These outliers highlight how some low-budget films can generate significant financial success, underscoring the need for a closer examination of the factors contributing to their profitability.

In [None]:
#Drop outlier values from 'roi' column
data_roi = data_roi[data_roi['roi'] < 1000]
data_roi['roi'].value_counts().reset_index().sort_values('count', ascending=False).head(20)

Unnamed: 0,roi,count
0,0.0,548
1,0.62,19
2,0.01,17
3,1.1,15
4,0.02,14
5,0.26,13
6,0.28,13
7,0.52,13
12,0.49,12
15,0.86,12


In [None]:
roi_by_year = data_roi.groupby('year')[['roi']].mean().reset_index()
fig = px.line(roi_by_year, x='year', y = 'roi',
              title = 'Trends in ROI Over the Years',
              width = 800, height = 600,
              color_discrete_sequence=px.colors.qualitative.Vivid,
              template = 'plotly_white',
              labels = dict(year='', roi = 'Average ROI')
              )
fig.show()

In [None]:
roi_after_1990 = data_roi[data_roi['year']>=1990]
roi_after_1990['roi'].describe()

Unnamed: 0,roi
count,3306.0
mean,3.54
std,14.57
min,0.0
25%,0.44
50%,1.71
75%,3.49
max,439.62


In [None]:
data_roi['roi'].describe()

Unnamed: 0,roi
count,3753.0
mean,5.0
std,23.44
min,0.0
25%,0.49
50%,1.87
75%,3.94
max,700.0


In [None]:
fig = px.box(data_roi, x='roi',
            template = 'plotly_white',)
fig.show()

* **75%** of films have an ROI of **4x** or lower, with a median ROI of **1.9x** and an upper limit of **9x**.

* Unlike the sharp increase observed in box office revenue and film budgets after 1990, the average ROI exhibited volatility until 1980 before stabilizing. Post-1990, the average ROI was **3.5x**, while the median ROI **1.7x.**

In [None]:
roi_after_1990_median = roi_after_1990.groupby('main_genre')[['roi']].median().reset_index().sort_values('roi', ascending=False)

fig = px.bar(roi_after_1990_median,
              x= 'roi',
              y='main_genre',
              title='Median ROI by genre',
              width=800,
              height=600,
              color_discrete_sequence=px.colors.qualitative.Pastel,
              template='plotly_white',
              color='main_genre',
              hover_name = 'main_genre',
              labels = dict(roi = 'Median ROI', main_genre = '')
              )

# Show the figure
fig.show()

In [None]:
roi_after_1990_upper_limit_10 = roi_after_1990[roi_after_1990['roi'] <=10]

for y in ['revenue', 'budget']:
   fig = px.scatter(data_frame = roi_after_1990_upper_limit_10, x = 'roi', y = y, hover_name = 'title', size = y,
                    color = 'main_genre', color_discrete_sequence=px.colors.qualitative.Vivid,
                    template='plotly_white', width = 700, height = 600,
                    )
   fig.show()

The analysis reveals that **75%** of films have an ROI below **4x**, with a median ROI of **1.9x** and an upper limit of **9x**. Unlike the sharp increase observed in box office revenue and film budgets after 1990, the average ROI was volatile until 1980 before stabilizing. After 1990, the average ROI leveled off at **3.5x**, while the median ROI was **1.7x**. The top three film genres with relatively higher median ROI were *Horror (2.57x), Historical (2.56x), and Science Fiction (2.55)*.

## **5. Summary**


The analysis reveals several insightful trends in the film industry based on the dataset.
* **Film Budget and Box Office Revenue Trends:** The budget invested in film production and box office revenue both saw significant growth starting in the 1990s, with revenue experiencing a markedly sharper upward trend compared to budget. 50% of the top 10 highest-grossing films were from the Action genre and produced after 2000, except for Titanic (Drama), which was produced in 1997.

* **Top Directors and Actors:** The top three film directors in terms of box office revenue were Steven Spielberg, Peter Jackson, and James Cameron. Steven Spielberg's films in total generated over 9 billion USD, Peter Jackson's films earned over 6 billion USD, and James Cameron's films grossed approximately 5.9 billion USD during this period. Spielberg directed 27 films, Jackson directed 9, and Cameron directed 7. Notably, the highest-grossing films, Avatar (2009) and Titanic (1997), were directed by James Cameron. Among the top 10 actors with the most film appearances (all roles), only Samuel L. Jackson and Morgan Freeman appeared in the top 10 highest-grossing actors list. This discrepancy indicates that a rich filmography does not always correlate with high box office revenue.

* **Genre Distribution and Box Office Trends:** The dataset comprises 20 genres, with Drama (25%), Comedy (22%), and Action (16%) being the most represented. Notably, Drama and Action genres exhibit a higher concentration of outliers at the upper end of the box office revenue spectrum, indicating some of the highest performances. Among these, Adventure, Action, and Animation were the top earning genres. However, the median box office revenue was higher in Animation, Adventure, and Fantasy, with Action ranking sixth. The analysis of the evolution of popular genres over time shows a steady increase in total box office revenue for the Action genre from 2000 to 2015, while the Adventure genre experienced a rise after 2000 with a volatile trend. The Animation genre saw a significant rise starting in 2009, followed by a period of sharp decline after 2013. Overall, box office revenue declined gradually across all genres by 2016, suggesting a need for a deeper analysis starting from 2016.

* **Seasonal Trends:** The analysis reveals that the Action genre achieved higher box office revenue when released during the spring and summer months, while the Adventure genre excelled in May and the late winter months. The Animation genre saw significant box office revenue with releases in June, whereas Comedy and Drama performed best with December releases.

* **Correlation Analysis:** The analysis of the Top 10 Highest-Grossing Movies (Section 3.2) and the Top 10 Movies by Budget and Number of Votes (Section 3.3) revealed a different pattern compared to the correlation analysis involving all films in the dataset. For the Top 100 Movies by Revenue, the correlation showed a weak relationship between revenue and budget (r = 0.31) and number of votes (r = 0.38), while the correlation between budget and number of votes was negligible (r = 0.13). These findings suggest that the dynamics influencing top-performing movies may differ significantly from those observed across the broader dataset.

* **ROI Analysis:** In terms of return on investment (ROI), 75% of films had an ROI below 4x, with a median ROI of 1.9x and an upper limit of 9x. The average ROI was volatile until 1980 before stabilizing. Post-1990, the average ROI was 3.5x, with a median ROI of 1.7x. The top three film genres with relatively higher median ROI were Horror (2.57x), Historical (2.56x), and Science Fiction (2.55x). The analysis identified films like The Blair Witch Project (1999), Paranormal Activity (2007), and Tarnation (2003) as outliers with exceptional box office revenue relative to their production budgets, demonstrating an extraordinary ROI of more than 1000x. This highlights how some low-budget films can achieve significant financial success and underscores the importance of understanding the factors contributing to their profitability.
