# Analysis of Movie Data for Business Insights

## Overview


This report analyzes data from Box Office Mojo and IMDB to identify key factors that influence box office success. The primary goal is to help a new movie studio make data-driven decisions to optimize revenue by examining genre performance, the impact of director loyalty, and the power of franchises. The findings from this analysis will provide strategic insights for movie selection, talent retention, and franchise development to maximize box office gross.

## Business Problem

A new movie studio is looking to produce films that will generate significant box office revenue. To guide their business decisions, the studio needs insights into how various factors like genre, director-studio loyalty, and franchise involvement affect total gross revenue. The analysis addresses three main questions:

1. Which genres should the studio focus on to meet or exceed a target annual revenue of $2,128,500,000?
2. How does director loyalty or studio collaboration affect box office success?
3. What impact does franchise power have on revenue?

By answering these questions, the studio will have the tools to focus on the right types of films and build lasting director-studio relationships for successful franchises.

## Data Understanding

The analysis combines data from two main sources:

- Box Office Mojo (BOM): This dataset contains information on 3,387 movies released between 2010 and 2018, including domestic and foreign box office gross.

- IMDB Database: This database includes detailed information on 146,144 movies, such as cast, crew, genres, and ratings. The key variables used in this analysis are domestic gross, foreign gross, studio, director, and whether the movie belongs to a franchise.

These datasets were merged and cleaned to create a comprehensive view of movie performance and underlying factors.

## 1.1 Load the Datasets

In [1]:
# Imports
import sqlite3
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

### Box office Data

In [2]:
box_office_data = pd.read_csv('../data/bom.movie_gross.csv.gz')

In [3]:
box_office_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [4]:
box_office_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [5]:
box_office_data["year"].value_counts()

year
2015    450
2016    436
2012    400
2011    399
2014    395
2013    350
2010    328
2017    321
2018    308
Name: count, dtype: int64

In [6]:
box_office_data.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


### Imdb Data

In [7]:
# Connect to the database
conn = sqlite3.connect('../data/im.db')

In [8]:
#View all data from sqlite_master such as table names
query = "SELECT * FROM sqlite_master"

In [9]:
# Load the data into a pandas DataFrame
imdb_data = pd.read_sql(query, conn)
imdb_data

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


## 1.2 Understand the structure:

### Box Office Data Key Features

- title (movie title)
- studio (movie production studio)
- domestic_gross (revenue from domestic box office)
- foreign_gross (revenue from foreign box office)
- year (release year)

Key Data Insights:

- Some missing values in studio and domestic_gross.
- Significant missing values in foreign_gross.
- Data types are generally correct except for foreign_gross, which is stored as object but should likely be float64 to handle numeric operations.

### IMDB Features
- movie_basics: Contains key information about movies like movie_id, title, genre, runtime_minutes, start_year, etc.
- directors: Links directors to movies via movie_id.
- known_for: Associates people (person_id) with movies.
- movie_ratings: Contains information about movie ratings (average_rating, num_votes).
- persons: Holds person-specific details such as name, birth_year, death_year, etc.
- principals: Contains cast and crew information for each movie.
- writers: Links writers to movies via movie_id.

### Data Distribution of Key Categorical Variables (Box Office Data)

In [10]:
# Distribution of studios
print(box_office_data['studio'].value_counts())

# Distribution of years
print(box_office_data['year'].value_counts())

studio
IFC           166
Uni.          147
WB            140
Fox           136
Magn.         136
             ... 
E1              1
PI              1
ELS             1
PalT            1
Synergetic      1
Name: count, Length: 257, dtype: int64
year
2015    450
2016    436
2012    400
2011    399
2014    395
2013    350
2010    328
2017    321
2018    308
Name: count, dtype: int64


This would show which studios and years are most represented in the dataset, which can help analyze trends over time or by studio.

### Distribution of Key Categorical Variables (IMDB Database):

In [11]:
# Distribution of genres in movie_basics
query1 = "SELECT genres FROM movie_basics"
movie_genres = pd.read_sql(query1, conn)
print(movie_genres['genres'].value_counts())

# Distribution of directors
query2 = "SELECT primary_name FROM persons JOIN directors USING(person_id)"
director_names = pd.read_sql(query2, conn)
print(director_names['primary_name'].value_counts())

genres
Documentary                   32185
Drama                         21486
Comedy                         9177
Horror                         4372
Comedy,Drama                   3519
                              ...  
Adventure,Music,Mystery           1
Documentary,Horror,Romance        1
Sport,Thriller                    1
Comedy,Sport,Western              1
Adventure,History,War             1
Name: count, Length: 1085, dtype: int64
primary_name
Tony Newton          238
Jason Impey          190
Shane Ryan           186
Ruben Rodriguez      181
Sam Mason-Bell       144
                    ... 
Muta'Ali Muhammad      1
Maureen Maundu         1
Michael A. Ybarra      1
Bulmaro Osornio        1
Luis E. Froiz          1
Name: count, Length: 106757, dtype: int64


This gives an idea of which genres are most common and which directors have worked on the most films.

## 1.3 Identify Relationships Between Datasets
Merging Box Office Mojo with IMDB Database:
 - Possible keys: Merge the Box Office Mojo dataset and the IMDB data using the title field from Box Office Mojo and the title field in the movie_basics table. Another option could be to use the movie_id from the IMDB database if it exists in both datasets.

Merging Tables within the IMDB Database:

- The tables within the IMDB database can be merged using movie_id to combine relevant information

  - movie_basics with movie_ratings to get both the movie details and their ratings.
  - movie_basics with directors, writers, or principals to understand crew members associated with each movie.

## Data Preparation

## 2.1 Filter for relevant features:
From Box Office Mojo Data:
- title: This is necessary to match movies with other datasets.
- studio: analyze which studios are producing successful movies.
- domestic_gross: This is the revenue generated domestically, which is a key measure of box office performance.
- foreign_gross: Revenue generated in foreign markets, another important aspect of box office success.
- year: Useful to analyze trends over time.

In [12]:
# Select relevant columns from Box Office Mojo data
box_office_filtered = box_office_data[['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']]
box_office_filtered

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


From IMDB SQL Database:
- movie_basics (from movie_basics table):
  - genres: Key feature to analyze which genres are most successful.
  - runtime_minutes: This could be useful for analyzing whether longer or shorter movies perform better.
  - title: To match with the Box Office data.

- movie_ratings (from movie_ratings table):
  - average_rating: This is the IMDB rating, useful to analyze the relationship between ratings and success.
  - num_votes: Number of votes can help measure how popular or widely seen the movie is.
- directors (optional for analyzing the influence of directors):
  - person_id: This can be linked to the persons table to get the director's name and further explore the influence of certain directors.

In [13]:
# Filter data from the cleaned df -- movie_basics_df 

movie_basics_filtered = movie_basics_df[["movie_id", "primary_title", "genres", "individual_genre", "runtime_minutes"]]
movie_basics_filtered

NameError: name 'movie_basics_df' is not defined

In [None]:
query_movie_ratings = """
SELECT movie_id, averagerating, numvotes
FROM movie_ratings
"""
movie_ratings_filtered = pd.read_sql(query_movie_ratings, conn)

movie_ratings_filtered

Join the Box Office and IMDB Data:

Join on title (or movie_id if possible) to merge the Box Office data with the relevant IMDB data.

In [None]:
movie_basics_filtered.head()

In [None]:
box_office_filtered.head()

In [None]:
# Merge Box Office and IMDB data
merged_data = movie_basics_filtered
merged_data = merged_data.merge(box_office_filtered, left_on="primary_title", right_on="title", suffixes=("_movie", "_bo"))

# Adding rating data to merged_data df
merged_data = merged_data.merge(movie_ratings_filtered, on="movie_id", suffixes=("_movie", "_rating"))

# adding director details to merged_data
merged_data = merged_data.merge(movie_director_details, on="movie_id", suffixes=("_movie", "_dir"))

# there are 2999 unique movies in merged_data
print(len(merged_data["movie_id"].unique()))

## 2.2 Handle Incorrect Data Types
- Convert columns to their correct data types. 
- Convert the foreign_gross column to numeric values since it's currently stored as an object.

In [None]:
# Convert 'foreign_gross' to numeric, coerce errors
box_office_data['foreign_gross'] = pd.to_numeric(box_office_data['foreign_gross'], errors='coerce')
(box_office_data['foreign_gross'])

## 2.3 Handle Missing Values
As mentioned earlier, there are missing values in the studio, domestic_gross and in foreign_gross columns in the box office data.

- Use isnull() and sum() to identify columns with missing data.

Depending on the context:
- Drop rows or columns with a large amount of missing data using dropna().
- Impute missing values with appropriate statistics (mean, median, mode) using fillna().

1. studio (5 missing values):
- Fill in the missing values with "Unknown" rather than dropping them. These movies still have important data like gross earnings, which is crucial for analysis.
  - The studio is not the primary focus of our analysis, and removing these rows could unnecessarily reduce the size of our dataset.

In [None]:
box_office_data['studio'].fillna('Unknown', inplace=True)
box_office_data['studio']

2. domestic_gross (28 missing values):
- Drop rows where domestic_gross is missing.
  - Missing domestic gross values make it impossible to assess a movie’s financial performance, which is essential for our analysis. Imputing a value here (e.g., with a mean or median) could distort our analysis.

In [None]:
box_office_data = box_office_data.dropna(subset=['domestic_gross'])
box_office_data['domestic_gross']

3. foreign_gross (1350 missing values):
- Fill in missing foreign_gross values with 0.
  - While a missing foreign gross could imply that the movie was not released internationally, setting the value to 0 allows us to continue analyzing its total performance, especially if it performed well domestically.

In [None]:
# check to see if there are any null values in foreign_gross
box_office_data['foreign_gross'].isna().sum()

In [None]:
# replace these null values with 0
box_office_data["foreign_gross"].fillna(0, inplace=True)

box_office_data['foreign_gross'].isna().sum()

## 2.4 Handling Director Information
- Instead of relying on person_id, we can join the directors table with the persons table to get director names for easier interpretation.
- This will be helpful when identifying relationships between directors and movie success.

In [None]:
# get director name of each movie
query3 = """
SELECT DISTINCT movie_id, person_id AS director_id, primary_name AS director_name
FROM persons JOIN directors USING(person_id)
"""
movie_director_details = pd.read_sql(query3, conn)
movie_director_details

## Data Cleaning

## 3.1 Splitting and Normalizing the Genres Columns
- Problem: Movies might be listed with multiple genres (e.g., Action, Comedy), so we need to normalize the genre data for easier analysis.
- Solution: Split the genres into separate rows so that each movie has one genre per row.

In [None]:
# putting all movie_basics info inside 1 df
movie_basics_df = pd.read_sql(
"""
SELECT *
FROM movie_basics
"""
, conn)
movie_basics_df.head()

## 3.2 Create New Features:
Return on Investment (ROI): A critical metric to see how profitable a movie is.

In [None]:
# creating new column total_gross for domestic + foreign revenue
merged_data["total_gross"] = merged_data["domestic_gross"] + merged_data["foreign_gross"]
merged_data

In [None]:
# check to see if there are null values
merged_data["total_gross"].isnull().sum()

## 3.3 Transform Categorical Variables
One-Hot Encoding for Genres: Convert genres into a numeric format for analysis, as genre is often a categorical variable that may need to be converted for modeling.

In [None]:
merged_data["individual_genre"].value_counts()

In [None]:
# Saving merged_data dataframe to use in data_analysis_notebook.ipynb
merged_data.to_pickle("merged_data.pkl")

## Data Analysis (EDA)

## Define Business Goals

### Maximize Box Office Revenue:

The primary goal is to identify the key factors that drive box office success, enabling the new studio to consistently generate high-grossing films. Specifically, the studio should aim to reach an annual revenue target of $2.1 billion to compete with established players like Fox and BV.

In [None]:
# 2018 is the latest year in our data
merged_data["year"].value_counts()

In [None]:
# Use 2018 data as it's the most recent
data_2018 = merged_data[merged_data["year"] == 2018]
data_2018

In [None]:
# since we have multiple rows for each movie_id (one row for each
# genre and director), we have to groupby movie_id

gross_by_movie_id_2018 = data_2018.groupby("movie_id")[["runtime_minutes", "studio", "total_gross"]].max()
gross_by_movie_id_2018

In [None]:
gross_by_studio_2018 = gross_by_movie_id_2018.groupby("studio")["total_gross"].sum().sort_values(ascending=False)
gross_by_studio_2018

In [None]:
# Plot the top 10 grossing studios

fig, ax = plt.subplots(figsize=(10,5))

top_10_studios_2018 = list(gross_by_studio_2018.keys())[:10]
top_10_gross_2018 = list(gross_by_studio_2018.values)[:10]

ax.bar(top_10_studios_2018, top_10_gross_2018)

ax.set_title("Total Gross Per Studio (2018)")
ax.set_xlabel("Studio")
ax.set_ylabel("Total Gross")
plt.xticks(rotation=90)


plt.tight_layout()
plt.show()

In [None]:
# "Box office success" can be defined as reaching the revenue
# of the middle studio among the top 10 in 2018.

box_office_success_goal = top_10_gross_2018[4]

print(f"The middle-grossing studio among the top 10: {top_10_studios_2018[4]} with a total annual gross profit of ${box_office_success_goal:,}")

The fifth-most-successful studio is Fox with a total annual gross profit of $2,128,500,000. We recommend that the new studio generates a minimum annual revenue of $2,128,500,000.

### Genres in the Box Office
Which genres should the new studio focus on in order to meet this minimum annual revenue?

We will use an ANOVA test to see whether some genres perform significantly better in the box office than others using 2018 data.

**Null Hypothesis:** All genres on average perform the same in the box office in 2018

**Alternative Hypothesis:** Genres on average perform significantly differently in the box office in 2018

In [None]:
# check to see if there are any null values in total_gross
data_2018["total_gross"].isnull().sum()

In [None]:
# Getting median total gross reveue for each individual genre
gross_by_genre_2018 = data_2018.groupby("individual_genre")["total_gross"].sum().sort_values(ascending=False)
gross_by_genre_2018

In [None]:
fig, ax = plt.subplots(figsize=(16, 16))

sns.boxplot(
    x="individual_genre",
    y="total_gross",
    data=data_2018,
    ax=ax,
    color="blue",
    linewidth=3
)

plt.tight_layout()

In [None]:
# define our alpha
alpha = 0.01

# get the list of genres in data
genres = list(data_2018["individual_genre"].unique())

# create dictionary of df for each genre
total_gross_data_per_genre_2018 = {}
for genre in genres:
    total_gross_data_per_genre_2018[genre] = list(data_2018[data_2018["individual_genre"] == genre]["total_gross"].values)

In [None]:
result = stats.f_oneway(*total_gross_data_per_genre_2018.values())
f_stat, p_value = result

p_value

In [None]:
p_value < alpha

Our p-value is less than our alpha, which means we can reject the null hypothesis. Genres on average perform significantly differently in the box office in 2018.

We know that there is a difference in box office success between genres, but we can look at the ANOVA table to know which genres are significantly different. 

In [None]:
# Look at the OLS ANOVA table
formula = "total_gross ~ C(individual_genre)"
anova_sm = ols(formula=formula, data=data_2018).fit()
anova_sm.summary()

From the table, we can see that coefficient for the `Adventure` genre had the largest positive effect on the total gross revenue. This means that Adventure movies earn $44,750,000 more than the baseline genre (intercept).

Alternatively, we can perform a Tukey's HSD Post-Hoc Test to find how genres and their median total gross revenue compare to one another.

In [None]:
tk_hsd = pairwise_tukeyhsd(data_2018["total_gross"], data_2018["individual_genre"], alpha=alpha)
tk_hsd.summary()

We can see from the Tukey's HSD Post-Hoc test that some genres perform significantly differently in the box office. For example, the adjusted p-value between Action vs Biography is `0.0015`, which is smaller than our alpha and suggests that their gross revenue is significantly different. However, Action vs Adventure have have a large adjusted p-value of `0.9999`, suggesting their box office performance is not significantly different.

In [None]:
# Find top performing genre
top_genre_2018 = list(gross_by_genre_2018.keys())[0]
top_genre_gross_2018 = list(gross_by_genre_2018.values)[0]

top_genre_2018, top_genre_gross_2018

In [None]:
# Find all the genres that reach the total gross profit goal
top_genres_2018 = [genre for genre in gross_by_genre_2018.keys() if gross_by_genre_2018[genre] >= box_office_success_goal]
top_genres_2018

The top-performing genre of 2018 was Adventure with total gross profit of $16,066,296,500, surpassing our annual gross revenue of of $2,128,500,000. However, there are other genres that also surpass our goal: Action, Comedy, Drama, Sci-Fi, Thriller, Animation, Fantasy, and Horror.

### Leverage Franchise Power
Prioritize the development or acquisition of film franchises, as movies that are part of a franchise (e.g., Avengers, Jurassic Park) tend to generate significantly higher revenue. The goal is to build long-term financial success through sequels and series.

Key Columns to focus on for this analysis:
- domestic_gross and foreign_gross for revenue.
- year for filtering data by 2018.
- genre for genre classification (for comparing genres within franchises).
- director for director consistency analysis.
- title for movie titles (identify franchises by titles).

Additional Considerations:
- Identify franchises by grouping related movies by title (e.g., "Avengers").
- Use total revenue to assess success.

In [None]:
# Filter the merged data for 2018
data_2018 = merged_data[merged_data['year'] == 2018]

# Calculate total revenue
data_2018['total_revenue'] = data_2018['domestic_gross'] + data_2018['foreign_gross']

# Create a function to identify franchise films based on keywords in the title
def check_franchise(title):
    franchise_keywords = ['Avengers', 'Star Wars', 'Harry Potter', 'Marvel', 'Toy Story', 
                          'Fast & Furious', 'Transformers', 'Pirates of the Caribbean', 'Spider-Man', 
                          'Batman', 'Superman', 'James Bond', 'X-Men', 'Jurassic', 'Mission: Impossible', 
                          'Despicable Me', 'Shrek', 'Hobbit', 'Lord of the Rings']
    for keyword in franchise_keywords:
        if keyword in title:
            return 'Yes'
    return 'No'

# Apply the function to create a new 'franchise' column
data_2018['franchise'] = data_2018['title'].apply(check_franchise)

# Group by title and franchise, and calculate total revenue
franchise_revenue = data_2018.groupby(['title', 'franchise'])['total_revenue'].sum().reset_index()

# Sort by highest total revenue, top 20 films
franchise_revenue = franchise_revenue.sort_values(by='total_revenue', ascending=False).head(20)

# Display top 20 franchise films by revenue
print(franchise_revenue)

![Sheet 7.png](<attachment:Sheet 7.png>)

- Movies that are part of a franchise (e.g., Deadpool, Avengers, Jurassic Park) generally outperform non-franchise films.
- Franchise films consistently bring in higher revenue, indicating the strong power of brand recognition and audience loyalty.

Developing franchises is a key strategy for maximizing long-term revenue. Studios should continue to invest in building and expanding franchises to capitalize on the momentum of successful series.

### Director vs Studio Loyalty Analysis using a T-test

Hypothesis Testing:
- Null Hypothesis: There is no significant difference in total revenue between franchise movies directed by the same director and those directed by different directors.
- Alternative Hypothesis: Franchise movies directed by the same director generate higher revenue.

T-test will be used to compare the revenues of movies directed by the same director for a studio vs. those directed by different directors.

To group movies by director and analyze consistency within a studio:

In [None]:
# Filter data for 2018 and calculate total revenue
data_2018['total_revenue'] = data_2018['domestic_gross'] + data_2018['foreign_gross']

# Check director loyalty within a studio
director_loyalty = data_2018.groupby(['studio', 'director_name'])['total_revenue'].mean().reset_index()

# T-test: Compare revenue for movies directed by the same director vs. different directors
same_director = data_2018[data_2018.duplicated(subset=['studio', 'director_name'], keep=False)]
different_director = data_2018[~data_2018.duplicated(subset=['studio', 'director_name'], keep=False)]

# Perform T-test
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(same_director['total_revenue'], different_director['total_revenue'])

print(f"T-statistic: {t_stat}, P-value: {p_value}")

- A T-test comparing revenue from movies directed by loyal directors (e.g., Joe and Anthony Russo at BV) versus other directors produced a T-statistic of 2.63 and a P-value of 0.0086.
- Since the P-value < 0.05, we reject the null hypothesis, indicating that loyalty to specific directors has a statistically significant impact on a studio’s revenue. Studios that frequently collaborate with high-performing directors tend to see greater financial returns.

Studios should prioritize maintaining long-term relationships with successful directors. By fostering loyalty and consistency, studios can boost their chances of producing box-office hits.

### Analysis & Results ##

Genres in the Box Office

Using 2018 data, we performed an ANOVA test to determine whether there are significant differences in box office performance across genres. The results indicate that genres do not perform equally. The Null Hypothesis that all genres on average perform the same in the box office in 2018 was rejected (p-value < 0.05).

We used Tukey’s HSD Post-Hoc Test to identify specific genre pairs that have significantly different box office performance. The test shows that Adventure movies are the top performers, with an additional $44,750,000 in total gross revenue compared to the baseline genre. Action, Comedy, Sci-Fi, Drama, and Animation also perform well, surpassing the new studio’s target revenue of $2,128,500,000.

Franchise Power

Franchise movies were analyzed using a binary feature that identified whether a movie belonged to a franchise. The data showed that franchise films significantly outperformed non-franchise films in 2018. Major franchises like Avengers, Jurassic Park, and Deadpool generated higher revenues due to brand loyalty and anticipation for sequels.

For example:

- Avengers: Infinity War, an action-adventure franchise movie, generated over $16 billion in 2018, far surpassing non-franchise films.
- Ralph Breaks the Internet, another franchise film, brought in half the revenue of Avengers but still performed significantly better than non-franchise films in its genre.

Director vs Studio Loyalty

A t-test was conducted to compare the total revenue between directors who consistently work with the same studio and those who do not. The results were statistically significant (T-statistic: 2.63, P-value: 0.0086), indicating that directors with established studio partnerships tend to produce higher-grossing films. For instance, movies directed by the Russo Brothers consistently generate billions in revenue, particularly when working with studios like BV or Fox.

Top-grossing studios in 2018 included BV, Fox, Universal, and Warner Bros, with films like Avengers: Infinity War directed by the Russo Brothers pulling in over $45 billion.


### Recommendations
- The new studio should prioritize producing Adventure and Action films, as these genres consistently generate the highest revenue. Other high-performing genres include Comedy, Sci-Fi, Fantasy, and Animation, which also exceed the annual revenue target.
- The studio should prioritize building franchise films, especially in high-performing genres. Franchise films create brand loyalty and generate long-term revenue streams.
- The studio should invest in long-term partnerships with proven directors. Directors who have a history of success with a studio are more likely to bring in higher box office revenue, especially when aligned with franchises or popular genres.

### Conclusion

1. Adventure and Action genres should be the studio’s focus, as they consistently generate the highest box office revenue, exceeding the target annual revenue of $2,128,500,000.
2. Franchise development is key to long-term financial success. Studios should prioritize creating or investing in franchises that attract loyal audiences and provide opportunities for sequels.
3. Director-studio loyalty leads to better financial outcomes. New studios should focus on building strong, long-term relationships with successful directors who specialize in high-grossing genres and franchises.

### Additional Insights 
- Invest in High-Revenue Genres: Genres like Adventure, Action, and Sci-Fi outperform others and should be the core focus for new releases.
- Develop Franchise Films: Building a long-term franchise strategy will ensure consistent revenue, as franchise films outperform standalone movies.
- Foster Talent Relationships: Collaborating with proven directors will not only improve movie performance but also contribute to the studio’s brand and audience retention.

### Overall Strategic Enchancements 
- Focus on producing Adventure and Action films, leveraging their box office dominance.
- Build franchises and cultivate brand loyalty to drive long-term financial success.
- Align with proven directors who have a track record of box office success to maximize profitability.



### Next Steps

1. Genre-Based Investments: Allocate resources to high-performing genres, particularly Adventure, Action, and Animation.
2. Franchise Feasibility Study: Evaluate potential franchise opportunities for sequels or spin-offs.
3. Talent Acquisition Strategy: Develop contracts with directors who have a history of success in target genres, ensuring continuity and creative control for future films.