# Data Analysis

## Importing Relevant Libraries

In [None]:
#importing relevant libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
%matplotlib inline

## Opening Data
- Opening cleaned datasets 

In [None]:
movie_genres = pd.read_csv('zippedData/movie_genres_cleaned')
movie_genres.head(2)

In [None]:
film_pop_profit = pd.read_csv('zippedData/cleaned_popularity_profit_data')
film_pop_profit.head()

Before we go down for analysis, we would like to have the same understanding on these metrics:
- ROI: returnn on investment (ROI = revenue/cost-1)
- Production budget: production cost
- Domestic gross: domestic revenue
- International gross: Internaltion revenue
- Worldwide gross:worldwide revenue
- Domestic ROI: domestic return on investment
- International ROI: Internaltion return on investment
- Worldwide ROI:worldwide return on investment

All of these metrics financial metrics are calculated in USD.

After cleaning data, we wanted to explore the highest grossing film genres by calculating the average values for worldwide profit. 

## 1. 1st Recommendation analysis

In [None]:
# calculated mean profit grouped by genre 
mean_profit_genre = movie_genres.groupby('each_genre').mean(numeric_only = True).sort_values('worldwide_profit',ascending = False)

# formatted numeric columns for readability 
mean_profit_genre.style.format({'runtime_minutes': '{:,.2f}','production_budget': '{:,.2f}','domestic_gross': '{:,.2f}', 'worldwide_gross': '{:,.2f}', 'international_gross': '{:,.2f}',
       'domestic_profit': '{:,.2f}','international_profit': '{:,.2f}','worldwide_profit': '{:,.2f}','domestic_ROI': '{:.2f}','international_ROI': '{:.2f}', 'worldwide_ROI': '{:.2f}'})
 

Using a bar chart to show film genres by their worldwide profit

In [None]:
# created bar charts mapping profit by genre 
x = mean_profit_genre.index
y1 = mean_profit_genre['worldwide_profit']

fig, ax1 = plt.subplots()

# Bar chart
ax1.bar(x, y1, color='orange')
ax1.set_ylabel('Worldwide profit (hundred million USD)')

#format a-axis tick & label:
ax1.set_xticks(range(len(x)))
ax1.set_xticklabels(x, rotation=90)
ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x,pos: format(x/1000000,'1.0f')+'M'))

ax1.set_title('Worldwide profit by genre');

Based on the bar chart, we identified the top four highest grossing genres worldwide to be (1) Animation, (2) Musical, (3) Sci-Fi, (4) Adventure

In [None]:
# verifying that new dataframe contains the top four highest grossing genres
top4_genre_profit = mean_profit_genre.head(4)
top4_genre_profit.style.format({'runtime_minutes': '{:,.2f}','production_budget': '{:,.2f}','domestic_gross': '{:,.2f}', 'worldwide_gross': '{:,.2f}', 'international_gross': '{:,.2f}',
       'domestic_profit': '{:,.2f}','international_profit': '{:,.2f}','worldwide_profit': '{:,.2f}','domestic_ROI': '{:.2f}','international_ROI': '{:.2f}', 'worldwide_ROI': '{:.2f}'})

In [None]:
# Created a list of top 4 genre list by worldwide profit:
top4_genre_list = list(top4_genre_profit.index)
top4_genre_list

#### Recommendation 1 conclusion: focus on top 4 genres

## 2. 2nd recommendation analysis 

In order to have a more concrete understanding, it is important to create a new dataframe filtering out the top four highest grossing film genres for easier analysis. 

In [None]:
top4_genre_df = movie_genres.loc[movie_genres['each_genre'].isin(top4_genre_list)]
# formatting numeric currency columns
top4_genre_df.head().style.format({'runtime_minutes': '{:,.2f}','production_budget': '{:,.2f}','domestic_gross': '{:,.2f}', 'worldwide_gross': '{:,.2f}', 'international_gross': '{:,.2f}',
       'domestic_profit': '{:,.2f}','international_profit': '{:,.2f}','worldwide_profit': '{:,.2f}','domestic_ROI': '{:.2f}','international_ROI': '{:.2f}', 'worldwide_ROI': '{:.2f}'})

#### Analyzing the Relationship Between Production Budget and Worldwide ROI

After identifying the highest grossing genres for films, we wanted to understand how to maximize worldwide profit. We then analyzed the relationship between worldwide ROI and production budget at specific intervals.

In [None]:
# Create intervals of production budget for plotting purposes:
budget_interval = [0,20000000, 50000000, 100000000,200000000,300000000,400000000, float('inf')]

# Create a new column of production_budget intervals:
top4_genre_df.loc[:, 'budget_group']  = pd.cut(top4_genre_df.loc[:,'production_budget'], bins=budget_interval, labels=['<20M', '20M-50M', '50M-100M','100M-200M', '200M-300M','300M-400M','>400M'])

Using a violin plot, the distribution of worldwide ROI and production budget is shown inclusive of all four of the highest grossing film genres. 

In [None]:
# mapping the distribution of worldwide ROI for highest grossing genres
# based on production budget group 
sns.violinplot(x = 'budget_group', y = 'worldwide_ROI', data = top4_genre_df)
plt.xticks(rotation=45)
plt.title('Relationship between worldwide ROI & budget group of top 4 genres')
plt.xlabel('Budget Group')
plt.ylabel('Worldwide ROI')
plt.show()

##### Analysis

Because the violin plot takes into account all four genres, we can not see any significant production budget interval associated with a higher worldwide ROI. Further analysis is needed to delve deeper into each genre. 

Besides, the average ROI of top 4 genres is less than 2. To give recommendation for Microsoft, we would like hypothize target ROI of for further analysis.

#### Animation Production Budget Groups & Worldwide ROI 

In [None]:
# creating a separate dataframe for animation films' production budget 
# and worldwide ROI 
animation_budget_df = top4_genre_df[top4_genre_df['each_genre']=="Animation"]
animation_budget_df.head()

In [None]:
# formatting numeric columns 
animation_budget_df.describe().style.format({'runtime_minutes': '{:,.2f}','production_budget': '{:,.2f}','domestic_gross': '{:,.2f}', 'worldwide_gross': '{:,.2f}', 'international_gross': '{:,.2f}',
       'domestic_profit': '{:,.2f}','international_profit': '{:,.2f}','worldwide_profit': '{:,.2f}','domestic_ROI': '{:.2f}','international_ROI': '{:.2f}', 'worldwide_ROI': '{:.2f}'})
 

In [None]:
# mapping out the distribution of worldwide ROI for animation films based 
# on their production budget group
sns.boxplot(x = 'budget_group', y = 'worldwide_ROI' , 
            data = animation_budget_df, orient = 'v')
plt.xticks(rotation=45)
plt.xlabel('Production budget group($)')
plt.ylabel('Worldwide ROI')
plt.title('Relationship between worldwide ROI & production budget group in Animation genre')
plt.show()

For animated films, production budgets between 50 million USD and 200 million USD had the highest worldwide ROI.

#### Musical Production Budget Groups & Worldwide ROI 

In [None]:
musical_budget_df = top4_genre_df[top4_genre_df['each_genre']=="Musical"]
musical_budget_df.head()

In [None]:
musical_budget_df.describe().style.format({'runtime_minutes': '{:,.2f}','production_budget': '{:,.2f}','domestic_gross': '{:,.2f}', 'worldwide_gross': '{:,.2f}', 'international_gross': '{:,.2f}',
       'domestic_profit': '{:,.2f}','international_profit': '{:,.2f}','worldwide_profit': '{:,.2f}','domestic_ROI': '{:.2f}','international_ROI': '{:.2f}', 'worldwide_ROI': '{:.2f}'})
 

In [None]:
sns.boxplot(x = 'budget_group', y = 'worldwide_ROI' , 
            data = musical_budget_df, orient = 'v')
plt.xticks(rotation=45)
plt.title('Relationship between worldwide ROI & production budget group in Musical genre')
plt.show()


For musical films, there is not enough data to identify whether different budget groups contribute to higher worldwide ROI.

#### Sci-Fi Production Budget Groups & Worldwide ROI 

In [None]:
scifi_budget_df = top4_genre_df[top4_genre_df['each_genre']=="Sci-Fi"]
scifi_budget_df.head()

In [None]:
sns.boxplot(x = 'budget_group', y = 'worldwide_ROI' , 
            data = scifi_budget_df, orient = 'v')
plt.xticks(rotation=45)
plt.title('Relationship between worldwide ROI & production budget group in Sci-Fi genre')
plt.show()


For Sci-Fi films, we see that movie with invested budget from 100M USD to 300M USD make the highest ROI median. However, notice that ones with budget below 20M USD has wider IQR, meaning that we could consider one more option to minimize our budget with highest ROI.

#### Animation Production Budget Groups & Worldwide ROI 

In [None]:
adventure_budget_df = top4_genre_df[top4_genre_df['each_genre']=="Adventure"]
adventure_budget_df.head()

In [None]:
sns.boxplot(x = 'budget_group', y = 'worldwide_ROI' , 
            data = scifi_budget_df, orient = 'v')
plt.xticks(rotation=45)
plt.title('Relationship between worldwide ROI & production budget group in Sci-Fi genre')
plt.show()


For adventure films, we see that movie with invested budget from 100M USD to 300M USD make the highest ROI median. However, notice that ones with budget below 20M USD has wider IQR, meaning that we could consider one more option to minimize our budget with highest ROI.

#### Recommendation 2 conclusion: Target on Animation with invested budget between 50M USD and 200m USD to get ROI higher than 2

## 3. 3rd Recommendation:

### Popularity

After identifying the highest grossing films and their worldwide ROI based on production budget group, we decided to look into the relationship between popularity and worldwide ROI.

In [None]:
# creating a new dataframe to compare the four highest grossing film genres 
# and their popularity
top4_popularity_df = film_pop_profit[film_pop_profit['each_genre'].isin(top4_genre_list)]

In [None]:
# previewing new dataframe
top4_popularity_df.describe()

The average worldwide gross across all four genres is around 400M USD, so for the purposes of our analysis, we are targeting revenues of at least 500M USD. 

In order to create more specific recommendation, we decided to use popularity scores to determine at what intervals popularity contributes to worldwide revenue of at least 500M USD.

In [None]:
# creating intervals for popularity scores 
interval_popularity = [0,10, 20, 30, 40, float('inf')]

# converting popularity column into interval ranges
top4_popularity_df.loc[:, 'popularity_group']  = pd.cut(top4_popularity_df['popularity'], bins=interval_popularity, labels=['<10','10-20', '20-30', '30-40','>40'])

In [None]:
# previewing new dataframe
top4_popularity_df.head(2)

#### Analyzing Popularity Groups and Worldwide ROI for 4 genres


In [None]:
sns.violinplot(x = 'popularity_group', y = 'worldwide_gross', data = top4_popularity_df)
plt.xticks(rotation=45)
plt.title('Relationship between popularity and worldwide_gross of top 4 genres')
plt.xlabel('Popularity group')
plt.ylabel('Worldwide ROI')
plt.show()

For four highest grossing film genres, there is a slight positive relationship between popularity and worldwide ROI. More popular films have increases in the ROI. Further analysis is needed to understand how this may differ across genre. 

#### Animation: Analyzing Popularity Groups and Worldwide ROI 

In [None]:
animation_popularity_df = top4_popularity_df[top4_popularity_df['each_genre']=="Animation"]
animation_popularity_df.head()

In [None]:
animation_popularity_df.describe().style.format({'runtime_minutes': '{:,.2f}','production_budget': '{:,.2f}','domestic_gross': '{:,.2f}', 'worldwide_gross': '{:,.2f}', 'international_gross': '{:,.2f}',
       'domestic_profit': '{:,.2f}','international_profit': '{:,.2f}','worldwide_profit': '{:,.2f}','domestic_ROI': '{:.2f}','international_ROI': '{:.2f}', 'worldwide_ROI': '{:.2f}'})
 

In [None]:
sns.violinplot(x = 'popularity_group', y = 'worldwide_gross', data = animation_popularity_df)
plt.xticks(rotation=45)
plt.title('Relationship between Popularity group & gross revenue in Animation genre')
plt.xlabel('Popularity group')
plt.ylabel('Worldwide gross revenue')
plt.show()

For animation films, we can see that films that are have popularity scores of at least 20 earn increases in worldwide gross revenue. 

#### Musical: Analyzing Popularity Groups and Worldwide ROI 

In [None]:
musical_popularity_df = top4_popularity_df[top4_popularity_df['each_genre']=="Musical"]
musical_popularity_df.head()

In [None]:
sns.violinplot(x = 'popularity_group', y = 'worldwide_gross', data = musical_popularity_df)
plt.xticks(rotation=45)
plt.title('Relationship between Popularity group & gross revenue in Musical genre')
plt.xlabel('Popularity group')
plt.ylabel('Worldwide gross revenue')
plt.show()

For musical films, there is not enough information to determine whether there is a relationship between popularity and worldwide gross revenue.

#### Sci-Fi: Analyzing Popularity Groups and Worldwide ROI 

In [None]:
scifi_popularity_df = top4_popularity_df[top4_popularity_df['each_genre']=="Sci-Fi"]
scifi_popularity_df.head()

In [None]:
sns.violinplot(x = 'popularity_group', y = 'worldwide_gross', data = scifi_popularity_df)
plt.xticks(rotation=45)
plt.title('Relationship between Popularity group & gross revenue in Sci-Fi genre')
plt.xlabel('Popularity group')
plt.ylabel('Worldwide gross revenue')
plt.show()

For sci-fi films, there is a relationship between popularity and worldwide gross revenue, however there increases only occur when the film has a popularity score of at least 30.

#### Adventure: Analyzing Popularity Groups and Worldwide ROI 

In [None]:
adventure_popularity_df = top4_popularity_df[top4_popularity_df['each_genre']=="Adventure"]
adventure_popularity_df.head()

In [None]:
sns.violinplot(x = 'popularity_group', y = 'worldwide_gross', data = scifi_popularity_df)
plt.xticks(rotation=45)
plt.title('Relationship between Popularity group & gross revenue in Adventure genre')
plt.xlabel('Popularity group')
plt.ylabel('Worldwide gross revenue')
plt.show()

Simliar to Sci-Fi films, adventure films that have a score of at least 30 in popularity generate increases in worldwide gross revenue. 

# Recommendations 

In conclusion, we would like to highlight that all finanacial metrics are on global view as launching movies across borders is ineviatble nowadays and focusing on domestic market will only limit the potential of getting more profit. After a thorough analysis, we came up with three recommendations for Microsoft regarding opening a movie studio:

First, target the top genres to generate highest worldwide profit, especially the top 4 as they earn 250M USD more or less while the next top genre can only gain around 150M USD.

Second, Animation is the most profitable genre and still it has plenty of opportunities to maximize worldwide profit. In order for ROI to be greater than 2, invested budget should be between 50M USD and 200M USD.


Third, focus on Animation and target worldwide revenue to be greater than 500M USD by managing popularity metric to be greater than 20 by lots of mean regarding marketing, reviews, critics and more.