
## Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


# Business Understanding



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import seaborn as sns
%matplotlib inline
df = pd.read_csv('Data/data-clean.csv')
df.drop('Unnamed: 0',axis=1,inplace=True)

#doing this here and not during data cleaning because this split doesnt survive being saved as .csv
df['genres'] = df['genres'].map(lambda x: x.split(","))
df['director'] = df['director'].map(lambda x: x.split(","))

display(df.head(2)), display(df.info()) ,display(df.isna().sum())

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit,genres,director
0,2011-05-20,Pirates of the Caribbean: On Stranger Tides 2011,410600000,241063875,1045663875,804600000,635063875,"[Action, Adventure, Fantasy]",[Rob Marshall]
1,2019-06-07,Dark Phoenix 2019,350000000,42762350,149762350,107000000,-200237650,"[Action, Adventure, Sci-Fi]",[Simon Kinberg]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1483 entries, 0 to 1482
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       1483 non-null   object
 1   movie              1483 non-null   object
 2   production_budget  1483 non-null   int64 
 3   domestic_gross     1483 non-null   int64 
 4   worldwide_gross    1483 non-null   int64 
 5   foreign_gross      1483 non-null   int64 
 6   profit             1483 non-null   int64 
 7   genres             1483 non-null   object
 8   director           1483 non-null   object
dtypes: int64(5), object(4)
memory usage: 104.4+ KB


None

release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
foreign_gross        0
profit               0
genres               0
director             0
dtype: int64

(None, None, None)

### Business Recommendation 1: Which genres should we create films in? Which genres provide the highest average ROI?

Based on our business problem, we've decided to first conclude which genres these new films should be. We've chosen to do this by calculating the average return on investment for each genre and pulling the top 5 genres.

First we created columns that include domestic, foreign, and worlwide ROI for each movie title in our dataframe. We've also exploded the genre for each movie, so now each genre for each movie has it's own row. This will make it easier to find the average for each genre.

In [2]:
#drop duplicate movie tables
df = df.drop_duplicates(subset='movie', keep='first')

#create a new column that represents the roi for domestic gross
df['roi_domestic'] = (df['domestic_gross'] - df['production_budget']) / df['production_budget'] * 100

#create a new column that represents the roi for foreign gross
df['roi_foreign'] = (df['foreign_gross'] - df['production_budget']) / df['production_budget'] * 100

#create a new column that represents the roi for worlwide gross
df['roi_worldwide'] = (df['worldwide_gross'] - df['production_budget']) / df['production_budget'] * 100

df.head(2)

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,foreign_gross,profit,genres,director,roi_domestic,roi_foreign,roi_worldwide
0,2011-05-20,Pirates of the Caribbean: On Stranger Tides 2011,410600000,241063875,1045663875,804600000,635063875,"[Action, Adventure, Fantasy]",[Rob Marshall],-41.28985,95.957136,154.667286
1,2019-06-07,Dark Phoenix 2019,350000000,42762350,149762350,107000000,-200237650,"[Action, Adventure, Sci-Fi]",[Simon Kinberg],-87.782186,-69.428571,-57.210757


In [3]:
#filter the dataframe to only show movies released in the last ten years
df = df.loc[(df['release_year'] >= 2011)]
df.info()

KeyError: 'release_year'

In [None]:
#the genres for each movie are in a list. I need to seperate the genres so i can perform further analysis
genres_exp = df.explode('genres')#,ignore_index=True) 
genres_exp['genres'].unique()

Now that that's out of the way, we can begin to find the median ROI for each genre. We've chosen to use median because the data contains many outliers which would make the mean a less accurate representation of average ROI. We'll be choosing the top 5 genres based on their worlwide ROI, since this is a better indication of each films total ROI.

In [None]:
#I created a new dataframe that by genres and took the median for each column
filtered_df_median = genres_exp.groupby('genres').median()
#I am sorting the genres in descending order and only looking at the top 5 genres
filtered_df_median = filtered_df_median.sort_values(by=['roi_worldwide'], ascending = False)
filtered_df_median.head(5)

> Great, let's see those roi_worldwide averages compared visually


In [None]:
#creating a boxplot with the median and IDR for each genre
fig_dims = (20, 14)
fig, ax = plt.subplots(figsize=fig_dims)
sns.boxplot( y=genres_exp["genres"], 
            x=genres_exp["roi_worldwide"], 
            palette="Blues", 
            width=0.6, 
            #removing outliers
            showfliers = False,
            #adding a green dot that shows mean for each genre
            showmeans = True);

#creating labels for my boxplot
plt.xlabel("% ROI", size=14)
plt.ylabel("Movie Genres", size=14)
plt.title("Average Worldwide ROI By Genre", size=18)
plt.show()

>We want to clear some of this noise and look only at the top 5 performing genres based on that average  worldwide ROI.

In [None]:
#creating an individual dataframe for each genre in my filtered_df_median dataframe
genres=genres_exp['genres'].unique().tolist()
musical = genres_exp.loc[genres_exp.genres=='Musical']
mystery = genres_exp.loc[genres_exp.genres=='Mystery']
scifi = genres_exp.loc[genres_exp.genres=='Sci-Fi']
adventure = genres_exp.loc[genres_exp.genres=='Adventure']
animation = genres_exp.loc[genres_exp.genres=='Animation']

In [None]:
#create a new dataframe with the top five average roi worldwide
top_5_ww = adventure.append(animation)
top_5_ww = top_5_ww.append(musical)
top_5_ww = top_5_ww.append(mystery)
top_5_ww = top_5_ww.append(scifi)

In [None]:
#create a box plot with all 5 top genres comparing median and mean
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.boxplot( x=top_5_ww["genres"], 
            y=top_5_ww["roi_worldwide"], 
            palette="Blues", 
            width=0.6, 
            showfliers = False, 
            showmeans=True);


plt.ylabel("% ROI", size=14)
plt.xlabel("Movie Genres", size=14)
plt.title("Average Worldwide ROI For The Top 5 Genres", size=18)
ax.yaxis.grid(False) # Hide the horizontal gridlines
ax.xaxis.grid(False) # Hide the vertical gridlines

sns.despine(offset=10, trim=True)
plt.show()

In [None]:
#let's predict worst, base, and best case scenarios for each genre
arr = adventure['roi_worldwide']
arr2 = animation['roi_worldwide']
arr3 = musical['roi_worldwide']
arr4 = mystery['roi_worldwide']
arr5 = scifi['roi_worldwide']
print(f'Adventure ROI: \n  Worst Case = {int(np.percentile(arr, 25))}% \n  Base Case = {int(np.percentile(arr, 50))}% \n  Best Case = {int(np.percentile(arr, 75))}%')
print(f'Animation ROI: \n  Worst Case = {int(np.percentile(arr2, 25))}% \n  Base Case = {int(np.percentile(arr2, 50))}% \n  Best Case = {int(np.percentile(arr2, 75))}%')
print(f'Musical ROI: \n  Worst Case = {int(np.percentile(arr3, 25))}% \n  Base Case = {int(np.percentile(arr3, 50))}% \n  Best Case = {int(np.percentile(arr3, 75))}%')
print(f'Mystery ROI: \n  Worst Case = {int(np.percentile(arr4, 25))}% \n  Base Case = {int(np.percentile(arr4, 50))}% \n  Best Case = {int(np.percentile(arr4, 75))}%')
print(f'Sci-fi ROI: \n  Worst Case = {int(np.percentile(arr5, 25))}% \n  Base Case = {int(np.percentile(arr5, 50))}% \n  Best Case = {int(np.percentile(arr5, 75))}%')

We next decided to take a look at how each genre performed domestically, to see if it was necessary to release films in foreign countries. We were also curious to know if the top 5 genres in the USA were different than the top 5 genres worldwide

In [None]:
fig_dims = (20, 14)
fig, ax = plt.subplots(figsize=fig_dims)
sns.boxplot( y=genres_exp["genres"], 
            x=genres_exp["roi_domestic"], 
            palette="Blues", 
            width=0.6, 
            showfliers = False, 
            showmeans=True);

plt.xlabel("% ROI", size=14)
plt.ylabel("Movie Genres", size=14)
plt.title("Average Domestic ROI By Genre", size=18)
plt.show()

In [None]:
#i know this worked because my rows went down from 5310 to 3945
filtered_df_median_dom = genres_exp.groupby('genres').median()
filtered_df_median_dom = filtered_df_median.sort_values(by=['roi_domestic'], ascending = False)
filtered_df_median_dom.head(5)

>It looks like our top 5 genres in the USA are different from our top 5 genres in the world. Only two genres (mystery and animation) are in the top 5 for domestic and worlwide ROI. 

We next wanted to see the worst, base, and best case ROI for the top 5 genres in the USA

In [None]:
comedy = genres_exp.loc[genres_exp.genres=='Comedy']
music = genres_exp.loc[genres_exp.genres=='Music']
romance = genres_exp.loc[genres_exp.genres=='Romance']

In [None]:
top_5_df_domestic= animation.append(comedy)
top_5_df_domestic['genres'].unique()

In [None]:
top_5_df_domestic = top_5_df_domestic.append(music)
top_5_df_domestic['genres'].unique()

In [None]:
top_5_df_domestic = top_5_df_domestic.append(mystery)
top_5_df_domestic['genres'].unique()

In [None]:
top_5_df_domestic = top_5_df_domestic.append(romance)
top_5_df_domestic['genres'].unique()

In [None]:
#created a box plot with top 5 genres in roi_domestic based on median
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.boxplot( x=top_5_df_domestic["genres"], 
            y=top_5_df_domestic["roi_domestic"], 
            palette="Blues", 
            width=0.6, 
            showfliers = False, 
            showmeans=True,
            data=top_5_df_domestic.groupby('roi_domestic', as_index=False).median());

plt.ylabel("% ROI", size=14)
plt.xlabel("Movie Genres", size=14)
plt.title("Average Domestic ROI For The Top 5 Genres", size=18)
#sns.stripplot(x="genres", y="roi_domestic", data=top_5_df_domestic)
ax.yaxis.grid(False) # Hide the horizontal gridlines
ax.xaxis.grid(False) # Show the vertical gridlines
sns.despine(offset=10, trim=True)
plt.show()

In [None]:
#let's predict worst, base, and best case scenarios for each genre
arr = animation['roi_domestic']
arr2 = comedy['roi_domestic']
arr3 = music['roi_domestic']
arr4 = mystery['roi_domestic']
arr5 = romance['roi_domestic']
print(f'Animation ROI: \n  Worst Case = {int(np.percentile(arr, 25))}% \n  Base Case = {int(np.percentile(arr, 50))}% \n  Best Case = {int(np.percentile(arr, 75))}%')
print(f'Comedy ROI: \n  Worst Case = {int(np.percentile(arr2, 25))}% \n  Base Case = {int(np.percentile(arr2, 50))}% \n  Best Case = {int(np.percentile(arr2, 75))}%')
print(f'Music ROI: \n  Worst Case = {int(np.percentile(arr3, 25))}% \n  Base Case = {int(np.percentile(arr3, 50))}% \n  Best Case = {int(np.percentile(arr3, 75))}%')
print(f'Mystery ROI: \n  Worst Case = {int(np.percentile(arr4, 25))}% \n  Base Case = {int(np.percentile(arr4, 50))}% \n  Best Case = {int(np.percentile(arr4, 75))}%')
print(f'Romance ROI: \n  Worst Case = {int(np.percentile(arr5, 25))}% \n  Base Case = {int(np.percentile(arr5, 50))}% \n  Best Case = {int(np.percentile(arr5, 75))}%')

### Business Recommendation 1: Conclusions

From what we see in the graph above and based on worst, base, and best case scenarios for each genre, we should create films in the adventure and animation genres. Although their best case ROI predictions are not as high as the mystery and musical genres, their worst case ROI predications are both far above 0%. 

All movies, no matter the genre, should be released worldwide and not just domestically, the worst case ROI predictions for each of the top 5 genres with highest average domestic ROI is below -25%.

# Question 3:
### Does the average ROI by genre follow a similar trend when compared between domestic and foreign markets?


# Question 3:
### Does the average ROI by genre follow a similar trend when compared between domestic and foreign markets?


In [None]:
#drop dupes
df = df.drop_duplicates(subset='movie', keep='first')
df.shape

In [None]:
# creating Q3 dataframe
# ADD COLUMNS 'DOMESTIC ROI', 'FOREIGN ROI', 'WORLDWIDE ROI'
# note: might not need to add these columns if Samantha already did it

Q3_df = df

Q3_df['ROI_domestic'] = ((Q3_df['domestic_gross'] - Q3_df['production_budget']) / Q3_df['production_budget']) * 100
                               
Q3_df['ROI_foreign'] = ((Q3_df['foreign_gross'] - Q3_df['production_budget']) / Q3_df['production_budget']) * 100
                              
Q3_df['ROI_worldwide'] = ((Q3_df['worldwide_gross'] - Q3_df['production_budget']) / Q3_df['production_budget']) * 100

Q3_df['year'] = pd.DatetimeIndex(Q3_df['release_year']).year

Q3_df = Q3_df.loc[(Q3_df['year'] > 2010)]

Q3_df.shape

In [None]:
#separate each genre of a movie into their own row
exploded_df = Q3_df.explode('genres')
exploded_df.head()

In [None]:
exploded_df['genres'].value_counts()

In [None]:
# determine top 5 genres (ROI-worldwide) by creating a new dataframe
top5_Q3_df = exploded_df[['genres', 'ROI_domestic', 'ROI_foreign','ROI_worldwide']].copy()
top5_Q3_df.head()

In [None]:
# group dataframe by genres and calculate median
grouped_by_genres = top5_Q3_df.groupby('genres').median()
grouped_by_genres

In [None]:
#get top genres sorted by WORLDWIDE ROI
roi_genres = grouped_by_genres.sort_values(by=['ROI_worldwide'], ascending = False)
roi_genres.head(10)

In [None]:
#isolate top 5 genres (ROI Worldwide) from original dataframe
# am I calculating my ROI world wide wrong? SAMANTHA DIFF: HER COMEDY VS MY THRILLER

mystery_df = exploded_df.loc[(exploded_df['genres'] == 'Mystery')]

animation_df = exploded_df.loc[(exploded_df['genres'] == 'Animation')]

musical_df = exploded_df.loc[(exploded_df['genres'] == 'Musical')]

scifi_df = exploded_df.loc[(exploded_df['genres'] == 'Sci-Fi')]

adventure_df = exploded_df.loc[(exploded_df['genres'] == 'Adventure')]



In [None]:
#group by year
mys2 = mystery_df.groupby('year').median()

ani2 = animation_df.groupby('year').median()

mus2 = musical_df.groupby('year').median()

scifi2 = scifi_df.groupby('year').median()

adv2 = adventure_df.groupby('year').median()

In [None]:
# plot the dataframes
mys2.loc[:,['ROI_domestic', 'ROI_foreign', 'ROI_worldwide']].plot();
plt.xlabel('Release Year')
plt.ylabel('ROI %')
plt.title('Mystery Genre')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);


In [None]:
ani2.loc[:,['ROI_domestic', 'ROI_foreign', 'ROI_worldwide']].plot();
plt.xlabel('Release Year')
plt.ylabel('ROI %')
plt.title('Animation Genre')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
mus2.loc[:,['ROI_domestic', 'ROI_foreign', 'ROI_worldwide']].plot();
plt.xlabel('Release Year')
plt.ylabel('ROI %')
plt.title('Musical Genre')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
scifi2.loc[:,['ROI_domestic', 'ROI_foreign', 'ROI_worldwide']].plot();
plt.xlabel('Release Year')
plt.ylabel('ROI %')
plt.title('Sci-Fi Genre')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
adv2.loc[:,['ROI_domestic', 'ROI_foreign', 'ROI_worldwide']].plot();
plt.xlabel('Release Year')
plt.ylabel('ROI %')
plt.title('Adventure Genre')
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
Q3_df.corr()

# Question 3 Conclusion
### Does the average ROI by genre follow a similar trend when compared between domestic and foreign markets?

#### The average ROI for the top 5 genres over the past decade shows that there is a bigger return on investment in the foreign market vs the domestic market for Adventure, Sci-Fi, Animation, and potentially Musical movies. The Mystery Genre's domestic and foreign ROI follow the same general positive trend over time.

#### Each genre's domestic ROI hovers around 0% except for Mystery movies that generally maintains a postive trajectory. There are a few outliers in the musical and mystery genre, but the foreign ROI generally performs better than the domestic ROI. 

#### Based on this section of the analysis, Microsoft's movie studios should focus on producing Adventure, Sci-Fi, and Animation movies because of their greater foreign ROI potential.



In [None]:
print(f'This cleaned data includes movies from {df.release_year.min()} to {df.release_year.max()}.')

In [None]:
#D ropping unnessary columns for the 3rd question
df.drop(['release_year'],axis=1,inplace=True)

In [None]:
# Finding top 10 movies by 'worldwide_gross'
top10_worldwide = df.sort_values(by='worldwide_gross', ascending=False).head(10)
top10_worldwide = top10_worldwide.set_index('movie')
top10_worldwide

In [None]:
# Fiding top 10 movies by 'domestic_gross'
#top10_domestic = df.sort_values(by='domestic_gross', ascending=False).head(10)
#top10_domestic = top10_domestic.set_index('movie')
#top10_domestic

In [None]:
#fig, ax = plt.subplots(ncols=2, figsize=(16, 5))
#ax.barh(top10_worldwide.index, top10_worldwide["worldwide_gross"])
#ax2.barh(top10_domestic.index, top10_domestic["domestic_gross"])

#ax.set_xlabel("Worldwide Gross")
#ax2.set_xlabel("Domestic Gross")
#ax.set_title("Top 10 Movies Worldwide 2011-2020")
#ax2.set_title("Top 10 Movies Domestic 2011-2020");
#ax.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
#ax2.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
# Visualizing data (bar chart, stacked chart)
ax1 = top10_worldwide.plot(kind='barh')
#ax2 = top10_domestic.plot(kind='barh')        
ax1.set_ylabel("Worldwide Gross")
#ax2.set_ylabel("Domestic Gross")
ax1.set_title("Top 10 Movies Worldwide 2011-2020")
#ax2.set_title("Top 10 Movies Domestic 2011-2020")
ax1.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
#ax2.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
ax1 = top10_worldwide.plot(kind='barh', stacked=True, title='Top 10 Movies Worldwide 2011-2020')
#ax2 = top10_domestic.plot(kind='barh', stacked=True, title='Top 10 Movies Domestic 2011-2020')
ax1.set_xlabel("Worldwide Gross")
#ax2.set_xlabel("Domestic Gross")
ax1.set_title("Top 10 Movies Worldwide 2011-2020")
#ax2.set_title("Top 10 Movies Domestic 2011-2020")
ax1.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
#ax2.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1);

In [None]:
### Answering Q3: Is there any particular director/producer who appears frequently in the above findings?

In [None]:
print(f"The directors who appear the most in 'the top 10 movies by worldwide gross' are {top10_worldwide['director'].value_counts().head(2)}.")

In [None]:
### Finding correlation between directors' productivity and profit

In [None]:
# Making 'director' column to individaul director per row
#df['director'] = df['director'].map(lambda x: x.split(","))
individual_df = df.explode('director')
individual_df

In [None]:
# Counting movies made by each director
individual_df['director'].value_counts()

In [None]:
# Creating a column that shows the number of movies made by the individual director
individual_df['#movies_by_director'] = individual_df.groupby(['director'])['movie'].transform('count')

# Dropping unnecessary columns and organizing the dataframe by 'director' and 'movie'
individual_df = individual_df.drop(['production_budget', 'domestic_gross', 'foreign_gross'], axis=1)
individual_df.groupby(by=['director', 'movie']).sum().head(20)

In [None]:
# Creating a column that shows the average profit made by the individual director
individual_df['avg_profit'] = individual_df.groupby(['director'])['profit'].transform('mean')
individual_df['avg_world_gross'] = individual_df.groupby(['director'])['worldwide_gross'].transform('mean')
individual_df.groupby(by=['director', 'movie']).sum().head(20)

In [None]:
# Checking correlation between directors' productivity and profit
individual_df.corr()['#movies_by_director'].sort_values()


In [None]:
# Scatter plot of the relationship between directors' productivity and profit
fig, ax = plt.subplots(figsize=(15,5))

ax.scatter(individual_df['#movies_by_director'], individual_df['avg_profit'], alpha=0.3, color="green")
ax.set_title("productivity of a director vs. profit")
ax.set_xlabel("Number of movies made by director")
ax.set_ylabel("avg_profit");

In [None]:
# Scatter plot of the relationship between directors' productivity and worldwide gross
fig, ax = plt.subplots(figsize=(15,5))

ax.scatter(individual_df['#movies_by_director'], individual_df['avg_world_gross'], alpha=0.3, color="green")
ax.set_title("productivity of a director vs. worldwide gross")
ax.set_xlabel("Number of movies made by director")
ax.set_ylabel("worldwide gross");