![example](images/director_shot.jpeg)

![example](images/movie_data_erd.jpeg)

# Project Title

**Authors:** Victor Kang
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Microsoft has been suffering from a severe case of FOMO (Fear-Of-Missing-Out) as they've watched many of their competitors succeed in opening up their own movie studios and creating original video content! Microsoft wants in on the action and is now opening their own movie studio! I've been tasked with providing critical data-based market research to ensure that Microsoft's first movie will be a global success!

***
Primary Objectives and Qualifications:
1. Explore and analyze what **types** of films are **currently** doing the **best at the box office**.
* The key terms here in bold must first be defined. For this Project, they shall be defined as follows:
>* **Types**: There are many ways to classify or categorize films, most common being by Genre. We can also categorize movies by their budget range, ie. big budget vs small budget.
>* **Currently**: Because Microsoft asked for "currently", we know we should only consider modern movies in our upcoming analysis. Exactly how modern will be influenced by our available data. Specific Date Range To Be Determined!
>* **Best at the Box Office**: "Best" will be defined by the financial performance of movies at the Worldwide Box Office. To measure financial performance, we will explore the Worldwide Box Office Gross of movies and compare it to their Production Budgets to calculate the *Profit/Loss* and *Return-On-Investment* metrics.  

* In short, our first objective is to determine which Genres and Budget Ranges of modern movies have produced the highest profit and return-on-investment for their movie studios! 

2. Provide (3) actionable insights / concrete business recommendations based on the analysis. 
* We plan to provide budget range recommendations and how budgets could have a relationship to financial success.
* Genre
* Recommendations for cast and crew! Actor, Actress, Director, and Writer recommendations.

* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [None]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

%matplotlib inline

In [None]:
# Here you run your code to explore the data
conn = sqlite3.connect("zippedData/im.db")

In [None]:
df = pd.read_sql("""SELECT name FROM sqlite_master WHERE type = 'table';""", conn)
df

In [None]:
pd.read_sql("""
SELECT *
FROM movie_basics
;
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM movie_basics
LEFT JOIN movie_ratings USING(movie_id)
;
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM movie_ratings
;
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM movie_basics
LEFT JOIN movie_ratings USING(movie_id)
WHERE movie_id IN 
    (SELECT movie_id
    FROM principals
    WHERE person_id = "nm0000129")
;
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM movie_basics
WHERE movie_id IN 
    (SELECT movie_id
    FROM principals
    WHERE person_id = "nm0000129")
;
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM movie_basics
WHERE movie_id IN 
    (SELECT movie_id
    FROM principals
    WHERE person_id = "nm0000129")
;
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM persons;
""", conn)

In [None]:
principals = pd.read_sql("""
SELECT *
FROM principals;
""", conn)

In [None]:
cruiseknownfor = pd.read_sql("""
SELECT *
FROM known_for
WHERE person_id = "nm0000129";
""", conn)

In [None]:
cruiseknownfor["movie_id"]

In [None]:
pd.read_sql("""
SELECT *
FROM known_for
WHERE movie_id = "tt0325710";
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM principals
WHERE person_id = "nm0000129";
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM writers
WHERE person_id = "nm0000129";
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM persons
WHERE primary_name LIKE "%cruise%";
""", conn)

In [None]:
pd.read_sql("""
SELECT *
FROM persons
WHERE primary_name LIKE "%jackman%";
""", conn)

In [None]:
#Tom Cruise person_id = "nm0000129"

In [None]:
pd.read_csv("zippedData/bom.movie_gross.csv.gz").head(50)

In [None]:
pd.read_csv("zippedData/tmdb.movies.csv.gz",index_col=0).head(25)

In [None]:
pd.read_csv("zippedData/tn.movie_budgets.csv.gz").head(50)

In [None]:
pd.read_csv("zippedData/rt.movie_info.tsv.gz", delimiter='\t' ).head(25)

In [None]:
pd.read_csv("zippedData/rt.reviews.tsv.gz", delimiter='\t', encoding='latin1').head(25)

In [None]:
pd.read_csv("zippedData/rt.reviews.tsv.gz", delimiter='\t', encoding='latin1')

In [None]:
pd.read_csv("zippedData/movie_gross_data.csv", index_col=0)

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data
thenumbers = pd.read_csv("zippedData/tn.movie_budgets.csv.gz")

In [None]:
thenumbers.info()

In [None]:
thenumbers.duplicated().values.sum() #No duplicates in dataset

In [None]:
thenumbers["production_budget"].apply(type).value_counts()

In [None]:
thenumbers["domestic_gross"].apply(type).value_counts()

In [None]:
thenumbers["worldwide_gross"].apply(type).value_counts()

In [None]:
thenumbers["production_budget"] = thenumbers["production_budget"].replace({'\$': '', ',': ''}, regex=True).astype(float)
thenumbers["production_budget"].apply(type).value_counts()

In [None]:
thenumbers["domestic_gross"] = thenumbers["domestic_gross"].replace({'\$': '', ',': ''}, regex=True).astype(float)
thenumbers["domestic_gross"].apply(type).value_counts()

In [None]:
thenumbers["worldwide_gross"] = thenumbers["worldwide_gross"].replace({'\$': '', ',': ''}, regex=True).astype(float)
thenumbers["worldwide_gross"].apply(type).value_counts()

In [None]:
thenumbers["foreign_gross"] = thenumbers["worldwide_gross"] - thenumbers["domestic_gross"]

In [None]:
thenumbers["foreign_gross"].apply(type).value_counts()

In [None]:
thenumbers["profit_loss"] = thenumbers["worldwide_gross"] - thenumbers["production_budget"]

In [None]:
thenumbers["profit_loss"].describe()

In [None]:
thenumbers["roi"] = (thenumbers["profit_loss"] / thenumbers["production_budget"] ) *100

In [None]:
thenumbers.sort_values(by='profit_loss', ascending=False).head(50).describe()

In [None]:
thenumbers.sort_values(by='roi', ascending=False).head(50).describe()

In [None]:
top500moviesbypnl = thenumbers.sort_values(by='profit_loss', ascending=False).head(500)["movie"]

In [None]:
top500moviesbypnl

In [None]:
top500moviesbyroi = thenumbers.sort_values(by='roi', ascending=False).head(500)["movie"]

In [None]:
top500moviesbyroi

In [None]:
#topmoviespnlroi = pd.Series(np.intersect1d(top500moviesbypnl,top500moviesbyroi))

topmoviespnlroi = pd.Series(list(set(top500moviesbypnl).intersection(set(top500moviesbyroi))))

topmoviespnlroi

In [None]:
dfpnlroi = thenumbers[thenumbers["movie"].isin(topmoviespnlroi)]

In [None]:
dfpnlroi.sort_values(by='profit_loss', ascending=False)

In [None]:
dfpnlroi.loc[dfpnlroi['movie'] == 'Twilight']

In [None]:
samenamemovies = dfpnlroi[dfpnlroi.duplicated('movie', keep=False)]

In [None]:
samenamemovies.sort_values(by='profit_loss', ascending=False)

In [None]:
samenamemovies.sort_values(by='roi', ascending=False)

In [None]:
pd.plotting.scatter_matrix(samenamemovies, figsize=(15,15));

In [None]:
samenamemovies.describe()

In [None]:
dfpnlroi.describe()

In [None]:
thenumbers.sort_values(by='profit_loss', ascending=False).head(500).describe() 
#minimum of top 500 pnl is 2.037891e+08 (Paddington)

In [None]:
thenumbers.sort_values(by='roi', ascending=False).head(500)
#minimum of top 500 roi is 759.705987 (The Woman in Black)

In [None]:
samenamemovies2 = samenamemovies[samenamemovies["profit_loss"] > 203789100]
samenamemovies2.sort_values(by = "profit_loss", ascending = False)
#samenamemovies5 = samenamemovies2[samenamemovies2["roi"] > 759.705987]
#samenamemovies5

In [None]:
samenamemovies3 = samenamemovies[samenamemovies["roi"] > 759.705987]
samenamemovies3.sort_values(by = "roi", ascending = False)
#samenamemovies4 = samenamemovies3[samenamemovies3["profit_loss"] > 203789100]
#samenamemovies4

In [None]:
worthymovies = pd.concat([samenamemovies2, samenamemovies3]).drop_duplicates()
worthymovies

In [None]:
unworthymovies = samenamemovies[~samenamemovies.isin(worthymovies)].dropna()
unworthymovies

In [None]:
dfpnlroi

In [None]:
dfpnlroi2 = dfpnlroi[~dfpnlroi.isin(unworthymovies)].dropna()

In [None]:
dfpnlroi2.describe()

In [None]:
thenumbers.describe()

In [None]:
dfpnlroi2['release_date'] = pd.to_datetime(dfpnlroi2['release_date'])
dfpnlroi2.sort_values(by="release_date", ascending=False).head(60)

In [None]:
dfpnlroi2["release_date"].apply(type).value_counts()

In [None]:
dfpnlroi2["release_date"].min()

In [None]:
dfpnlroi2["release_date"].max()

In [None]:
dfpnlroi[dfpnlroi['movie'].str.contains('mission', regex=False)]

In [None]:
pd.read_csv("zippedData/tn.movie_budgets.csv.gz")

In [None]:
bomdf = pd.read_csv("zippedData/bom.movie_gross.csv.gz")

In [None]:
newbomdf = pd.read_csv("zippedData/movie_gross_data.csv", index_col=0)

In [None]:
bomdf.info()

In [None]:
bomdf.sort_values(by="year", ascending=False).head(60)

In [None]:
newbomdf[newbomdf['Release Group'].str.contains('Mission', regex=False)]

In [None]:
newbomdf.info()

In [None]:
newbomdf.sort_values(by="year", ascending=False).head(60)

In [None]:
budgets = pd.read_csv("zippedData/movies budgets.csv")

In [None]:
justbudgets = budgets.drop(droplist,axis=1) #just extract budgets

In [None]:
justbudgets.drop(justbudgets[justbudgets['budget'] <= 1].index, inplace = True) #remove records with no budget info

In [None]:
justbudgets.drop_duplicates(inplace=True)

In [None]:
justbudgets["movie (year)"] = justbudgets['title'].astype(str) + " (" + justbudgets['release_date'].str[0:4] + ")"

In [None]:
justbudgets

In [None]:
justbudgetsdup = justbudgets[justbudgets.duplicated(keep=False)]

In [None]:
justbudgetsdup.head(50)

In [None]:
#pd.merge(df1,df2,on ='Name', how ='left')

In [None]:
budgets.info()

In [None]:
droplist = list(budgets.columns)

In [None]:
droplist.remove('title')

In [None]:
droplist.remove('budget')

In [None]:
droplist.remove('release_date')

In [None]:
newbomdf.rename({'Release Group': 'title'}, axis=1, inplace=True)

In [None]:
newbomdf.duplicated().values.sum()

In [None]:
newbomdf

In [None]:
newbomdf["movie (year)"] = newbomdf['title'].astype(str) + " (" + newbomdf['year'].astype(str).str[0:4] + ")"

In [None]:
newbomdf[newbomdf['movie (year)'].str.contains('mission', regex=False)]

In [None]:
newbomdf

In [None]:
mergenewbombudgets = pd.merge(newbomdf,justbudgets,on ='movie (year)', how ='left')

In [None]:
mergenewbombudgets

In [None]:
mergenewbombudgets.dropna(inplace=True)

In [None]:
mergenewbombudgets

In [None]:
thenumbers #first time i did this, didn't create movie+year so had to do extra cool programming to deal with movies that share same name
#this time, tho, I will deal with this ahead of time

In [None]:
thenumbers["movie (year)"] = thenumbers['movie'].astype(str) + " (" + thenumbers['release_date'].astype(str).str[-4:] + ")"

In [None]:
thenumbers.head(50)

In [None]:
thenumbers[thenumbers['movie (year)'].str.contains('Pirates', regex=False)]

In [None]:
thenumbers['movie'] = thenumbers['movie'].str.replace("â","'")

In [None]:
mergenewbombudgets[mergenewbombudgets['movie (year)'].str.contains('Mission', regex=False)]

In [None]:
testing = pd.merge(mergenewbombudgets,thenumbers,on ='movie (year)', how ='outer')

In [None]:
testing[["movie (year)","budget","production_budget"]]

In [None]:
testing[["movie (year)","budget","production_budget"]]

In [None]:
testing['maxBudget'] = testing.apply(
    lambda row: max(row["budget"] , row["production_budget"]), axis=1)

In [None]:
testing['maxBudget'] = testing['maxBudget'].fillna(testing['production_budget'])

In [None]:
testing.info()

In [None]:
testing[["movie (year)","budget","production_budget", 'maxBudget']].head(2000)

In [None]:
testing.head(4000)

In [None]:
testing[testing['movie (year)'].str.contains('Mission', regex=False)]

In [None]:
test = testing.drop(["Rank","%","%.1"], axis=1)

In [None]:
test[~test.isnull().any(axis=1)]

In [None]:
test[['Worldwide','worldwide_gross','movie (year)']]

In [None]:
test["Worldwide"] = test["Worldwide"].replace({'\$': '', ',': ''}, regex=True).astype(float)
test["Worldwide"].apply(type).value_counts()

In [None]:
test["Domestic"] = test["Domestic"].replace({'\$': '', ',': '', '-': '0'}, regex=True).astype(float)
test["Domestic"].apply(type).value_counts()

In [None]:
test["Foreign"] = test["Foreign"].replace({'\$': '', ',': '', '-': '0'}, regex=True).astype(float)
test["Foreign"].apply(type).value_counts()

In [None]:
test['WW UTD'] = test.apply(
    lambda row: max(row["Worldwide"] , row["worldwide_gross"]), axis=1)
test['WW UTD'] = test['WW UTD'].fillna(testing['worldwide_gross'])

In [None]:
test[['Worldwide','worldwide_gross','WW UTD','movie (year)']].head(4800)

In [None]:
test['DOM UTD'] = test.apply(
    lambda row: max(row["Domestic"] , row["domestic_gross"]), axis=1)
test['DOM UTD'] = test['DOM UTD'].fillna(testing['domestic_gross'])

In [None]:
test['FOR UTD'] = test.apply(
    lambda row: max(row["Foreign"] , row["foreign_gross"]), axis=1)
test['FOR UTD'] = test['FOR UTD'].fillna(testing['foreign_gross'])

In [None]:
test["profit_loss"] = test["WW UTD"] - test["maxBudget"]

In [None]:
test["roi"] = (test["profit_loss"] / test["maxBudget"] ) *100

In [None]:
test["Year"] = test['movie (year)'].str[-5:-1].astype(int)
test.info()

In [None]:
test.drop("year2", axis=1, inplace=True)

In [None]:
test.sort_values(by='maxBudget').head(500)

In [None]:
test.drop(test[test['maxBudget'] < 1000000].index, inplace = True) 
#dropping all records with budget data less than $1mil as its causing skews

In [None]:
test2 = test[test.columns[~test.isnull().any()]]

In [None]:
test3 = test2.sort_values(by="Year", ascending = False).head(5000) 
#take 5000 most recent movies in database, cuz we want currently which equates to 1999

In [None]:
test[test['movie (year)'].str.contains('Mission', regex=False)]

In [None]:
test3roi500 = test3.sort_values(by="roi", ascending = False).head(1000)['movie (year)']

In [None]:
test3pnl500 = test3.sort_values(by="profit_loss", ascending = False).head(500)['movie (year)']

In [None]:
test3[test3['movie (year)'] == "Edge of Tomorrow (2014)"]

In [None]:
test3[test3['movie (year)'] == "How to Train Your Dragon 2 (2014)"]

In [None]:
test3pnl500.tail(6)

In [None]:
test3[test3['movie (year)'] == "Crouching Tiger, Hidden Dragon (2000)"]

In [None]:
test3pnlroi500 = pd.Series(list(set(test3roi500).intersection(set(test3pnl500))))

In [None]:
test3pnlroi500.head(60)

In [None]:
test3pnlroi500[test3pnlroi500 == "Edge of Tomorrow (2014)"]

In [None]:
test4 = test3[test3["movie (year)"].isin(test3pnlroi500)]
test4

In [None]:
test4.describe()

In [None]:
test4[test4['movie (year)'].str.contains('Mission', regex=False)]

In [None]:
test4.info()

In [None]:
pd.set_option('display.max_rows', None)
test4

In [None]:
pd.reset_option('display.max_rows')

In [None]:
test4=test4.drop(2513)

In [None]:
test4.describe()

In [None]:
test4.sort_values(by="profit_loss", ascending=False).head(50)

In [None]:
test4.sort_values(by="roi", ascending=False).head(50)

In [None]:
test4["Profit/Loss ($, Millions)"] = round(test4['profit_loss']/1000000)

In [None]:
test4["RoI (%)"] = round(test4['roi'])

In [None]:
test4["Budget ($, Millions)"] = round(test4['maxBudget']/1000000)

In [None]:
test4["Total Box Office ($, Millions)"] = round(test4['WW UTD']/1000000)

In [None]:
test4

In [None]:
test5 = test4[["movie (year)","Budget ($, Millions)","Total Box Office ($, Millions)","Profit/Loss ($, Millions)","RoI (%)","Year"]]

In [None]:
test5.head(100)

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data
pd.plotting.scatter_matrix(test5, figsize=(20,20));

In [None]:
plt.figure(figsize=(20,20))
sns.pairplot(test5,palette="husl")

In [None]:
test5.describe()

In [None]:
test5.cov()

In [None]:
test5.corr(method='pearson')

In [None]:
principals = pd.read_sql("""
SELECT pr.movie_id, pr.person_id, primary_name, primary_title, start_year
FROM principals AS pr
JOIN persons AS pe ON pr.person_id=pe.person_id
JOIN movie_basics AS mb ON pr.movie_id=mb.movie_id
;
""", conn)


In [None]:
principals.loc[principals[principals["primary_title"].str.contains('Top Gun')].index, 'start_year'] = 2022

In [None]:
principals["movie (year)"] = principals['primary_title'].astype(str) + " (" + principals['start_year'].astype(str).str[-4:] + ")"

In [None]:
principals["movie (year)"].value_counts()

In [None]:
#principals['start_year'] = principals[principals["primary_title"].str.contains('Top Gun')]['start_year'] == 2022

In [None]:
principals[principals["primary_title"].str.contains('Top Gun')].index

In [None]:
principals[principals["primary_title"].str.contains('Top Gun')]

In [None]:
principals[principals["movie (year)"].isin(test5["movie (year)"])]["movie (year)"].value_counts()

In [None]:
principals[principals["movie (year)"].isin(test5["movie (year)"])]["primary_name"].value_counts().head(50)

In [None]:
principals[principals["primary_name"].str.contains('Cruise')]['movie (year)'].isin(test5['movie (year)'])

In [None]:
pd.read_sql("""
SELECT pr.movie_id, pr.person_id, primary_name, primary_title, start_year
FROM principals AS pr
JOIN persons AS pe ON pr.person_id=pe.person_id
JOIN movie_basics AS mb ON pr.movie_id=mb.movie_id;
""", conn)


In [None]:
pd.read_sql("""
SELECT *
FROM persons
WHERE primary_profession LIKE "%writer%"
;
""", conn)


In [None]:
interesting = budgets[['title', 'genres','release_date','credits']]

In [None]:
dropper = list(budgets.columns)

In [None]:
dropper.remove('title')

In [None]:
dropper.remove('genres')

In [None]:
dropper.remove('release_date')

In [None]:
dropper.remove('credits')

In [None]:
interesting = budgets.drop(dropper,axis=1)

In [None]:
interesting['movie (year)'] = interesting['title'].astype(str) + " (" + interesting['release_date'].astype(str).str[0:4] + ")"
interesting2 = interesting.dropna().drop_duplicates()

In [None]:
interesting3 = interesting2[interesting2['movie (year)'].isin(test5['movie (year)'])]

In [None]:
interesting3.head(60)

In [None]:
interesting3["genre_split"] = interesting3['genres'].str.split("-")

In [None]:
interesting4 = interesting3.explode("genre_split")

In [None]:
interesting4

In [None]:
#split in string of column

In [None]:
newprincipals = pd.read_csv("zippedData/title.principals.tsv.gz", delimiter='\t' )

In [None]:
pd.read_csv("zippedData/title.principals.tsv.gz", delimiter='\t' )

In [None]:
newtitles = pd.read_csv("zippedData/title.basics.tsv.gz", delimiter='\t' )

In [None]:
movietitles = newtitles[newtitles['titleType'] == 'movie']

In [None]:
movietitles.drop(["isAdult","endYear",'runtimeMinutes'],axis=1,inplace = True)

In [None]:
movietitles = movietitles[movietitles.startYear != '\\N']

In [None]:
movietitles = movietitles[movietitles['startYear'].astype(int) > 1998]

In [None]:
movietitles['movie (year)'] = movietitles['primaryTitle'].astype(str) + " (" + movietitles['startYear'].astype(str) + ")"
movietitles = movietitles.dropna().drop_duplicates()
movietitles.head(50)

In [None]:
movietitles.drop(["titleType","primaryTitle",'originalTitle'],axis=1, inplace=True)

In [None]:
moviekey = movietitles.set_index('tconst')
moviekey

In [None]:
newprincipals

In [None]:
newprincipals['movie (year)'] = newprincipals['tconst'].map(moviekey['movie (year)'])
newprincipals

In [None]:
newprincipals['movi

In [None]:
newprincipals['movie (year)'].value_counts()

In [None]:
topprincipals = newprincipals[newprincipals['movie (year)'].isin(test5['movie (year)'])]

In [None]:
topprincipals['movie (year)'].value_counts()

In [None]:
#this may be interesting... the principals table only shows 10 principals per movie, so the max value of value_counts should be 10 for any movie
#this shows there are duplicate named movies! More investigation required on those 4 movies!

In [None]:
topprincipals[topprincipals['movie (year)'].isin(
    ['Cinderella (2015)','Coco (2017)', 'Alice in Wonderland (2010)','Beauty and the Beast (2017)'])].tail(60)
#we see here that tconst tt5089556 is a duplicate bollywood  cinderella movie, DELETE
#tt11861230 is an imposter of B&B
#tt2049386 is an imposter of Alice
#tt7002100 is an imposter of Coco

In [None]:
imposters = ['tt5089556' , 'tt11861230' ,'tt2049386' , 'tt7002100']

In [None]:
topprincipals[topprincipals['tconst'].isin(imposters) == False]['movie (year)'].value_counts()
#this worked cleaning and removing imposters!

In [None]:
newnames = pd.read_csv("zippedData/name.basics.tsv.gz", delimiter='\t' )

In [None]:
newnames.head()

In [None]:
newnames.drop_duplicates()

In [None]:
namekey = newnames.set_index('nconst')

In [None]:
namekey

In [None]:
newnames[newnames['nconst'] == 'nm0000158']

In [None]:
topprincipals['primary_name'] = topprincipals['nconst'].map(namekey['primaryName'])
topprincipals

In [None]:
topprincipals['primary_name'].value_counts().head(50)

In [None]:
top2principals = topprincipals[topprincipals['ordering'] < 3]

In [None]:
top1principals = topprincipals[topprincipals['ordering'] < 2]

In [None]:
top1principals['primary_name'].value_counts().head(50)

In [None]:
topprincipals[topprincipals['primary_name'] == 'Mike Myers']

In [None]:
test5[test5['movie (year)'].str.contains("Austin")]

In [None]:
top1principals.head(50)

In [None]:
top2principals.head(50)

In [None]:
newcrew = pd.read_csv("zippedData/title.crew.tsv.gz", delimiter='\t' )

In [None]:
newcrew.info()

In [None]:
newcrew["director_split"] = newcrew['directors'].str.split(",")
newcrew2 = newcrew.explode("director_split")

In [None]:
directorkey = newcrew2[["tconst","director_split"]]

In [None]:
directorkey

In [None]:
directorkey['movie (year)'] = directorkey['tconst'].map(moviekey['movie (year)'])
#directorkey['director'] = directorkey['nconst'].map(namekey['primaryName'])

In [None]:
directorkey.loc[directorkey['movie (year)'] == 'Ted Bundy Had a Son (2022)']

In [None]:
namekey

In [None]:
directorkey['director'] = directorkey['director_split'].map(namekey['primaryName'])

In [None]:
directorkey[directorkey['tconst'] == 'tt11861230']

In [None]:
topdirectors = directorkey[directorkey['movie (year)'].isin(test5['movie (year)'])]

In [None]:
topdirectors['movie (year)'].value_counts()

In [None]:
topdirectors['director'].isnull().sum()

In [None]:
topdirectors

In [None]:
topdirectors[topdirectors['director'].isnull()]

In [None]:
topdirectors.loc[2005836]

In [None]:
topdirectors.head(60)

In [None]:
topprincipals[topprincipals['category'] == 'director']['movie (year)'].value_counts()

In [None]:
topdirectors['director'].value_counts().head(60)

In [None]:
topdirectors['movie (year)'].value_counts()

In [None]:
topprincipals.loc[topprincipals[''].duplicated()].head(60)
#important, this is how you find who is a co-director vs main director
#nope not here

In [None]:
topdirectors.loc[topdirectors['movie (year)'].duplicated(keep=False)].head(60)

In [None]:
topprincipals[topprincipals['ordering']==5]
#wow this is where I discovered ordering number 5 is always main director

In [None]:
topprincipals[topprincipals['ordering']==5]['category'].value_counts()

In [None]:
maindirectors = topprincipals[topprincipals['ordering']==5]

In [None]:
maindirectors[maindirectors['movie (year)'].duplicated(keep=False)]
#found errors in data

In [None]:
topprincipals[topprincipals['movie (year)'] == 'Cinderella (2015)']
#found that tconst tt5089556 is indian version of Cinderella (2015). DROP IT

In [None]:
cleantopprincipals = topprincipals[topprincipals['tconst'].isin(imposters) == False]

In [None]:
cleantopprincipals[cleantopprincipals['ordering']==5]

In [None]:
maindirectors = cleantopprincipals[cleantopprincipals['ordering']==5]

In [None]:
maindirectors.duplicated().sum()

In [None]:
maindirectors['movie (year)'].value_counts()

In [None]:
maindirectors[maindirectors['category'] != 'director']
#found problem where movies have directors who also acted in the movie!

In [None]:
cleantopprincipals[cleantopprincipals['ordering']==5]['category'].value_counts()

In [None]:
cleantopprincipals[cleantopprincipals['movie (year)']=='Top Gun: Maverick (2022)']

In [None]:
onlymaindirectors = maindirectors[maindirectors['category'] == 'director']

In [None]:
onlymaindirectors['primary_name'].value_counts()

In [None]:
onlymaindirectors['movie (year)'].value_counts()

In [None]:
cleantopprincipals.head(60)

In [None]:
cleantopprincipals[cleantopprincipals['ordering']==7]['category'].value_counts()

In [None]:
cleantopprincipals[cleantopprincipals['ordering']==9]['category'].value_counts()

In [None]:
mainwriters = cleantopprincipals[cleantopprincipals['ordering'].isin([6,7])]

In [None]:
onlymainwriters = mainwriters[mainwriters['category'] == 'writer']

In [None]:
onlymainwriters['primary_name'].value_counts()

In [None]:
onlymainwriters['movie (year)'].value_counts()

In [None]:
onlymainwriters.head(60)

In [None]:
onlymainwriters.drop_duplicates(subset='movie (year)', keep='first', inplace=True )

In [None]:
onlymainwriters['movie (year)'].value_counts()

In [None]:
onlymainwriters['primary_name'].value_counts()

In [None]:
top3principals = cleantopprincipals[cleantopprincipals['ordering'] < 4]

In [None]:
top3males = top3principals[top3principals['category'] == 'actor']

In [None]:
top3females = top3principals[top3principals['category'] == 'actress']

In [None]:
top3males['movie (year)'].value_counts()

In [None]:
top3females['movie (year)'].value_counts()

In [None]:
top3principals['category'].value_counts()

In [None]:
top2males['primary_name'].value_counts().head(60)

In [None]:
toplead = cleantopprincipals[cleantopprincipals['ordering'] < 2]

In [None]:
topmale = toplead[toplead['category'] == 'actor']

In [None]:
topfemale = toplead[toplead['category'] == 'actress']

In [None]:
topmale['primary_name'].value_counts().head(60)

In [None]:
topfemale['primary_name'].value_counts()

In [None]:
topfemale['movie (year)'].value_counts()

In [None]:
top4females

In [None]:
top4females.drop_duplicates(subset='movie (year)', keep='first', inplace=True )

In [None]:
top4females['movie (year)'].value_counts()

In [None]:
top4males['movie (year)'].value_counts()

In [None]:
top4males.drop_duplicates(subset='movie (year)', keep='first', inplace=True )

In [None]:
top4males['primary_name'].value_counts()

In [None]:
top4females['primary_name'].value_counts()

In [None]:
%whos

In [None]:
test5

In [None]:
omdmerge = onlymaindirectors[["movie (year)","primary_name"]]

In [None]:
omdmerge.rename(columns ={'primary_name':'director'}, inplace = True)


In [None]:
omdmerge.set_index('movie (year)')

In [None]:
test6 = pd.merge(test5, omdmerge, on ='movie (year)',how ='left')

In [None]:
test6.groupby('director').mean().sort_values(by = 'Total Box Office ($, Millions)',ascending=False).head(5)

In [None]:
omwmerge = onlymainwriters[["movie (year)","primary_name"]]
omwmerge.rename(columns ={'primary_name':'writer'}, inplace = True)
omwmerge.set_index('movie (year)')

In [None]:
test7 = pd.merge(test6, omwmerge, on ='movie (year)',how ='left')

In [None]:
test7.groupby('writer').mean().sort_values(by = 'Total Box Office ($, Millions)',ascending=False).head(5)

In [None]:
malemerge = top4males[["movie (year)","primary_name"]]
malemerge.rename(columns ={'primary_name':'lead actor'}, inplace = True)
malemerge.set_index('movie (year)')

In [None]:
femalemerge = top4females[["movie (year)","primary_name"]]
femalemerge.rename(columns ={'primary_name':'lead actress'}, inplace = True)
femalemerge.set_index('movie (year)')

In [None]:
test8 = pd.merge(test7, malemerge, on ='movie (year)',how ='left')

In [None]:
test9 = pd.merge(test8, femalemerge, on ='movie (year)',how ='left')

In [None]:
test9.loc[test9['lead actor'] == 'Leonardo DiCaprio']

In [None]:
test9.groupby('lead actor').mean().sort_values(by = 'Profit/Loss ($, Millions)',ascending=False).head(50)

In [None]:
test9.groupby('lead actor').mean().sort_values(by = 'RoI (%)',ascending=False).head(50)

In [None]:
test9['lead actor'].value_counts()

In [None]:
test9['lead actress'].value_counts().head(25)

In [None]:
test9[test9['lead actor'] == 'Tom Cruise']

In [None]:
test9.describe()

In [None]:
test9[test9["movie (year)"] == 'Crouching Tiger, Hidden Dragon (2000)']

In [None]:
interesting4[['movie (year)','genre_split']]

In [None]:
interesting3

In [None]:
genremerge = interesting3[["movie (year)","genre_split"]]
genremerge.set_index('movie (year)')
genremerge

In [None]:
testgenre = pd.merge(test5, genremerge, on ='movie (year)',how ='left')
testgenre

In [None]:
testgenre2 = testgenre.explode("genre_split")
testgenre2

In [None]:
testgenre2['genre_split'].value_counts()

In [None]:
genrecountdata

In [None]:
plt.figure(figsize=(18,20))
# Draw the seaborn barplot
sns.countplot(y='genre_split', data=testgenre2 )
# Set the barplot's title.
plt.title('Genre Counts of Top Movie Pool')
# Label the y-axis
plt.ylabel('Genre')
# Label the x-axis
plt.xlabel('Count')

In [None]:
testgenre2.groupby('genre_split').mean().sort_values(by = 'Profit/Loss ($, Millions)',ascending=False)

In [None]:
testgenre2.groupby('genre_split').mean().sort_values(by = 'RoI (%)',ascending=False)

In [None]:
testgenre3 = testgenre2.groupby('genre_split').mean().sort_values(by = 'Profit/Loss ($, Millions)',ascending=False)

In [None]:
testgenre4 = testgenre3.reset_index()

In [None]:
testgenre5 = testgenre2.groupby('genre_split').mean().sort_values(by = 'RoI (%)',ascending=False).reset_index()

In [None]:
testgenre4

In [None]:
test5

In [None]:
# not needed dont use

sns.set(style="whitegrid")
# Set the scale plot to be larger and easier to read.
sns.set_context("talk")
# Create a plot and size it appropriately for information shown
plt.figure(figsize=(15,12))
# Draw the seaborn barplot
sns.barplot(y = keys, x = values, alpha=0.8, palette="deep")
# Set the barplot's title.
plt.title('Top 100 grossing films by Genre')
# Label the y-axis
plt.ylabel('Genre', fontsize=20)
# Label the x-axis
plt.xlabel('Count', fontsize=20)

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(x=test5['Budget ($, Millions)'],
                y=test5['Total Box Office ($, Millions)'], alpha=0.6, palette='blues')

#sns.regplot(x=test5['Budget ($, Millions)'], y=test5['Total Box Office ($, Millions)'], line_kws={"color":"r","alpha":0.7})

# Set title of plot
plt.title('Box Office VS Production Budgets')
# Set y-axis label and fontsize
plt.ylabel('Total Box Office ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('Budget ($, Millions)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(x=test5['Budget ($, Millions)'],
                y=test5['RoI (%)'], alpha=0.6, palette='blues')
# Set title of plot
plt.title('RoI VS Production Budgets')
# Set y-axis label and fontsize
plt.ylabel('RoI (%)')

# Set x-axis label and fontsize
plt.xlabel('Budget ($, Millions)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(x=test5['Budget ($, Millions)'],
                y=test5['Profit/Loss ($, Millions)'], alpha=0.6, palette='blues')
# Set title of plot
plt.title('Profit/Loss VS Production Budgets')
# Set y-axis label and fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('Budget ($, Millions)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.histplot(x=test5['Budget ($, Millions)'],
                y=test5['Profit/Loss ($, Millions)'], alpha=0.6, palette='blues')
# Set title of plot
plt.title('Profit/Loss VS Production Budgets')
# Set y-axis label and fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('Budget ($, Millions)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(x='Budget ($, Millions)',y='Profit/Loss ($, Millions)', data=testgenre4, hue='genre_split', s=150)
# Set title of plot
plt.title('Profit/Loss VS Production Budgets')
# Set y-axis label and fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('Budget ($, Millions)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(x='Budget ($, Millions)',y='RoI (%)', data=testgenre5, hue='genre_split', s=150)
# Set title of plot
plt.title('RoI (%) VS Production Budgets')
# Set y-axis label and fontsize
plt.ylabel('RoI (%)')

# Set x-axis label and fontsize
plt.xlabel('Budget ($, Millions)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
malemetrics = test9.groupby('lead actor').mean()

In [None]:
malemetrics.loc['Tom Cruise']

In [None]:
top10maleVC = test9['lead actor'].value_counts().head(14)

In [None]:
top10maleVClist = top10maleVC.index.tolist()

In [None]:
malemetrics.loc[top10maleVClist]

In [None]:
maledata = malemetrics.loc[top10maleVClist].reset_index()

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(y='Profit/Loss ($, Millions)',x='RoI (%)', data=maledata, hue='lead actor', s=150)
# Set title of plot
plt.title('RoI (%) Profit/Loss ($, Millions)')
# Set y-axis label and fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('RoI (%)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and define its size.
plt.figure(figsize=(20, 12))
# Draw a boxplot showing the runtime minute distributions among the ranges of
# production budget.
sns.boxplot(x='lead actor',
            y='Profit/Loss ($, Millions)',data=maleboxdata)
# Set the plot title
plt.title('Distribution of Profit by Lead Actor')
# Set the x-axis Label and define fontsize
plt.xlabel('Lead Actor')
# Set the y-axis label and define fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Show plot
plt.show()

In [None]:
maleboxdata = test9[test9['lead actor'].isin(top10maleVClist)]

In [None]:
maleboxdata

In [None]:
#take the top 10 recurring billed actresses from top movie pool
top10femaleVC = test9['lead actress'].value_counts().head(10)

#turn series into list
top10femaleVClist = top10femaleVC.index.tolist()

#create averages data for scatter plot
femalemetrics = test9.groupby('lead actress').mean()
femaledata = femalemetrics.loc[top10femaleVClist].reset_index()

#create dataframe slice for boxplot
femaleboxdata = test9[test9['lead actress'].isin(top10femaleVClist)]



In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(y='Profit/Loss ($, Millions)',x='RoI (%)', data=femaledata, hue='lead actress', s=150)
# Set title of plot
plt.title('RoI (%) Profit/Loss ($, Millions)')
# Set y-axis label and fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('RoI (%)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and define its size.
plt.figure(figsize=(15, 12))
# Draw a boxplot showing the runtime minute distributions among the ranges of
# production budget.
sns.boxplot(x='lead actress',
            y='Profit/Loss ($, Millions)',data=femaleboxdata)
# Set the plot title
plt.title('Distribution of Profit by Lead Actress')
# Set the x-axis Label and define fontsize
plt.xlabel('Lead Actress')
# Set the y-axis label and define fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Show plot
plt.show()

In [None]:
test9['director'].value_counts().head(40)

In [None]:
#take the top averaging directors from top movie pool
#top10dirVC = test9['director'].value_counts().head(10)

#turn series into list
#top10dirVClist = top10dirVC.index.tolist()

#create averages data for scatter plot, take top 10
dirmetrics = test9.groupby('director').mean().sort_values(by = 'Profit/Loss ($, Millions)',ascending=False).head(10)
dirdata = dirmetrics.reset_index()

#take above top 10 series into list
top10dirVClist = dirmetrics.index.tolist()

#create dataframe slice for boxplot
dirboxdata = test9[test9['director'].isin(top10dirVClist)]


In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(15, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.scatterplot(y='Profit/Loss ($, Millions)',x='RoI (%)', data=dirdata, hue='director', s=150)
# Set title of plot
plt.title('RoI (%) Profit/Loss ($, Millions)')
# Set y-axis label and fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('RoI (%)')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and define its size.
plt.figure(figsize=(15, 12))
# Draw a boxplot showing the runtime minute distributions among the ranges of
# production budget.
sns.boxplot(x='director',
            y='Profit/Loss ($, Millions)',data=dirboxdata)
# Set the plot title
plt.title('Distribution of Profit by Director')
# Set the x-axis Label and define fontsize
plt.xlabel('Director')
# Set the y-axis label and define fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Show plot
plt.show()

In [None]:
#create averages data for scatter plot, take top 10
wrimetrics = test9.groupby('writer').sum().sort_values(by = 'Total Box Office ($, Millions)',ascending=False).head(10)
wridata = wrimetrics.reset_index()

#take above top 10 series into list
top10wriVClist = wrimetrics.index.tolist()

#create dataframe slice for boxplot
wriboxdata = test9[test9['writer'].isin(top10wriVClist)]

In [None]:
test9['writer'].value_counts().head(40)

In [None]:
# Create a plot and set the appropriate size
plt.figure(figsize=(18, 12))
# Draw a seaborn scatterplot based on worldwide gross and production budget,

sns.barplot(y='Total Box Office ($, Millions)',x='writer', data=wridata)
# Set title of plot
plt.title('Writer Cumulative Box Office')
# Set y-axis label and fontsize
plt.ylabel('Total Box Office ($, Millions)')

# Set x-axis label and fontsize
plt.xlabel('Writer')

# Change x-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='x')
# Change y-axis ticks from scientific notation to integers
#plt.ticklabel_format(style='plain', axis='y')
# Show the plot
plt.show()

In [None]:
# Create a plot and define its size.
plt.figure(figsize=(20, 12))
# Draw a boxplot showing the runtime minute distributions among the ranges of
# production budget.
sns.boxplot(x='writer',
            y='Profit/Loss ($, Millions)',data=wriboxdata)
# Set the plot title
plt.title('Distribution of Profit by Writer')
# Set the x-axis Label and define fontsize
plt.xlabel('Writer')
# Set the y-axis label and define fontsize
plt.ylabel('Profit/Loss ($, Millions)')

# Show plot
plt.show()

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***

In [None]:
#Target Production Budget Sweet Spot

In [None]:
#Recommended Cast and Crew

In [None]:
#Recommended Genre Type