![example](images/director_shot.jpeg)

# Project Title

**Author:** Sam Oliver
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Here you run your code to explore the data

# get preview list of the files
import glob, os
fpath = 'zippedData/'
os.listdir(fpath)

['bom.movie_gross.csv.gz',
 'imdb.name.basics.csv.gz',
 'imdb.title.akas.csv.gz',
 'imdb.title.basics.csv.gz',
 'imdb.title.crew.csv.gz',
 'imdb.title.principals.csv.gz',
 'imdb.title.ratings.csv.gz',
 'rt.movie_info.tsv.gz',
 'rt.reviews.tsv.gz',
 'tmdb.movies.csv.gz',
 'tn.movie_budgets.csv.gz']

In [3]:
# search string 
query = fpath+"*.gz"

file_list = glob.glob(query)
file_list

['zippedData\\bom.movie_gross.csv.gz',
 'zippedData\\imdb.name.basics.csv.gz',
 'zippedData\\imdb.title.akas.csv.gz',
 'zippedData\\imdb.title.basics.csv.gz',
 'zippedData\\imdb.title.crew.csv.gz',
 'zippedData\\imdb.title.principals.csv.gz',
 'zippedData\\imdb.title.ratings.csv.gz',
 'zippedData\\rt.movie_info.tsv.gz',
 'zippedData\\rt.reviews.tsv.gz',
 'zippedData\\tmdb.movies.csv.gz',
 'zippedData\\tn.movie_budgets.csv.gz']

In [4]:
tables = {}

# loop through file lsit and get a preview of the files - notice the tsv files
for file in file_list:
    print('---'*20)
    file_name = file.replace('zippedData\\','').replace('.', '_')
    print(file_name)
    
    if '.tsv.gz' in file:
        tmp_df = pd.read_csv(file, sep="\t",encoding="latin-1")
    else:
        tmp_df = pd.read_csv(file)
    display(tmp_df.head(),tmp_df.tail())
    tables[file_name] = tmp_df

------------------------------------------------------------
bom_movie_gross_csv_gz


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


------------------------------------------------------------
imdb_name_basics_csv_gz


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
606643,nm9990381,Susan Grobes,,,actress,
606644,nm9990690,Joo Yeon So,,,actress,"tt9090932,tt8737130"
606645,nm9991320,Madeline Smith,,,actress,"tt8734436,tt9615610"
606646,nm9991786,Michelle Modigliani,,,producer,
606647,nm9993380,Pegasus Envoyé,,,"director,actor,writer",tt8743182


------------------------------------------------------------
imdb_title_akas_csv_gz


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
331698,tt9827784,2,Sayonara kuchibiru,,,original,,1.0
331699,tt9827784,3,Farewell Song,XWW,en,imdbDisplay,,0.0
331700,tt9880178,1,La atención,,,original,,1.0
331701,tt9880178,2,La atención,ES,,,,0.0
331702,tt9880178,3,The Attention,XWW,en,imdbDisplay,,0.0


------------------------------------------------------------
imdb_title_basics_csv_gz


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,
146143,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,,Documentary


------------------------------------------------------------
imdb_title_crew_csv_gz


Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


Unnamed: 0,tconst,directors,writers
146139,tt8999974,nm10122357,nm10122357
146140,tt9001390,nm6711477,nm6711477
146141,tt9001494,"nm10123242,nm10123248",
146142,tt9004986,nm4993825,nm4993825
146143,tt9010172,,nm8352242


------------------------------------------------------------
imdb_title_principals_csv_gz


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


Unnamed: 0,tconst,ordering,nconst,category,job,characters
1028181,tt9692684,1,nm0186469,actor,,"[""Ebenezer Scrooge""]"
1028182,tt9692684,2,nm4929530,self,,"[""Herself"",""Regan""]"
1028183,tt9692684,3,nm10441594,director,,
1028184,tt9692684,4,nm6009913,writer,writer,
1028185,tt9692684,5,nm10441595,producer,producer,


------------------------------------------------------------
imdb_title_ratings_csv_gz


Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


Unnamed: 0,tconst,averagerating,numvotes
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5
73855,tt9894098,6.3,128


------------------------------------------------------------
rt_movie_info_tsv_gz


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


------------------------------------------------------------
rt_reviews_tsv_gz


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


------------------------------------------------------------
tmdb_movies_csv_gz


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


------------------------------------------------------------
tn_movie_budgets_csv_gz


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


In [5]:
# insepct some tables: imdb.title.basics, imdb.title.ratings, bom.movie_gross
# I will see if there are some usual places to consolidate the table into 1
bom_movie_gross = tables['bom_movie_gross_csv_gz'].copy()
bom_movie_gross

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [6]:
# print out head and tail of title ratings df
imdb_title_ratings = tables['imdb_title_ratings_csv_gz'].copy()
imdb_title_ratings

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
...,...,...,...
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5


In [7]:
# print out title basics table
imdb_title_basics = tables['imdb_title_basics_csv_gz'].copy()
imdb_title_basics

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


In [8]:
# Where can I join these tables? What is tconst on the IMDB tables?
# tconst is a unique identifier for a particular film, so I'm going
# to go ahead and merge the two imdb tables together
imdb_table = pd.merge(imdb_title_ratings, imdb_title_basics, on="tconst")
imdb_table

Unnamed: 0,tconst,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres
0,tt10356526,8.3,31,Laiye Je Yaarian,Laiye Je Yaarian,2019,117.0,Romance
1,tt10384606,8.9,559,Borderless,Borderless,2019,87.0,Documentary
2,tt1042974,6.4,20,Just Inès,Just Inès,2010,90.0,Drama
3,tt1043726,4.2,50352,The Legend of Hercules,The Legend of Hercules,2014,99.0,"Action,Adventure,Fantasy"
4,tt1060240,6.5,21,Até Onde?,Até Onde?,2011,73.0,"Mystery,Thriller"
...,...,...,...,...,...,...,...,...
73851,tt9805820,8.1,25,Caisa,Caisa,2018,84.0,Documentary
73852,tt9844256,7.5,24,Code Geass: Lelouch of the Rebellion - Glorifi...,Code Geass: Lelouch of the Rebellion Episode III,2018,120.0,"Action,Animation,Sci-Fi"
73853,tt9851050,4.7,14,Sisters,Sisters,2019,,"Action,Drama"
73854,tt9886934,7.0,5,The Projectionist,The Projectionist,2019,81.0,Documentary


In [9]:
# now I want to merge imdb_table with bom on title. I will use original title
# for imdb and title for bom. I am not going to merge on primary title because
# there may be areas in which there are the same primary title and title but
# they may not refer to the same film. Original title is more likely to give
# more complete title information

# first I will change the original_title in imdb to title so that I can merge
imdb_table_tmp = imdb_table.rename(columns={"original_title": "title"})

# merge bom and imdb on title
imdb_bom_table = pd.merge(imdb_table_tmp, bom_movie_gross, on="title")
imdb_bom_table

Unnamed: 0,tconst,averagerating,numvotes,primary_title,title,start_year,runtime_minutes,genres,studio,domestic_gross,foreign_gross,year
0,tt1043726,4.2,50352,The Legend of Hercules,The Legend of Hercules,2014,99.0,"Action,Adventure,Fantasy",LG/S,18800000.0,42400000,2014
1,tt1171222,5.1,8296,Baggage Claim,Baggage Claim,2013,96.0,Comedy,FoxS,21600000.0,887000,2013
2,tt1210166,7.6,326657,Moneyball,Moneyball,2011,133.0,"Biography,Drama,Sport",Sony,75600000.0,34600000,2011
3,tt1212419,6.5,87288,Hereafter,Hereafter,2010,129.0,"Drama,Fantasy,Romance",WB,32700000.0,72500000,2010
4,tt1229238,7.4,428142,Mission: Impossible - Ghost Protocol,Mission: Impossible - Ghost Protocol,2011,132.0,"Action,Adventure,Thriller",Par.,209400000.0,485300000,2011
...,...,...,...,...,...,...,...,...,...,...,...,...
2442,tt3142688,5.8,5841,Finding Fanny,Finding Fanny,2014,102.0,"Adventure,Comedy,Drama",FIP,616000.0,7100000,2014
2443,tt3399916,6.3,4185,The Dead Lands,The Dead Lands,2014,107.0,"Action,Adventure",Magn.,5200.0,,2015
2444,tt3748512,7.4,4977,Hitchcock/Truffaut,Hitchcock/Truffaut,2015,79.0,Documentary,Cohen,260000.0,,2015
2445,tt7008872,7.0,18768,Boy Erased,Boy Erased,2018,115.0,"Biography,Drama",Focus,6800000.0,5000000,2018


## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [10]:
# Here you run your code to clean the data
# I will remove tconst. tconst is not necessary anymore because it
# was only useful for joining the imdb tables. I can also drop primary title 
# because it is redundant to have both title columns... I am also going to 
# drop start year to remove redundancies with year

table1 = imdb_bom_table.drop(['primary_title', 'tconst', 'start_year'], axis=1)
table1

Unnamed: 0,averagerating,numvotes,title,runtime_minutes,genres,studio,domestic_gross,foreign_gross,year
0,4.2,50352,The Legend of Hercules,99.0,"Action,Adventure,Fantasy",LG/S,18800000.0,42400000,2014
1,5.1,8296,Baggage Claim,96.0,Comedy,FoxS,21600000.0,887000,2013
2,7.6,326657,Moneyball,133.0,"Biography,Drama,Sport",Sony,75600000.0,34600000,2011
3,6.5,87288,Hereafter,129.0,"Drama,Fantasy,Romance",WB,32700000.0,72500000,2010
4,7.4,428142,Mission: Impossible - Ghost Protocol,132.0,"Action,Adventure,Thriller",Par.,209400000.0,485300000,2011
...,...,...,...,...,...,...,...,...,...
2442,5.8,5841,Finding Fanny,102.0,"Adventure,Comedy,Drama",FIP,616000.0,7100000,2014
2443,6.3,4185,The Dead Lands,107.0,"Action,Adventure",Magn.,5200.0,,2015
2444,7.4,4977,Hitchcock/Truffaut,79.0,Documentary,Cohen,260000.0,,2015
2445,7.0,18768,Boy Erased,115.0,"Biography,Drama",Focus,6800000.0,5000000,2018


In [11]:
# I am now going to combine domestic and foreign gross to get total gross
# convert foreign gross to float type
table1["foreign_gross"] = table1["foreign_gross"].str.replace(",",
                                                              "").astype(float)
table1['domestic_gross'] = table1['domestic_gross'].astype(float)

#I will onvert NaN values from both domestic and foreign gross to 0
table1['domestic_gross'] = table1['domestic_gross'].fillna(0)
table1['foreign_gross'] = table1['foreign_gross'].fillna(0)

# time to add them together
total_gross = table1['domestic_gross'] + table1['foreign_gross']
table1['total_gross'] = total_gross
table1

Unnamed: 0,averagerating,numvotes,title,runtime_minutes,genres,studio,domestic_gross,foreign_gross,year,total_gross
0,4.2,50352,The Legend of Hercules,99.0,"Action,Adventure,Fantasy",LG/S,18800000.0,42400000.0,2014,61200000.0
1,5.1,8296,Baggage Claim,96.0,Comedy,FoxS,21600000.0,887000.0,2013,22487000.0
2,7.6,326657,Moneyball,133.0,"Biography,Drama,Sport",Sony,75600000.0,34600000.0,2011,110200000.0
3,6.5,87288,Hereafter,129.0,"Drama,Fantasy,Romance",WB,32700000.0,72500000.0,2010,105200000.0
4,7.4,428142,Mission: Impossible - Ghost Protocol,132.0,"Action,Adventure,Thriller",Par.,209400000.0,485300000.0,2011,694700000.0
...,...,...,...,...,...,...,...,...,...,...
2442,5.8,5841,Finding Fanny,102.0,"Adventure,Comedy,Drama",FIP,616000.0,7100000.0,2014,7716000.0
2443,6.3,4185,The Dead Lands,107.0,"Action,Adventure",Magn.,5200.0,0.0,2015,5200.0
2444,7.4,4977,Hitchcock/Truffaut,79.0,Documentary,Cohen,260000.0,0.0,2015,260000.0
2445,7.0,18768,Boy Erased,115.0,"Biography,Drama",Focus,6800000.0,5000000.0,2018,11800000.0


In [12]:
# now remove domestic and foreign gross columns
table1 = table1.drop(['domestic_gross', 'foreign_gross'], axis=1)
table1

Unnamed: 0,averagerating,numvotes,title,runtime_minutes,genres,studio,year,total_gross
0,4.2,50352,The Legend of Hercules,99.0,"Action,Adventure,Fantasy",LG/S,2014,61200000.0
1,5.1,8296,Baggage Claim,96.0,Comedy,FoxS,2013,22487000.0
2,7.6,326657,Moneyball,133.0,"Biography,Drama,Sport",Sony,2011,110200000.0
3,6.5,87288,Hereafter,129.0,"Drama,Fantasy,Romance",WB,2010,105200000.0
4,7.4,428142,Mission: Impossible - Ghost Protocol,132.0,"Action,Adventure,Thriller",Par.,2011,694700000.0
...,...,...,...,...,...,...,...,...
2442,5.8,5841,Finding Fanny,102.0,"Adventure,Comedy,Drama",FIP,2014,7716000.0
2443,6.3,4185,The Dead Lands,107.0,"Action,Adventure",Magn.,2015,5200.0
2444,7.4,4977,Hitchcock/Truffaut,79.0,Documentary,Cohen,2015,260000.0
2445,7.0,18768,Boy Erased,115.0,"Biography,Drama",Focus,2018,11800000.0


In [15]:
# check for duplicates
table1.duplicated().sum()

0

In [16]:
# There aren't any duplicates, so that's great!
# Let's get more info - I'm skeptical about how useful year and studio will be
# but I think they should be kept for now...
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2447 entries, 0 to 2446
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   averagerating    2447 non-null   float64
 1   numvotes         2447 non-null   int64  
 2   title            2447 non-null   object 
 3   runtime_minutes  2402 non-null   float64
 4   genres           2443 non-null   object 
 5   studio           2444 non-null   object 
 6   year             2447 non-null   int64  
 7   total_gross      2447 non-null   float64
dtypes: float64(3), int64(2), object(3)
memory usage: 172.1+ KB


In [17]:
# so there are some places where there are null objects... Let's deal with 
# those! I first want to drop the places where genres are null
table1 = table1.dropna(axis=0, subset=['genres'])
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2443 entries, 0 to 2446
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   averagerating    2443 non-null   float64
 1   numvotes         2443 non-null   int64  
 2   title            2443 non-null   object 
 3   runtime_minutes  2400 non-null   float64
 4   genres           2443 non-null   object 
 5   studio           2440 non-null   object 
 6   year             2443 non-null   int64  
 7   total_gross      2443 non-null   float64
dtypes: float64(3), int64(2), object(3)
memory usage: 171.8+ KB


In [22]:
# I do think it is important to understand how runtime affects likeability and
# success of a particular film. There are only 43 missing, so let's just drop
# those to maintain consistent accuracy.
table1 = table1.dropna(axis=0, subset=['runtime_minutes'])

# let's also chekc to see if there are still null elements in the studio col.
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2400 entries, 0 to 2446
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   averagerating    2400 non-null   float64
 1   numvotes         2400 non-null   int64  
 2   title            2400 non-null   object 
 3   runtime_minutes  2400 non-null   float64
 4   genres           2400 non-null   object 
 5   studio           2397 non-null   object 
 6   year             2400 non-null   int64  
 7   total_gross      2400 non-null   float64
dtypes: float64(3), int64(2), object(3)
memory usage: 168.8+ KB


In [24]:
# There are only 3 more rows that contain NaN values, but because these are
# studio values, I actually would like to change these null values to simply
# say 'unknown'
table1 = table1.fillna('Unknown')

# sanity check - make sure it worked
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2400 entries, 0 to 2446
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   averagerating    2400 non-null   float64
 1   numvotes         2400 non-null   int64  
 2   title            2400 non-null   object 
 3   runtime_minutes  2400 non-null   float64
 4   genres           2400 non-null   object 
 5   studio           2400 non-null   object 
 6   year             2400 non-null   int64  
 7   total_gross      2400 non-null   float64
dtypes: float64(3), int64(2), object(3)
memory usage: 168.8+ KB


In [26]:
# Year seems to be ONLY useful insofar as it affects total_gross
# so we might need to look out for the effects observed during the pandemic
# and we also should control for inflation - if possible.
# First, we need some information about the year column - let's get min & max
print(table1['year'].max())
print(table1['year'].min())

2018
2010


In [None]:
# so the min year value is 2010 and the max value is 2018. Therefore, the 
# pandemic is not a concern for this data. 
# After doing some research into inflation, it does not seem necessar to 
# adjust for it considering the flat inflation rates between 2010-2018

## Data Analysis
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [14]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***