# Introduction - Data Cleaning Notebook

Goal: Analyze the available data to present a concise representation of the movie industry, analyze the success of movies based on various factors, and provide insights and reccomendations for the development of film projects.

Methodology:
    
1. Explore the data to determine what sorts of questions it can help answer.
2. Determine 3 relevant questions to examine. 
3. Clean the data. 
4. Restructure the data so that it can be analyzed.
5. Use data visualizations and statistical analysis to make inferences.
6. Develop a presentation of findings. 

# Step 1- Initial Data Exploration

First we'll unpack each file to get a better idea of what each dataframe contains.

In [1]:
import pandas as pd

# I'll be unpacking alot of files, so before I name them I'll get some basic info with a function.

def unpack_df(filename, tsv=False, encoding=False):
    if encoding==True:
        temp_df = pd.read_csv(filename, delimiter='\t', encoding='latin')
    elif tsv==True:
        temp_df = pd.read_csv(filename, delimiter='\t')
    else:
        temp_df = pd.read_csv(filename)
    print('Info from {}'.format(filename))
    display(temp_df.head())
    display(temp_df.tail())
    display(temp_df.info())

unpack_df('Data/bom.movie_gross.csv.gz')
unpack_df('Data/imdb.name.basics.csv.gz')
unpack_df('Data/imdb.title.akas.csv.gz')
unpack_df('Data/imdb.title.basics.csv.gz')
unpack_df('Data/imdb.title.crew.csv.gz')
unpack_df('Data/imdb.title.principals.csv.gz')
unpack_df('Data/imdb.title.ratings.csv.gz')
unpack_df('Data/rt.movie_info.tsv.gz', tsv=True)
unpack_df('Data/rt.reviews.tsv.gz', encoding=True)
unpack_df('Data/tmdb.movies.csv.gz')
unpack_df('Data/tn.movie_budgets.csv.gz')

Info from Data/bom.movie_gross.csv.gz


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
title             3387 non-null object
studio            3382 non-null object
domestic_gross    3359 non-null float64
foreign_gross     2037 non-null object
year              3387 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


None

Info from Data/imdb.name.basics.csv.gz


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
606643,nm9990381,Susan Grobes,,,actress,
606644,nm9990690,Joo Yeon So,,,actress,"tt9090932,tt8737130"
606645,nm9991320,Madeline Smith,,,actress,"tt8734436,tt9615610"
606646,nm9991786,Michelle Modigliani,,,producer,
606647,nm9993380,Pegasus Envoyé,,,"director,actor,writer",tt8743182


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 6 columns):
nconst                606648 non-null object
primary_name          606648 non-null object
birth_year            82736 non-null float64
death_year            6783 non-null float64
primary_profession    555308 non-null object
known_for_titles      576444 non-null object
dtypes: float64(2), object(4)
memory usage: 27.8+ MB


None

Info from Data/imdb.title.akas.csv.gz


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
331698,tt9827784,2,Sayonara kuchibiru,,,original,,1.0
331699,tt9827784,3,Farewell Song,XWW,en,imdbDisplay,,0.0
331700,tt9880178,1,La atención,,,original,,1.0
331701,tt9880178,2,La atención,ES,,,,0.0
331702,tt9880178,3,The Attention,XWW,en,imdbDisplay,,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331703 entries, 0 to 331702
Data columns (total 8 columns):
title_id             331703 non-null object
ordering             331703 non-null int64
title                331703 non-null object
region               278410 non-null object
language             41715 non-null object
types                168447 non-null object
attributes           14925 non-null object
is_original_title    331678 non-null float64
dtypes: float64(1), int64(1), object(6)
memory usage: 20.2+ MB


None

Info from Data/imdb.title.basics.csv.gz


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,
146143,tt9916754,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,,Documentary


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
tconst             146144 non-null object
primary_title      146144 non-null object
original_title     146123 non-null object
start_year         146144 non-null int64
runtime_minutes    114405 non-null float64
genres             140736 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


None

Info from Data/imdb.title.crew.csv.gz


Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


Unnamed: 0,tconst,directors,writers
146139,tt8999974,nm10122357,nm10122357
146140,tt9001390,nm6711477,nm6711477
146141,tt9001494,"nm10123242,nm10123248",
146142,tt9004986,nm4993825,nm4993825
146143,tt9010172,,nm8352242


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
tconst       146144 non-null object
directors    140417 non-null object
writers      110261 non-null object
dtypes: object(3)
memory usage: 3.3+ MB


None

Info from Data/imdb.title.principals.csv.gz


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


Unnamed: 0,tconst,ordering,nconst,category,job,characters
1028181,tt9692684,1,nm0186469,actor,,"[""Ebenezer Scrooge""]"
1028182,tt9692684,2,nm4929530,self,,"[""Herself"",""Regan""]"
1028183,tt9692684,3,nm10441594,director,,
1028184,tt9692684,4,nm6009913,writer,writer,
1028185,tt9692684,5,nm10441595,producer,producer,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
tconst        1028186 non-null object
ordering      1028186 non-null int64
nconst        1028186 non-null object
category      1028186 non-null object
job           177684 non-null object
characters    393360 non-null object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


None

Info from Data/imdb.title.ratings.csv.gz


Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


Unnamed: 0,tconst,averagerating,numvotes
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5
73855,tt9894098,6.3,128


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
tconst           73856 non-null object
averagerating    73856 non-null float64
numvotes         73856 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


None

Info from Data/rt.movie_info.tsv.gz


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
id              1560 non-null int64
synopsis        1498 non-null object
rating          1557 non-null object
genre           1552 non-null object
director        1361 non-null object
writer          1111 non-null object
theater_date    1201 non-null object
dvd_date        1201 non-null object
currency        340 non-null object
box_office      340 non-null object
runtime         1530 non-null object
studio          494 non-null object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


None

Info from Data/rt.reviews.tsv.gz


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
id            54432 non-null int64
review        48869 non-null object
rating        40915 non-null object
fresh         54432 non-null object
critic        51710 non-null object
top_critic    54432 non-null int64
publisher     54123 non-null object
date          54432 non-null object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


None

Info from Data/tmdb.movies.csv.gz


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
Unnamed: 0           26517 non-null int64
genre_ids            26517 non-null object
id                   26517 non-null int64
original_language    26517 non-null object
original_title       26517 non-null object
popularity           26517 non-null float64
release_date         26517 non-null object
title                26517 non-null object
vote_average         26517 non-null float64
vote_count           26517 non-null int64
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


None

Info from Data/tn.movie_budgets.csv.gz


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


None

Then, we can use the descriptions we just unpacked to make descriptive names for each dataframe.

In [2]:
#bom.movie_gross - gross profits domestic and international from 2010 to 2018
gross_profits = pd.read_csv('Data/bom.movie_gross.csv.gz')

#imdb.name.basics - professionals and the movies they are known for
imdb_crew = pd.read_csv('Data/imdb.name.basics.csv.gz')

#imdb.title.akas - appears to list alternate titles to movies
imdb_alt_titles = pd.read_csv('Data/imdb.title.akas.csv.gz')

#imdb.title.basics - movie titles, runtime, and genres
imdb_details = pd.read_csv('Data/imdb.title.basics.csv.gz')

#imdb.title.crew - directors and writers
imdb_creators = pd.read_csv('Data/imdb.title.crew.csv.gz')

#imdb.title.principals -  the principal people involved with the creation of the movie
imdb_principals = pd.read_csv('Data/imdb.title.principals.csv.gz')

#imdb.title.ratings - the ratings and number of votes for each movie
imdb_ratings = pd.read_csv('Data/imdb.title.ratings.csv.gz')

#rt.movie_info - synopsis, MPAA Rating, and other details, including box office
rt_info = pd.read_csv('Data/rt.movie_info.tsv.gz', delimiter='\t')

#rt.reviews.tsv.gz - reviews with matching ids with the other rt df, but no movie titles
rt_reviews = pd.read_csv('Data/rt.reviews.tsv.gz', delimiter='\t', encoding='latin')

#tmdb.movies - title, ratings, popularity score, release date
tmdb = pd.read_csv('Data/tmdb.movies.csv.gz')

#tn.movie_budgets - budgets and box office returns
budgets = pd.read_csv('Data/tn.movie_budgets.csv.gz')

#I'll name them, because having a way to identify them later will come in handy.
movie_data = [('gross_profits', gross_profits), ('imdb_crew', imdb_crew), ('imdb_alt_titles', imdb_alt_titles), 
              ('imdb_details', imdb_details), ('imdb_creators', imdb_creators), ('imdb_principals', imdb_principals),
              ('imdb_ratings', imdb_ratings), ('rt_info', rt_info), ('rt_reviews', rt_reviews),
              ('tmdb', tmdb), ('budgets', budgets)]

# Step 2- Determine Questions

The goal of this project is to reccomend a strategy to successfully launch a movie studio. Based on the data available we will first look at how best to measure success using the factors present in the data:
        
- Box Office Profit
- Audience Reviews
        
We will investigate the following questions:            

1. How much money should we invest?

2. What kinds of films should we produce?

3. Who should we hire to produce it?     

# Step 3- Data Cleaning

### Duplicates:

In [3]:
#Find and drop duplicates
for dftup in movie_data:
    print('Duplicates in {}:'.format(dftup[0]))
    display(dftup[1][dftup[1].duplicated()])
    print('\n')

Duplicates in gross_profits:


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year




Duplicates in imdb_crew:


Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles




Duplicates in imdb_alt_titles:


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title




Duplicates in imdb_details:


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres




Duplicates in imdb_creators:


Unnamed: 0,tconst,directors,writers




Duplicates in imdb_principals:


Unnamed: 0,tconst,ordering,nconst,category,job,characters




Duplicates in imdb_ratings:


Unnamed: 0,tconst,averagerating,numvotes




Duplicates in rt_info:


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio




Duplicates in rt_reviews:


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
8129,304,"Friends With Kids is a smart, witty and potty-...",,fresh,,0,Liverpool Echo,"June 29, 2012"
14575,581,,4.5/5,fresh,,0,Film Threat,"December 6, 2005"
26226,1055,,4/5,fresh,,0,Film Threat,"December 6, 2005"
35162,1368,,2/5,rotten,,0,Film Threat,"December 6, 2005"
35166,1368,,2/5,rotten,,0,Film Threat,"December 8, 2002"
40567,1535,,2/5,rotten,,0,Film Threat,"December 6, 2005"
42381,1598,"This tired, neutered action thriller won't cau...",2/5,rotten,,0,Empire Magazine,"November 14, 2008"
49487,1843,,0.5/5,rotten,,0,Film Threat,"December 6, 2005"
49492,1843,,0.5/5,rotten,,0,Film Threat,"December 8, 2002"




Duplicates in tmdb:


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count




Duplicates in budgets:


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross






In [4]:
rt_reviews.drop_duplicates(inplace=True)

rt_reviews[rt_reviews.duplicated()]

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date


### Null Values and Other Issues:

In [5]:
#Find and deal with null values

for dftup in movie_data:
    print('Total null values for {}:'.format(dftup[0]))
    display(dftup[1].isna().sum())
    print('\n\n')

Total null values for gross_profits:


title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64




Total null values for imdb_crew:


nconst                     0
primary_name               0
birth_year            523912
death_year            599865
primary_profession     51340
known_for_titles       30204
dtype: int64




Total null values for imdb_alt_titles:


title_id                  0
ordering                  0
title                     0
region                53293
language             289988
types                163256
attributes           316778
is_original_title        25
dtype: int64




Total null values for imdb_details:


tconst                 0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64




Total null values for imdb_creators:


tconst           0
directors     5727
writers      35883
dtype: int64




Total null values for imdb_principals:


tconst             0
ordering           0
nconst             0
category           0
job           850502
characters    634826
dtype: int64




Total null values for imdb_ratings:


tconst           0
averagerating    0
numvotes         0
dtype: int64




Total null values for rt_info:


id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64




Total null values for rt_reviews:


id                0
review         5556
rating        13516
fresh             0
critic         2713
top_critic        0
publisher       309
date              0
dtype: int64




Total null values for tmdb:


Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64




Total null values for budgets:


id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64






##### gross_profits

In [6]:
# Try to find a worthy replacement value
gross_profits['domestic_gross'].median()

1400000.0

In [7]:
gross_profits[~gross_profits['foreign_gross'].isna()]['domestic_gross'].median()

16500000.0

In [8]:
# gross_profits['foreign_gross'].median()
# returns error

In [9]:
#Oh no! These commas are preventing me from getting accurate statistics. Based on a quick internet search, 
#these values are also stadardized to millions.
gross_profits[gross_profits['foreign_gross'].str.contains(',')==True]['foreign_gross'] 

1872    1,131.6
1873    1,019.4
1874    1,163.0
2760    1,010.0
3079    1,369.5
Name: foreign_gross, dtype: object

In [10]:
# fix the values so they can be added to the data

comma_values = [value for value in gross_profits[gross_profits['foreign_gross'].str.contains(',')==True]['foreign_gross']]
fixed_values = []
for value in comma_values:
    fixed_values.append(int(value.replace(',','').replace('.','')+'00000'))
fixed_values

[1131600000, 1019400000, 1163000000, 1010000000, 1369500000]

In [11]:
gross_profits.loc[gross_profits['foreign_gross'].str.contains(',')==True, 'foreign_gross'] = fixed_values

In [12]:
gross_profits[gross_profits['foreign_gross'].str.contains(',')==True]['foreign_gross']

Series([], Name: foreign_gross, dtype: object)

In [13]:
# median for foreign gross of films with a value.

gross_profits[~gross_profits['foreign_gross'].isna()]['foreign_gross'].median()

19000000.0

In [14]:
# median for domestic gross of films without a foreign box office value.
gross_profits[gross_profits['foreign_gross'].isna()]['domestic_gross'].median()

180000.0

In gross_profits about a third of all 'foreign_gross' are nan. 

In seeking the best replacement value:

- The median 'domestic_gross' overall: 1,400,000 

- The median 'domestic_gross' for films without 'foreign_gross': 180,000.  

- The median 'domestic_gross' for films with 'foreign_gross': 16,500,000 

- The median 'foreign_gross' for films with values: 19,000,000. 

Films with a foreign_gross value are significantly larger than those with a null. I would like to preserve the shape of the values, so I will use the ratio of 'domestic_gross' and 'foreign_gross' values to replace the nulls.  

In [15]:
#replace null values with proper value
proper = []

#using the ration of foreign gross over domestic gross in the sample of data where both exist
f_d_ratio = 19000000/16500000

# impute the new values to fill the null values
for value in gross_profits[gross_profits['foreign_gross'].isna()]['domestic_gross']:
    proper.append(round(value * f_d_ratio, 1))
gross_profits.loc[gross_profits['foreign_gross'].isna(), 'domestic_gross'] = proper

We will not need studio data for this analysis, so we'll drop that column.

In [16]:
# drop studio data
gross_profits.drop('studio',axis=1,inplace=True)
gross_profits.isna().sum()

title                0
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

The remaining null values are for domestic gross, which is important, but there are so few values in the group we can just drop the rows. 

In [17]:
# drop the remaining null values
gross_profits.dropna(inplace=True)
gross_profits.isna().sum()

title             0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

##### imdb_crew

The main useful information from this dataframe is the name of crew, the primary profession, and the titles they are known for. Birth year is not important, so I'll start by dropping that column.

In [18]:
# drop birth year column
imdb_crew.drop('birth_year', axis=1, inplace=True)
imdb_crew.isna().sum()

nconst                     0
primary_name               0
death_year            599865
primary_profession     51340
known_for_titles       30204
dtype: int64

In addition, since we are using this data to determine crew that may impact the success of a film we won't need to know their death year. We can impute 'Working' or 'Passed' into the column 'Status' which we can use to determine if they are able to be hired in our production. 

In [19]:
imdb_crew.loc[~imdb_crew['death_year'].isna(), 'status'] = 'Passed'
imdb_crew.loc[imdb_crew['death_year'].isna(), 'status'] = 'Working'
imdb_crew.drop('death_year', axis=1, inplace=True)
imdb_crew.isna().sum()

nconst                    0
primary_name              0
primary_profession    51340
known_for_titles      30204
status                    0
dtype: int64

To replace the 'primary_profession' null values, we can impute the general term 'crew' because it may be helpful to know the impact these professionals had on movies even without knowing their specific role.

In [20]:
imdb_crew['primary_profession'].fillna(value='crew', inplace=True)
imdb_crew.isna().sum()

nconst                    0
primary_name              0
primary_profession        0
known_for_titles      30204
status                    0
dtype: int64

Without knowing the movies they worked on, data about crew will simply not be useful.

In [21]:
# drop all rows containing null values
imdb_crew.dropna(inplace=True)
imdb_crew.isna().sum()

nconst                0
primary_name          0
primary_profession    0
known_for_titles      0
status                0
dtype: int64

##### imdb_alt_titles
We won't be using this information for my analysis.

##### imdb_details

In [22]:
display(imdb_details.head())
display(imdb_details.isna().sum())

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


tconst                 0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64

It appears the main useful information in this dataframe is the genres, with years and primary title as identifiers. For this analysis original_title will not be important information. For the films without genre data, we will simply have to drop it because its a relatively small slice of the dataframe.

In [23]:
imdb_details.drop('original_title', axis=1, inplace=True)
imdb_details.drop('runtime_minutes', axis=1, inplace=True)
imdb_details.dropna(inplace=True)
imdb_details.isna().sum()

tconst           0
primary_title    0
start_year       0
genres           0
dtype: int64

##### imdb_creators

In [24]:
imdb_creators.isna().sum()

tconst           0
directors     5727
writers      35883
dtype: int64

There are relatively few null values in this dataframe. We'll just drop them.

In [25]:
imdb_creators.dropna(inplace=True)
imdb_creators.isna().sum()

tconst       0
directors    0
writers      0
dtype: int64

##### imdb_principals

In [26]:
imdb_principals.isna().sum()

tconst             0
ordering           0
nconst             0
category           0
job           850502
characters    634826
dtype: int64

Which characters the actors played isn't relavant to our analysis.

In [27]:
imdb_principals.drop('characters', axis=1, inplace=True)

In [28]:
imdb_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job
0,tt0111414,1,nm0246005,actor,
1,tt0111414,2,nm0398271,director,
2,tt0111414,3,nm3739909,producer,producer
3,tt0323808,10,nm0059247,editor,
4,tt0323808,1,nm3579312,actress,


The category column is storing the data I want. We can drop job and ordering. 

In [29]:
imdb_principals.drop(['job','ordering'], axis=1, inplace=True)
imdb_principals.isna().sum()

tconst      0
nconst      0
category    0
dtype: int64

##### imdb_ratings
Contains no Nulls.

##### rt_info

In [30]:
rt_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [31]:
rt_info.isna().sum()

id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64

First, we'll drop the information we won't need from this dataframe. 

In [32]:
rt_info.drop(['dvd_date', 'currency', 'studio'], axis=1, inplace=True)

The rest of this information is not easy to replace, so we'll drop the rows that contain null.

In [33]:
rt_info.dropna(inplace=True)
rt_info.isna().sum()

id              0
synopsis        0
rating          0
genre           0
director        0
writer          0
theater_date    0
box_office      0
runtime         0
dtype: int64

##### rt_reviews

In [34]:
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


The information we want from this dataframe is just the fresh rating. we'll drop the rest.

In [35]:
rt_reviews.drop(['review', 'rating', 'critic', 'top_critic', 'publisher', 'date'], axis=1, inplace=True)
rt_reviews.isna().sum()

id       0
fresh    0
dtype: int64

##### tmdb
Contains no Nulls.

##### budgets
This data needs to be converted to integers.

In [36]:
budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [37]:
# we need to do this three times, so we can save space by defining a function
def fix_values(col):
    broken_values = [value for value in [value for value in budgets[col]]]
    fixed_values = []
    for value in broken_values:
        fixed_values.append(int(value.replace(',','').replace('$','')))
    budgets[col] = fixed_values

In [38]:
fix_values('production_budget')
fix_values('domestic_gross')
fix_values('worldwide_gross')
budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


I will also alter the release date into a more usable format. 

In [39]:
# split the date into month and year
budgets['month'] = budgets['release_date'].map(lambda x: x.split()[0])
budgets['year'] = budgets['release_date'].map(lambda x: x.split()[2])

I'll also add a column called 'foreign_gross' and drop the ID column because it has no other reference points in my data.

In [40]:
budgets.drop('id', axis=1, inplace=True)

# calculate foreign gross by subtacting dom gross from world gross
budgets['foreign_gross']= budgets['worldwide_gross']-budgets['domestic_gross']
budgets.head()

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,month,year,foreign_gross
0,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,Dec,2009,2015837654
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,May,2011,804600000
2,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,Jun,2019,107000000
3,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,May,2015,944008095
4,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,Dec,2017,696540365


# Step 4
### Restructuring

Next, I'll review the data I have and attempt to organize it in ways that will be helpful in answering my questions. Then, I'll search for more data where appropriate.

In [41]:
for df_tup in movie_data:
    print('Info for {}:'.format(df_tup[0]))
    display(df_tup[1].head())
    display(df_tup[1].info())
    print('\n\n\n')

Info for gross_profits:


Unnamed: 0,title,domestic_gross,foreign_gross,year
0,Toy Story 3,415000000.0,652000000,2010
1,Alice in Wonderland (2010),334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,296000000.0,664300000,2010
3,Inception,292600000.0,535700000,2010
4,Shrek Forever After,238700000.0,513900000,2010


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2009 entries, 0 to 3353
Data columns (total 4 columns):
title             2009 non-null object
domestic_gross    2009 non-null float64
foreign_gross     2009 non-null object
year              2009 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 78.5+ KB


None





Info for imdb_crew:


Unnamed: 0,nconst,primary_name,primary_profession,known_for_titles,status
0,nm0061671,Mary Ellen Bauder,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553",Working
1,nm0061865,Joseph Bauer,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940",Working
2,nm0062070,Bruce Baum,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898",Working
3,nm0062195,Axel Baumann,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387",Working
4,nm0062798,Pete Baxter,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256",Working


<class 'pandas.core.frame.DataFrame'>
Int64Index: 576444 entries, 0 to 606647
Data columns (total 5 columns):
nconst                576444 non-null object
primary_name          576444 non-null object
primary_profession    576444 non-null object
known_for_titles      576444 non-null object
status                576444 non-null object
dtypes: object(5)
memory usage: 26.4+ MB


None





Info for imdb_alt_titles:


Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331703 entries, 0 to 331702
Data columns (total 8 columns):
title_id             331703 non-null object
ordering             331703 non-null int64
title                331703 non-null object
region               278410 non-null object
language             41715 non-null object
types                168447 non-null object
attributes           14925 non-null object
is_original_title    331678 non-null float64
dtypes: float64(1), int64(1), object(6)
memory usage: 20.2+ MB


None





Info for imdb_details:


Unnamed: 0,tconst,primary_title,start_year,genres
0,tt0063540,Sunghursh,2013,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,2019,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,2018,Drama
3,tt0069204,Sabse Bada Sukh,2018,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,2017,"Comedy,Drama,Fantasy"


<class 'pandas.core.frame.DataFrame'>
Int64Index: 140736 entries, 0 to 146143
Data columns (total 4 columns):
tconst           140736 non-null object
primary_title    140736 non-null object
start_year       140736 non-null int64
genres           140736 non-null object
dtypes: int64(1), object(3)
memory usage: 5.4+ MB


None





Info for imdb_creators:


Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943
6,tt0996958,nm2286991,"nm2286991,nm2651190"


<class 'pandas.core.frame.DataFrame'>
Int64Index: 109008 entries, 0 to 146142
Data columns (total 3 columns):
tconst       109008 non-null object
directors    109008 non-null object
writers      109008 non-null object
dtypes: object(3)
memory usage: 3.3+ MB


None





Info for imdb_principals:


Unnamed: 0,tconst,nconst,category
0,tt0111414,nm0246005,actor
1,tt0111414,nm0398271,director
2,tt0111414,nm3739909,producer
3,tt0323808,nm0059247,editor
4,tt0323808,nm3579312,actress


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 3 columns):
tconst      1028186 non-null object
nconst      1028186 non-null object
category    1028186 non-null object
dtypes: object(3)
memory usage: 23.5+ MB


None





Info for imdb_ratings:


Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
tconst           73856 non-null object
averagerating    73856 non-null float64
numvotes         73856 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


None





Info for rt_info:


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,box_office,runtime
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012",600000,108 minutes
6,10,Some cast and crew from NBC's highly acclaimed...,PG-13,Comedy,Jake Kasdan,Mike White,"Jan 11, 2002",41032915,82 minutes
7,13,"Stewart Kane, an Irishman living in the Austra...",R,Drama,Ray Lawrence,Raymond Carver|Beatrix Christian,"Apr 27, 2006",224114,123 minutes
8,14,"""Love Ranch"" is a bittersweet love story that ...",R,Drama,Taylor Hackford,Mark Jacobson,"Jun 30, 2010",134904,117 minutes
15,22,Two-time Academy Award Winner Kevin Spacey giv...,R,Comedy|Drama|Mystery and Suspense,George Hickenlooper,Norman Snider,"Dec 17, 2010",1039869,108 minutes


<class 'pandas.core.frame.DataFrame'>
Int64Index: 258 entries, 1 to 1545
Data columns (total 9 columns):
id              258 non-null int64
synopsis        258 non-null object
rating          258 non-null object
genre           258 non-null object
director        258 non-null object
writer          258 non-null object
theater_date    258 non-null object
box_office      258 non-null object
runtime         258 non-null object
dtypes: int64(1), object(8)
memory usage: 20.2+ KB


None





Info for rt_reviews:


Unnamed: 0,id,fresh
0,3,fresh
1,3,rotten
2,3,fresh
3,3,fresh
4,3,fresh


<class 'pandas.core.frame.DataFrame'>
Int64Index: 54423 entries, 0 to 54431
Data columns (total 2 columns):
id       54423 non-null int64
fresh    54423 non-null object
dtypes: int64(1), object(1)
memory usage: 1.2+ MB


None





Info for tmdb:


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
Unnamed: 0           26517 non-null int64
genre_ids            26517 non-null object
id                   26517 non-null int64
original_language    26517 non-null object
original_title       26517 non-null object
popularity           26517 non-null float64
release_date         26517 non-null object
title                26517 non-null object
vote_average         26517 non-null float64
vote_count           26517 non-null int64
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


None





Info for budgets:


Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,month,year,foreign_gross
0,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,Dec,2009,2015837654
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,May,2011,804600000
2,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,Jun,2019,107000000
3,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,May,2015,944008095
4,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,Dec,2017,696540365


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 8 columns):
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null int64
domestic_gross       5782 non-null int64
worldwide_gross      5782 non-null int64
month                5782 non-null object
year                 5782 non-null object
foreign_gross        5782 non-null int64
dtypes: int64(4), object(4)
memory usage: 361.5+ KB


None







### Unifying the Dataframes

The rotten tomatoes data doesn't have movie titles, which means that it will not be able to be joined with the other data. Because it is a much smaller dataset than those with imdb ids, we won't move forward with it.

As for the rest of the data, we will not need the 'imdb_creators' dataframe because that information is already in 'imdb_principals' and the 'gross_profits' dataframe does not contain production costs, so we won't use it either. 

Next we'll create a master dataframe in five steps:

1. Create the main dataframe by joing imdb_details with imdb_ratings

2. Impute the names from 'imdb_crew' into 'imdb_principals to create a 'roles' dataframe.

3. Join the 'roles' dataframe to the main dataframe using tconst as an index.

4. Join the budgets dataframe using title as an index.

5. Drop any columns that are redundant or unecessary
 
##### 1. Create movies

In [42]:
# create the main dataframe
movies = imdb_details.set_index('tconst').join(imdb_ratings.set_index('tconst'), how='inner')

# check that it looks right
movies.head()

Unnamed: 0_level_0,primary_title,start_year,genres,averagerating,numvotes
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
tt0063540,Sunghursh,2013,"Action,Crime,Drama",7.0,77
tt0066787,One Day Before the Rainy Season,2019,"Biography,Drama",7.2,43
tt0069049,The Other Side of the Wind,2018,Drama,6.9,4517
tt0069204,Sabse Bada Sukh,2018,"Comedy,Drama",6.1,13
tt0100275,The Wandering Soap Opera,2017,"Comedy,Drama,Fantasy",6.5,119


##### 2. Create roles 

In [43]:
# join the dataframes on nconst
roles = imdb_crew.set_index('nconst').join(imdb_principals.set_index('nconst'), how='inner')

# check that is looks right
roles.head()

Unnamed: 0_level_0,primary_name,primary_profession,known_for_titles,status,tconst,category
nconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
nm0000002,Lauren Bacall,"actress,soundtrack","tt0038355,tt0117057,tt0071877,tt0037382",Passed,tt1626811,self
nm0000002,Lauren Bacall,"actress,soundtrack","tt0038355,tt0117057,tt0071877,tt0037382",Passed,tt0858500,actress
nm0000002,Lauren Bacall,"actress,soundtrack","tt0038355,tt0117057,tt0071877,tt0037382",Passed,tt1368858,actress
nm0000002,Lauren Bacall,"actress,soundtrack","tt0038355,tt0117057,tt0071877,tt0037382",Passed,tt2053352,archive_footage
nm0000003,Brigitte Bardot,"actress,soundtrack,producer","tt0049189,tt0057345,tt0054452,tt0059956",Working,tt2004245,archive_footage


##### 3. Join roles to main dataframe

In [44]:
#reset movies index
movies.reset_index(inplace=True)

# join roles to movies
movies = movies.set_index('tconst').join(roles.set_index('tconst'), how='inner')

# check that it looks right
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 616196 entries, tt0063540 to tt9916160
Data columns (total 10 columns):
primary_title         616196 non-null object
start_year            616196 non-null int64
genres                616196 non-null object
averagerating         616196 non-null float64
numvotes              616196 non-null int64
primary_name          616196 non-null object
primary_profession    616196 non-null object
known_for_titles      616196 non-null object
status                616196 non-null object
category              616196 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 51.7+ MB


##### 4. Join budgets to main dataframe

In [45]:
# rename the index column
movies.rename(columns={'primary_title':'title'}, inplace=True)
budgets.rename(columns={'movie':'title'}, inplace=True)

# join the dataframes
movies = movies.set_index('title').join(budgets.set_index('title'), how='inner')
movies.reset_index(inplace=True)
display(movies.info())
display(movies.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26998 entries, 0 to 26997
Data columns (total 17 columns):
title                 26998 non-null object
start_year            26998 non-null int64
genres                26998 non-null object
averagerating         26998 non-null float64
numvotes              26998 non-null int64
primary_name          26998 non-null object
primary_profession    26998 non-null object
known_for_titles      26998 non-null object
status                26998 non-null object
category              26998 non-null object
release_date          26998 non-null object
production_budget     26998 non-null int64
domestic_gross        26998 non-null int64
worldwide_gross       26998 non-null int64
month                 26998 non-null object
year                  26998 non-null object
foreign_gross         26998 non-null int64
dtypes: float64(1), int64(6), object(10)
memory usage: 3.5+ MB


None

Unnamed: 0,title,start_year,genres,averagerating,numvotes,primary_name,primary_profession,known_for_titles,status,category,release_date,production_budget,domestic_gross,worldwide_gross,month,year,foreign_gross
0,#Horror,2015,"Crime,Drama,Horror",3.0,3092,Tara Subkoff,"actress,producer,director","tt3526286,tt0954947,tt0119822,tt0209958",Working,director,"Nov 20, 2015",1500000,0,0,Nov,2015,0
1,#Horror,2015,"Crime,Drama,Horror",3.0,3092,Brendan Walsh,"assistant_director,director,producer","tt1190689,tt4878612,tt1125849,tt5565334",Working,producer,"Nov 20, 2015",1500000,0,0,Nov,2015,0
2,#Horror,2015,"Crime,Drama,Horror",3.0,3092,Oren Segal,"manager,producer,miscellaneous","tt3526286,tt0462519,tt1283887,tt1772261",Working,producer,"Nov 20, 2015",1500000,0,0,Nov,2015,0
3,#Horror,2015,"Crime,Drama,Horror",3.0,3092,Haley Murphy,actress,"tt2357547,tt1767382,tt1990422,tt3526286",Working,actress,"Nov 20, 2015",1500000,0,0,Nov,2015,0
4,#Horror,2015,"Crime,Drama,Horror",3.0,3092,Jason Ludman,"producer,miscellaneous,production_manager","tt4041636,tt3778354,tt4779776,tt3526286",Working,producer,"Nov 20, 2015",1500000,0,0,Nov,2015,0


##### 5. Drop unecessary columns

In [47]:
movies.drop(['start_year', 'primary_profession',
             'known_for_titles', 'release_date'], axis=1, inplace=True)

# Conclusion

In [49]:
display(movies.info())
display(movies.head())

movies['title'].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26998 entries, 0 to 26997
Data columns (total 13 columns):
title                26998 non-null object
genres               26998 non-null object
averagerating        26998 non-null float64
numvotes             26998 non-null int64
primary_name         26998 non-null object
status               26998 non-null object
category             26998 non-null object
production_budget    26998 non-null int64
domestic_gross       26998 non-null int64
worldwide_gross      26998 non-null int64
month                26998 non-null object
year                 26998 non-null object
foreign_gross        26998 non-null int64
dtypes: float64(1), int64(5), object(7)
memory usage: 2.7+ MB


None

Unnamed: 0,title,genres,averagerating,numvotes,primary_name,status,category,production_budget,domestic_gross,worldwide_gross,month,year,foreign_gross
0,#Horror,"Crime,Drama,Horror",3.0,3092,Tara Subkoff,Working,director,1500000,0,0,Nov,2015,0
1,#Horror,"Crime,Drama,Horror",3.0,3092,Brendan Walsh,Working,producer,1500000,0,0,Nov,2015,0
2,#Horror,"Crime,Drama,Horror",3.0,3092,Oren Segal,Working,producer,1500000,0,0,Nov,2015,0
3,#Horror,"Crime,Drama,Horror",3.0,3092,Haley Murphy,Working,actress,1500000,0,0,Nov,2015,0
4,#Horror,"Crime,Drama,Horror",3.0,3092,Jason Ludman,Working,producer,1500000,0,0,Nov,2015,0


2126

With 2126 films in the dataframe we'll be able to do significant analysis on film trends. The last step of the data cleaning process is saving the dataframe as a csv to use in the analysis notebook.

In [50]:
movies.to_csv('movies.csv')