In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [66]:
#import all my files to dataframes
movie_gross_df = pd.read_csv('data/bom.movie_gross.csv')
rt_info_df = pd.read_csv('data/rt.movie_info.tsv', sep='\t', index_col=0)
rt_reviews_df = pd.read_csv('data/rt.reviews.tsv', sep='\t', index_col=0)
tmdb_movies_df = pd.read_csv('data/tmdb.movies.csv', index_col = 0, converters={'genre_ids': eval})
budgets_df = pd.read_csv('data/tn.movie_budgets.csv', index_col=0)

# Movie Gross

In [19]:
#Inspect the beginning of the dataframe
movie_gross_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [18]:
#Inspect metadata
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [21]:
#Check how many years are covered by this dataset
movie_gross_df['year'].unique()

array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018], dtype=int64)

Initial observations: This dataset includes both foreign and domestic gross for movies from 2010-2018. Will be useful for looking at what types of movies make the most money, and looking at trends over time. There are a few missing values, but this data should be easy to clean.

In [32]:
#Fill null values in the studio column with 'Other'
movie_gross_df['studio'].fillna(value='Other', inplace=True)

In [37]:
#Inspect data where domestic_gross is null
movie_gross_df.loc[movie_gross_df['domestic_gross'].isna() == True]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
230,It's a Wonderful Afterlife,UTV,,1300000,2010
298,Celine: Through the Eyes of the World,Sony,,119000,2010
302,White Lion,Scre.,,99600,2010
306,Badmaash Company,Yash,,64400,2010
327,Aashayein (Wishes),Relbig.,,3800,2010
537,Force,FoxS,,4800000,2011
713,Empire of Silver,NeoC,,19000,2011
871,Solomon Kane,RTWC,,19600000,2012
928,The Tall Man,Imag.,,5200000,2012
933,Keith Lemon: The Film,Other,,4000000,2012


At a glance, I am reasonably sure that none of these movies are going to be glaring omissions, and since this is less than 1% of the data, I feel comfortable just dropping these rows entirely.

In [42]:
#Drop offending rows
movie_gross_df.dropna(subset=['domestic_gross'], inplace=True)

In [43]:
#Inspect rows where foreign_gross is null
movie_gross_df.loc[movie_gross_df['foreign_gross'].isna() == True]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
222,Flipped,WB,1800000.0,,2010
254,The Polar Express (IMAX re-issue 2010),WB,673000.0,,2010
267,Tiny Furniture,IFC,392000.0,,2010
269,Grease (Sing-a-Long re-issue),Par.,366000.0,,2010
280,Last Train Home,Zeit.,288000.0,,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


I am really not sure what to do with this data -- it represents about 1/3 of the dataset, so dropping that much data really isn't an option. My gut instinct says that these films did not get a foreign release at all, which is why the data is missing. Additionally, I'm not sure if I will want to even use the foreign_gross data. So for now, I am going to leave the data as-is, and reconsider this question later if it becomes relevent to our analysis.

In [44]:
movie_gross_df.info()
#Other than the foreign_gross column, we have dealt with all null values.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3359 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3359 non-null   object 
 1   studio          3359 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2009 non-null   object 
 4   year            3359 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 157.5+ KB


In [45]:
#Save cleaned data to a new csv file
movie_gross_df.to_csv('Movie Gross Info.csv', index=False)

# Rotten Tomatoes Info

In [59]:
#Inspect Data
rt_info_df.head()

Unnamed: 0_level_0,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [100]:
#Inspect metadata
rt_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            1560 non-null   int64         
 1   synopsis      1498 non-null   object        
 2   rating        1557 non-null   object        
 3   genre         1552 non-null   object        
 4   director      1361 non-null   object        
 5   writer        1111 non-null   object        
 6   theater_date  1201 non-null   datetime64[ns]
 7   dvd_date      1201 non-null   object        
 8   currency      340 non-null    object        
 9   box_office    340 non-null    object        
 10  runtime       1530 non-null   object        
 11  studio        494 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 146.4+ KB


Well, every row has at least one null value, so that's fun. I can see immediately that this extends way beyond the 2010-2018 range of the previous dataset. Since my business question is about making recommendations for the future, I think in general I won't be interested in movies prior to 2010 anyway, so I will begin by splitting this data into pre- and post-2010 dataframes. I want to retain the pre-2010 data for now, because I think it might possibly be interesting to recommend remakes of some popular older films, but that is based on my personal feeling that established properties tend to do better at the box office, so we will see if the data actually supports that before I do any more with that data.

In [60]:
#Convert theater_date to datetime format
rt_info_df['theater_date'] = pd.to_datetime(rt_info_df['theater_date'])

In [61]:
#Create new dataframe with only post-2010 movies
rt_recent_info_df = rt_info_df.loc[rt_info_df['theater_date'] >= '2010-01-01']

In [62]:
# Create new dataframe with pre-2010 movies
rt_old_info_df = rt_info_df.loc[rt_info_df['theater_date'] < '2010-01-01']

In [63]:
#Save pre-2010 data as a CSV in case I want it later
rt_old_info_df.to_csv('Rotten Tomatoes Info Pre 2010.csv', index=False)

In [64]:
#Save post-2010 data as a CSV
rt_recent_info_df.to_csv('Rotten Tomatoes Info Post 2010.csv', index=False)

Good news is that there is much less missing data in this dataset. However, it is extremely small (could be a good thing or a bad thing--hard to tell at this point, but it is definitely much smaller than my gross data). I also just realized that this dataset is missing the actual movie titles--my guess is those come from the other Rotten Tomatoes dataset.
For now, I am going to ignore the null values in this dataset. I'm not sure exactly what information will be the most valuable, so I don't want to remove or change this data until I am sure I need it.

# Rotten Tomato Reviews

In [67]:
#Inspect data
rt_reviews_df.head()

Unnamed: 0_level_0,review,rating,fresh,critic,top_critic,publisher,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [68]:
#Inspect metadata
rt_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54432 entries, 3 to 2000
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   review      48869 non-null  object
 1   rating      40915 non-null  object
 2   fresh       54432 non-null  object
 3   critic      51710 non-null  object
 4   top_critic  54432 non-null  int64 
 5   publisher   54123 non-null  object
 6   date        54432 non-null  object
dtypes: int64(1), object(6)
memory usage: 3.3+ MB


Interestingly, this Rotten Tomatoes Dataset also doesn't have movie titles......
That may make the Rotten Tomatoes data difficult to use. I'm not going to make any changes to this data for now, and we will see how my inquiries shake out.

# TMDB Data

In [69]:
#Inspect data
tmdb_movies_df.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [70]:
#Inspect metadata
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 2.0+ MB


This data appears not to have any null values--it's possible/likely that there are placeholder type values instead--I would be shocked if the data was this clean right out of the gate. 
The genres are only id values--I will need to pull the actual genres out of TMDB if I want to use them. Considering TMDB apparently has an excellent API, I might be inclined to just grab any data I might want from there instead.

# Budgets 

In [71]:
#Inspect data
budgets_df.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [72]:
#Inspect metadata
budgets_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5782 entries, 1 to 82
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       5782 non-null   object
 1   movie              5782 non-null   object
 2   production_budget  5782 non-null   object
 3   domestic_gross     5782 non-null   object
 4   worldwide_gross    5782 non-null   object
dtypes: object(5)
memory usage: 271.0+ KB


In [74]:
#Check date range
print(budgets_df['release_date'].min())
print(budgets_df['release_date'].max())

Apr 1, 1975
Sep 9, 2016


Loving this dataset -- no missing values, super straightforward and clean. I think the only data I am really wishing I had here is genres.

# Possible Questions from Looking at This Data

1. Compare budget to gross to see what types of movies tend to provide the best bang for your buck (great for a startup studio). Also look at movies that lost money and/or broke even to determine what types of movies should be avoided.
2. Analyze the top grossing movies--what do they have in common?
3. Is there a correlation between positive reviews and gross? Is there a correlation between positive reviews and ROI?
4. Netflix top ten

# Adding Genres to TMDB data

In [75]:
#Load and inspect new genres data
genres = pd.read_csv('data/edited data/tmdb genres.csv')
genres

Unnamed: 0,id,name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime
5,99,Documentary
6,18,Drama
7,10751,Family
8,14,Fantasy
9,36,History


In [41]:
#Expand list of genre ids into individual cells that can be merged with the genre table
df = tmdb_movies_df.explode('genre_ids')

In [45]:
#Review new table to make sure things happened the way I expected
df.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,12,12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
0,14,12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
0,10751,12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,14,10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
1,12,10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610


In [76]:
#Mergr dataframes and review new dataframe
with_genres = df.merge(genres, how='left', left_on='genre_ids', right_on='id')
with_genres.head()

Unnamed: 0,genre_ids,id_x,original_language,original_title,popularity,release_date,title,vote_average,vote_count,id_y,name
0,12,12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,12.0,Adventure
1,14,12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,14.0,Fantasy
2,10751,12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,10751.0,Family
3,14,10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,14.0,Fantasy
4,12,10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,12.0,Adventure


In [51]:
#Pull out only the columns I think will be useful
clean_tmdb = with_genres.loc[:, ['title', 'name', 'release_date', 'popularity', 'vote_average', 'vote_count']]

In [53]:
#Inspect metadata to find null values
clean_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47834 entries, 0 to 47833
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         47834 non-null  object 
 1   name          45355 non-null  object 
 2   release_date  47834 non-null  object 
 3   popularity    47834 non-null  float64
 4   vote_average  47834 non-null  float64
 5   vote_count    47834 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 3.6+ MB


In [55]:
#Fill null genre values with 'Other'
clean_tmdb['name'].fillna(value='Other', inplace=True)

In [56]:
#Inspect metadata
clean_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47834 entries, 0 to 47833
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         47834 non-null  object 
 1   name          47834 non-null  object 
 2   release_date  47834 non-null  object 
 3   popularity    47834 non-null  float64
 4   vote_average  47834 non-null  float64
 5   vote_count    47834 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 3.6+ MB


In [57]:
#Export this dataframe to a CSV for later use
clean_tmdb.to_csv('TMDB Info With Genre Names.csv', index=False)

# Netflix Top Ten Data

In [2]:
netflix_df = pd.read_csv('Data/Edited Data/Netflix Top 10.csv')
tmdb_df = pd.read_csv('Data/Edited Data/TMDB-Netflix Data.csv', converters={'genre_ids': eval})

In [27]:
tmdb_df.head(30)

Unnamed: 0,first_air_date,genre_ids,id,media_type,name,origin_country,original_language,overview,popularity,vote_average,vote_count,release_date
0,11/1/2020,"[16, 10762]",114718,tv,CoComelon,"['GB', 'US']",en,"Cocomelon Kids Hits, Vol. 1",2.744,6.6,5,
1,10/23/2020,[18],87739,tv,The Queen's Gambit,['US'],en,"In a Kentucky orphanage in the 1950s, a young ...",100.144,8.7,1933,
2,12/25/2020,[18],91239,tv,Bridgerton,['US'],en,"Wealth, lust, and betrayal set in the backdrop...",99.582,8.2,1124,
3,5/2/2018,"[10759, 18]",77169,tv,Cobra Kai,['US'],en,This Karate Kid sequel series picks up 30 year...,329.624,8.1,3157,
4,2/24/2021,"[35, 18]",117581,tv,Ginny & Georgia,['US'],en,Angsty and awkward fifteen year old Ginny Mill...,122.521,8.1,553,
5,3/24/2021,"[18, 80, 9648]",120168,tv,Who Killed Sara?,['MX'],es,Hell-bent on exacting revenge and proving he w...,798.923,7.8,761,
6,9/8/2007,"[10751, 35, 18, 10762]",5371,tv,iCarly,['US'],en,"Watch Carly, Sam, and Freddie, as they try to ...",153.554,8.0,944,
7,,"[16, 12, 35, 10751, 878]",501929,movie,The Mitchells vs. the Machines,,en,"A quirky, dysfunctional family's road trip is ...",121.52,8.0,859,4/22/2021
8,2/3/2021,[18],87049,tv,Firefly Lane,['US'],en,"For decades, childhood best friends Kate and T...",29.649,7.9,71,
9,1/25/2016,"[80, 10765]",63174,tv,Lucifer,['US'],en,"Bored and unhappy as the Lord of Hell, Lucifer...",1642.45,8.5,9007,


In [7]:
combined_df = netflix_df.merge(tmdb_df, how='outer', left_on='Title', right_on='name')

In [8]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 170
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Title                 149 non-null    object 
 1   Type                  149 non-null    object 
 2   Netflix Release Date  149 non-null    object 
 3   Days in Top Ten       149 non-null    float64
 4   Viewership Score      149 non-null    float64
 5   first_air_date        67 non-null     object 
 6   genre_ids             149 non-null    object 
 7   id                    149 non-null    float64
 8   media_type            149 non-null    object 
 9   name                  149 non-null    object 
 10  origin_country        67 non-null     object 
 11  original_language     149 non-null    object 
 12  overview              148 non-null    object 
 13  popularity            149 non-null    float64
 14  vote_average          149 non-null    float64
 15  vote_count            1

In [14]:
combined_df.loc[combined_df['Title'].isna()]

Unnamed: 0,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score,first_air_date,genre_ids,id,media_type,name,origin_country,original_language,overview,popularity,vote_average,vote_count,release_date
149,,,,,,11/1/2020,"[16, 10762]",114718.0,tv,CoComelon,"['GB', 'US']",en,"Cocomelon Kids Hits, Vol. 1",2.744,6.6,5.0,
150,,,,,,,"[16, 12, 35, 10751, 878]",501929.0,movie,The Mitchells vs. the Machines,,en,"A quirky, dysfunctional family's road trip is ...",121.52,8.0,859.0,4/22/2021
151,,,,,,1/1/2020,[10764],97083.0,tv,The Circle,['US'],en,Status and strategy collide in this social exp...,27.78,8.9,14.0,
152,,,,,,,"[35, 80, 53]",601666.0,movie,I Care a Lot,,en,A court-appointed legal guardian defrauds her ...,98.503,6.7,1446.0,2/19/2021
153,,,,,,,"[99, 80, 18]",799555.0,movie,Operation Varsity Blues: The College Admission...,,en,An examination that goes beyond the celebrity-...,12.594,7.2,68.0,3/17/2021
154,,,,,,5/19/2013,"[10764, 99]",60910.0,tv,Life Below Zero,['US'],en,Viewers go deep into an Alaskan winter to meet...,11.969,7.6,26.0,
155,,,,,,2/28/2021,[99],119815.0,tv,Attenborough's Life in Colour,['GB'],en,Exploring the vital role colour plays in the d...,5.102,7.6,16.0,
156,,,,,,,"[28, 12, 878]",429617.0,movie,Spider-Man: Far From Home,,en,Peter Parker and his friends go on a summer tr...,208.031,7.5,9919.0,6/28/2019
157,,,,,,1/26/2021,"[16, 10762]",117162.0,tv,Go Dog Go,[],en,Handy and inventive pup Tag chases adventure w...,2.086,7.2,5.0,
158,,,,,,,"[10749, 35, 18]",614409.0,movie,To All the Boys: Always and Forever,,en,Senior year of high school takes center stage ...,71.23,7.9,1258.0,2/12/2021


In [13]:
combined_df.loc[combined_df['name'].isna()]

Unnamed: 0,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score,first_air_date,genre_ids,id,media_type,name,origin_country,original_language,overview,popularity,vote_average,vote_count,release_date
0,Cocomelon,TV Show,1-Jun-20,220.0,730.0,,,,,,,,,,,,
7,The Mitchells vs. The Machines,Movie,30-Apr-21,31.0,204.0,,,,,,,,,,,,
12,The Circle US,TV Show,1-Jan-20,25.0,154.0,,,,,,,,,,,,
25,I Care a Lot.,Movie,19-Feb-21,15.0,103.0,,,,,,,,,,,,
44,Operation Varsity Blues,Movie,17-Mar-21,11.0,67.0,,,,,,,,,,,,
57,Below Zero,Movie,29-Jan-21,6.0,54.0,,,,,,,,,,,,
59,Life in Color with David Attenborough,TV Show,22-Apr-21,9.0,54.0,,,,,,,,,,,,
61,Home,Movie,25-May-21,6.0,53.0,,,,,,,,,,,,
62,"Go, Dog, Go",TV Show,26-Jan-21,15.0,52.0,,,,,,,,,,,,
67,To All the Boys Always and Forever,Movie,12-Feb-21,7.0,47.0,,,,,,,,,,,,


Inspecting the two sets of rows that didn't merge correctly, it looks like the majority are due to small differences in data entry/naming conventions. Some look like they are due to an inconsistency with my API call -- because I chose to select the only the first search result in all cases, I sometimes ended up with the wrong data (for example, Octonauts vs Octonauts & The Ring of Fire). Because most of these movies/shows are lower on the list, I am comfortable losing most of this data, even though it represents a decent percentage of the data. I am going to manually update The Mitchells vs. The Machines and I Care a Lot since those are the two movies with viewership scores > 100 and then re-join using an inner join.

In [32]:
netflix_df.at[7, 'Title'] = tmdb_df.at[7, 'name']
netflix_df.at[25, 'Title'] = tmdb_df.at[25, 'name']

In [33]:
combined_df = netflix_df.merge(tmdb_df, how='inner', left_on='Title', right_on='name')

In [34]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129 entries, 0 to 128
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Title                 129 non-null    object 
 1   Type                  129 non-null    object 
 2   Netflix Release Date  129 non-null    object 
 3   Days in Top Ten       129 non-null    int64  
 4   Viewership Score      129 non-null    int64  
 5   first_air_date        59 non-null     object 
 6   genre_ids             129 non-null    object 
 7   id                    129 non-null    int64  
 8   media_type            129 non-null    object 
 9   name                  129 non-null    object 
 10  origin_country        59 non-null     object 
 11  original_language     129 non-null    object 
 12  overview              128 non-null    object 
 13  popularity            129 non-null    float64
 14  vote_average          129 non-null    float64
 15  vote_count            1

In [35]:
combined_df.loc[combined_df['overview'].isna()]

Unnamed: 0,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score,first_air_date,genre_ids,id,media_type,name,origin_country,original_language,overview,popularity,vote_average,vote_count,release_date
36,Jenni Rivera: Mariposa de Barrio,TV Show,1-Jan-21,21,73,6/27/2017,"[10766, 18]",83111,tv,Jenni Rivera: Mariposa de Barrio,[],es,,19.482,7.7,143,


In [36]:
combined_df.to_csv('Netflix Top Ten with Info.csv', index=False)