## Module 1 Project

Please fill out:
* Student name: Jennifer Wadkins
* Student pace: self paced
* Scheduled project review date/time: 
* Instructor name: Jeff Herman
* Blog post URL:



Questions I have:
Do I need to justify not using provided data?

### Importing our modules

We will be using the following libraries in this project:

pandas, numpy, matplotlib, json

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import datetime
import json
%matplotlib inline

### Other preparation work

Recommended to also install the nbextensions "Table of Contents 2" and "Collapsible Headings" for easier navigation through this notebook.

Gitbhub here: https://github.com/Jupyter-contrib/jupyter_nbextensions_configurator

## Source 1 -  The Numbers

### Exploring the Data

First we will look at our provided data set from "The Numbers" and see how it needs cleaning. When performing cleaning analysis on ALL datasets in this project, we initially want to know things like:
    * What is the shape of our imported data?
    * How many data entries?
    * What format is the data in?
    * How can we remove the most obvious redundancies (columns we just don't need, etc)
    * Are there missing/null values in the dataset that will need to be removed or imputed?

Before we work on this data set, we should check if we can get better/updated data from the source. We followed the Data link at "The Numbers" to https://www.opusdata.com/ and submitted a request for access to their data set. In the meantime we will contine to work with this data set as given.

In [2]:
# import movie budgets dataset from file
df1 = pd.read_csv('zippedData/tn.movie_budgets.csv')

In [3]:
# taking a look at what we've imported
df1.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [4]:
# what is the shape of our data?
df1.shape
# this data has 5782 entries

(5782, 6)

In [5]:
# what format is the data stored?
df1.dtypes
# We have a lot of data format problems here. Everything but the id is stored as an object,
# including the monetary numbers and the date. We will fix these problems during data cleanup.

id                    int64
release_date         object
movie                object
production_budget    object
domestic_gross       object
worldwide_gross      object
dtype: object

In [6]:
# do we have any missing/null values?
df1.isnull().sum()
# since we know that all of our data is objects, we MAY actually have missing values. We won't be sure until later.
# for now let's look at the tail of the set and see if anything pops out.

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

In [7]:
df1.tail()
# we do, in fact, see entries with a $0 for gross. These aren't showing up as null because
# they are actual entries rather than null values. We will need to remove or impute these entries after we convert these cells.

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0
5781,82,"Aug 5, 2005",My Date With Drew,"$1,100","$181,041","$181,041"


### Data Cleanup 

On the movie budgets dataset, we find the following things to clean up and resolve:
    * We have 5782 entries. We'll want to explore how/why movies were included in this dataset, as it's not a very large dataset compared to the number of movies released over time
    * all of the data in this set is objects. A lot of the data is numbers, so we need it to be in a numerical format
    * We have an id column, which can be used as our dataset index
    * Many entries with a $0 for gross. These aren't showing up as null in our initial EDA because they are actual entries of $0 not null values. We will need to remove these entries after we convert these cells.

We're going to clean up this dataset in the following way before moving on:

    1) set the id as the index
    2) convert the release date into a standard datetime
    3) convert all cost/gross fields into integers
    4) add 2 new columns for domestic net and worldwide net
    5) remove rows without information for budget OR gross, as we won't be able to use this data

In [8]:
# block of cleanup actions performing actions 1-4 listed above

# sets the id as the index, removing a redundant column (former index)
df1.set_index('id', inplace=True)

# using pandas built-in datetime converter to change our release date column to standard format
df1['release_date'] = pd.to_datetime(df1['release_date'])


# write a function to convert the cost/gross object entries into proper numbers that we can use in calculation
def convert_numbers(x):
    '''Takes in a string formatted number that starts with $ and may include commas, and returns that 
    number as a whole integer that can be used in calculations'''
    x = x[1:]
    x = x.replace(',', '')
    x = int(x)
    return x

# run the function on each of our three cost/gross entries
df1['production_budget'] = df1['production_budget'].map(lambda x: convert_numbers(x))
df1['domestic_gross'] = df1['domestic_gross'].map(lambda x: convert_numbers(x))
df1['worldwide_gross'] = df1['worldwide_gross'].map(lambda x: convert_numbers(x))

# add two new columns for domestic net and worldwide net
df1['domestic_net'] = df1['domestic_gross'] - df1['production_budget']
df1['worldwide_net'] = df1['worldwide_gross'] - df1['production_budget']

# check that the data now looks the way we want it
df1.tail()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
78,2018-12-31,Red 11,7000,0,0,-7000,-7000
79,1999-04-02,Following,6000,48482,240495,42482,234495
80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,-3662,-3662
81,2015-09-29,A Plague So Pleasant,1400,0,0,-1400,-1400
82,2005-08-05,My Date With Drew,1100,181041,181041,179941,179941


Now that we have corrected our numbers, we need to address the missing data that we identified before. We also want to figure out how the movies were selected for inclusion on this list, if possible, as it's clearly a small sample of all available released movies.

In [9]:
sum(df1['production_budget'] == 0)
# all of the movies have a production budget listed. Regardless, we can't get enough info about success without any gross, so
# we'll be dropping the rows that have a gross of 0 for domestic

0

In [10]:
sum(df1['domestic_gross'] == 0)
# 548 of our entries have no data for domestic_gross. We can't use these in calculations, and we're not going
# to impute them, so we are going to drop these rows from the dataset.

548

In [11]:
df1 = df1[df1['domestic_gross'] !=0]
# dropping all rows where there is no domestic gross information
df1.tail()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
76,2006-05-26,Cavite,7000,70071,71644,63071,64644
77,2004-12-31,The Mongol King,7000,900,900,-6100,-6100
79,1999-04-02,Following,6000,48482,240495,42482,234495
80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,-3662,-3662
82,2005-08-05,My Date With Drew,1100,181041,181041,179941,179941


We're still not sure how movies were chosen for this particular dataset, but at least we've cleaned up the data. We no longer have any movies in the set without a budget, gross and net information. All of our dates are in a standard format, and all of our money entries are in an integer format so that we can do further calculations with them.

### EDA

We're now happy with our cleanup. Time to look deeper into the info our data gives us. Namely, what appears to be the stats that warranted inclusion on this list?

In [12]:
df1.sort_values('worldwide_net', ascending=False)
# our net ranges from positive to negative, so it's not just top grossing movies

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2009-12-18,Avatar,425000000,760507625,2776345279,335507625,2351345279
43,1997-12-19,Titanic,200000000,659363944,2208208395,459363944,2008208395
7,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200,378815482,1748134200
6,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220,630662225,1747311220
34,2015-06-12,Jurassic World,215000000,652270625,1648854864,437270625,1433854864
...,...,...,...,...,...,...,...
5,2002-08-16,The Adventures of Pluto Nash,100000000,4411102,7094995,-95588898,-92905005
53,2001-04-27,Town & Country,105000000,6712451,10364769,-98287549,-94635231
42,2019-06-14,Men in Black: International,110000000,3100000,3100000,-106900000,-106900000
94,2011-03-11,Mars Needs Moms,150000000,21392758,39549758,-128607242,-110450242


In [13]:
df1.sort_values('release_date', ascending=False)
# our release dates cover the gamut of 1915-2020, so it's not just movies within the last x years

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
98,2019-06-14,Shaft,30000000,600000,600000,-29400000,-29400000
42,2019-06-14,Men in Black: International,110000000,3100000,3100000,-106900000,-106900000
3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-307237650,-200237650
35,2019-06-07,Late Night,4000000,246305,246305,-3753695,-3753695
81,2019-06-07,The Secret Life of Pets 2,80000000,63795655,113351496,-16204345,33351496
...,...,...,...,...,...,...,...
70,1925-12-30,Ben-Hur: A Tale of the Christ,3900000,9000000,9000000,5100000,5100000
7,1925-11-19,The Big Parade,245000,11000000,22000000,10755000,21755000
84,1920-09-17,Over the Hill to the Poorhouse,100000,3000000,3000000,2900000,2900000
15,1916-12-24,"20,000 Leagues Under the Sea",200000,8000000,8000000,7800000,7800000


In [14]:
df1.sort_values('production_budget')
# production budget was clearly not a minimum requirement, as our budgets range from only a few thousand dollars
# to over 400 million dollars

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
82,2005-08-05,My Date With Drew,1100,181041,181041,179941,179941
80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,-3662,-3662
79,1999-04-02,Following,6000,48482,240495,42482,234495
74,1993-02-26,El Mariachi,7000,2040920,2041928,2033920,2034928
77,2004-12-31,The Mongol King,7000,900,900,-6100,-6100
...,...,...,...,...,...,...,...
5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,303181382,999721747
4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,128405868,1072413963
3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-307237650,-200237650
2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,-169536125,635063875


In [15]:
#pd.plotting.scatter_matrix(df1[['production_budget', 'domestic_net', 'worldwide_net']], figsize=(15,15));

In [16]:
#df1.plot('release_date', 'domestic_net', kind='scatter', figsize=(10, 10));

Now that we're looking at some visualizations, we realize that this data goes back further than we really need. We're not aiming for the full history of cinema - we're aiming to capitalize on current trends and provide current recommendations. With this in mind, we will lose all entries that are more than 20 years old.

In [17]:
current_date = pd.datetime.now().date()
current_date = pd.to_datetime(current_date)
current_date

df1['movie_age'] = df1['release_date'] - current_date
df1['movie_age'] = df1['movie_age'] / -(np.timedelta64(1, 'Y'))

#df1.drop(df1[(df1['movie_age'] >= 20)].index, inplace=True)

df1.sort_values('movie_age').tail()
#sum(df1['movie_age'] >= 20)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_net,worldwide_net,movie_age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
70,1925-12-30,Ben-Hur: A Tale of the Christ,3900000,9000000,9000000,5100000,5100000,94.846575
7,1925-11-19,The Big Parade,245000,11000000,22000000,10755000,21755000,94.958829
84,1920-09-17,Over the Hill to the Poorhouse,100000,3000000,3000000,2900000,2900000,100.130735
15,1916-12-24,"20,000 Leagues Under the Sea",200000,8000000,8000000,7800000,7800000,103.862502
78,1915-02-08,The Birth of a Nation,110000,10000000,11000000,9890000,10890000,105.737969


## Source 2 - The Movie Database

Time to work with data from a different source. We're using information from TMDB - The Movie Database

### Exploring the Data

We're going to perform our cleanup analysis on this dataset, including:
    * What is the shape of our imported data?
    * How many data entries?
    * What format is the data in?
    * How can we remove the most obvious redundancies (columns we just don't need, etc)
    * Are there missing/null values in the dataset that will need to be removed or imputed?

In [2]:
# importing the movie database movies data set from file
df2 = pd.read_csv('zippedData/tmdb.movies.csv')

In [3]:
# taking a look at what we've imported
df2.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [4]:
# what is the shape of our data?
df2.shape
# this dataset has 26,517 movie entries

(26517, 10)

In [5]:
# what kind of data is stored?
df2.dtypes
# Most of the data in this set seems to be stored in the correct format already (numbers as numbers, etc)
# we'll change the date to a proper date/time

Unnamed: 0             int64
genre_ids             object
id                     int64
original_language     object
original_title        object
popularity           float64
release_date          object
title                 object
vote_average         float64
vote_count             int64
dtype: object

In [6]:
# do we have any missing/null values?
df2.isnull().sum()
# This dataset has no missing values. That doesn't mean there aren't categorical placeholders, and we will look into that further


Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64

In [7]:
df2['vote_count'].value_counts()
# There are 6541 entries in this dataset with only 1 vote. We're going to look at these entries later and figure out what is
# unusual about them.

1       6541
2       3044
3       1757
4       1347
5        969
        ... 
2328       1
6538       1
489        1
2600       1
2049       1
Name: vote_count, Length: 1693, dtype: int64

In [8]:
df2.sort_values('popularity').head()
# while sorting on popularity, I also notice for the first time that a lot of the genre_ids on this low popularity list are absent


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
13258,13258,[99],403294,en,9/11: Simulations,0.6,2014-07-04,9/11: Simulations,10.0,1
11010,11010,[],203325,en,Slaves Body,0.6,2013-06-25,Slaves Body,0.5,1
11011,11011,[99],186242,en,Re-Emerging: The Jews of Nigeria,0.6,2013-05-17,Re-Emerging: The Jews of Nigeria,0.5,2
11012,11012,[99],116868,en,Occupation: Fighter,0.6,2013-08-02,Occupation: Fighter,0.5,2
11013,11013,[99],85337,en,Wonders Are Many: The Making of Doctor Atomic,0.6,2013-08-07,Wonders Are Many: The Making of Doctor Atomic,0.5,2


In [9]:
df2.describe()
# One thing we can see in this dataset is that there are a LOT of movies with 5 or fewer votes. A full 50% of the dataset
# has 5 or fewer votes. The difference between or 75th percentile and the max goes from 28 to 22,000 votes!!
# We will look more into this and figure out the situation.

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0,26517.0
mean,13258.0,295050.15326,3.130912,5.991281,194.224837
std,7654.94288,153661.615648,4.355229,1.852946,960.961095
min,0.0,27.0,0.6,0.0,1.0
25%,6629.0,157851.0,0.6,5.0,2.0
50%,13258.0,309581.0,1.374,6.0,5.0
75%,19887.0,419542.0,3.694,7.0,28.0
max,26516.0,608444.0,80.773,10.0,22186.0


In [10]:
df2[(df2['vote_count'] > 30)].count()
# we only have 6347 entries in this dataset with more than 30 user votes. I question the quality of this dataset.
# Overall, this might just not be great data, and since we have access to a TMDB API, we may decide to pull better data
# ourselves

Unnamed: 0           6347
genre_ids            6347
id                   6347
original_language    6347
original_title       6347
popularity           6347
release_date         6347
title                6347
vote_average         6347
vote_count           6347
dtype: int64

### Pulling Data from TMDB via API

Before we take time to clean up the data set we have, we're going to take a look at the TMDB API and see if we can pull more or better information first!

TMDB offers an API, so we're going to pull the data that we want to use from their site using an API key. Looking a little further into the other data sets we are using, we can see some interesting options with the TMDB API that want to add to our data including:
    * Movie genre list to match up with the genre-ids (under Genres)
    * More up-to-date dataset in general, retrieved with some predetermined data refinement critera 
    * A list of the IMDB movie ids, which will be incredibly helpful for us to join this TMDB info with our IMDB info later in the notebook (under Movies -> Get External IDs)
    
We're accessing the API documentation for TMDB at https://developers.themoviedb.org/3/getting-started/introduction, after registering for an API key.



#### Discover Movie Data Set

The big workhorse API call for TMDB is in "Discover" located at https://developers.themoviedb.org/3/discover/movie-discover

In this section we can get back a data set that can, in some ways, be pre-cleaned. So we are going to determine how we plan to refine/clean our data set right now, and then figure out ways that we can pull data from TMDB that already fits the parameters we want.

Here are the data cleanup steps we are planning for our data set, some of which can be achieved while we grab the data:

    * Drop entries with fewer than 30 votes. Our client is looking for a blockbuster, not a bespoke production.
    * Drop entries with no genre specified. We'll want to use the genre to make recommendations.
    * Drop entries with 1.0 or less popularity, for the same reasons as votes
    * Drop movies older than 20 years
    * Set the index as the Unnamed column
    
The Discover API lets us pass the following useful parameters to fulfill some of our data refinement goals:
    * primary_release_date.gte lets us include movies that have a primary release date greater or equal than the specified value
    * primary_release_date.lte lets us keep our scope in 2019 or newer for purposes of our case study. We're looking at movie production in a pre-covid world.
    * vote_count.gte lets us filter for movies with a vote count greater than or equal to the specified value
    * with_original_language lets us pull english language films. Our client will be making films in english and we will still have a full 10,000 returns 

This will take care of a few of the things we wanted to clean up in our dataset.
 
We're getting this and other API data in a separate notebook, because we don't want to make these API calls every time we run this notebook! We've pulled the data via the notebook called "tmdb_api_calls" and saved those as JSON files, and will now import our JSON files here for further processing.

In [69]:
# opening up our Discover dataset

f = open('api_data/tmdb_movies.json', encoding='utf-8')
discover = json.load(f)
type(discover)

type(discover) # we've loaded our Discover dataset and it's a dictionary


dict

In [70]:
discover.keys() # checking the keys
# we ran our function to paginate in the API and as a result, our keys are each of the 500 calls we made to the api. We'll
# need to go a level lower to hit our data.

dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157'

In [71]:
# what does the first level of our dictionary look like?
discover['1']
# This is page 1 of the results

discover['1']['results']
# these are the entries on page 1. Our plan now is to write a function to iterate through the pages, and concatenate the 
# results onto a pandas dataframe

[{'popularity': 1043.179,
  'vote_count': 12572,
  'video': False,
  'poster_path': '/gGEsBPAijhVUFoiNpgZXqRVWJt2.jpg',
  'id': 354912,
  'adult': False,
  'backdrop_path': '/askg3SMvhqEl4OL52YuvdtY40Yb.jpg',
  'original_language': 'en',
  'original_title': 'Coco',
  'genre_ids': [16, 10751, 35, 12, 14, 10402],
  'title': 'Coco',
  'vote_average': 8.2,
  'overview': "Despite his family’s baffling generations-old ban on music, Miguel dreams of becoming an accomplished musician like his idol, Ernesto de la Cruz. Desperate to prove his talent, Miguel finds himself in the stunning and colorful Land of the Dead following a mysterious chain of events. Along the way, he meets charming trickster Hector, and together, they set off on an extraordinary journey to unlock the real story behind Miguel's family history.",
  'release_date': '2017-10-27'},
 {'popularity': 437.037,
  'vote_count': 5113,
  'video': False,
  'poster_path': '/zfE0R94v1E8cuKAerbskfD3VfUt.jpg',
  'id': 474350,
  'adult': Fal

In [72]:
tmdb_discover = pd.DataFrame() #start by making an empty dataframe to hold our results

# loop through each page of our response JSON, make it into a dataframe, and concatenate onto our big dataframe
for x in discover:
    df = pd.DataFrame.from_dict(discover[x]['results'])
    tmdb_discover = pd.concat([tmdb_discover, df])

tmdb_discover #finished dataframe with all 10,000 entries

Unnamed: 0,popularity,vote_count,video,poster_path,id,adult,backdrop_path,original_language,original_title,genre_ids,title,vote_average,overview,release_date
0,1043.179,12572,False,/gGEsBPAijhVUFoiNpgZXqRVWJt2.jpg,354912,False,/askg3SMvhqEl4OL52YuvdtY40Yb.jpg,en,Coco,"[16, 10751, 35, 12, 14, 10402]",Coco,8.2,Despite his family’s baffling generations-old ...,2017-10-27
1,437.037,5113,False,/zfE0R94v1E8cuKAerbskfD3VfUt.jpg,474350,False,/8moTOzunF7p40oR5XhlDvJckOSW.jpg,en,It Chapter Two,"[27, 14]",It Chapter Two,6.9,27 years after overcoming the malevolent super...,2019-09-04
2,369.335,15356,False,/udDclJoHjfjb8Ekgsd4FDteOkCU.jpg,475557,False,/n6bUvigpRFqSwmPp1m2YADdbRBc.jpg,en,Joker,"[80, 53, 18]",Joker,8.2,"During the 1980s, a failed stand-up comedian i...",2019-10-02
3,288.743,6330,False,/qXsndsv3WOoxszmdlvTWeY688eK.jpg,330457,False,/xJWPZIYOEFIjZpBL7SVBGnzRYXp.jpg,en,Frozen II,"[16, 10751, 12, 35, 14]",Frozen II,7.3,"Elsa, Anna, Kristoff and Olaf head far into th...",2019-11-20
4,253.076,3714,False,/ubLbY97m8lYJ3Fykh7nfiwB5eth.jpg,316727,False,/craD86vySKvAkboyeXFnZwHrNA8.jpg,en,The Purge: Election Year,"[28, 27, 53]",The Purge: Election Year,6.4,Two years after choosing not to kill the man w...,2016-06-29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15,4.985,32,False,/2urrmnf0Pw2u2F9vfhmpa8giaGL.jpg,12855,False,/oh1lR2T1ycTlMmbxG1Isq4KasZc.jpg,en,Blue State,"[35, 18, 10749]",Blue State,6.0,A disgruntled Democrat follows through on a dr...,2007-04-27
16,4.984,40,False,/tLwDdvsgQecx6SopQM9XEp8J1fY.jpg,41488,False,/af1veNOoA5Mkxxh3EEOGoKdcWwF.jpg,en,The Statement,"[18, 53]",The Statement,5.7,"The film is set in France in the 1990s, the Fr...",2003-12-12
17,4.984,32,False,/s576k2BjMnT7wKx3so0sRXZDzmK.jpg,27711,False,/tSjMGRDNzVj519bnChXvfuiYq7o.jpg,en,Killjoy,[27],Killjoy,2.7,"Deep in an inner city hell, a ghastly figure i...",2000-10-24
18,4.983,31,False,/jNlwEt7khyv6GPQfeBvSAodCOkO.jpg,521190,False,/1oSFFpjikWZw7qUyLOngSB0LeCx.jpg,en,The Beach House,"[18, 10749, 10770]",The Beach House,6.2,Cara Rudland thought she’d left her Southern r...,2018-04-28


One important thing that we notice here is that even filtering for <20 years old, 31+ votes, english language films, and all of our other preliminary data refinement, we have still hit the 10,000 return maximum using this API. Because of this, we are definitely going to use this data instead of the provided TMDB dataset, because we know that we have at least 10,000 entries of quality data instead of around 6,000 (and actually less, as we never filtered the original set for within 20 years). 

### Data Cleanup

We're going to do the following work on this dataset to clean it up:
    * drop entries with no genre specified. We'll want to use the genre to make recommendations.
    * Set the id column as the index


We are NOT using the provided TMDB dataset from earlier in the notebook. We've found that we have higher quality data via our API pull, and will be using our tmdb_discover dataset and discarding our df2 dataset.

In [73]:
tmdb_discover.shape
# we have 10,000 entries which is the maximum that can be pulled via the API

(10000, 14)

In [74]:
tmdb_discover.dtypes
# we'll take a look at fixing the release_date format and converting that to a proper datetime. Everything else looks correct.

popularity           float64
vote_count             int64
video                   bool
poster_path           object
id                     int64
adult                   bool
backdrop_path         object
original_language     object
original_title        object
genre_ids             object
title                 object
vote_average         float64
overview              object
release_date          object
dtype: object

In [75]:
tmdb_discover.describe()
# we can see that we have meaningful data with a reasonable vote_count per entry and high popularity

Unnamed: 0,popularity,vote_count,id,vote_average
count,10000.0,10000.0,10000.0,10000.0
mean,17.014888,877.6173,182248.818,6.12231
std,21.837671,2057.028942,181291.150766,0.9461
min,4.983,26.0,12.0,1.5
25%,8.707,60.0,16994.0,5.5
50%,11.588,155.0,91323.5,6.2
75%,17.441,652.0,339410.75,6.8
max,1043.179,27489.0,704264.0,9.0


In [76]:
tmdb_discover[(tmdb_discover['vote_count'] > 30)].count()
# we got exactly 1 return with fewer than 31 votes, for some reason. Not sure why!

popularity           9998
vote_count           9998
video                9998
poster_path          9985
id                   9998
adult                9998
backdrop_path        9015
original_language    9998
original_title       9998
genre_ids            9998
title                9998
vote_average         9998
overview             9998
release_date         9998
dtype: int64

In [77]:
tmdb_discover[(tmdb_discover['genre_ids'] == '[]')].count()
# All of our entries have genre ids. That is very important for our recommendations!

popularity           0
vote_count           0
video                0
poster_path          0
id                   0
adult                0
backdrop_path        0
original_language    0
original_title       0
genre_ids            0
title                0
vote_average         0
overview             0
release_date         0
dtype: int64

In [78]:
tmdb_discover.columns
# we don't need all of these columns, so I need a reminder right here of what I want to drop

Index(['popularity', 'vote_count', 'video', 'poster_path', 'id', 'adult',
       'backdrop_path', 'original_language', 'original_title', 'genre_ids',
       'title', 'vote_average', 'overview', 'release_date'],
      dtype='object')

In [79]:
# cleaning up this dataset

# set our index equal to the first column
# we might go back later and drop this index column and make the title the index
#tmdb_discover.set_index('id', inplace=True)

# Drop all entries with no genre id
tmdb_discover.drop(tmdb_discover[(tmdb_discover['genre_ids'] == '[]')].index, inplace=True)

# using pandas built-in datetime converter to change our release date column to standard format
tmdb_discover['release_date'] = pd.to_datetime(tmdb_discover['release_date'])

#drop columns by name
tmdb_discover.drop(columns=['video', 'poster_path', 'adult', 'backdrop_path', 'original_title', 'overview', 'original_language'], inplace=True)

In [80]:
tmdb_discover # confirming that we have cleaned up our data and have only the information we need to use


Unnamed: 0,popularity,vote_count,id,genre_ids,title,vote_average,release_date
0,1043.179,12572,354912,"[16, 10751, 35, 12, 14, 10402]",Coco,8.2,2017-10-27
1,437.037,5113,474350,"[27, 14]",It Chapter Two,6.9,2019-09-04
2,369.335,15356,475557,"[80, 53, 18]",Joker,8.2,2019-10-02
3,288.743,6330,330457,"[16, 10751, 12, 35, 14]",Frozen II,7.3,2019-11-20
4,253.076,3714,316727,"[28, 27, 53]",The Purge: Election Year,6.4,2016-06-29
...,...,...,...,...,...,...,...
15,4.985,32,12855,"[35, 18, 10749]",Blue State,6.0,2007-04-27
16,4.984,40,41488,"[18, 53]",The Statement,5.7,2003-12-12
17,4.984,32,27711,[27],Killjoy,2.7,2000-10-24
18,4.983,31,521190,"[18, 10749, 10770]",The Beach House,6.2,2018-04-28


We need this data set in order to make our API calls for the IMDB ID matchup, so we're going to export it to a csv that we can then import into our API production file.

In [84]:
tmdb_discover.to_csv('api_data/tmdb_discover.csv', index=False)

### Movie Genres Data Set

TMDB allows for browser-based API calls, so we will use their browser system for the simpler calls by copying the text results into our source code editor and saving each as a JSON

First up using https://developers.themoviedb.org/3/genres/get-movie-list to get a JSON dictionary of movie genres.

In [62]:
# We saved the resulting web-based text return as a JSON using our source code editor, and now we load it
f = open('api_data/tmdb_movie_genres.json')
data = json.load(f)
data

{'genres': [{'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'},
  {'id': 16, 'name': 'Animation'},
  {'id': 35, 'name': 'Comedy'},
  {'id': 80, 'name': 'Crime'},
  {'id': 99, 'name': 'Documentary'},
  {'id': 18, 'name': 'Drama'},
  {'id': 10751, 'name': 'Family'},
  {'id': 14, 'name': 'Fantasy'},
  {'id': 36, 'name': 'History'},
  {'id': 27, 'name': 'Horror'},
  {'id': 10402, 'name': 'Music'},
  {'id': 9648, 'name': 'Mystery'},
  {'id': 10749, 'name': 'Romance'},
  {'id': 878, 'name': 'Science Fiction'},
  {'id': 10770, 'name': 'TV Movie'},
  {'id': 53, 'name': 'Thriller'},
  {'id': 10752, 'name': 'War'},
  {'id': 37, 'name': 'Western'}]}

In [63]:
tmdb_genres = pd.DataFrame.from_dict(data['genres']) # loading our JSON into a pandas dataframe

tmdb_genres.set_index('id', inplace=True) # Setting the genre id as our index

tmdb_genres # Looks as expected

Unnamed: 0_level_0,name
id,Unnamed: 1_level_1
28,Action
12,Adventure
16,Animation
35,Comedy
80,Crime
99,Documentary
18,Drama
10751,Family
14,Fantasy
36,History


### IMDB ID Matchup Data Set

Our next goal is to get a list of the IMDB movie ids for each of the movie ids in our data set. The TMDB movie id is a parameter that must be passed to the API call to get an IMDB id return, so we won't be able to use the web interface for this action.

We've pulled the data via the notebook called "tmdb_api_calls" and saved it as a JSON files, and will now import our JSON file here for further processing.



### EDA

EDA time

## Data Set 3 - Box Office Mojo

We're going to perform the same EDA that we have done on the previous datasets.

In [None]:
#Box Office Mojo movie gross
df3 = pd.read_csv('zippedData/bom.movie_gross.csv')
df3.head()

# what is the shape of our data?
df3.shape
# this dataset has 3387 movie entries

# what kind of data is stored?
df3.dtypes
# Most of this data is stored correctly, except foreign_gross. We will have to fix this column

# what are our columns?
df3.columns
# Not a lot of unclear data here

# do we have any missing/null values?
df3.isnull().sum()
# This dataset has a few missing values in domestic_gross and many in foreign_gross. We will definitely need to deal with
# domestic gross at least, as we need this information for our recommendations

df3['studio'].value_counts()
# There are some odd one-off studios listed here. We might not use these entries, as our client is looking to emulate
# the successful studios

df3

We're going to clean up this dataset in the following way before moving on:

    a) Get rid of bespoke productions by eliminating all entries that are a studio's only movie
    b) Getting rid of all entries with no information on domestic gross
    c) Turn our foreign gross numbers into floats instead of objects


In [None]:
sum(df3['studio'].value_counts() == 1)

# getting rid of all entries with no information for domestic gross
df3.drop(df3[(df3['domestic_gross'].isnull())].index, inplace=True)

# turning our foreign_gross entries into floats
#df3['foreign_gross'] = df3['foreign_gross'].notnull().apply(lambda x: float(x))

# dropping studio counts of only 1
counts = df3['studio'].value_counts()
df3.drop(df3[df3['studio'].isin(counts[counts == 1].index)].index, inplace=True)

In [None]:
df3

#df3['studio'].value_counts()

#temp = df3.loc[(df3['domestic_gross']<50000)]
#temp

df3['year'].min()
# The oldest movie on this list is from 2010.
# This might be acceptable, as we should strive to use more recent data for our recommendations
# to account for the current moviegoing climate

## Data Sets 4-9 IMDB

We're going to do our EDA on each of these datasets, exploring how they will interact with each other when we merge them. We'll determine what 
needs to be cleaned before vs after merging the datasets.

### Set 4 -  imdb user ratings per movie

In [None]:
#imdb user ratings per movie
df4 = pd.read_csv('zippedData/title.ratings.csv')

# taking a look at what we've imported
df4.head()
# this dataset is using the movie id and showing the average rating, and the number of votes

# what is the shape of our data?
df4.shape
# this dataset has 73,856 movie entries

# what kind of data is stored?
df4.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df4.columns
# The 'tconst' will be found throughout our IMDB datasets. We will consider turning it into our index for all of the IMDB datasets.

# do we have any missing/null values?
df4.isnull().sum()
# This dataset has no missing values. That doesn't mean there aren't categorical placeholders, and we will look into that further

# how many entries with vote count 30 or less?
temp = df4.loc[(df4['numvotes'] <= 30)]
temp
#30553 entries with vote count 30 or less. We are going to drop all of these entries, but we will do this AFTER merging.

In [None]:
df4.set_index('tconst', inplace=True)

##### Conclusions for Dataset 4:

We made the unique "tconst" into our index.

### Set 5 - cast and crew per movie

In [None]:
#imdb primary cast and crew per movie
df5 = pd.read_csv('zippedData/title.principals.csv')

# taking a look at what we've imported
df5.head()
# this dataset is using the movie id and showing the average rating, and the number of votes

# what is the shape of our data?
df5.shape
# this dataset has 1,028,186 cast and crew entries

# what kind of data is stored?
df5.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df5.columns
# The 'tconst' will be found throughout our IMDB datasets. We will turn it into our index for all of the IMDB datasets.

# do we have any missing/null values?
df5.isnull().sum()
# This dataset has large numbers of missing values. We will inspect the data itself to determine if this is important.

df5.head()

In [None]:
# After inspecting the data, we can see that the "job" column is generally an extension of the "category" column 
# We will drop this column.
df5.drop(columns=['job'], inplace=True)

# We can also see that the "ordering" column is just for sorting the different jobs for each movie id
# we don't really need this column and will remove it as well
df5.drop(columns=['ordering'], inplace=True)

# lastly, we want all of our data to contribute to a recommendation, and while the actors themselves may be important,
# the characters they play do not seem particularly important. We will also drop the "characters" column
df5.drop(columns=['characters'], inplace=True)

df5.head()



##### Conclusions for Dataset 5:

This dataset had three unnecessary columns which were removed. We now have a cleaned list of the cast and crew for each movie id.

After studying this dataset, we see that the movie id (tconst) is not unique. Because of this, we will not turn the tconst value into the index in any of the datasets.

### Set 6 - director and writer assignments per movie

In [None]:
#IMDB directors and writers per movie
df6 = pd.read_csv('zippedData/title.crew.csv')
df6


This appears to give the same information as the previous dataset, but in a different format. Let's do a few comparisons and see if that is the case.

In [None]:
temp = df5.loc[df5['tconst'] == 'tt0417610']
temp
# our director is nm1145057 and our writer is nm0083201, let's check if it's the same in dataset 6

temp = df6.loc[df6['tconst'] == 'tt0417610']
temp
# at first glance it's not the same! But then we see that the director is also a writer.

# using this information, we'll have to decide if we want to value when a person is credited in multiple roles.

# let's check one more multi-role
temp = df5.loc[df5['tconst'] == 'tt0999913']
temp
# we have 1 director and 3 writers listed

temp = df6.loc[df6['tconst'] == 'tt0999913']
temp
# 1 director and 4 writers, where one of the writers is the director.

# Let's take a look at a listing from this dataset with no writer attached, in dataset 5
temp = df5.loc[df5['tconst'] == 'tt0879859']
temp
# there is indeed no writer attached to this movie according to dataset 5

#### Dataset 6 conclusions:

Based on what we are seeing here, we are NOT going to use this dataset. We'll use the other cast and crew dataset to get this same information already broken apart, rather than having to break apart this dataset.

### Set 7 - movie stats

In [37]:
#imdb stats per movie
df7 = pd.read_csv('zippedData/title.basics.csv')

# taking a look at what we've imported
df7.head()
# this dataset is using the movie id and finally we have the title of the movie, as well as the year, the runtime, and the genres

# what is the shape of our data?
df7.shape
# this dataset has 146,144 movie entries

# what kind of data is stored?
df7.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df7.columns
# The 'tconst' is found throughout our IMDB datasets and is the movie identifier
# we will want to understand the distinction between primary_title and original_title

# do we have any missing/null values?
df7.isnull().sum()
# This dataset has some missing values. We will inspect the data itself to determine if this is important.
# there are no primary titles or years missing, which seems like the most important data to have

# let's look at where the primary title and original title don't match in order to understand more about that
temp = df7.loc[(df7['primary_title']) != (df7['original_title'])]
temp
# We can see from this that the original title is the movie's foreign language title. We will use the translated titles
# and drop this column

# Does this list include only movies, or does it also have shows? Let's take a look at runtime minutes
df7.sort_values('runtime_minutes', ascending=False).head()
# It's not clear if these are movies or shows

# We sort by start year to see what the range of release dates is for our info 
df7.sort_values('start_year')
# the IMDB dataset starts at 2010, and includes unreleased future movies! This is really going to skew our dataset!
# we are only going to use movies up to 2019.

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
9599,tt1566491,Brainiacs in La La Land,Brainiacs in La La Land,2010,,Comedy
43264,tt2578092,Fireplace for your Home: Crackling Fireplace w...,Fireplace for your Home: Crackling Fireplace w...,2010,61.0,Music
11550,tt1634300,Role/Play,Role/Play,2010,85.0,"Drama,Romance"
11551,tt1634332,Johan1,Johan Primero,2010,78.0,"Comedy,Drama,Romance"
11552,tt1634334,Hands Up,Les mains en l'air,2010,90.0,Drama
...,...,...,...,...,...,...
2948,tt10300396,Untitled Star Wars Film,Untitled Star Wars Film,2024,,
52213,tt3095356,Avatar 4,Avatar 4,2025,,"Action,Adventure,Fantasy"
2949,tt10300398,Untitled Star Wars Film,Untitled Star Wars Film,2026,,Fantasy
96592,tt5637536,Avatar 5,Avatar 5,2027,,"Action,Adventure,Fantasy"


In [None]:
df7.drop(columns=['original_title'], inplace=True)

df7.set_index('tconst', inplace=True)

df7

##### Dataset 7 conclusions:

This dataset seems nearly ready to use. We dropped the original language column and decided to use the english titles.

We set the index as the unique value tconst.

### Set 8 - alternate titles

In [None]:
#imdb alternate titles
df8 = pd.read_csv('zippedData/title.akas.csv')

# taking a look at what we've imported
df8


##### Dataset 8 conclusions:

It is immediately apparent that this dataset lists all of the alternate titles for each movie id.

We won't be using this dataset.

### Set 9 - detailed crew information

In [None]:
#imdb detailed crew information
df9 = pd.read_csv('zippedData/name.basics.csv')

# taking a look at what we've imported
df9.head()
# this dataset has the information about the cast and crew ids

# what is the shape of our data?
df9.shape
# this dataset has 606,648 people entries

# what kind of data is stored?
df9.dtypes
# The data in this set appears to be stored in the proper formats

# what are our columns?
df9.columns

# do we have any missing/null values?
df9.isnull().sum()
# This dataset has a lot of missing values for birth year, death year, profession, and known for.
# We don't need some of this information, including birth year, profession and known for
# We will keep death year to make sure we don't make any recommendations for cast/crew that is deceased

In [None]:
# the only info we need on people is if they are alive, so we will drop their year of birth
df9.drop(columns=['birth_year'], inplace=True)

# We don't need the specific professions of our players. We can see their role from dataset 5
df9.drop(columns=['primary_profession'], inplace=True)

# We're going to use other, more quantifiable metrics of popularity than the known for information
df9.drop(columns=['known_for_titles'], inplace=True)

# we will make the unique nconst the index
df9.set_index('nconst', inplace=True)

In [None]:
df9.head()
df9.sort_values('death_year').head()
# now we realize that we can have writers and composers that are long deceased. We are going to keep the death_year column.

##### Dataset 9 conclusions:

We got rid of some unnecessary columns: birth year, profession and "known for" titles

### IMDB data set observations/summaries

df4 - User ratings and votes for each movie id. Join on movie id (tconst).

df5 - Cast and crew for each movie id. Join on movie id tconst and person id nconst. Consider this join as a separate dataframe.

df6 - DO NOT USE. Redundant information with df5.

df7 - Movie title, year, runtime and genre for each movie id. Join on movie id (tconst).

df8 - DO NOT USE. Alternate titles.

df9 - Cast and crew info. Join on nconst.


In [None]:
# We are joining our df4 and df7 on the tconst which is the movie id
imdb_movies = df7.join(df4, how='left')
imdb_movies

#how many null values are there in the averagerating and numvotes categories?
imdb_movies.isnull().sum()

# we're not interested in any movies that aren't even popular enough to have ratings. We are dropping all movies
# with no rating entries, and all movies with fewer than 30 votes, just like our df2 cleanup
imdb_movies.drop(imdb_movies[imdb_movies['averagerating'].isnull()].index, inplace=True)
imdb_movies.drop(imdb_movies[imdb_movies['numvotes'] <= 30].index, inplace=True)

imdb_movies.sort_values('numvotes', ascending=False).head()
# We now have 43,303 entries

In [None]:
# we are joining our df5 and df9 to move the cast and crew information over to where they have performed

imdb_crew = df5.join(df9, on='nconst', how='inner')
# we lost a few hundred entries (out of over a million) for people listed in IMDB who have never worked on a movie

imdb_crew


We've succesfully turned our 6 IMDB datasets into 2 dataframes.

## Sets 10 and 11 - Rotten Tomatoes

In [None]:
df10 = pd.read_csv('zippedData/rt.reviews.tsv', sep='\t', encoding='Latin-1')
df10.tail()

#It's immediately apparent that these are the posted reviews for movies on rotten tomatoes, using the id of the movie

In [None]:
df11 = pd.read_csv('zippedData/rt.movie_info.tsv', sep='\t', encoding='Latin-1')
df11.tail()

# this is the information on the movies, by id. But it doesn't include the movie name!!

After checking out the Rotten Tomatoes/Fandango API usage, we see that they do not grant API access to individuals. We will have to scrape for more data if we want to use this data.