# Importing our data from TMDB using API calls

## "Discover" Data Set

First we are getting our BIG set of movie data using the Discover API, with the following parameters:
    * include_adult false (does not include adult movies in the response)
    * include_video false (does not include short format video in the response)
    * page * where we are going to make calls to the maximum allowed number of pages which is 500
    * primary_release_date_gte 2000-01-01 (we want movie data for 20 years only)
    * primary_release_date_lte 2019-12-31 (this is a pre-covid case study)
    * vote_count.gte 31 (we want the movies to have at least 31 votes to be included)
    * with_original_language en (our client will make movies in English, and we will still have 10,000 returns)
    
The TMDB database allows for a maximum number of 10,000 returns, so we are trying to use parameter rules that will give us only the most relevant and important returns in order to make recommendations.

Helpfully, TMDB sorts returns by popularity, so if there are more than 10,000 returns, we will get only the 10,000 most popular movie entries for the last 15 years.

https://developers.themoviedb.org/3/discover/movie-discover

First loading in the libraries we will need to make these calls.

In [1]:
import json
import requests

Our API key is stored in our .secret folder in a json file. We're accessing that here and saving it to our api-key variable

In [2]:
# Retrieve our API key and load it into a variable

def get_keys(path):
    '''takes in a path string, loads the json file, and returns the info inside'''
    with open(path) as f:
        return json.load(f)

keys = get_keys("/Users/Thren/Dropbox/Flatiron/module_01/.secret/tmdb_api.json")

api_key = keys['api_key']

In [3]:
# Running our API call to the TMDB API

response={}

for i in range (1,501):
    url = 'https://api.themoviedb.org/3/discover/movie?api_key={}&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&page={}&primary_release_date.gte=2000-01-01&primary_release_date.lte=2019-12-31&vote_count.gte=31&with_original_language=en'.format(api_key, i)
    response[i] = (requests.get(url).json())

In [4]:
# open our file tmdb_movies.json and saving our information to the file
with open('api_data/tmdb_movies.json', 'w', encoding='utf-8') as f:
    json.dump(response, f, ensure_ascii=False, indent=4)

## IMDB ID Matchup

!!! STOP !!!
Don't run this portion of the notebook until AFTER the Discover data set created above ^ has been cleaned and exported in the main student.ipynb notebook

Then return to this notebook and run this section

In [5]:
# load our exported csv which is the dataframe we prepared in our main student.ipynb file
import pandas as pd
data = pd.read_csv('api_data/tmdb_discover.csv')

data

Unnamed: 0,popularity,vote_count,id,genre_ids,title,vote_average,release_date
0,520.621,12639,354912,"[16, 10751, 35, 12, 14, 10402]",Coco,8.2,2017-10-27
1,330.357,15378,475557,"[80, 53, 18]",Joker,8.2,2019-10-02
2,288.149,5133,474350,"[27, 14]",It Chapter Two,6.9,2019-09-04
3,257.243,6344,330457,"[16, 10751, 12, 35, 14]",Frozen II,7.3,2019-11-20
4,216.184,4925,512200,"[12, 35, 14]",Jumanji: The Next Level,7.0,2019-12-04
...,...,...,...,...,...,...,...
9995,4.837,61,16048,"[35, 18, 10749]",All About Anna,3.2,2005-11-24
9996,4.836,67,24959,"[16, 878]",Program,6.9,2003-02-07
9997,4.830,72,127144,"[27, 14, 16]",Don't Hug Me I'm Scared,7.5,2011-07-25
9998,4.830,40,41488,"[18, 53]",The Statement,5.7,2003-12-12


In [40]:
#data = data.head(10)

In [6]:
# use the TMDB API to call for the IMDB ids and add them to the dataframe
response={}

for ind in data.index:
    i = data['id'][ind]
    url = 'https://api.themoviedb.org/3/movie/{}/external_ids?api_key={}'.format(i, api_key)
    response = (requests.get(url).json())
    data['id'][ind] = response['imdb_id']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [7]:
data

Unnamed: 0,popularity,vote_count,id,genre_ids,title,vote_average,release_date
0,520.621,12639,tt2380307,"[16, 10751, 35, 12, 14, 10402]",Coco,8.2,2017-10-27
1,330.357,15378,tt7286456,"[80, 53, 18]",Joker,8.2,2019-10-02
2,288.149,5133,tt7349950,"[27, 14]",It Chapter Two,6.9,2019-09-04
3,257.243,6344,tt4520988,"[16, 10751, 12, 35, 14]",Frozen II,7.3,2019-11-20
4,216.184,4925,tt7975244,"[12, 35, 14]",Jumanji: The Next Level,7.0,2019-12-04
...,...,...,...,...,...,...,...
9995,4.837,61,tt0349080,"[35, 18, 10749]",All About Anna,3.2,2005-11-24
9996,4.836,67,tt0366178,"[16, 878]",Program,6.9,2003-02-07
9997,4.830,72,tt2501618,"[27, 14, 16]",Don't Hug Me I'm Scared,7.5,2011-07-25
9998,4.830,40,tt0340376,"[18, 53]",The Statement,5.7,2003-12-12


We have succesfully converted all of our TMDB ids into IMDB ids!

Time to re-export this dataframe so we can load it back into the student notebook.

In [8]:
#exporting the csv
data.to_csv('api_data/tmdb_discover_converted.csv', index=False)