# API Calls

## Overview

This project analyzes box office and streaming data to provide strategic direction for a newly formed movie division. Our analysis explores financial data, film ratings, and viewership data to help guide the division in deciding what types of movies to create.

## Business Problem

Microsoft is interested in entering the content creation space, and we need to use data insights to help them figure out how to launch their studio.

## Notebook Overview

After investigating the provided data, there are pieces of information I would still like to collect to develop a more robust analysis. The TMDB data I received only lists genres by id, so I need to get the associated names for that information to be meaningful. Additionally, I have previously used web scraping to collect some data from Netflix (see Web Scraping Notebook), and I would like to cross-reference that with TMBD's API to gather additional information about each title. 

## Getting TMDB Genre Codes

In [1]:
import requests
import pandas as pd
from config import tmdb_key

The TMDB API has a direct link to pull out the most current list of genres by id.

In [4]:
#request the information from the server
genres = requests.get(f'https://api.themoviedb.org/3/genre/movie/list?api_key={tmdb_key}&language=en-US')

#parse the API response into a dataframe
genres = genres.json()
genres = genres['genres']
df = pd.DataFrame.from_dict(genres)
df

Unnamed: 0,id,name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime
5,99,Documentary
6,18,Drama
7,10751,Family
8,14,Fantasy
9,36,History


In [9]:
#save this list to a CSV file
df.to_csv('TMDB Genres.csv', index=False)

# Retrieve More Info About the Netflix Top Ten

I want to gather more information about the titles in the Netflix Top Ten list I scraped (see web scraping for additional info on that process and netflix analysis for further analysis of this data). Using the TMDB API, I will feed each title into the API search function and return more information about each title.
(Please note that for efficiency, I did a small amount of data cleaning in Excel, so this table is a cleaned up version of what was scraped (predominantly to eliminate repeated titles).

In [8]:
#load netflix data
netflix_df = pd.read_csv('Data/Edited Data/Netflix Top 10.csv')
netflix_df

Unnamed: 0,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
0,Cocomelon,TV Show,1-Jun-20,220,730
1,The Queen's Gambit,TV Show,23-Oct-20,73,446
2,Bridgerton,TV Show,25-Dec-20,58,432
3,Cobra Kai,TV Show,28-Aug-20,53,370
4,Ginny & Georgia,TV Show,24-Feb-21,44,335
...,...,...,...,...,...
144,The Secret Life of Pets 2,Movie,27-Dec-19,1,2
145,Your Highness,Movie,1-May-21,1,2
146,Are We There Yet?,Movie,1-Feb-21,1,1
147,Step Up Revolution,Movie,1-Mar-21,1,1


In [9]:
#extract list of titles to search
queries = list(netflix_df['Title'])

#initialize list to store API call info
info = []

#loop through each title and get the data I am looking for from the API
for query in queries:
    request = requests.get(f'https://api.themoviedb.org/3/search/multi?api_key={tmdb_key}&query={query}')
    data = request.json()
    info.append(data['results'][0])

In [29]:
#convert raw data into a dataframe
additional_data_df = pd.DataFrame.from_dict(info)
additional_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   backdrop_path      147 non-null    object 
 1   first_air_date     68 non-null     object 
 2   genre_ids          149 non-null    object 
 3   id                 149 non-null    int64  
 4   media_type         149 non-null    object 
 5   name               68 non-null     object 
 6   origin_country     68 non-null     object 
 7   original_language  149 non-null    object 
 8   original_name      68 non-null     object 
 9   overview           149 non-null    object 
 10  popularity         149 non-null    float64
 11  poster_path        148 non-null    object 
 12  vote_average       149 non-null    float64
 13  vote_count         149 non-null    int64  
 14  adult              81 non-null     object 
 15  original_title     81 non-null     object 
 16  release_date       80 non-

A quick look at this data shows that there are four different columns for title, and all four have a significant number of null values. Let's look at those columns to see what's going on.

In [16]:
additional_data_df.loc[:, ['name', 'original_name', 'original_title', 'title', 'media_type']]

Unnamed: 0,name,original_name,original_title,title,media_type
0,CoComelon,CoComelon,,,tv
1,The Queen's Gambit,The Queen's Gambit,,,tv
2,Bridgerton,Bridgerton,,,tv
3,Cobra Kai,Cobra Kai,,,tv
4,Ginny & Georgia,Ginny & Georgia,,,tv
...,...,...,...,...,...
144,,,The Secret Life of Pets 2,The Secret Life of Pets 2,movie
145,,,Your Highness,Your Highness,movie
146,,,Are We There Yet?,Are We There Yet?,movie
147,,,Step Up Revolution,Step Up Revolution,movie


It looks like the 'name' columns are for TV shows and the 'title' columns are for movies. Since this analysis is focused only on movies, I can drop the TV show rows and clean up the null values that way.

In [30]:
additional_data_df = additional_data_df.loc[additional_data_df['media_type'] == 'movie']

In [31]:
additional_data_df.head()

Unnamed: 0,backdrop_path,first_air_date,genre_ids,id,media_type,name,origin_country,original_language,original_name,overview,popularity,poster_path,vote_average,vote_count,adult,original_title,release_date,title,video
7,/qXMXmhsJeW28DYp5iOar9BGepVS.jpg,,"[16, 12, 35, 10751, 878]",501929,movie,,,en,,"A quirky, dysfunctional family's road trip is ...",117.264,/mI2Di7HmskQQ34kz0iau6J1vr70.jpg,8.0,898,False,The Mitchells vs. the Machines,2021-04-22,The Mitchells vs. the Machines,False
15,/y6FYAhaeFMj9Lsk8OdF2NeWQkbd.jpg,,"[28, 14, 10751, 35]",615677,movie,,,en,,When alien invaders capture Earth's superheroe...,236.653,/1S21HpcKY6uQ9UAw68aICmrJaq6.jpg,6.3,484,False,We Can Be Heroes,2020-12-25,We Can Be Heroes,False
17,/5aBJ2f3D1MhsY1psSX97shHcovO.jpg,,"[35, 10751]",638597,movie,,,en,,A mom and dad who usually say no decide to say...,155.542,/rejrD9ovTHJbfmpLM0mbEliEPV6.jpg,7.0,578,False,Yes Day,2021-03-12,Yes Day,False
22,/z7HLq35df6ZpRxdMAE0qE3Ge4SJ.jpg,,"[28, 12, 35]",615678,movie,,,en,,In a world where supervillains are commonplace...,622.463,/3mKMWP5OokB7QpcOMA1yl8BXFAF.jpg,5.8,628,False,Thunder Force,2021-04-09,Thunder Force,False
25,/iQbGg0hsIVTnU5x4Fsy0BeJ3tSE.jpg,,"[35, 80, 53]",601666,movie,,,en,,A court-appointed legal guardian defrauds her ...,114.29,/gKnhEsjNefpKnUdAkn7INzIFLSu.jpg,6.7,1458,False,I Care a Lot,2021-02-19,I Care a Lot,False


In [32]:
additional_data_df.drop(columns = 
                        ['backdrop_path', 'first_air_date', 'media_type', 'name', 'original_name', 
                         'poster_path', 'adult', 'original_title', 'video', 'origin_country'], inplace=True)

In [33]:
additional_data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81 entries, 7 to 148
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          81 non-null     object 
 1   id                 81 non-null     int64  
 2   original_language  81 non-null     object 
 3   overview           81 non-null     object 
 4   popularity         81 non-null     float64
 5   vote_average       81 non-null     float64
 6   vote_count         81 non-null     int64  
 7   release_date       80 non-null     object 
 8   title              81 non-null     object 
dtypes: float64(2), int64(2), object(5)
memory usage: 6.3+ KB


Our data now looks to be in good shape for our analysis! However, I still need to evaluate it with the genre ids.

## Merge Dataframes

In [41]:
additional_data_df

Unnamed: 0,genre_ids,id,original_language,overview,popularity,vote_average,vote_count,release_date,title
7,"[16, 12, 35, 10751, 878]",501929,en,"A quirky, dysfunctional family's road trip is ...",117.264,8.0,898,2021-04-22,The Mitchells vs. the Machines
15,"[28, 14, 10751, 35]",615677,en,When alien invaders capture Earth's superheroe...,236.653,6.3,484,2020-12-25,We Can Be Heroes
17,"[35, 10751]",638597,en,A mom and dad who usually say no decide to say...,155.542,7.0,578,2021-03-12,Yes Day
22,"[28, 12, 35]",615678,en,In a world where supervillains are commonplace...,622.463,5.8,628,2021-04-09,Thunder Force
25,"[35, 80, 53]",601666,en,A court-appointed legal guardian defrauds her ...,114.290,6.7,1458,2021-02-19,I Care a Lot
...,...,...,...,...,...,...,...,...,...
144,"[12, 16, 35, 10751]",412117,en,Max the terrier must cope with some major life...,80.784,7.0,2152,2019-05-24,The Secret Life of Pets 2
145,"[35, 12, 14]",38319,en,"A fantasy movie about an arrogant, lazy prince...",11.433,5.4,978,2011-04-08,Your Highness
146,"[10751, 12, 35, 10749]",11637,en,"The fledgling romance between Nick, a playboy ...",16.062,5.5,670,2005-01-20,Are We There Yet?
147,"[10402, 18, 10749]",85446,en,Emily arrives in Miami with aspirations to bec...,25.199,6.9,1712,2012-07-26,Step Up Revolution


In [44]:
netflix_df

Unnamed: 0,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
0,Cocomelon,TV Show,1-Jun-20,220,730
1,The Queen's Gambit,TV Show,23-Oct-20,73,446
2,Bridgerton,TV Show,25-Dec-20,58,432
3,Cobra Kai,TV Show,28-Aug-20,53,370
4,Ginny & Georgia,TV Show,24-Feb-21,44,335
...,...,...,...,...,...
144,The Secret Life of Pets 2,Movie,27-Dec-19,1,2
145,Your Highness,Movie,1-May-21,1,2
146,Are We There Yet?,Movie,1-Feb-21,1,1
147,Step Up Revolution,Movie,1-Mar-21,1,1


Our Netflix Top Ten Data still contains TV shows, so I will want to drop those rows before merging the data

In [54]:
netflix_df = netflix_df.loc[netflix_df['Type'] == 'Movie']

In [55]:
combined_df = additional_data_df.merge(netflix_df, how='outer', left_on='title', right_on='Title')

In [56]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96 entries, 0 to 95
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   genre_ids             81 non-null     object 
 1   id                    81 non-null     float64
 2   original_language     81 non-null     object 
 3   overview              81 non-null     object 
 4   popularity            81 non-null     float64
 5   vote_average          81 non-null     float64
 6   vote_count            81 non-null     float64
 7   release_date          80 non-null     object 
 8   title                 81 non-null     object 
 9   Title                 82 non-null     object 
 10  Type                  82 non-null     object 
 11  Netflix Release Date  82 non-null     object 
 12  Days in Top Ten       82 non-null     float64
 13  Viewership Score      82 non-null     float64
dtypes: float64(6), object(8)
memory usage: 11.2+ KB


In [63]:
combined_df.loc[(combined_df['title'].isna())]

Unnamed: 0,genre_ids,id,original_language,overview,popularity,vote_average,vote_count,release_date,title,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
81,,,,,,,,,,The Mitchells vs. The Machines,Movie,30-Apr-21,31.0,204.0
82,,,,,,,,,,I Care a Lot.,Movie,19-Feb-21,15.0,103.0
83,,,,,,,,,,Operation Varsity Blues,Movie,17-Mar-21,11.0,67.0
84,,,,,,,,,,Below Zero,Movie,29-Jan-21,6.0,54.0
85,,,,,,,,,,Home,Movie,25-May-21,6.0,53.0
86,,,,,,,,,,To All the Boys Always and Forever,Movie,12-Feb-21,7.0,47.0
87,,,,,,,,,,Malcolm & Marie,Movie,5-Feb-21,4.0,28.0
88,,,,,,,,,,Biggie: I Got A Story to Tell,Movie,1-Mar-21,4.0,27.0
89,,,,,,,,,,Monster,Movie,7-May-21,5.0,16.0
90,,,,,,,,,,The Stand-In,Movie,10-Apr-21,2.0,9.0


In [64]:
combined_df.loc[(combined_df['Title'].isna())]

Unnamed: 0,genre_ids,id,original_language,overview,popularity,vote_average,vote_count,release_date,title,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
0,"[16, 12, 35, 10751, 878]",501929.0,en,"A quirky, dysfunctional family's road trip is ...",117.264,8.0,898.0,2021-04-22,The Mitchells vs. the Machines,,,,,
4,"[35, 80, 53]",601666.0,en,A court-appointed legal guardian defrauds her ...,114.29,6.7,1458.0,2021-02-19,I Care a Lot,,,,,
14,"[99, 80, 18]",799555.0,en,An examination that goes beyond the celebrity-...,16.887,7.2,68.0,2021-03-17,Operation Varsity Blues: The College Admission...,,,,,
23,"[28, 12, 878]",429617.0,en,Peter Parker and his friends go on a summer tr...,252.284,7.5,9949.0,2019-06-28,Spider-Man: Far From Home,,,,,
24,[35],773655.0,en,2020: A year so [insert adjective of choice he...,10.884,6.6,434.0,2020-12-27,Death to 2020,,,,,
26,"[10749, 35, 18]",614409.0,en,Senior year of high school takes center stage ...,90.065,7.9,1261.0,2021-02-12,To All the Boys: Always and Forever,,,,,
40,"[99, 10402]",797309.0,en,"Christopher Wallace, AKA The Notorious B.I.G.,...",7.626,7.1,57.0,2021-03-01,Biggie: I Got a Story to Tell,,,,,
56,"[28, 14, 12]",458576.0,en,A portal transports Cpt. Artemis and an elite ...,663.254,7.0,1705.0,2020-12-03,Monster Hunter,,,,,
66,[35],633908.0,en,"When ordered to serve a year in rehab, actress...",7.726,5.6,69.0,2020-12-03,The Stand In,,,,,
68,"[28, 18, 53]",651589.0,es,A commentary on Spanish society dressed as a t...,17.526,6.5,108.0,2020-12-18,Sky High,,,,,


Inspecting the two sets of rows that didn't merge correctly, it looks like the majority are due to small differences in data entry/naming conventions. Some look like they are due to an inconsistency with my API call -- because I chose to select the only the first search result in all cases, I sometimes ended up with the wrong data (for example, Octonauts vs Octonauts & The Ring of Fire). Because most of these movies/shows are lower on the list, I am comfortable losing most of this data, even though it represents a decent percentage of the data. I am going to manually update The Mitchells vs. The Machines and I Care a Lot since those are the two movies with viewership scores > 100 and then re-join using an inner join.

In [None]:
#adjusting data to match
additional_data_df.at[7, 'title'] = netflix_df.at[7, 'Title']
additional_data_df.at[25, 'title'] = netflix_df.at[25, 'Title']

In [65]:
#re-join with an inner join
combined_df = additional_data_df.merge(netflix_df, how='inner', left_on='title', right_on='Title')

In [66]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67 entries, 0 to 66
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   genre_ids             67 non-null     object 
 1   id                    67 non-null     int64  
 2   original_language     67 non-null     object 
 3   overview              67 non-null     object 
 4   popularity            67 non-null     float64
 5   vote_average          67 non-null     float64
 6   vote_count            67 non-null     int64  
 7   release_date          66 non-null     object 
 8   title                 67 non-null     object 
 9   Title                 67 non-null     object 
 10  Type                  67 non-null     object 
 11  Netflix Release Date  67 non-null     object 
 12  Days in Top Ten       67 non-null     int64  
 13  Viewership Score      67 non-null     int64  
dtypes: float64(2), int64(4), object(8)
memory usage: 7.9+ KB


In [None]:
combined_df.to_csv('Netflix Top Ten with Info.csv', index=False)

## Next Steps

- Return to my data cleaning notebook to merge these two new CSVs with some of my existing data