# API Calls

## Project Overview

This project analyzes box office and streaming data to provide strategic direction for a newly formed movie division. Our analysis explores financial data, film ratings, and viewership data to help guide the division in deciding what types of movies to create.

## Notebook Overview

After investigating the provided data, there are pieces of information I would still like to collect to develop a more robust analysis. The TMDB data I received only lists genres by id, so I need to get the associated names for that information to be meaningful. Additionally, I have previously used web scraping to collect some data from Netflix (see [Web Scraping](https://github.com/zelda4669/movie-data-analysis/blob/main/Data%20Cleaning%20Notebooks/Noelle%20-%20Web%20Scraping.ipynb)), and I would like to cross-reference that with TMBD's API to gather additional information about each title. 

## Getting TMDB Genre Codes

In [1]:
import requests
import pandas as pd
from config import tmdb_key

The TMDB API has a direct link to pull out the most current list of genres by id.

In [2]:
#request the information from the server
genres = requests.get(f'https://api.themoviedb.org/3/genre/movie/list?api_key={tmdb_key}&language=en-US')

#parse the API response into a dataframe
genres = genres.json()
genres = genres['genres']
df = pd.DataFrame.from_dict(genres)
df

Unnamed: 0,id,name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime
5,99,Documentary
6,18,Drama
7,10751,Family
8,14,Fantasy
9,36,History


In [3]:
#save this list to a CSV file
df.to_csv('TMDB Genres.csv', index=False)

# Retrieve More Info About the Netflix Top Ten

I want to gather more information about the titles in the Netflix Top Ten list I scraped (see [Web Scraping](https://github.com/zelda4669/movie-data-analysis/blob/main/Data%20Cleaning%20Notebooks/Noelle%20-%20Web%20Scraping%20.ipynb) for additional info on that process and [Netflix Analysis](https://github.com/zelda4669/movie-data-analysis/blob/main/Netflix%20Top%20Ten.ipynb) for further analysis of this data). Using the TMDB API, I will feed each title into the API search function and return more information about each title.
(Note that for efficiency, I did a small amount of data cleaning in Excel, so this is a slightly cleaned version of the scraped dataset).

In [4]:
#load netflix data
netflix_df = pd.read_csv('../Data/Edited Data/Netflix Top 10.csv')
netflix_df

Unnamed: 0,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
0,Cocomelon,TV Show,1-Jun-20,220,730
1,The Queen's Gambit,TV Show,23-Oct-20,73,446
2,Bridgerton,TV Show,25-Dec-20,58,432
3,Cobra Kai,TV Show,28-Aug-20,53,370
4,Ginny & Georgia,TV Show,24-Feb-21,44,335
...,...,...,...,...,...
144,The Secret Life of Pets 2,Movie,27-Dec-19,1,2
145,Your Highness,Movie,1-May-21,1,2
146,Are We There Yet?,Movie,1-Feb-21,1,1
147,Step Up Revolution,Movie,1-Mar-21,1,1


In [5]:
#extract list of titles to search
queries = list(netflix_df['Title'])

#initialize list to store API call info
info = []

#loop through each title and get the data I am looking for from the API
for query in queries:
    request = requests.get(f'https://api.themoviedb.org/3/search/multi?api_key={tmdb_key}&query={query}')
    data = request.json()
    info.append(data['results'][0])

The query API returns a list of all titles returned by the search term; I am choosing to select the first search result for every movie. It's possible that this will result in some incorrect information, but it should be good enough for our purposes.

In [6]:
#convert raw data into a dataframe
additional_data_df = pd.DataFrame.from_dict(info)
additional_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   backdrop_path      147 non-null    object 
 1   first_air_date     67 non-null     object 
 2   genre_ids          149 non-null    object 
 3   id                 149 non-null    int64  
 4   media_type         149 non-null    object 
 5   name               67 non-null     object 
 6   origin_country     67 non-null     object 
 7   original_language  149 non-null    object 
 8   original_name      67 non-null     object 
 9   overview           149 non-null    object 
 10  popularity         149 non-null    float64
 11  poster_path        149 non-null    object 
 12  vote_average       149 non-null    float64
 13  vote_count         149 non-null    int64  
 14  adult              82 non-null     object 
 15  original_title     82 non-null     object 
 16  release_date       82 non-

A quick look at this data shows that there are four different columns for title, and all four have a significant number of null values. Let's look at those columns to see what's going on.

In [7]:
additional_data_df.loc[:, ['name', 'original_name', 'original_title', 'title', 'media_type']]

Unnamed: 0,name,original_name,original_title,title,media_type
0,CoComelon,CoComelon,,,tv
1,The Queen's Gambit,The Queen's Gambit,,,tv
2,Bridgerton,Bridgerton,,,tv
3,Cobra Kai,Cobra Kai,,,tv
4,Ginny & Georgia,Ginny & Georgia,,,tv
...,...,...,...,...,...
144,,,The Secret Life of Pets 2,The Secret Life of Pets 2,movie
145,,,Your Highness,Your Highness,movie
146,,,Are We There Yet?,Are We There Yet?,movie
147,,,Step Up Revolution,Step Up Revolution,movie


It looks like the 'name' columns are for TV shows and the 'title' columns are for movies. Since this analysis is focused only on movies, I can drop the TV show rows and clean up the null values that way.

In [8]:
additional_data_df = additional_data_df.loc[additional_data_df['media_type'] == 'movie']

In [9]:
additional_data_df.head()

Unnamed: 0,backdrop_path,first_air_date,genre_ids,id,media_type,name,origin_country,original_language,original_name,overview,popularity,poster_path,vote_average,vote_count,adult,original_title,release_date,title,video
7,/qXMXmhsJeW28DYp5iOar9BGepVS.jpg,,"[16, 12, 35, 10751, 878]",501929,movie,,,en,,"A quirky, dysfunctional family's road trip is ...",92.808,/mI2Di7HmskQQ34kz0iau6J1vr70.jpg,8.0,925,False,The Mitchells vs. the Machines,2021-04-22,The Mitchells vs. the Machines,False
15,/y6FYAhaeFMj9Lsk8OdF2NeWQkbd.jpg,,"[28, 14, 10751, 35]",615677,movie,,,en,,When alien invaders capture Earth's superheroe...,204.662,/1S21HpcKY6uQ9UAw68aICmrJaq6.jpg,6.3,484,False,We Can Be Heroes,2020-12-25,We Can Be Heroes,False
17,/5aBJ2f3D1MhsY1psSX97shHcovO.jpg,,"[35, 10751]",638597,movie,,,en,,A mom and dad who usually say no decide to say...,123.405,/rejrD9ovTHJbfmpLM0mbEliEPV6.jpg,7.0,581,False,Yes Day,2021-03-12,Yes Day,False
22,/z7HLq35df6ZpRxdMAE0qE3Ge4SJ.jpg,,"[28, 12, 35]",615678,movie,,,en,,In a world where supervillains are commonplace...,430.962,/3mKMWP5OokB7QpcOMA1yl8BXFAF.jpg,5.8,641,False,Thunder Force,2021-04-09,Thunder Force,False
23,/az0UktwHYMrWprVyomDsTTHprRz.jpg,,[],824974,movie,,,en,,"The comedic dream team from ""The Upshaws"" disc...",2.5,/iRww3ilt4lpzsKN7ht9BXSz4rRF.jpg,10.0,2,False,The Upshaws - The Afterparty,2021-05-14,The Upshaws - The Afterparty,False


Finally, I will drop all of the columns that are not relevant to my analysis.

In [10]:
additional_data_df.drop(columns = 
                        ['backdrop_path', 'first_air_date', 'media_type', 'name', 'original_name', 
                         'poster_path', 'adult', 'original_title', 'video', 'origin_country'], inplace=True)

In [11]:
additional_data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 7 to 148
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          82 non-null     object 
 1   id                 82 non-null     int64  
 2   original_language  82 non-null     object 
 3   overview           82 non-null     object 
 4   popularity         82 non-null     float64
 5   vote_average       82 non-null     float64
 6   vote_count         82 non-null     int64  
 7   release_date       82 non-null     object 
 8   title              82 non-null     object 
dtypes: float64(2), int64(2), object(5)
memory usage: 6.4+ KB


Our data now looks to be in good shape for our analysis! However, I still need to evaluate it with the genre ids.

## Merge Dataframes

Our Netflix Top Ten Data still contains TV shows, so I will want to drop those rows before merging the data

In [12]:
netflix_df = netflix_df.loc[netflix_df['Type'] == 'Movie']

In [13]:
combined_df = additional_data_df.merge(netflix_df, how='outer', left_on='title', right_on='Title')

In [14]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 96
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   genre_ids             82 non-null     object 
 1   id                    82 non-null     float64
 2   original_language     82 non-null     object 
 3   overview              82 non-null     object 
 4   popularity            82 non-null     float64
 5   vote_average          82 non-null     float64
 6   vote_count            82 non-null     float64
 7   release_date          82 non-null     object 
 8   title                 82 non-null     object 
 9   Title                 82 non-null     object 
 10  Type                  82 non-null     object 
 11  Netflix Release Date  82 non-null     object 
 12  Days in Top Ten       82 non-null     float64
 13  Viewership Score      82 non-null     float64
dtypes: float64(6), object(8)
memory usage: 11.4+ KB


In [15]:
combined_df.loc[(combined_df['title'].isna())]

Unnamed: 0,genre_ids,id,original_language,overview,popularity,vote_average,vote_count,release_date,title,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
82,,,,,,,,,,The Mitchells vs. The Machines,Movie,30-Apr-21,31.0,204.0
83,,,,,,,,,,I Care a Lot.,Movie,19-Feb-21,15.0,103.0
84,,,,,,,,,,Operation Varsity Blues,Movie,17-Mar-21,11.0,67.0
85,,,,,,,,,,Below Zero,Movie,29-Jan-21,6.0,54.0
86,,,,,,,,,,Home,Movie,25-May-21,6.0,53.0
87,,,,,,,,,,To All the Boys Always and Forever,Movie,12-Feb-21,7.0,47.0
88,,,,,,,,,,Malcolm & Marie,Movie,5-Feb-21,4.0,28.0
89,,,,,,,,,,Biggie: I Got A Story to Tell,Movie,1-Mar-21,4.0,27.0
90,,,,,,,,,,Monster,Movie,7-May-21,5.0,16.0
91,,,,,,,,,,The Stand-In,Movie,10-Apr-21,2.0,9.0


In [16]:
combined_df.loc[(combined_df['Title'].isna())]

Unnamed: 0,genre_ids,id,original_language,overview,popularity,vote_average,vote_count,release_date,title,Title,Type,Netflix Release Date,Days in Top Ten,Viewership Score
0,"[16, 12, 35, 10751, 878]",501929.0,en,"A quirky, dysfunctional family's road trip is ...",92.808,8.0,925.0,2021-04-22,The Mitchells vs. the Machines,,,,,
4,[],824974.0,en,"The comedic dream team from ""The Upshaws"" disc...",2.5,10.0,2.0,2021-05-14,The Upshaws - The Afterparty,,,,,
5,"[35, 80, 53]",601666.0,en,A court-appointed legal guardian defrauds her ...,87.627,6.7,1468.0,2021-02-19,I Care a Lot,,,,,
15,"[99, 80, 18]",799555.0,en,An examination that goes beyond the celebrity-...,15.779,7.2,68.0,2021-03-17,Operation Varsity Blues: The College Admission...,,,,,
24,"[28, 12, 878]",429617.0,en,Peter Parker and his friends go on a summer tr...,214.3,7.5,9974.0,2019-06-28,Spider-Man: Far From Home,,,,,
25,[35],773655.0,en,2020: A year so [insert adjective of choice he...,10.191,6.6,434.0,2020-12-27,Death to 2020,,,,,
27,"[10749, 35, 18]",614409.0,en,Senior year of high school takes center stage ...,76.642,7.9,1267.0,2021-02-12,To All the Boys: Always and Forever,,,,,
41,"[99, 10402]",797309.0,en,"Christopher Wallace, AKA The Notorious B.I.G.,...",8.443,7.1,59.0,2021-03-01,Biggie: I Got a Story to Tell,,,,,
57,"[28, 14, 12]",458576.0,en,A portal transports Cpt. Artemis and an elite ...,560.54,7.0,1718.0,2020-12-03,Monster Hunter,,,,,
67,[35],633908.0,en,"When ordered to serve a year in rehab, actress...",8.581,5.6,70.0,2020-12-03,The Stand In,,,,,


Inspecting the two sets of rows that didn't merge correctly, it looks like the majority are due to small differences in data entry/naming conventions. Some look like they are due to an inconsistency with my API call -- because I chose to select the only the first search result in all cases, I sometimes ended up with the wrong data (for example, The Croods vs. The Croods: A New Age). Because most of these movies are lower on the list, I am comfortable losing most of this data, even though it represents a decent percentage of the data. I am going to manually update The Mitchells vs. The Machines and I Care a Lot since those are the two movies with viewership scores > 100 and then re-join using an inner join.

In [17]:
#adjusting data to match
additional_data_df.at[7, 'title'] = netflix_df.at[7, 'Title']
additional_data_df.at[25, 'title'] = netflix_df.at[25, 'Title']

In [18]:
#re-join with an inner join
combined_df = additional_data_df.merge(netflix_df, how='inner', left_on='title', right_on='Title')

In [19]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69 entries, 0 to 68
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   genre_ids             69 non-null     object 
 1   id                    69 non-null     int64  
 2   original_language     69 non-null     object 
 3   overview              69 non-null     object 
 4   popularity            69 non-null     float64
 5   vote_average          69 non-null     float64
 6   vote_count            69 non-null     int64  
 7   release_date          69 non-null     object 
 8   title                 69 non-null     object 
 9   Title                 69 non-null     object 
 10  Type                  69 non-null     object 
 11  Netflix Release Date  69 non-null     object 
 12  Days in Top Ten       69 non-null     int64  
 13  Viewership Score      69 non-null     int64  
dtypes: float64(2), int64(4), object(8)
memory usage: 8.1+ KB


In [41]:
combined_df.to_csv('Netflix Top Ten with Info.csv', index=False)

## Next Steps

- Return to my data cleaning notebook to merge these two new CSVs with some of my existing data ([Data Cleaning Notebook](https://github.com/zelda4669/movie-data-analysis/blob/main/Data%20Cleaning%20Notebooks/Noelle%20-%20Data%20Exploration%20and%20Cleaning.ipynb))