# DSC 540- Milestone-4 API
# Kannur, Gyan
# Instructor Catherine Williams

In [40]:
import pandas as pd
import requests
import traceback
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

##### Load the previously webscraped clean data

In [70]:
web_scraped_data = pd.read_csv("./project_datasets/clean-webscraped.csv")

##### Step 1: Reading API data using "requests" library available in Python

Before invoking the API, you need to signup for your [TMDB](https://www.themoviedb.org/) account and get your api key to work with tmdb API.

In [42]:
api_key = '5a51d21e02c0baae3a96d6a5e7687635'
genre_dict = dict()

In [44]:
genre_url=f'https://api.themoviedb.org/3/genre/movie/list?api_key={api_key}'

try:
    response = requests.get(genre_url)
    if response.status_code == 200:
        genre_dict = { genre_json['id']:genre_json['name'] for genre_json in response.json()['genres'] }
except Exception as e:
    print('Error downloading genres: ',e)
    traceback.print_exc()

#Lets add a unique key for movies without any genre
if 0 not in genre_dict.keys():
    genre_dict[0] = 'unknown'


In [71]:
#Genre Names
genre_dict

{'28': 'Action',
 '12': 'Adventure',
 '16': 'Animation',
 '35': 'Comedy',
 '80': 'Crime',
 '99': 'Documentary',
 '18': 'Drama',
 '10751': 'Family',
 '14': 'Fantasy',
 '36': 'History',
 '27': 'Horror',
 '10402': 'Music',
 '9648': 'Mystery',
 '10749': 'Romance',
 '878': 'Science Fiction',
 '10770': 'TV Movie',
 '53': 'Thriller',
 '10752': 'War',
 '37': 'Western',
 '0': 'unknown'}

In [72]:
#Display the standard list of genres

genre_dict

{'28': 'Action',
 '12': 'Adventure',
 '16': 'Animation',
 '35': 'Comedy',
 '80': 'Crime',
 '99': 'Documentary',
 '18': 'Drama',
 '10751': 'Family',
 '14': 'Fantasy',
 '36': 'History',
 '27': 'Horror',
 '10402': 'Music',
 '9648': 'Mystery',
 '10749': 'Romance',
 '878': 'Science Fiction',
 '10770': 'TV Movie',
 '53': 'Thriller',
 '10752': 'War',
 '37': 'Western',
 '0': 'unknown'}

In [46]:
#Get the list of unique movie titles
scraped_movie_list=web_scraped_data['movie_title'].unique()
len(scraped_movie_list)

353

## Fetch the movie details using the api keys

In [47]:
%%time
# We will first create an empty dataframe to store all the movie detail
api_df = pd.DataFrame()
# Our for loop will iterate through each page, get json data convert it into dataframe and append it to original dataframe
for title in scraped_movie_list:
    url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={title}"
    response = requests.get(url)
    if 'results' in response.json():
        temporary_df = pd.DataFrame(response.json()['results'])
        api_df = pd.concat([api_df,temporary_df],ignore_index=True)

CPU times: total: 984 ms
Wall time: 34.1 s


In [73]:
api_df.head()

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/npCPnwDyWfQltGfIZKN6WqeUXGI.jpg,"[Fantasy, Adventure, Action]",57158,en,The Hobbit: The Desolation of Smaug,"The Dwarves, Bilbo and Gandalf have successful...",87.34,/xQYiXsheRCDBA39DOrmaw1aSpbk.jpg,2013-12-11,The Hobbit: The Desolation of Smaug,False,7.574,13227.0
1,False,/u2bZhH3nTf0So0UIC1QxAqBvC07.jpg,"[Animation, Family, Adventure, Fantasy]",109445,en,Frozen,Young princess Anna of Arendelle dreams about ...,191.622,/mmWheq3cFI4tYrZDiATOkCNTqgK.jpg,2013-11-20,Frozen,False,7.246,16611.0
2,False,/vD1yKObsRS2cvpmtuaCaMhr4zxe.jpg,[Thriller],44363,en,Frozen,When three skiers find themselves stranded on ...,40.401,/2J3URUnDrIpNvh0uVqINQvr4HhW.jpg,2010-02-05,Frozen,False,5.996,1831.0
3,False,/sGjhRHNiQSkgVec18D3oX45hPmz.jpg,[Thriller],26041,en,Frozen,It's two years since the mysterious disappeara...,7.794,/a6RlPQUerliQLkAieku5B8Loamk.jpg,2005-03-12,Frozen,False,5.7,18.0
4,False,,"[Drama, Horror]",170986,hi,Frozen,This is a touching and somber journey of Lasya...,4.0,/2GL9yZtrgbYKCeKBc3TF9gGfZpX.jpg,2007-07-21,Frozen,False,7.2,4.0


In [74]:
api_df.shape

(3001, 14)

##### Check if the genre_ids have any empty data like [] as genre_ids column is a list

In [75]:
api_df.genre_ids.value_counts()

genre_ids
[unknown]                                                  329
[Documentary]                                              262
[Drama]                                                    190
[Comedy]                                                   128
[Horror]                                                   125
                                                          ... 
[Documentary, History, Crime]                                1
[Crime, Thriller, Action]                                    1
[TV Movie, Action, Adventure, Fantasy, Science Fiction]      1
[Action, Animation, Fantasy, Science Fiction]                1
[Animation, Fantasy, Comedy]                                 1
Name: count, Length: 761, dtype: int64

##### pandas replace empty square brackets with [0], here genre[0] is unknown which we created earlier

In [76]:
# pandas replace empty square brackets with [0], here genre[0] is unknown which we created earlier
api_df['genre_ids'] = api_df['genre_ids'].apply(lambda  x : [0] if not x else x)

In [77]:
api_df.genre_ids.value_counts()

genre_ids
[unknown]                                                  329
[Documentary]                                              262
[Drama]                                                    190
[Comedy]                                                   128
[Horror]                                                   125
                                                          ... 
[Documentary, History, Crime]                                1
[Crime, Thriller, Action]                                    1
[TV Movie, Action, Adventure, Fantasy, Science Fiction]      1
[Action, Animation, Fantasy, Science Fiction]                1
[Animation, Fantasy, Comedy]                                 1
Name: count, Length: 761, dtype: int64

##### Removing brackets [] from list type inside pandas cell

In [78]:
#removing brackets [] from list type inside pandas cell
api_df['genre_ids'] = api_df['genre_ids'].str.join(',')

In [79]:
#List out the genre and its count, observe there are no square braces [] 
api_df['genre_ids'].value_counts()

genre_ids
unknown                                              329
Documentary                                          262
Drama                                                190
Comedy                                               128
Horror                                               125
                                                    ... 
Documentary,History,Crime                              1
Crime,Thriller,Action                                  1
TV Movie,Action,Adventure,Fantasy,Science Fiction      1
Action,Animation,Fantasy,Science Fiction               1
Animation,Fantasy,Comedy                               1
Name: count, Length: 761, dtype: int64

## Extract year as a separate column from release_date

In [80]:
#extract year as a separate column from release_date
api_df['release_date']=pd.to_datetime(api_df['release_date'],format="%Y-%m-%d")
api_df["release_year"] = pd.to_datetime(api_df['release_date'], format="%Y-%m-%d").dt.year
api_df.head()

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,release_year
0,False,/npCPnwDyWfQltGfIZKN6WqeUXGI.jpg,"Fantasy,Adventure,Action",57158,en,The Hobbit: The Desolation of Smaug,"The Dwarves, Bilbo and Gandalf have successful...",87.34,/xQYiXsheRCDBA39DOrmaw1aSpbk.jpg,2013-12-11,The Hobbit: The Desolation of Smaug,False,7.574,13227.0,2013.0
1,False,/u2bZhH3nTf0So0UIC1QxAqBvC07.jpg,"Animation,Family,Adventure,Fantasy",109445,en,Frozen,Young princess Anna of Arendelle dreams about ...,191.622,/mmWheq3cFI4tYrZDiATOkCNTqgK.jpg,2013-11-20,Frozen,False,7.246,16611.0,2013.0
2,False,/vD1yKObsRS2cvpmtuaCaMhr4zxe.jpg,Thriller,44363,en,Frozen,When three skiers find themselves stranded on ...,40.401,/2J3URUnDrIpNvh0uVqINQvr4HhW.jpg,2010-02-05,Frozen,False,5.996,1831.0,2010.0
3,False,/sGjhRHNiQSkgVec18D3oX45hPmz.jpg,Thriller,26041,en,Frozen,It's two years since the mysterious disappeara...,7.794,/a6RlPQUerliQLkAieku5B8Loamk.jpg,2005-03-12,Frozen,False,5.7,18.0,2005.0
4,False,,"Drama,Horror",170986,hi,Frozen,This is a touching and somber journey of Lasya...,4.0,/2GL9yZtrgbYKCeKBc3TF9gGfZpX.jpg,2007-07-21,Frozen,False,7.2,4.0,2007.0


In [81]:
#Check if there are any nan in release year
api_df.loc[api_df.release_year.isna()]

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,release_year
5,False,,unknown,950554,en,Frozen,A film by Adonia Bouchehri.,0.402,,NaT,Frozen,False,0.0,0.0,
32,False,,unknown,566990,en,Gravity,Three boys power play with a gun. Gravity is a...,0.001,,NaT,Gravity,False,0.0,0.0,
75,False,/xcIqtToUFrie1o4g4ZtYVKj5R1f.jpg,"Science Fiction,Action",374771,en,Riddick: Furya,"Riddick finally returns to his home world, a p...",16.738,,NaT,Riddick: Furya,False,0.0,0.0,
88,False,,"Comedy,Romance",887567,en,The Butler,A story of a butler who is in love with their ...,0.001,,NaT,The Butler,False,0.0,0.0,
113,False,,unknown,1159811,ko,극락전,"Srey Na, a female immigrant from Cambodia, who...",0.001,,NaT,The Road to Elysium,False,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2902,False,/46Br5afTklXva5gRI6wXcAApaGP.jpg,"Animation,Action,Adventure,Science Fiction",911916,en,Spider-Man: Beyond the Spider-Verse,The third installment in the Spider-Verse fran...,29.473,/rZ4arzyaDyI8l9Y7VIPPsDGARwh.jpg,NaT,Spider-Man: Beyond the Spider-Verse,False,0.0,0.0,
2924,False,,"Action,Adventure,Science Fiction",939345,en,Transformers: Rise of the Beasts 2,The first of two planned sequels to the 2023 f...,19.756,/f4PFiwOHVcNUXRcOmxX2hUYdAx7.jpg,NaT,Transformers: Rise of the Beasts 2,False,0.0,0.0,
2925,False,,"Action,Adventure,Science Fiction,Fantasy",939347,en,Transformers: Rise of the Beasts 3,The second of two planned sequels to the 2023 ...,14.596,/zjDGpjRj9M9pLqVVZPpaFhG6BLx.jpg,NaT,Transformers: Rise of the Beasts 3,False,0.0,0.0,
2974,False,,unknown,1195165,en,Dungeons & Derrick,"Derrick and Tori, best friends and avid player...",0.001,,NaT,Dungeons & Derrick,False,0.0,0.0,


In [82]:
#Fill in those nan with some arbitrary value so you can avoid any type conversion errors
api_df['release_year'] = api_df.release_year.fillna(1900)

In [83]:
#conver floats to ints now
api_df["release_year"] = api_df.release_year.astype('int64')

In [84]:
# checks if any of columns in the data have null values - should print False
api_df.isnull().sum().any()

True

In [85]:
api_df.dropna(inplace=True)

In [86]:
api_df.shape

(1893, 15)

In [87]:
api_df.head()

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,release_year
0,False,/npCPnwDyWfQltGfIZKN6WqeUXGI.jpg,"Fantasy,Adventure,Action",57158,en,The Hobbit: The Desolation of Smaug,"The Dwarves, Bilbo and Gandalf have successful...",87.34,/xQYiXsheRCDBA39DOrmaw1aSpbk.jpg,2013-12-11,The Hobbit: The Desolation of Smaug,False,7.574,13227.0,2013
1,False,/u2bZhH3nTf0So0UIC1QxAqBvC07.jpg,"Animation,Family,Adventure,Fantasy",109445,en,Frozen,Young princess Anna of Arendelle dreams about ...,191.622,/mmWheq3cFI4tYrZDiATOkCNTqgK.jpg,2013-11-20,Frozen,False,7.246,16611.0,2013
2,False,/vD1yKObsRS2cvpmtuaCaMhr4zxe.jpg,Thriller,44363,en,Frozen,When three skiers find themselves stranded on ...,40.401,/2J3URUnDrIpNvh0uVqINQvr4HhW.jpg,2010-02-05,Frozen,False,5.996,1831.0,2010
3,False,/sGjhRHNiQSkgVec18D3oX45hPmz.jpg,Thriller,26041,en,Frozen,It's two years since the mysterious disappeara...,7.794,/a6RlPQUerliQLkAieku5B8Loamk.jpg,2005-03-12,Frozen,False,5.7,18.0,2005
6,False,/9PxXSAnbVfvFacsGTJu1aXEWVg7.jpg,"Animation,Adventure,Comedy,Family",573171,es,Huevitos Congelados,"In the final Huevos adventure, Toto and his fa...",37.37,/8xCO3IarklLD4tK1rPn0e4gSMoV.jpg,2022-12-14,Little Eggs: A Frozen Rescue,False,7.67,348.0,2022


In [88]:
api_df.columns

Index(['adult', 'backdrop_path', 'genre_ids', 'id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'release_date', 'title', 'video', 'vote_average', 'vote_count',
       'release_year'],
      dtype='object')

#### Convert key column title to lowercase

In [89]:
api_df['title'] = api_df.title.str.lower()

#### Drop unused columns

In [90]:
api_df.drop(['id','adult','backdrop_path','poster_path','video'], axis=1, inplace=True)

In [91]:
api_df.rename(columns={'genre_ids': 'genres'}, inplace=True)

In [92]:
new_col_order=['title', 'release_year','genres', 'popularity', 'vote_average','vote_count', 'original_title', 'overview'
                ,'release_date','original_language'
               ]

for i,col in enumerate(new_col_order):
    tmp = api_df[col]
    api_df.drop(labels=[col],axis=1,inplace=True)
    api_df.insert(i,col,tmp)

sum(api_df.duplicated())
api_df.drop_duplicates(inplace=True)    

In [93]:
#Final data of cleaned and transformed data
api_df.head()

Unnamed: 0,title,release_year,genres,popularity,vote_average,vote_count,original_title,overview,release_date,original_language
0,the hobbit: the desolation of smaug,2013,"Fantasy,Adventure,Action",87.34,7.574,13227.0,The Hobbit: The Desolation of Smaug,"The Dwarves, Bilbo and Gandalf have successful...",2013-12-11,en
1,frozen,2013,"Animation,Family,Adventure,Fantasy",191.622,7.246,16611.0,Frozen,Young princess Anna of Arendelle dreams about ...,2013-11-20,en
2,frozen,2010,Thriller,40.401,5.996,1831.0,Frozen,When three skiers find themselves stranded on ...,2010-02-05,en
3,frozen,2005,Thriller,7.794,5.7,18.0,Frozen,It's two years since the mysterious disappeara...,2005-03-12,en
6,little eggs: a frozen rescue,2022,"Animation,Adventure,Comedy,Family",37.37,7.67,348.0,Huevitos Congelados,"In the final Huevos adventure, Toto and his fa...",2022-12-14,es


# Store the transformed api data to be used in future

In [33]:
api_df.to_csv(r'./project_datasets/clean-api_data.csv',index=False)

# API Ethical Implications and Assumptions:

What changes were made to the data?

The api may provide some of the dollar amounts separate by commas. I had to create columns which tells within what range the production cost was, this user friendly representation is easier to read. Example:

production cost= 365,000,000
prod_cost_range_million = 361-366

The genres are listed as numbers, Example genre_id 28 which maps to Action and 12 maps to Adventure.

I had to call another api to fetch the standard genre list and covert the genre_ids to its equivalent names.

I also created a separate year column from the release date. The year the movie released is significant for my future analysis and visualizations.


Are there any legal or regulatory guidelines for your data or project topic?

There is a huge number of movies release every year and this would be a problem when making several calls and the api can be rate limited, hence I had to cut down the analysis to last 10 years.

What risks could be created based on the transformations done?

Care should be taken when saving the files into csv and later persisting them in database. Some of the database like SQLite does not allow a list of values to be stored.
 
Did you make any assumptions in cleaning/transforming the data?

I had to drop some  columns such as adult - which was to indicate the age recommendations for that movie. I also dropped columns like backdrop_path and poster_path which as not part of the analysis. The genre was tricky as some movies fall under more than one. As it had square braces, that would throw an exception when I have to save it in database. 

How was your data sourced / verified for credibility?

I initially signed up for an account in imdb for an api key.The api had limitations on the number of requests I can make, therefore I signed up with tmdb api key which was far more flexible. TMDB is a very well know movie analytics website catering the needs of several projects wanting the movie data.

Was your data acquired in an ethical way?

The api keys ensure the api is accessed only via verified sources. The service hosted can only be accessed via the api keys
