# 2. IMDB_Data_Wrangling<a id='2_Data_wrangling'></a>

## 2.1 Table of Contents<a id='2.1_Contents'></a>
* 2. IMDB_Data_Wrangling
  * 2.1 Table of Contents
  * 2.2 Introduction
  * 2.3 Imports
  * 2.4 Retrieve IMDB Movie Data
  * 2.5 Target Feature (Movie Genres)
  * 2.6 Save data
  * 2.7 Summary


## 2.2 Introduction

In the Data Wrangling/Cleaning phase, data needs first be retrieved by the IMDB API Get request and then transformed into pandas dataframe for further manipulation and export to .csv.

## 2.3 Imports<a id='2.3_Imports'></a>

Importing all appropriate packages in order to access requests and organize/clean json data to pandas dataframe.

In [1]:
#Import requests, json, pandas, numpy, matplotlib.pyplot, and seaborn
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from ydata_profiling import ProfileReport

## 2.4 Retrieve IMDB Movie Data

Dataset is retrieved with movie data from 1990 to past year of 2023 with filtered setting of the API Request for G and PG rated movies based on the company's focus on Family Friendly Movies.

In [2]:
# Intialize start and end datetime for IMDB API Get request
start_year = ['1990-01-01', '1991-01-01', '1992-01-01', '1993-01-01', '1994-01-01', '1995-01-01', '1996-01-01', '1997-01-01','1998-01-01', '1999-01-01','2000-01-01', '2001-01-01', '2002-01-01', '2003-01-01', '2004-01-01', '2005-01-01', '2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01', '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01', '2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01']
end_year = ['1990-12-31', '1991-12-31', '1992-12-31', '1993-12-31', '1994-12-31', '1995-12-31', '1996-12-31', '1997-12-31','1998-12-31', '1999-12-31','2000-12-31', '2001-12-31', '2002-12-31', '2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31', '2007-12-31', '2008-12-31', '2009-12-31', '2010-12-31', '2011-12-31', '2012-12-31', '2013-12-31', '2014-12-31', '2015-12-31', '2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31', '2022-12-31', '2023-12-31']

In [4]:
#Created a function in order to generate API requests and convert the request into a Pandas Dataframe based on start and end year
#Code for API Get Request found on IMDB Website: https://imdb-api.com/api/#Search-header
#Code to convert json data into dict found on https://favtutor.com/blogs/string-to-dict-python
#Get keys from python dicts code found in https://tutorialdeep.com/knowhow/get-dictionary-value-key-python/
def imdb_data(start_year, end_year):
    url = f"https://imdb-api.com/API/AdvancedSearch/k_8jnkh6yr?title_type=feature&release_date={start_year},{end_year}&certificates=us:G,us:PG&count=250"
    payload = {}
    headers = {}
    res = requests.request("GET", url, headers=headers, data=payload)
    movie_list = res.text.encode('utf8')
    movie_list_decode = json.loads(movie_list.decode('utf8'))
    movie_list = movie_list_decode['results']
    movie_df = pd.DataFrame(movie_list)
    return movie_df

In [6]:
#Created list of movie dataframes of movies from 1990 to 2023
movie_df_concat = []
for i in range(len(start_year)):
    movie_df_concat.append(imdb_data(start_year[i], end_year[i]))

In [8]:
#Joined the above dataframes to create a combinded dataframe for movie data from 1990 to 2023
#Code to join dataframes ignore_index found on https://www.statology.org/stack-pandas-dataframes/ and https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.DataFrame.append.html
movie_df_total = pd.concat(movie_df_concat, ignore_index=True)
movie_df_total

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,1990,103 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,7.7,644183,63,"An eight-year-old troublemaker, mistakenly lef...",,[]
1,tt0100944,https://m.media-amazon.com/images/M/MV5BMjI1MD...,The Witches,1990,91 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,6.8,54068,78,A young boy stumbles onto a witch convention a...,,[]
2,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,1990,135 mins,"Action, Adventure, Thriller","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,7.5,211894,58,"In November 1984, the Soviet Union's best subm...",,[]
3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,1990,118 mins,"Adventure, Comedy, Sci-Fi","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,7.4,474698,55,"Stranded in 1955, Marty McFly learns about the...",,[]
4,tt0100419,https://m.media-amazon.com/images/M/MV5BYzk3NG...,Problem Child,1990,81 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,5.4,32670,27,A young boy just short of a monster is adopted...,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4053,tt0122511,https://m.media-amazon.com/images/M/MV5BMTQwMW...,The Gnomes Great Adventure,2023,74 mins,"Animation, Adventure, Comedy","[{'key': 'Animation', 'value': 'Animation'}, {...",G,6.1,63,,When a bunch of dim-witted trolls steal the Ki...,,[]
4054,tt30133792,,Justice League x RWBY: Super Heroes and Huntsmen,2023,,"Animation, Action, Adventure","[{'key': 'Animation', 'value': 'Animation'}, {...",PG,0,0,,Justice League x RWBY: Super Heroes and Huntsm...,,[]
4055,tt15426874,https://m.media-amazon.com/images/M/MV5BNmIwZj...,Seaper Powers: Mystery of the Blue Pearls,2023,66 mins,Animation,"[{'key': 'Animation', 'value': 'Animation'}]",PG,0,0,,Emma works for NOAA as a diver and researcher....,,[]
4056,tt28018383,,Graham Bullard Movie: Peanuts Meets the Loud H...,2023,89 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",G,0,0,,The Peanuts Gang and The Loud Family all have ...,,[]


Audited the IMDB Dataset with .head() and .info() methods.

In [9]:
movie_df_total.head()

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,1990,103 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,7.7,644183,63,"An eight-year-old troublemaker, mistakenly lef...",,[]
1,tt0100944,https://m.media-amazon.com/images/M/MV5BMjI1MD...,The Witches,1990,91 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,6.8,54068,78,A young boy stumbles onto a witch convention a...,,[]
2,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,1990,135 mins,"Action, Adventure, Thriller","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,7.5,211894,58,"In November 1984, the Soviet Union's best subm...",,[]
3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,1990,118 mins,"Adventure, Comedy, Sci-Fi","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,7.4,474698,55,"Stranded in 1955, Marty McFly learns about the...",,[]
4,tt0100419,https://m.media-amazon.com/images/M/MV5BYzk3NG...,Problem Child,1990,81 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,5.4,32670,27,A young boy just short of a monster is adopted...,,[]


In [10]:
movie_df_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4058 entries, 0 to 4057
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                4058 non-null   object
 1   image             3972 non-null   object
 2   title             4058 non-null   object
 3   description       4057 non-null   object
 4   runtimeStr        4004 non-null   object
 5   genres            4058 non-null   object
 6   genreList         4058 non-null   object
 7   contentRating     4057 non-null   object
 8   imDbRating        4058 non-null   object
 9   imDbRatingVotes   4058 non-null   object
 10  metacriticRating  1471 non-null   object
 11  plot              4058 non-null   object
 12  stars             4058 non-null   object
 13  starList          4058 non-null   object
dtypes: object(14)
memory usage: 444.0+ KB


Looking at the above results, we can see that the "image" column for the data has some null values. However, all rows seem to have a genre value, but will be investigated further. Rows with missing image data will be dropped as these will not be usable for future image classification.

In [12]:
#Dropped null values for image column and reset index
#Code to reset index found on https://www.geeksforgeeks.org/pandas-how-to-reset-index-in-a-given-dataframe/
movie_df_total['image'] = movie_df_total['image'].dropna()
movie_df_dropped = movie_df_total.dropna(subset=['image'])
movie_df_dropped = movie_df_dropped.reset_index()
movie_df_dropped.shape

(3972, 15)

In [13]:
movie_df_dropped.head()

Unnamed: 0,index,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
0,0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,1990,103 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,7.7,644183,63,"An eight-year-old troublemaker, mistakenly lef...",,[]
1,1,tt0100944,https://m.media-amazon.com/images/M/MV5BMjI1MD...,The Witches,1990,91 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,6.8,54068,78,A young boy stumbles onto a witch convention a...,,[]
2,2,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,1990,135 mins,"Action, Adventure, Thriller","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,7.5,211894,58,"In November 1984, the Soviet Union's best subm...",,[]
3,3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,1990,118 mins,"Adventure, Comedy, Sci-Fi","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,7.4,474698,55,"Stranded in 1955, Marty McFly learns about the...",,[]
4,4,tt0100419,https://m.media-amazon.com/images/M/MV5BYzk3NG...,Problem Child,1990,81 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,5.4,32670,27,A young boy just short of a monster is adopted...,,[]


In [24]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('None')])

0

In [25]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('none')])

0

In [26]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('NONE')])

0

In [27]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('nOne')])

0

In [28]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('noNe')])

0

In [29]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('nonE')])

0

In [30]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('NOne')])

0

In [31]:
len(movie_df_dropped[movie_df_dropped['genres'].astype(str).str.contains('NONe')])

0

Looking at the above results for a few combinations of identifying null values in the genre column, it seems all data for genre seems present. However, this will be kept in mind and checked again in future steps for implementing deep learning neural network model. 

After creating a "cleaned" dataset for no null image values, the images are then saved into the corresponding file folder for future image classification.

In [50]:
#For loop to save the image naming it with the movie id from the url. Images are then moved into an the approrpaite file folder 
#Actual code block not able to be retrieved for addressing errors in requests but it diagnosed error cause to not having an appropriate headers to accept application/json
#Similar explanation for headers seen on ChatGPT search https://chat.openai.com/c/85f15339-0e94-4816-a444-bf82bc5fad63 with start phrase: Python, remote forcibly closed connection, how to fix get request
#Code for handling timeouts in get requests found on https://stackoverflow.com/questions/21965484/timeout-for-python-requests-get-entire-response
for i in range(len(movie_df_dropped)-1):
    url = movie_df_dropped['image'][i]
    headers = {'accept':'application/json'}
    data = requests.get(url, headers=headers, timeout=60).content
    f = open(f"{movie_df_dropped['id'][i]}.jpg",'wb')
    f.write(data)
    f.close()

## 2.5 Target Feature (Movie Genres)

After retreiving the movie dataset, the target feature for genres needs to be created for the output layer for the future deep learning model.

In [32]:
#Split genres column based on ',' in order to retrieve unique genre values for movies
#Code to split dataframe columns based on ',' found on https://stackoverflow.com/questions/14745022/how-to-split-a-dataframe-string-column-into-two-columns
movie_df_dropped[['Genre_1', 'Genre_2']] = movie_df_dropped['genres'].str.split(',', n=1, expand=True)
movie_df_dropped[['Genre_2', 'Genre_3']] = movie_df_dropped['Genre_2'].str.split(',', n=1, expand=True)

In [33]:
#Stack the genre columns into one genre list, strip whitespaes, and remove duplicates and null values to retrieve target genres
#Code for stacking dataframes vertically found on https://www.geeksforgeeks.org/stack-two-pandas-series-vertically-and-horizontally/
#Code for removing whitespace found on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html
genre_list = pd.concat([movie_df_dropped['Genre_1'], movie_df_dropped['Genre_2'], movie_df_dropped['Genre_3']], axis=0, ignore_index = True)
genres = genre_list.str.strip()
dropped = genres.drop_duplicates()
dropped.replace('',np.nan, inplace=True)
target_genres = dropped.dropna()
target_genres

0            Comedy
1         Adventure
2            Action
7         Animation
12            Drama
62            Crime
63      Documentary
66        Biography
114          Family
134         Fantasy
375         Romance
455         Western
541          Sci-Fi
558          Horror
868         Musical
1068        History
1152        Mystery
1536       Thriller
2277           News
2285          Music
2802          Sport
4099            War
dtype: object

## 2.6 Save data

In [34]:
movie_df_dropped.shape

(3972, 18)

In [35]:
target_genres.shape

(22,)

In [36]:
# save the data to a new csv file
datapath = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets'
movie_df_dropped.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/movie_df_clean.csv') 
target_genres.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/target_genres.csv')

## 2.7 Summary

After retrieving json and "cleaning of the IMDB movie data, rows within the "image" column with the value of 'None'  were replaced with np.nan and then dropped as these specific movies without image data will not be useful for the image data classifier model. The genres were then retrieved from the "genre" column and split and duplicates were dropped in order to get an output target column for categories for movies for the future classifier model output layer. Final shape of the dataset is (3972, 18) from a (4058, 13) intial dataset. Further EDA will be conducted in order to assess genre balance within the movie_df_cleaned dataset.