# 2. IMDB_Data_Wrangling<a id='2_Data_wrangling'></a>

## 2.1 Table of Contents<a id='2.1_Contents'></a>
* 2. IMDB_Data_Wrangling
  * 2.1 Table of Contents
  * 2.2 Introduction
  * 2.3 Imports
  * 2.4 Retrieve IMDB Movie Data
      * 2.4.1 Intial Data Retrieval
      * 2.4.2 Assess & Drop Null Values
      * 2.4.3 Retrieve IMDB Poster Image Data
  * 2.5 Target Feature (Movie Genres)
  * 2.6 Save data
  * 2.7 Summary


## 2.2 Introduction

In the Data Wrangling/Cleaning phase, data needs first be retrieved by the IMDB API Get request and then transformed from json into pandas dataframe for further manipulation and exported to .csv.

## 2.3 Imports<a id='2.3_Imports'></a>

Importing all appropriate packages in order to access requests and organize/clean json data to pandas dataframe.

In [152]:
#Import requests, json, pandas, numpy, matplotlib.pyplot, and seaborn
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 2.4 Retrieve IMDB Movie Data

### 2.4.1 Intial Data Retrieval

Dataset is retrieved with movie data from 1990 to past year of 2023 with filtered setting of the API Request for G and PG rated movies based on the company's focus on Family Friendly Movies.

In [79]:
# Intialize start and end datetime for IMDB API Get request
start_year = ['1990-01-01', '1991-01-01', '1992-01-01', '1993-01-01', '1994-01-01', '1995-01-01', '1996-01-01', '1997-01-01','1998-01-01', '1999-01-01','2000-01-01', '2001-01-01', '2002-01-01', '2003-01-01', '2004-01-01', '2005-01-01', '2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01', '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01', '2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01']
end_year = ['1990-12-31', '1991-12-31', '1992-12-31', '1993-12-31', '1994-12-31', '1995-12-31', '1996-12-31', '1997-12-31','1998-12-31', '1999-12-31','2000-12-31', '2001-12-31', '2002-12-31', '2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31', '2007-12-31', '2008-12-31', '2009-12-31', '2010-12-31', '2011-12-31', '2012-12-31', '2013-12-31', '2014-12-31', '2015-12-31', '2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31', '2022-12-31', '2023-12-31']

In [80]:
#Created a function in order to generate API requests and convert the request into a Pandas Dataframe based on start and end year
#Code for API Get Request found on IMDB Website: https://tv-api.com/api/#Search-header
#Code to convert json data into dict found on https://favtutor.com/blogs/string-to-dict-python
#Get keys from python dicts code found in https://tutorialdeep.com/knowhow/get-dictionary-value-key-python/
def imdb_data(start_year, end_year):
    url = f"https://tv-api.com/API/AdvancedSearch/k_8jnkh6yr?title_type=feature&release_date={start_year},{end_year}&certificates=us:G,us:PG&count=250"
    payload = {}
    headers = {}
    res = requests.request("GET", url, headers=headers, data=payload)
    movie_list = res.text.encode('utf8')
    movie_list_decode = json.loads(movie_list.decode('utf8'))
    movie_list = movie_list_decode['results']
    movie_df = pd.DataFrame(movie_list)
    return movie_df

In [35]:
#Created list of movie dataframes of movies from 1990 to 2023
movie_df_concat = []
for i in range(len(start_year)):
    movie_df_concat.append(imdb_data(start_year[i], end_year[i]))

In [222]:
#Joined the above dataframes to create a combinded dataframe for movie data from 1990 to 2023
#Code to join dataframes ignore_index found on https://www.statology.org/stack-pandas-dataframes/ and https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.DataFrame.append.html
movie_df_total = pd.concat(movie_df_concat, ignore_index=True)
movie_df_total

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,1990,103 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,7.7,645257,63,"An eight-year-old troublemaker, mistakenly lef...",,[]
1,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,1990,135 mins,"Action, Adventure, Thriller","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,7.5,212075,58,"In November 1984, the Soviet Union's best subm...",,[]
2,tt0100758,https://m.media-amazon.com/images/M/MV5BNzg3NT...,Teenage Mutant Ninja Turtles,1990,93 mins,"Action, Adventure, Comedy","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,6.8,103577,51,Four teenage mutant ninja turtles emerge from ...,,[]
3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,1990,118 mins,"Adventure, Comedy, Sci-Fi","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,7.4,475051,55,"Stranded in 1955, Marty McFly learns about the...",,[]
4,tt0099422,https://m.media-amazon.com/images/M/MV5BMzA5MD...,Dick Tracy,1990,105 mins,"Action, Comedy, Crime","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,6.2,65117,68,The comic strip detective finds his life vastl...,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4054,tt15426874,https://m.media-amazon.com/images/M/MV5BNmIwZj...,Seaper Powers: Mystery of the Blue Pearls,2023,66 mins,Animation,"[{'key': 'Animation', 'value': 'Animation'}]",PG,0,0,,Emma works for NOAA as a diver and researcher....,,[]
4055,tt30133792,,Justice League x RWBY: Super Heroes and Huntsmen,2023,,"Animation, Action, Adventure","[{'key': 'Animation', 'value': 'Animation'}, {...",PG,0,0,,Justice League x RWBY: Super Heroes and Huntsm...,,[]
4056,tt27838714,https://m.media-amazon.com/images/M/MV5BYjkxNz...,M&A,2023,84 mins,Documentary,"[{'key': 'Documentary', 'value': 'Documentary'}]",PG,0,0,,Baby Boomers own 83% of businesses. It is $10 ...,,[]
4057,tt28018383,,Graham Bullard Movie: Peanuts Meets the Loud H...,2023,89 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",G,0,0,,The Peanuts Gang and The Loud Family all have ...,,[]


Audited the IMDB Dataset with .head() and .info() methods.

In [223]:
movie_df_total.head()

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,1990,103 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,7.7,645257,63,"An eight-year-old troublemaker, mistakenly lef...",,[]
1,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,1990,135 mins,"Action, Adventure, Thriller","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,7.5,212075,58,"In November 1984, the Soviet Union's best subm...",,[]
2,tt0100758,https://m.media-amazon.com/images/M/MV5BNzg3NT...,Teenage Mutant Ninja Turtles,1990,93 mins,"Action, Adventure, Comedy","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,6.8,103577,51,Four teenage mutant ninja turtles emerge from ...,,[]
3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,1990,118 mins,"Adventure, Comedy, Sci-Fi","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,7.4,475051,55,"Stranded in 1955, Marty McFly learns about the...",,[]
4,tt0099422,https://m.media-amazon.com/images/M/MV5BMzA5MD...,Dick Tracy,1990,105 mins,"Action, Comedy, Crime","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,6.2,65117,68,The comic strip detective finds his life vastl...,,[]


In [224]:
movie_df_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4059 entries, 0 to 4058
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                4059 non-null   object
 1   image             3973 non-null   object
 2   title             4059 non-null   object
 3   description       4058 non-null   object
 4   runtimeStr        4005 non-null   object
 5   genres            4059 non-null   object
 6   genreList         4059 non-null   object
 7   contentRating     4058 non-null   object
 8   imDbRating        4059 non-null   object
 9   imDbRatingVotes   4059 non-null   object
 10  metacriticRating  1471 non-null   object
 11  plot              4059 non-null   object
 12  stars             4059 non-null   object
 13  starList          4059 non-null   object
dtypes: object(14)
memory usage: 444.1+ KB


Looking at the above results, we can see that the "image" column for the data has some null values. However, all rows seem to have a genre value, but this will be investigated further. Rows with missing image data will be dropped as these will not be usable for future image classification. Stars and StarsList seem to be empty (blank) columns from intial pull of the dataset from the IMDB API.

### 2.4.2 Assess & Drop Null Values

After retrieving and converting the json data from the IMDB api into a dataframe, the dataframe is assessed for any null or blank columns for image data or genres. Movies with null values in these columns will be filtered out as the future image classifer needs both image data and a respective label or genre for training.

In [225]:
movie_df_total[movie_df_total['image'] == 'None']

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList


In [226]:
movie_df_total[movie_df_total['image'].isnull() == True]

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
139,tt8987008,,Hopes for a Miracle,1991,70 mins,"Drama, Family","[{'key': 'Drama', 'value': 'Drama'}, {'key': '...",G,0,0,,A young dancer stricken with a physical illnes...,,[]
140,tt0219516,,Athena: The Goddess Awakens,1991,,Documentary,"[{'key': 'Documentary', 'value': 'Documentary'}]",PG,0,0,,,,[]
205,tt2379689,,Backlash: Race and the American Dream,1992,58 mins,Documentary,"[{'key': 'Documentary', 'value': 'Documentary'}]",G,0,0,,Backlash: Race and the American Dream chronicl...,,[]
207,tt0104789,,The Magical World of Chuck Jones,1992,93 mins,Documentary,"[{'key': 'Documentary', 'value': 'Documentary'}]",PG,6.7,116,,Documentary on animator Chuck Jones.,,[]
208,tt11221864,,Live the Life You Love,1992,,,[],PG,0,0,,,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3887,tt13132110,,Walking the Park with Walt,2021,190 mins,Biography,"[{'key': 'Biography', 'value': 'Biography'}]",PG,0,0,,Interviewing Ben Harris and 6 other members of...,,[]
3981,tt25306094,,BFDI Plush: A Trip to Las Vegas,2022,101 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",G,0,0,,It's time for the BFDI plushies to go on anoth...,,[]
4055,tt30133792,,Justice League x RWBY: Super Heroes and Huntsmen,2023,,"Animation, Action, Adventure","[{'key': 'Animation', 'value': 'Animation'}, {...",PG,0,0,,Justice League x RWBY: Super Heroes and Huntsm...,,[]
4057,tt28018383,,Graham Bullard Movie: Peanuts Meets the Loud H...,2023,89 mins,"Adventure, Comedy, Family","[{'key': 'Adventure', 'value': 'Adventure'}, {...",G,0,0,,The Peanuts Gang and The Loud Family all have ...,,[]


In [227]:
len(movie_df_total[movie_df_total['image'].isnull() == True])

86

In [228]:
movie_df_total[movie_df_total['image'] == ""]

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList


In [229]:
movie_df_total[movie_df_total['genres'] == 'None']

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList


In [230]:
movie_df_total[movie_df_total['genres'].isnull() == True]

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList


In [231]:
movie_df_total[movie_df_total['genres'] == ""]

Unnamed: 0,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList
208,tt11221864,,Live the Life You Love,1992,,,[],PG,0.0,0,,,,[]
294,tt0107959,,The Ride,1993,,,[],PG,0.0,0,,,,[]
397,tt0109845,,Friend for a Day,1994,94 mins,,[],PG,0.0,0,,,,[]
484,tt0114409,https://m.media-amazon.com/images/M/MV5BMGQzYm...,Shakuhachi,1995,60 mins,,[],G,0.0,0,,Shakuhachi - the Japanese flute has a very rel...,,[]
485,tt0310978,https://m.media-amazon.com/images/M/MV5BMWUwMW...,Dark Passage to Wan,1995,60 mins,,[],G,0.0,0,,,,[]
744,tt0384129,https://m.media-amazon.com/images/M/MV5BMTc1MD...,Girl Cottage,1998,98 mins,,[],PG,0.0,0,,,,[]
745,tt0161956,,Sammy the Squirrel,1998,,,[],G,0.0,0,,,,[]
748,tt0161734,,King of the Birds,1998,,,[],G,0.0,0,,,,[]
749,tt0161727,,Jungle Bungle,1998,,,[],G,0.0,0,,,,[]
846,tt0220392,,The Dinosaur Piece,1999,,,[],G,0.0,0,,,,[]


In [232]:
len(movie_df_total[movie_df_total['genres'] == ""])

45

Looking at the above results, it seems that the image column and genres column have different methods of input for missing or null values. The image column has missing values as null with 86 movies listed and the genre column has missing values as being blank with 45 movies listed. Both this columns null valued movie titles will be filtered out as the image classifier needs both an image and a predefined label in order to train.

In [233]:
movie_df_clean = movie_df_total[movie_df_total['image'].isnull() == False]

In [234]:
movie_df_clean = movie_df_clean[movie_df_clean['genres'] != ""]

In [235]:
#After filtering out null values, the dataframe's index is reset. Code to reset index found on https://www.geeksforgeeks.org/python-pandas-dataframe-reset_index/
movie_df_clean[movie_df_clean['image'].isnull() == True]
movie_df_clean = movie_df_clean.reset_index()

In [236]:
movie_df_clean[movie_df_clean['image'].isnull() == True]

Unnamed: 0,index,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList


In [237]:
movie_df_clean[movie_df_clean['genres'] == ""]

Unnamed: 0,index,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList


In [238]:
movie_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3940 entries, 0 to 3939
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   index             3940 non-null   int64 
 1   id                3940 non-null   object
 2   image             3940 non-null   object
 3   title             3940 non-null   object
 4   description       3939 non-null   object
 5   runtimeStr        3908 non-null   object
 6   genres            3940 non-null   object
 7   genreList         3940 non-null   object
 8   contentRating     3939 non-null   object
 9   imDbRating        3940 non-null   object
 10  imDbRatingVotes   3940 non-null   object
 11  metacriticRating  1470 non-null   object
 12  plot              3940 non-null   object
 13  stars             3940 non-null   object
 14  starList          3940 non-null   object
dtypes: int64(1), object(14)
memory usage: 461.8+ KB


In [239]:
movie_df_clean.shape

(3940, 15)

Final shape of this "cleaned" dataset is (3940, 15) after removing movies without a poster image or genre listed.

### 2.4.3 Retrieve IMDB Poster Image Data

After creating a "cleaned" dataset for no null image and genre values, images are then retrieved from the urls listed in the 'image' column of the dataframe and saved into the corresponding file folder for the future image classification model.

In [217]:
#For loop to save the image naming it with the movie id from the url. Images are then moved into an the approrpaite file folder 
#Actual code block not able to be retrieved for addressing errors in requests but it diagnosed error cause to not having an appropriate headers to accept application/json
#Similar explanation for headers seen on ChatGPT search https://chat.openai.com/c/85f15339-0e94-4816-a444-bf82bc5fad63 with start phrase: Python, remote forcibly closed connection, how to fix get request
#Code for handling timeouts in get requests found on https://stackoverflow.com/questions/21965484/timeout-for-python-requests-get-entire-response
for i in range(len(movie_df_clean)-1):
    url = movie_df_clean['image'][i]
    headers = {'accept':'application/json',}
    data = requests.get(url, headers=headers, timeout=60).content
    f = open(f"{movie_df_clean['id'][i]}.jpg",'wb')
    f.write(data)
    f.close()

## 2.5 Target Feature (Movie Genres)

After retreiving the movie dataset, the target feature for genres needs to be created for the output layer for the future deep learning model.

In [240]:
#Split genres column based on ',' in order to retrieve unique genre values for movies
#Code to split dataframe columns based on ',' found on https://stackoverflow.com/questions/14745022/how-to-split-a-dataframe-string-column-into-two-columns
movie_df_clean[['Genre_1', 'Genre_2']] = movie_df_clean['genres'].str.split(',', n=1, expand=True)
movie_df_clean[['Genre_2', 'Genre_3']] = movie_df_clean['Genre_2'].str.split(',', n=1, expand=True)
movie_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3940 entries, 0 to 3939
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   index             3940 non-null   int64 
 1   id                3940 non-null   object
 2   image             3940 non-null   object
 3   title             3940 non-null   object
 4   description       3939 non-null   object
 5   runtimeStr        3908 non-null   object
 6   genres            3940 non-null   object
 7   genreList         3940 non-null   object
 8   contentRating     3939 non-null   object
 9   imDbRating        3940 non-null   object
 10  imDbRatingVotes   3940 non-null   object
 11  metacriticRating  1470 non-null   object
 12  plot              3940 non-null   object
 13  stars             3940 non-null   object
 14  starList          3940 non-null   object
 15  Genre_1           3940 non-null   object
 16  Genre_2           2920 non-null   object
 17  Genre_3       

After creating separate feature for the different genre labels for the movie, if the movie genre feature has a null value, this will be saved as genre label of 'None' for the future image classifier.

In [245]:
#Adding 'None' for any null values in the movie genre features
movie_df_clean = movie_df_clean.replace({np.nan:'None'})
movie_df_clean

Unnamed: 0,index,id,image,title,description,runtimeStr,genres,genreList,contentRating,imDbRating,imDbRatingVotes,metacriticRating,plot,stars,starList,Genre_1,Genre_2,Genre_3
0,0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,1990,103 mins,"Comedy, Family","[{'key': 'Comedy', 'value': 'Comedy'}, {'key':...",PG,7.7,645257,63,"An eight-year-old troublemaker, mistakenly lef...",,[],Comedy,Family,
1,1,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,1990,135 mins,"Action, Adventure, Thriller","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,7.5,212075,58,"In November 1984, the Soviet Union's best subm...",,[],Action,Adventure,Thriller
2,2,tt0100758,https://m.media-amazon.com/images/M/MV5BNzg3NT...,Teenage Mutant Ninja Turtles,1990,93 mins,"Action, Adventure, Comedy","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,6.8,103577,51,Four teenage mutant ninja turtles emerge from ...,,[],Action,Adventure,Comedy
3,3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,1990,118 mins,"Adventure, Comedy, Sci-Fi","[{'key': 'Adventure', 'value': 'Adventure'}, {...",PG,7.4,475051,55,"Stranded in 1955, Marty McFly learns about the...",,[],Adventure,Comedy,Sci-Fi
4,4,tt0099422,https://m.media-amazon.com/images/M/MV5BMzA5MD...,Dick Tracy,1990,105 mins,"Action, Comedy, Crime","[{'key': 'Action', 'value': 'Action'}, {'key':...",PG,6.2,65117,68,The comic strip detective finds his life vastl...,,[],Action,Comedy,Crime
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3935,4051,tt11650502,https://m.media-amazon.com/images/M/MV5BODZlNG...,Touch the Water,2023,94 mins,Drama,"[{'key': 'Drama', 'value': 'Drama'}]",PG,8.4,14,,When an intern at a Senior Center challenges a...,,[],Drama,,
3936,4052,tt21862696,https://m.media-amazon.com/images/M/MV5BYjZmND...,Snow White's Christmas Adventure,2023,,Family,"[{'key': 'Family', 'value': 'Family'}]",G,0,0,,,,[],Family,,
3937,4053,tt0122511,https://m.media-amazon.com/images/M/MV5BMTQwMW...,The Gnomes Great Adventure,2023,74 mins,"Animation, Adventure, Comedy","[{'key': 'Animation', 'value': 'Animation'}, {...",G,6.1,63,,When a bunch of dim-witted trolls steal the Ki...,,[],Animation,Adventure,Comedy
3938,4054,tt15426874,https://m.media-amazon.com/images/M/MV5BNmIwZj...,Seaper Powers: Mystery of the Blue Pearls,2023,66 mins,Animation,"[{'key': 'Animation', 'value': 'Animation'}]",PG,0,0,,Emma works for NOAA as a diver and researcher....,,[],Animation,,


In [246]:
#Stack the genre columns into one genre list, strip whitespaes, and remove duplicates and null values to retrieve target genres
#Code for stacking dataframes vertically found on https://www.geeksforgeeks.org/stack-two-pandas-series-vertically-and-horizontally/
#Code for removing whitespace found on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html
genre_list = pd.concat([movie_df_clean['Genre_1'], movie_df_clean['Genre_2'], movie_df_clean['Genre_3']], axis=0, ignore_index = True)
genres = genre_list.str.strip()
dropped = genres.drop_duplicates()
dropped.replace('','None', inplace=True)
target_genres = dropped.dropna()
target_genres

0            Comedy
1            Action
3         Adventure
7         Animation
11            Drama
52      Documentary
65            Crime
67        Biography
114          Family
130         Fantasy
379         Romance
455         Western
537          Horror
539          Sci-Fi
864         Musical
1065        History
1147        Mystery
1540       Thriller
2255          Music
2257           News
2762          Sport
3954           None
4067            War
dtype: object

## 2.6 Save data

In [247]:
movie_df_clean.shape

(3940, 18)

In [248]:
target_genres.shape

(23,)

In [249]:
# save the data to a new csv file
datapath = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets'
movie_df_clean.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/movie_df_clean.csv') 
target_genres.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/target_genres.csv')

## 2.7 Summary

After retrieving json and "cleaning of the IMDB movie data, columns were assessed for any null or missing values. Seems that stars and starslist are empty columns from intial API retrieval from IMDB. Also seems that rows within the "image" column and "genre" columns had different methods of inputted null values. The "image" column's null values were found with df.isnull() where as "genre"'s null values were found as blanks. Movies in these columns with respective null values were dropped for the future image classifier as the classifier needs both image data and a label in order to train. Then the dataset's movie genres were split into separate features for the future image classifier and a list of the respective features were saved as target for the output layer for the image classfier. The final shape of the dataset is (3940, 18) with 23 respective different unique labels of genres including 'None' as a label for the movie genre. Further EDA will be conducted in order to assess genre balance within the movie_df_cleaned dataset.