# 4. IMDb_Classification_Preprocessing<a id='2_Data_wrangling'></a>

## 4.1 Table of Contents<a id='2.1_Contents'></a>
* 4. IMDb_Classification_Preprocessing
  * 4.1 Table of Contents
  * 4.2 Introduction
  * 4.3 Imports
  * 4.4 Load IMDb Genre Class Dataset & Retrieve Image Data
    * 4.4.1 Loading IMDb Genre Class Dataset
    * 4.4.2 Retrieve Image Data & Labels
    * 4.4.3 Transform IMDb Genre Dataset
  * 4.5 Create Train_Test Split
  * 4.6 Save Data
  * 4.7 Summary

## 4.2 Introduction

Now that a thorough analysis was conducted for the different genre class representation, the next steps will be in preprocessing the dataset to be implemented within the deep learning image classification model.

## 4.3 Imports<a id='2.3_Imports'></a>

Importing all appropriate packages in order to preprocess and develop associated train and test sets for image classification deep learning predictive model. 

In [1]:
#Import pandas, matplotlib.pyplot, seaborn, and associated scikit learn methods and functions as well as random number for reproducibility
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import tensorflow as tf
import os
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import time
from datetime import datetime
random_number = 42

## 4.4 Load IMDb Genre Class Dataset & Retrieve Image Data

### 4.4.1 Loading IMDb Genre Class Dataset

In [2]:
#Loading associated datasets of dropped vs. missing datasets for movie_df_filtered.
path_file = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/movie_df_filtered.csv'
movie_df_filtered = pd.read_csv(path_file, index_col=0)

Auditing the datasets with .info() and .head() displaying the first few records.

In [3]:
#.info() on bank_missing_df and bank_dropped_df to see a summary of the data
movie_df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3940 entries, 0 to 3939
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       3940 non-null   object
 1   Genre_1  3940 non-null   object
 2   Genre_2  2920 non-null   object
 3   Genre_3  1991 non-null   object
dtypes: object(4)
memory usage: 153.9+ KB


In [4]:
#head method on bank_missing_df and bank_dropped_df to print the first several rows of the data
movie_df_filtered.head()

Unnamed: 0,id,Genre_1,Genre_2,Genre_3
0,tt0099785,Comedy,Family,
1,tt0099810,Action,Adventure,Thriller
2,tt0100758,Action,Adventure,Comedy
3,tt0099088,Adventure,Comedy,Sci-Fi
4,tt0099422,Action,Comedy,Crime


In [5]:
movie_df_filtered.shape

(3940, 4)

### 4.4.2 Retrieve Image Data & Labels

Created a function in order to resize the images to 200 x 200 pixel size found within the image folder and retrieve the respective image's RGB array values as the train set for the image classifcation model.

In [6]:
#Code to get filename from Image data found on https://stackoverflow.com/questions/45087638/get-image-filename-from-image-pil
#Code to reshape numpy data array found on https://realpython.com/numpy-reshape/
#Code to convert image data to numpy array found on https://www.geeksforgeeks.org/how-to-convert-images-to-numpy-array/
def image_open(path):
    shape = (200,200)
    img = Image.open(path)
    img_re = img.resize(shape)
    color = img_re.convert("RGB")
    img_array = np.asarray(color.getdata())
    reshape = np.reshape(img_array, (200,200,3))
    #image_df = pd.DataFrame(img_array, columns=["red", "green", "blue"])
    #image_df = pd.DataFrame(img_array)
    return reshape

In [7]:
#Code to retrieve each image within the specified folder found on example seen in https://www.kaggle.com/code/prateek0x/multiclass-image-classification-using-keras/notebook
#Code to get .jpg filenames from specified folder seen in similiar code on https://stackoverflow.com/questions/42331589/read-jpg-files-in-a-directory
path = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/Images'
arr = []
movie_genre = []
for i in os.listdir(path):
    if i.endswith(".jpg"):
        #image = image_open(f"{path}/{i}")
        #image_data_arr = np.stack(image_open(f"{path}/{i}"))
        arr.append(image_open(f"{path}/{i}"))
        movie_genre.append(i)



In [8]:
#Code to stack array values to create proper input array for image classification model found on https://numpy.org/doc/stable/reference/generated/numpy.stack.html
arr_stack = np.stack(arr, axis=0)
arr_stack.shape

(3939, 200, 200, 3)

In [9]:
movie_genre

['tt0081145.jpg',
 'tt0095902.jpg',
 'tt0097106.jpg',
 'tt0097115.jpg',
 'tt0097668.jpg',
 'tt0098372.jpg',
 'tt0098959.jpg',
 'tt0098965.jpg',
 'tt0098966.jpg',
 'tt0099018.jpg',
 'tt0099073.jpg',
 'tt0099088.jpg',
 'tt0099092.jpg',
 'tt0099219.jpg',
 'tt0099266.jpg',
 'tt0099300.jpg',
 'tt0099327.jpg',
 'tt0099334.jpg',
 'tt0099341.jpg',
 'tt0099422.jpg',
 'tt0099428.jpg',
 'tt0099462.jpg',
 'tt0099472.jpg',
 'tt0099512.jpg',
 'tt0099564.jpg',
 'tt0099615.jpg',
 'tt0099622.jpg',
 'tt0099654.jpg',
 'tt0099669.jpg',
 'tt0099675.jpg',
 'tt0099726.jpg',
 'tt0099741.jpg',
 'tt0099785.jpg',
 'tt0099810.jpg',
 'tt0099816.jpg',
 'tt0099845.jpg',
 'tt0099857.jpg',
 'tt0099878.jpg',
 'tt0099892.jpg',
 'tt0099969.jpg',
 'tt0100046.jpg',
 'tt0100102.jpg',
 'tt0100151.jpg',
 'tt0100240.jpg',
 'tt0100281.jpg',
 'tt0100284.jpg',
 'tt0100287.jpg',
 'tt0100411.jpg',
 'tt0100419.jpg',
 'tt0100470.jpg',
 'tt0100477.jpg',
 'tt0100519.jpg',
 'tt0100558.jpg',
 'tt0100631.jpg',
 'tt0100666.jpg',
 'tt010070

Retrieved now an array of RGB values for the different movie posters as well as an array of the movie poster's respective IMDb id.

In [10]:
#Cleaned name array to remove quotation marks ' ' and .jpg extension
#Code to remove quotation marks within series seen in https://stackoverflow.com/questions/21491291/remove-all-quotes-within-values-in-pandas
#Code to remove ".jpg" filename extension found in a similiar code on https://stackoverflow.com/questions/3548673/how-can-i-replace-or-strip-an-extension-from-a-filename-in-python
movie_gen = pd.Series(movie_genre)
movie_gen = movie_gen.str.replace('""', "")
gen = []
for i in range(len(movie_gen)):
    gen.append(movie_gen[i].rsplit(".",1)[0])

In [11]:
#Converted list of IMDb filename unique id's into pandas series
genr = pd.Series(gen)
genr

0       tt0081145
1       tt0095902
2       tt0097106
3       tt0097115
4       tt0097668
          ...    
3934    tt9844368
3935    tt9848626
3936    tt9876582
3937    tt9883996
3938    tt9896916
Length: 3939, dtype: object

In [12]:
#Set the id's for the images to match each of the genre classes for each of the images.
labels = []
for i in range(len(genr)):
    labels.append(movie_df_filtered[movie_df_filtered['id'] == genr[i]])

In [13]:
#Code to remove whitespace within columns of dataframe found on https://www.geeksforgeeks.org/pandas-strip-whitespace-from-entire-dataframe/#
#Similar solution found with code on https://stackoverflow.com/questions/70770016/how-to-remove-excess-whitespaces-in-entire-python-dataframe-columns
train_labels = pd.concat(labels, ignore_index=True)
train_labels['id'] = train_labels['id'].str.strip()
train_labels['Genre_1'] = train_labels['Genre_1'].str.strip()
train_labels['Genre_2'] = train_labels['Genre_2'].str.strip()
train_labels['Genre_3'] = train_labels['Genre_3'].str.strip()
train_labels['Genre_2'][0]

'Crime'

Now each of the id's for the genre labels match with their respective image data array to be used as respective test and train sets for the image classification deep learning predictive model.

In [14]:
train_labels

Unnamed: 0,id,Genre_1,Genre_2,Genre_3
0,tt0081145,Comedy,Crime,Drama
1,tt0095902,Adventure,Drama,Family
2,tt0097106,Comedy,Drama,Romance
3,tt0097115,Adventure,Drama,Family
4,tt0097668,Adventure,,
...,...,...,...,...
3934,tt9844368,Drama,Western,
3935,tt9848626,Animation,Adventure,Comedy
3936,tt9876582,Documentary,,
3937,tt9883996,Biography,Comedy,Drama


In [62]:
train_labels.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/train_labels_2.csv', header=True, index=True)

In [63]:
len(train_labels['id'])

3939

In [87]:
len(train_labels[train_labels['Genre_2'] == 'War'])

8

In [145]:
len(train_labels[train_labels['Genre_3'] == 'War'])

15

### 4.4.3 Transform IMDb Genre Dataset

Taking the above constructed dataframe, the values for each of the respective genres need to be transformed to be its own features as the order for the model to predict movie genre is not of importance, just whether or not the model can accurately predict which genre the movie belongs to.

In [169]:
#Code to insert an item to beginning of the list found on https://stackoverflow.com/questions/17911091/append-integer-to-beginning-of-list-in-python
#Retrieving each genre unique genre label and creating a list of all genre labels
#Converting list to a set code found on https://www.geeksforgeeks.org/python-get-unique-values-list/
#Converting list to python array found on https://www.geeksforgeeks.org/python-convert-list-to-python-array/
col = train_labels['Genre_1'].unique()
col_2 = train_labels['Genre_2'].unique()
col_3 = train_labels['Genre_3'].unique()
new_col = np.concatenate((col, col_2))
new_col_2 = np.concatenate((new_col, col_3))
list_2 = new_col_2.tolist()
unique_set = set(list_2)
unique_list = list(unique_set)
new_array = np.array(unique_list)
movie_list = new_array.tolist()
movie_list.append('None')
movie_list.insert(0,'id')
movie_list.pop(17)

'nan'

In [170]:
movie_list

['id',
 'Adventure',
 'Horror',
 'Western',
 'Sci-Fi',
 'News',
 'Romance',
 'Sport',
 'Action',
 'Musical',
 'History',
 'Biography',
 'Mystery',
 'Drama',
 'Animation',
 'Music',
 'Crime',
 'Comedy',
 'Family',
 'Thriller',
 'Documentary',
 'War',
 'Fantasy',
 'None']

In [171]:
#Intialized dictionary to have genre as the features with respective value of 0 for its intial binary value.
movie_dict = {'id':0}
for i in range(1,len(movie_list)):
    movie_dict[movie_list[i]] = 0

In [172]:
movie_dict

{'id': 0,
 'Adventure': 0,
 'Horror': 0,
 'Western': 0,
 'Sci-Fi': 0,
 'News': 0,
 'Romance': 0,
 'Sport': 0,
 'Action': 0,
 'Musical': 0,
 'History': 0,
 'Biography': 0,
 'Mystery': 0,
 'Drama': 0,
 'Animation': 0,
 'Music': 0,
 'Crime': 0,
 'Comedy': 0,
 'Family': 0,
 'Thriller': 0,
 'Documentary': 0,
 'War': 0,
 'Fantasy': 0,
 'None': 0}

In [173]:
#Code to create a list of numbers found on https://codefather.tech/blog/python-create-list-from-1-to-n/
num = list(range(3939))

In [175]:
#Taking the constructed dictionary and expanding the index to create an intial sparse dataframe to eventually include all values for images extracted from IMDb
df_mov = pd.DataFrame(movie_dict, index = num)
df_mov

Unnamed: 0,id,Adventure,Horror,Western,Sci-Fi,News,Romance,Sport,Action,Musical,...,Animation,Music,Crime,Comedy,Family,Thriller,Documentary,War,Fantasy,None
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3934,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3935,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3936,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3937,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [202]:
for i in range(len(train_labels['id'])):
    df_mov['id'][i] = train_labels['id'][i]
    if train_labels['Genre_1'][i] == 'Comedy':
        df_mov['Comedy'][i] = 1
    elif train_labels['Genre_1'][i] == 'Adventure':
        df_mov['Adventure'][i] = 1
    elif train_labels['Genre_1'][i] == 'Drama':
        df_mov['Drama'][i] = 1
    elif train_labels['Genre_1'][i] == 'Action':
        df_mov['Action'][i] = 1
    elif train_labels['Genre_1'][i] == 'Animation':
        df_mov['Animation'][i] = 1
    elif train_labels['Genre_1'][i] == 'Documentary':
        df_mov['Documentary'][i] = 1
    elif train_labels['Genre_1'][i] == 'Biography':
        df_mov['Biography'][i] = 1
    elif train_labels['Genre_1'][i] == 'Family':
        df_mov['Family'][i] = 1
    elif train_labels['Genre_1'][i] == 'Crime':
        df_mov['Crime'][i] = 1
    elif train_labels['Genre_1'][i] == 'Fantasy':
        df_mov['Fantasy'][i] = 1
    elif train_labels['Genre_1'][i] == 'Musical':
        df_mov['Musical'][i] = 1
    elif train_labels['Genre_1'][i] == 'Horror':
        df_mov['Horror'][i] = 1
    elif train_labels['Genre_1'][i] == 'Western':
        df_mov['Western'][i] = 1
    elif train_labels['Genre_1'][i] == 'Sci-Fi':
        df_mov['Sci-Fi'][i] = 1
    elif train_labels['Genre_1'][i] == 'Romance':
        df_mov['Romance'][i] = 1
    elif train_labels['Genre_1'][i] == 'History':
        df_mov['History'][i] = 1
    elif train_labels['Genre_1'][i] == 'Mystery':
        df_mov['Mystery'][i] = 1
    elif train_labels['Genre_1'][i] == 'Thriller':
        df_mov['Thriller'][i] = 1
    elif train_labels['Genre_1'][i] == 'Sport':
        df_mov['Sport'][i] = 1
    elif train_labels['Genre_1'][i] == 'News':
        df_mov['News'][i] = 1
    elif train_labels['Genre_1'][i] == 'Music':
        df_mov['Music'][i] = 1
    elif train_labels['Genre_1'][i] == 'War':
        df_mov['War'][i] = 1
    elif train_labels['Genre_1'][i] == 'None':
        df_mov['None'][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Comedy'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Adventure'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Comedy'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Adventure'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame


In [203]:
for j in range(len(train_labels['id'])):
    if train_labels['Genre_2'][j] == 'Comedy':
        df_mov['Comedy'][j] = 1
    elif train_labels['Genre_2'][j] == 'Adventure':
        df_mov['Adventure'][j] = 1
    elif train_labels['Genre_2'][j] == 'Drama':
        df_mov['Drama'][j] = 1
    elif train_labels['Genre_2'][j] == 'Action':
        df_mov['Action'][j] = 1
    elif train_labels['Genre_2'][j] == 'Animation':
        df_mov['Animation'][j] = 1
    elif train_labels['Genre_2'][j] == 'Documentary':
        df_mov['Documentary'][j] = 1
    elif train_labels['Genre_2'][j] == 'Biography':
        df_mov['Biography'][j] = 1
    elif train_labels['Genre_2'][j] == 'Family':
        df_mov['Family'][j] = 1
    elif train_labels['Genre_2'][j] == 'Crime':
        df_mov['Crime'][j] = 1
    elif train_labels['Genre_2'][j] == 'Fantasy':
        df_mov['Fantasy'][j] = 1
    elif train_labels['Genre_2'][j] == 'Musical':
        df_mov['Musical'][j] = 1
    elif train_labels['Genre_2'][j] == 'Horror':
        df_mov['Horror'][j] = 1
    elif train_labels['Genre_2'][j] == 'Western':
        df_mov['Western'][j] = 1
    elif train_labels['Genre_2'][j] == 'Sci-Fi':
        df_mov['Sci-Fi'][j] = 1
    elif train_labels['Genre_2'][j] == 'Romance':
        df_mov['Romance'][j] = 1
    elif train_labels['Genre_2'][j] == 'History':
        df_mov['History'][j] = 1
    elif train_labels['Genre_2'][j] == 'Mystery':
        df_mov['Mystery'][j] = 1
    elif train_labels['Genre_2'][j] == 'Thriller':
        df_mov['Thriller'][j] = 1
    elif train_labels['Genre_2'][j] == 'Sport':
        df_mov['Sport'][j] = 1
    elif train_labels['Genre_2'][j] == 'News':
        df_mov['News'][j] = 1
    elif train_labels['Genre_2'][j] == 'Music':
        df_mov['Music'][j] = 1
    elif train_labels['Genre_2'][j] == 'War':
        df_mov['War'][j] = 1
    elif train_labels['Genre_2'][j] == 'None':
        df_mov['None'][j] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Crime'][j] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Drama'][j] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Drama'][j] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Drama'][j] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the c

In [208]:
for k in range(len(train_labels['id'])):
    if train_labels['Genre_3'][k] == 'Comedy':
        df_mov['Comedy'][k] = 1
    elif train_labels['Genre_3'][k] == 'Adventure':
        df_mov['Adventure'][k] = 1
    elif train_labels['Genre_3'][k] == 'Drama':
        df_mov['Drama'][k] = 1
    elif train_labels['Genre_3'][k] == 'Action':
        df_mov['Action'][k] = 1
    elif train_labels['Genre_3'][k] == 'Animation':
        df_mov['Animation'][k] = 1
    elif train_labels['Genre_3'][k] == 'Documentary':
        df_mov['Documentary'][k] = 1
    elif train_labels['Genre_3'][k] == 'Biography':
        df_mov['Biography'][k] = 1
    elif train_labels['Genre_3'][k] == 'Family':
        df_mov['Family'][k] = 1
    elif train_labels['Genre_3'][k] == 'Crime':
        df_mov['Crime'][k] = 1
    elif train_labels['Genre_3'][k] == 'Fantasy':
        df_mov['Fantasy'][k] = 1
    elif train_labels['Genre_3'][k] == 'Musical':
        df_mov['Musical'][k] = 1
    elif train_labels['Genre_3'][k] == 'Horror':
        df_mov['Horror'][k] = 1
    elif train_labels['Genre_3'][k] == 'Western':
        df_mov['Western'][k] = 1
    elif train_labels['Genre_3'][k] == 'Sci-Fi':
        df_mov['Sci-Fi'][k] = 1
    elif train_labels['Genre_3'][k] == 'Romance':
        df_mov['Romance'][k] = 1
    elif train_labels['Genre_3'][k] == 'History':
        df_mov['History'][k] = 1
    elif train_labels['Genre_3'][k] == 'Mystery':
        df_mov['Mystery'][k] = 1
    elif train_labels['Genre_3'][k] == 'Thriller':
        df_mov['Thriller'][k] = 1
    elif train_labels['Genre_3'][k] == 'Sport':
        df_mov['Sport'][k] = 1
    elif train_labels['Genre_3'][k] == 'News':
        df_mov['News'][k] = 1
    elif train_labels['Genre_3'][k] == 'Music':
        df_mov['Music'][k] = 1
    elif train_labels['Genre_3'][k] == 'War':
        df_mov['War'][k] = 1
    elif train_labels['Genre_3'][k] == 'None':
        df_mov['None'][k] = 1 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Drama'][k] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Family'][k] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Romance'][k] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mov['Family'][k] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See t

In [220]:
df_mov.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/df_mov_war.csv', header=True, index=True)

In [221]:
df_mov

Unnamed: 0,id,Adventure,Horror,Western,Sci-Fi,News,Romance,Sport,Action,Musical,...,Animation,Music,Crime,Comedy,Family,Thriller,Documentary,War,Fantasy,None
0,tt0081145,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
1,tt0095902,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,tt0097106,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,tt0097115,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,tt0097668,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3934,tt9844368,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3935,tt9848626,1,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
3936,tt9876582,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3937,tt9883996,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


Now a completed transformed dataframe has been constructed from each movie genre labels as features with binary values. To ready the dataset for training, dropped 'id' and 'None' column to force model to give an accurate prediction from the movie poster image data.

In [222]:
dummy_labels_filter = df_mov.drop(['id','None'], axis=1)
dummy_labels_filter.shape

(3939, 22)

In [224]:
dummy_labels_filter

Unnamed: 0,Adventure,Horror,Western,Sci-Fi,News,Romance,Sport,Action,Musical,History,...,Drama,Animation,Music,Crime,Comedy,Family,Thriller,Documentary,War,Fantasy
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3934,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3935,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
3936,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3937,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0


In [113]:
name = dummy_labels_filter.columns
name

Index(['Adventure', 'Horror', 'Western', 'Sci-Fi', 'News', 'Romance', 'Sport',
       'Action', 'Musical', 'History', 'Biography', 'Mystery', 'Drama',
       'Animation', 'Music', 'Crime', 'Comedy', 'Family', 'Thriller',
       'Documentary', 'War', 'Fantasy'],
      dtype='object')

## 4.5 Create Train_Test Split

Taking the array of image data values, and its respective above image data binary labels, a train_test_split is created for  future image classifier. The test size is arbritraily set for a 80/20 split.

In [226]:
X_train, X_val, y_train, y_val = train_test_split(arr_stack,dummy_labels_filter, test_size=0.20 , random_state=42) 

In [227]:
X_train.shape

(3151, 200, 200, 3)

In [228]:
y_train.shape

(3151, 22)

In [229]:
y_train

Unnamed: 0,Adventure,Horror,Western,Sci-Fi,News,Romance,Sport,Action,Musical,History,...,Drama,Animation,Music,Crime,Comedy,Family,Thriller,Documentary,War,Fantasy
1921,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3287,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
903,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1894,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3125,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1294,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
860,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3507,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


## 4.6 Save data

In [230]:
#X_train and y_train shape
print(f"X_train shape: {X_train.shape} and y_train shape: {y_train.shape}")

X_train shape: (3151, 200, 200, 3) and y_train shape: (3151, 22)


In [231]:
#X_val and y_val set shape
print(f"X_val shape: {X_val.shape} and y_val shape: {y_val.shape}")

X_val shape: (788, 200, 200, 3) and y_val shape: (788, 22)


In [232]:
# save the X_train and X_val Image Data arrays as numpy array and y_train and y_val dataframes as csv files to load for future image classifier
np.save('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/X_train', X_train)
np.save('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/X_val', X_val)
y_train.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/y_train.csv', header=True, index=True)
y_val.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/y_val.csv', header=True, index=True)

## 4.7 Summary

Few transformation steps were underwent in order to develop train and test sets for the Image Classification Deep Learning Predictive model. Since the model is an image classifier, the image data of movie posters had to be first retrieved and transformed into a stacked array of 200x200 pixel size with corresponding RGB values. Then the image data needed to be matched with its corresponding genre label from the cleaned genre label dataset. Once each of the labels were matched with their corresponding image data array, then the label dataset was transformed where each of the unique genres became features for the labeled dataset. This transformation was needed as the performance of the image classifier model does not need to be determined by if the model is able to predict the genre's of movies based on its order (Action/Adventure vs. Adventure/Action) just if the model can predict the specific genre for the movie thereby making y_train a shape of only 21 unique items. Once both image data array and genre label dataframe were developed, the datasets were then split 80/20 train/test for the future image classification predictive model training and performance assessment.