## Transform "genres_omdb" Column

Most movies are categorized into more than one genres. Data collected from API calls is saved as strings so further parsing is needed. The goal is to assign 0 and 1 values to every movie for each unique category existed in the. We believe this would come in handy in our machine learning models.

In [1]:
import pandas as pd
import os

In [2]:
path = os.path.join('..','resources','cleaned_data', 'movies_complete_cleaned.csv')
raw_df = pd.read_csv(path)
raw_df.head()

Unnamed: 0,name,production,director,runtime,released,year,month,country_kaggle,country_omdb,star_kaggle,...,plot,awards,score_imdb,votes_imdb,score_metacritic,budget,genre_kaggle,gross,genres_omdb,rating
0,Gold,Black Bear Pictures,Stephen Gaghan,120,2017-01-27,2016,1,USA,USA,Matthew McConaughey,...,"With the sudden death of his father, fourth-ge...",1 win & 5 nominations.,6.7,32147,49.0,20000000,Adventure,7222964,"Crime, Drama",R
1,The Choice,Nicholas Sparks Productions,Ross Katz,111,2016-02-05,2016,2,USA,United States,Benjamin Walker,...,"In a small coastal town, the veterinarian Trav...",3 nominations,6.6,22972,26.0,0,Drama,18709066,"Drama, Romance",PG-13
2,Middle School: The Worst Years of My Life,CBS Films,Steve Carr,92,2016-10-07,2016,10,USA,"USA, Cambodia",Griffin Gluck,...,Imaginative quiet teenager Rafe Katchadorian i...,5 nominations.,6.1,4556,51.0,8500000,Animation,19985196,"Animation, Comedy, Family",PG
3,Midnight Special,Warner Bros.,Jeff Nichols,112,2016-04-21,2016,4,USA,"USA, Greece",Michael Shannon,...,Alton Meyer is a boy unlike any other in the w...,3 wins & 14 nominations.,6.7,58549,76.0,18000000,Drama,3707794,"Drama, Mystery, Sci-Fi, Thriller",PG-13
4,A Monster Calls,Apaches Entertainment,J.A. Bayona,108,2017-01-06,2016,1,UK,"UK, Spain, USA",Lewis MacDougall,...,The monster does not come walking often. This ...,39 wins & 56 nominations.,7.5,49969,76.0,43000000,Drama,3730982,"Adventure, Drama, Family, Fantasy",PG-13


In [3]:
# Select only genre columns
df = raw_df[['name', 'genre_kaggle', 'genres_omdb']]
df.head(2)

Unnamed: 0,name,genre_kaggle,genres_omdb
0,Gold,Adventure,"Crime, Drama"
1,The Choice,Drama,"Drama, Romance"


In [4]:
df.isna().sum()

name              0
genre_kaggle      0
genres_omdb     347
dtype: int64

In [5]:
# Drop rows without genres labeled from the OMDB API
df = df.dropna(axis='index', how='any')

In [6]:
# Transform
df_genres = df.copy()

for index, row in df.iterrows():
    
    if index % 1000 == 0:
        print(f'Counting row #{index}...')
        
    genres = df.loc[index, 'genres_omdb'].split(', ')
    
    for genre in genres:
        genre = genre.lower()
        if genre not in df_genres.columns:
            df_genres[genre] = 0
            df_genres.loc[index, genre] += 1
        else:
            df_genres.loc[index, genre] += 1

print(f'---------------')
print(f'Mapping completed.')

df_genres.head()

Counting row #0...
Counting row #1000...
Counting row #2000...
Counting row #3000...
Counting row #4000...
Counting row #5000...
Counting row #6000...
---------------
Mapping completed.


Unnamed: 0,name,genre_kaggle,genres_omdb,crime,drama,romance,animation,comedy,family,mystery,...,history,war,music,sport,short,western,musical,documentary,film-noir,adult
0,Gold,Adventure,"Crime, Drama",1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Choice,Drama,"Drama, Romance",0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Middle School: The Worst Years of My Life,Animation,"Animation, Comedy, Family",0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,Midnight Special,Drama,"Drama, Mystery, Sci-Fi, Thriller",0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,A Monster Calls,Drama,"Adventure, Drama, Family, Fantasy",0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Double check if there're duplicated genres being recorded for the same movie
# Only values in min and max should be 0 and 1 respectively...
check_dup = pd.DataFrame(df_genres.describe().loc['min'])
check_dup['max'] = pd.DataFrame(df_genres.describe().loc['max'])
print(check_dup.value_counts())
check_dup

min  max
0.0  1.0    24
dtype: int64


Unnamed: 0,min,max
crime,0.0,1.0
drama,0.0,1.0
romance,0.0,1.0
animation,0.0,1.0
comedy,0.0,1.0
family,0.0,1.0
mystery,0.0,1.0
sci-fi,0.0,1.0
thriller,0.0,1.0
adventure,0.0,1.0


In [8]:
final_df = df_genres.copy()
final_df.dtypes

name            object
genre_kaggle    object
genres_omdb     object
crime            int64
drama            int64
romance          int64
animation        int64
comedy           int64
family           int64
mystery          int64
sci-fi           int64
thriller         int64
adventure        int64
fantasy          int64
action           int64
biography        int64
horror           int64
history          int64
war              int64
music            int64
sport            int64
short            int64
western          int64
musical          int64
documentary      int64
film-noir        int64
adult            int64
dtype: object

In [9]:
final_df.head(2)

Unnamed: 0,name,genre_kaggle,genres_omdb,crime,drama,romance,animation,comedy,family,mystery,...,history,war,music,sport,short,western,musical,documentary,film-noir,adult
0,Gold,Adventure,"Crime, Drama",1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Choice,Drama,"Drama, Romance",0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# Export final_df to CSV
path = os.path.join('..','resources','cleaned_data', 'parsed_genres_table.csv')
final_df.to_csv(path, index=False)