## Transform "genres_omdb" Column

Most movies are categorized into more than one genres. Data collected from API calls is saved as strings so further parsing is needed. The goal is to assign 0 and 1 values to every movie for each unique category existed in the. We believe this would come in handy in our machine learning models.

In [1]:
import pandas as pd
import os

In [2]:
path = os.path.join('..', '..', 'resources','cleaned_data', 'movies_complete_cleaned.csv')
raw_df = pd.read_csv(path)
raw_df.head()

Unnamed: 0,name,production,director,runtime,released,year,month,country_kaggle,country_omdb,star_kaggle,...,plot,awards,score_imdb,votes_imdb,score_metacritic,budget,genre_kaggle,gross,genres_omdb,rating
0,Doctor Strange,Marvel Studios,Scott Derrickson,115,2016-11-04,2016,11,USA,USA,Benedict Cumberbatch,...,"Marvel's ""Doctor Strange"" follows the story of...",Nominated for 1 Oscar. Another 19 wins & 67 no...,7.5,348307,72.0,165000000,Action,232641920,"Action, Adventure, Fantasy, Sci-Fi",PG-13
1,Sleight,Diablo Entertainment (II),J.D. Dillard,89,2017-04-28,2016,4,USA,USA,Jacob Latimore,...,A young street magician (Jacob Latimore) is le...,3 nominations.,5.9,4012,62.0,250000,Action,3986245,"Crime, Drama, Sci-Fi",R
2,Silence,Cappa Defina Productions,Martin Scorsese,161,2017-01-13,2016,1,USA,"USA, UK, Taiwan, Japan, Mexico, Italy",Andrew Garfield,...,The story of two Catholic missionaries (Andrew...,Nominated for 1 Oscar. Another 6 wins & 56 nom...,7.2,61798,79.0,46000000,Adventure,7100177,"Drama, History",R
3,Manchester by the Sea,Amazon Studios,Kenneth Lonergan,137,2016-12-16,2016,12,USA,USA,Casey Affleck,...,"Lee Chandler is a brooding, irritable loner wh...",Won 2 Oscars. Another 127 wins & 263 nominations.,7.9,159673,96.0,8500000,Drama,47695371,Drama,R
4,Dirty Grandpa,Lionsgate,Dan Mazer,102,2016-01-22,2016,1,USA,"United States, United Kingdom",Robert De Niro,...,"Jason Kelly, the grandson of Dick Kelly, loses...",2 wins & 11 nominations,6.0,82289,21.0,27500000,Comedy,35593113,Comedy,R


In [3]:
# Select only genre columns
df = raw_df[['name', 'genre_kaggle', 'genres_omdb']]
df.head(2)

Unnamed: 0,name,genre_kaggle,genres_omdb
0,Doctor Strange,Action,"Action, Adventure, Fantasy, Sci-Fi"
1,Sleight,Action,"Crime, Drama, Sci-Fi"


In [4]:
df.isna().sum()

name              0
genre_kaggle      0
genres_omdb     307
dtype: int64

In [5]:
# Drop rows without genres labeled from the OMDB API
df = df.dropna(axis='index', how='any')

In [6]:
# Transform
df_genres = df.copy()

for index, row in df.iterrows():
    
    if index % 1000 == 0:
        print(f'Counting row #{index}...')
        
    genres = df.loc[index, 'genres_omdb'].split(', ')
    
    for genre in genres:
        genre = genre.lower()
        if genre not in df_genres.columns:
            df_genres[genre] = 0
            df_genres.loc[index, genre] += 1
        else:
            df_genres.loc[index, genre] += 1

print(f'---------------')
print(f'Mapping completed.')

df_genres.head()

Counting row #0...
Counting row #1000...
Counting row #2000...
Counting row #3000...
Counting row #4000...
Counting row #5000...
Counting row #6000...
---------------
Mapping completed.


Unnamed: 0,name,genre_kaggle,genres_omdb,action,adventure,fantasy,sci-fi,crime,drama,history,...,family,sport,music,mystery,short,western,musical,documentary,film-noir,adult
0,Doctor Strange,Action,"Action, Adventure, Fantasy, Sci-Fi",1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Sleight,Action,"Crime, Drama, Sci-Fi",0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,Silence,Adventure,"Drama, History",0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,Manchester by the Sea,Drama,Drama,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,Dirty Grandpa,Comedy,Comedy,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Double check if there're duplicated genres being recorded for the same movie
# Only values in min and max should be 0 and 1 respectively...
check_dup = pd.DataFrame(df_genres.describe().loc['min'])
check_dup['max'] = pd.DataFrame(df_genres.describe().loc['max'])
print(check_dup.value_counts())
check_dup

min  max
0.0  1.0    24
dtype: int64


Unnamed: 0,min,max
action,0.0,1.0
adventure,0.0,1.0
fantasy,0.0,1.0
sci-fi,0.0,1.0
crime,0.0,1.0
drama,0.0,1.0
history,0.0,1.0
comedy,0.0,1.0
biography,0.0,1.0
romance,0.0,1.0


In [8]:
final_df = df_genres.copy()
final_df.dtypes

name            object
genre_kaggle    object
genres_omdb     object
action           int64
adventure        int64
fantasy          int64
sci-fi           int64
crime            int64
drama            int64
history          int64
comedy           int64
biography        int64
romance          int64
horror           int64
thriller         int64
war              int64
animation        int64
family           int64
sport            int64
music            int64
mystery          int64
short            int64
western          int64
musical          int64
documentary      int64
film-noir        int64
adult            int64
dtype: object

In [9]:
final_df.head(2)

Unnamed: 0,name,genre_kaggle,genres_omdb,action,adventure,fantasy,sci-fi,crime,drama,history,...,family,sport,music,mystery,short,western,musical,documentary,film-noir,adult
0,Doctor Strange,Action,"Action, Adventure, Fantasy, Sci-Fi",1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Sleight,Action,"Crime, Drama, Sci-Fi",0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# Export final_df to CSV
path = os.path.join('..', '..', 'resources','cleaned_data', 'parsed_genres_table.csv')
final_df.to_csv(path, index=False)