This notebook preprocess the dataset of wikipedia movie plots to classify a movie genre based on its description 

In [None]:
import numpy as np
import pandas as pd
from os import listdir

In [None]:
listdir('/kaggle/input')

In [None]:
data = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')

In [None]:
data.head()

We are interest only in the genre & plot

In [None]:
data.drop(columns=set(data.columns)-{'Plot', 'Genre'}, inplace=True)
data.rename(str.lower, axis='columns', inplace=True)

In [None]:
data.info()

Lowercase descriptions 

In [None]:
data['plot'] = data['plot'].map(str.lower)

Remove genres that appears less than 100 times in the dataset

In [None]:
genres_count = data.groupby('genre').size()
genres_count = genres_count[genres_count >= 100]
data = pd.merge(data, pd.DataFrame(genres_count).drop(columns=[0]), left_on='genre', right_index=True)

In [None]:
data.info()

In [None]:
data['genre'].unique()

Now we parse the genres:
* Some movies can be categorized with multiple genres. e.g: comedy-drama. Also note that "comedy drama", "comedy & drama" are the same. We are going to parse the data so that there is no ambiguity. They all will be "comedy,drama"
* science-fiction is the same as sci-fi and romance same as romantic

In [None]:
from collections import ChainMap
defaults = dict(map(lambda genre: (genre, genre), data['genre'].unique()))
parser = dict(ChainMap({
    'romantic drama': 'romantic,drama',
    'crime drama': 'crime,drama',
    'comedy drama': 'comedy,drama',
    'romantic comedy': 'romantic,comedy',
    'musical comedy': 'musical,comedy',
    'comedy, drama': 'comedy,drama',
    'science fiction': 'sci-fi',
    'comedy-drama': 'comedy,drama',
    'romance': 'romantic'
}, defaults))
data['genre'] = data['genre'].map(parser)

In [None]:
data['genre'].unique()

Now we convert genre in categorical columns on the same dataframe

In [None]:
from itertools import chain
categories = frozenset(chain.from_iterable(data['genre'].map(lambda genre: genre.split(','))))

genres = pd.DataFrame(dict(map(lambda category: (category, pd.Series([], dtype=np.uint8)), categories)),
                      columns=categories)
for index in data['genre'].unique():
    genres.loc[index] = np.array(list(map(set(index.split(',')).__contains__, categories)), dtype=np.uint8)
data = pd.merge(data, genres, left_on='genre', right_index=True).drop(columns=['genre'])

In [None]:
data.info()

Save the result for other kernels

In [None]:
data.to_csv('preprocessed-data.csv', index=False)

In [None]:
listdir('/kaggle/working')