<center><h2>Predict Movie Genre for TMDb</h2></center>

_____


<center><img src="../images/image.png" width="80%"/></center>

In [15]:
import itertools
import json
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import string

from collections import defaultdict
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models import Word2Vec
from glob import glob
from ipywidgets import interact, IntSlider
from IPython.display import display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [16]:
def freeze_header(df, num_rows=30, num_columns=10, step_rows=1,
                  step_columns=1):
    """
    Freeze the headers (column and index names) of a Pandas DataFrame. A widget
    enables to slide through the rows and columns.
    """
    @interact(last_row=IntSlider(min=min(num_rows, df.shape[0]),
                                 max=df.shape[0],
                                 step=step_rows,
                                 description='rows',
                                 readout=False,
                                 disabled=False,
                                 continuous_update=True,
                                 orientation='horizontal',
                                 slider_color='purple'),
              last_column=IntSlider(min=min(num_columns, df.shape[1]),
                                    max=df.shape[1],
                                    step=step_columns,
                                    description='columns',
                                    readout=False,
                                    disabled=False,
                                    continuous_update=True,
                                    orientation='horizontal',
                                    slider_color='purple'))
    def _freeze_header(last_row, last_column):
        display(df.iloc[max(0, last_row-num_rows):last_row,
                        max(0, last_column-num_columns):last_column])

# Outline

______

+ Goal
+ Introduce the dataset
+ Feature selection and engineering
+ Visualization Demo
+ Models and Results

# Goal

______
+ Build Multi-label Classification Model
+ Predict movie genre for TMDB movie
+ Auto tagging/checking movie genre
+ Possibly fix existing problem

In [13]:
freeze_header(df=df_movies, num_rows=5)

interactive(children=(IntSlider(value=5, description='rows', max=3036, min=5, readout=False), IntSlider(value=…

In [10]:
genres_list

[('Drama', 1801),
 ('Comedy', 1442),
 ('Thriller', 1151),
 ('Action', 1025),
 ('Romance', 726),
 ('Adventure', 720),
 ('Crime', 627),
 ('Science Fiction', 488),
 ('Horror', 470),
 ('Family', 437),
 ('Fantasy', 388),
 ('Mystery', 316),
 ('Animation', 201),
 ('History', 153),
 ('Music', 149),
 ('War', 115),
 ('Western', 69),
 ('Documentary', 56),
 ('Foreign', 14),
 ('TV Movie', 3)]

# Dataset - features and instances

______

Dataset: TMDB(The Movie Database) and Kaggle  
General Info :   
    1. cast  
    2. budget  
    3. overview  
    4. production company  
instances : 4799  
Additional Data:   
    1. 3440 of them have subtitles  
    2. retrieve them with API
        

## Dataset 

-----

    - Groud Truth is Messy
        -- user contributed database, everyone can edit it
        -- some movies get little attention 
        -- That's what make our work meaningful
_____



<center><img src="../images/edit_genres.jpg" width="60%"/></center>

In [17]:
"""
Load data
"""
df = pd.read_csv('dataset/tmdb_5000_movies_nonull.csv')
df_credits = pd.read_csv('dataset/tmdb_5000_credits.csv')
credits_sub = df_credits.loc[:, ['movie_id', 'cast']].rename(columns={
                                                             'movie_id': 'id'})
df = df[['id', 'budget', 'genres', 'keywords', 'original_language',
         'overview', 'popularity', 'production_companies',
         'production_countries', 'release_date', 'revenue', 'runtime',
         'spoken_languages', 'tagline', 'title', 'vote_average', 'vote_count']]
df = df.merge(credits_sub)

In [18]:
"""
Add subtitles
"""
with open('dataset/subtitles/subtitles.json', 'r') as f:
    sub_dict = json.load(f)
sub_dict = defaultdict(str, sub_dict)
df["subtitles"] = df.title.apply(
    lambda title: "\n\n".join(sub_dict[title]))  # single string

In [19]:
# drop movies with no genre info
for i, row in df.iterrows():
    if row['genres'] == '[]':
        df.drop(i, inplace=True)
df = df.dropna()
all_genres = defaultdict(int)
for row in df.genres:
    for item in json.loads(row):
        all_genres[item['name']] += 1
genres_list = sorted(all_genres.items(), key=lambda x: x[1], reverse=True)

In [20]:
def convert_list(cell):
    """convert the json format to a list of categories"""
    kw_list = []
    for kw in json.loads(cell):
        kw_list.append(kw['name'])
    return kw_list


def larger_n(col, n):
    """filter the column"""
    keywords = defaultdict(int)
    for row in df[col]:
        row = json.loads(row)
        for entry in row:
            keywords[entry['name']] += 1
    kw_cnt = sorted(keywords.items(), key=lambda x: -x[1])
    return [kw[0] for kw in kw_cnt if kw[1] >= n]


def extract_gender(cell):
    """Extract cast gender"""
    female = 0
    male = 0
    for item in json.loads(cell):
        if item['gender'] == 1:
            female += 1
        elif item['gender'] == 2:
            male += 1
        else:
            continue
    return female, male


def concat_names(cell):
    """Concatenate first names and last names"""
    names = []
    for name in cell:
        names.append(name.replace(' ', ''))
    return names


def list2str(cell):
    """Convert list to string"""
    return ' '.join(cell)


def transform_cols(df, cols_to_transform):
    """Transform columns of a dataframe.
    cols_to_transform should be a dict(col_name: filter value n)
    """
    for col_name in cols_to_transform.keys():
        larger_col = larger_n(col_name, cols_to_transform[col_name])
        if col_name == 'cast':
            gen = df[col_name].apply(extract_gender)
            df['female_pct'] = gen.apply(lambda x: x[0]/(x[0]+x[1]+0.001))
            df['male_pct'] = gen.apply(lambda x: x[1]/(x[0]+x[1]+0.001))

        df[col_name] = df[col_name].apply(convert_list)\
            .apply(lambda cell: [kw for kw in cell if kw in larger_col])
    return df


def tokenize(text):
    """
    A tokenizer, remove special characters punctuation and replace them with a space, stem the word
    """
    text = text.lower()
    cleanString = re.sub('[^a-zA-Z]', ' ', text)
    words = nltk.word_tokenize(cleanString)
    english = list(ENGLISH_STOP_WORDS)
    goodwords = [w for w in words if w not in english]
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in goodwords]
    return stemmed

In [21]:
cols_to_transform = {'keywords': 30,
                     'genres': 0,
                     'production_companies': 5,
                     'production_countries': 3,
                     'spoken_languages': 10,
                     'cast': 2}

df_movies = transform_cols(df, cols_to_transform)

In [22]:
# use only movies with subtitles
df_movies = df_movies[df_movies["subtitles"] != '']
df_movies = df_movies.reset_index(drop=True)

In [23]:
# Ground Truth is messy
df_movies[df_movies.title.str.lower()
          .str.contains("kung fu panda")][["title", "release_date", "genres"]]

Unnamed: 0,title,release_date,genres
120,Kung Fu Panda 2,2011-05-25,"[Animation, Family]"
134,Kung Fu Panda 3,2016-01-23,"[Action, Adventure, Animation, Comedy, Family]"
159,Kung Fu Panda,2008-06-04,"[Adventure, Animation, Family, Comedy]"


## Dataset 

______
    - Missing Data
        -- Label Missing
        -- Abnormal Data

In [24]:
df_movies = df_movies.drop(df_movies.loc[df_movies.runtime == 0].index)

In [25]:
df_movies = df_movies.reset_index(drop=True)

#### Summarized data  - wordclouds


<center><img src="../images/wccon.jpeg" width="120%"/></center>


# Data processing and Feature engineering -- Text

______




NLP for overview and sub-titles

+ Tokenization / Stemming
+ TfidfVectorizer
+ Doc2vec (Not pretrained vs Pretrained) PAINFUL





## Data processing and Feature engineering -- Text

______




Tagline - do not use

+ Tagline is too short for each movie, and has almost no overlap across movies

Cast names -- CountVectorizer

+ Concatenate Fisrt and Last Name



## Data processing and Feature engineering -- Numeric

______


Budget & Revenue - predict
+ deal with abnormal values by predict with other information
+ For these 107 movies with less than \\$100 budget and more than \\$10000 revenue, we use `revenue + overview + popularity + release_date` to predict the actual budget.
+ For others whose revenue are below $10000, use `overview + popularity + release_date` to predict budget

## Column transformation

**Steps:**  
+ Filter columns with specific filtering values
+ **Convert** json-format cells to list
+ **Create feature** Extract cast gender and calculate the proportion of female cast and male cast for each movie

**Other features：**
+ Production company
+ Popularity, Vote count -- MinMax
+ Vote


In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

df_movies = df_movies.loc[df_movies.overview.apply(lambda x: len(x) > 10)]

tfidf = TfidfVectorizer(stop_words='english', binary=True,
                        ngram_range=(1, 2), min_df=0.001)
overview_vec = tfidf.fit_transform(df_movies['overview'])

overview = pd.DataFrame(overview_vec.toarray(),
                        columns=tfidf.get_feature_names()).add_prefix('ov_')

In [28]:
df_movies = df_movies.loc[df_movies.overview.apply(lambda x: len(x) > 10)]

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english', binary=True,
                        ngram_range=(1, 2), min_df=0.001)
overview_vec = tfidf.fit_transform(df_movies['overview'])

overview_docvecs = pd.DataFrame(overview_vec.toarray(
), columns=tfidf.get_feature_names()).add_prefix('ov_')

In [29]:
def text_to_words(text):
    """
    remove punctuation and whitespace
    but keep hyphens and apostrophes
     """
    filtered_text = re.sub(r'[^\w\'\s-]',
                           '', text)
    return word_tokenize(filtered_text.lower())

In [30]:
idxs = df_movies.id.tolist()
%time sub_words = [text_to_words(text) for text in df_movies.subtitles.tolist()]
subs = dict(zip(idxs, sub_words))

CPU times: user 1min 7s, sys: 493 ms, total: 1min 7s
Wall time: 1min 7s


In [31]:
%time tagged_data = [TaggedDocument(words=word_list, tags=[index]) for index, word_list in subs.items()]

CPU times: user 274 ms, sys: 1.09 ms, total: 275 ms
Wall time: 274 ms


In [32]:
model = Doc2Vec(vector_size=50, min_count=2, workers=4)
%time model.build_vocab(tagged_data)

CPU times: user 8.95 s, sys: 37.9 ms, total: 8.98 s
Wall time: 8.99 s


In [33]:
%time model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 2min 30s, sys: 717 ms, total: 2min 30s
Wall time: 42.4 s


In [34]:
word2vec = Word2Vec(size=50)
%time word2vec.build_vocab([word for text in tagged_data for word in text])

CPU times: user 4.96 s, sys: 19.8 ms, total: 4.98 s
Wall time: 4.98 s


In [35]:
glove_path = "dataset/glove.6B.50d.word2vec.txt"
%time word2vec.intersect_word2vec_format(glove_path, lockf=1.0, binary=False, encoding='utf8', unicode_errors='strict')

CPU times: user 11.9 s, sys: 49.1 ms, total: 12 s
Wall time: 12 s


In [36]:
model_pretrained = Doc2Vec(vector_size=50, min_count=2, workers=4)
model_pretrained.build_vocab(tagged_data)
model_pretrained.wv = word2vec.wv
%time model_pretrained.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 2min 43s, sys: 959 ms, total: 2min 44s
Wall time: 46 s


In [37]:
def infer_docvecs(df):
    docvecs = []
    for index in tqdm(df.id.tolist()):
        word_list = subs[index]
        vec = model_pretrained.infer_vector(word_list, steps=20)
        docvecs.append(vec)
    docvecs = np.array(docvecs, dtype=np.float32)
    return docvecs

In [38]:
movies_docvecs = infer_docvecs(df_movies)
movies_docvecs.shape

100%|██████████| 3036/3036 [08:10<00:00,  6.19it/s]


(3036, 50)

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

# concatenate names and join the list of names to a string
df_movies['cast'] = df_movies['cast'].apply(concat_names).apply(list2str)

vectorizer = CountVectorizer(ngram_range=(1, 1))
cast_vect = vectorizer.fit_transform(df_movies['cast'])

# cast = pd.DataFrame(cast_vect.todense(), columns=vectorizer.get_feature_names()).add_prefix('cast_')

In [40]:
from sklearn import ensemble


def predict_budget1(dataset):
    """Predict budget of movies with less than $100 budget and more than $10000 revenue"""
    data_p = dataset[['revenue', 'popularity', 'budget']]
    data_p = pd.concat([data_p, overview_docvecs], axis=1)

    x_train = data_p.loc[~(data_p.budget < 100) & (
        data_p.revenue > 10000)].drop('budget', 1)
    y_train = data_p.loc[~(data_p.budget < 100) &
                         (data_p.revenue > 10000)]['budget']
    x_test = data_p.loc[(data_p.budget < 100) & (
        data_p.revenue > 10000)].drop('budget', 1)

    rfr = ensemble.RandomForestRegressor(random_state=42)
    rfr.fit(x_train, y_train)

    y_test = rfr.predict(x_test)
    return y_test


df_movies.loc[(df_movies.budget < 100) &
              (df_movies.revenue > 10000), 'budget'] = predict_budget1(df_movies)



In [30]:
def predict_budget2(dataset):
    """Predict budget of movies with less than $100 budget and less than $10000 revenue"""
    data_p = dataset[['popularity', 'budget']]
    data_p = pd.concat([data_p, overview_docvecs], axis=1)

    x_train = data_p.loc[~(data_p.budget < 100)].drop('budget', 1)
    y_train = data_p.loc[~(data_p.budget < 100)]['budget']
    x_test = data_p.loc[(data_p.budget < 100) & (
        df_movies.revenue <= 10000)].drop('budget', 1)

    rfr = ensemble.RandomForestRegressor(random_state=42)
    rfr.fit(x_train, y_train)
    y_test = rfr.predict(x_test)
    return y_test


df_movies.loc[(df_movies.budget < 100) & (df_movies.revenue <=
                                          10000), 'budget'] = predict_budget2(df_movies)

In [41]:
kw_all = defaultdict(int)

for cell in df_movies.keywords:
    for kw in cell:
        kw_all[kw] += 1

sorted(kw_all.items(), key=lambda x: x[1], reverse=True)

[('duringcreditsstinger', 252),
 ('woman director', 182),
 ('independent film', 155),
 ('based on novel', 146),
 ('aftercreditsstinger', 138),
 ('murder', 136),
 ('violence', 111),
 ('dystopia', 109),
 ('revenge', 89),
 ('sport', 86),
 ('3d', 81),
 ('sequel', 79),
 ('friendship', 78),
 ('teenager', 74),
 ('sex', 73),
 ('musical', 70),
 ('suspense', 67),
 ('los angeles', 66),
 ('love', 63),
 ('new york', 62),
 ('high school', 62),
 ('alien', 60),
 ('superhero', 59),
 ('biography', 58),
 ('family', 54),
 ('police', 53),
 ('remake', 50),
 ('prison', 48),
 ('nudity', 48),
 ('drug', 48),
 ('based on comic book', 47),
 ('dying and death', 47),
 ('corruption', 46),
 ('serial killer', 46),
 ('airplane', 43),
 ('wedding', 43),
 ('magic', 42),
 ('father son relationship', 42),
 ('fbi', 42),
 ('friends', 41),
 ('london england', 41),
 ('daughter', 40),
 ('time travel', 40),
 ('party', 40),
 ('lawyer', 39),
 ('based on young adult novel', 38),
 ('cia', 38),
 ('brother brother relationship', 37),
 

In [42]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

keywords_dummies = pd.DataFrame(mlb.fit_transform(
    df_movies['keywords']), columns=mlb.classes_).add_prefix('kw_')

In [43]:
prod_com = pd.DataFrame(mlb.fit_transform(
    df_movies['production_companies']), columns=mlb.classes_).add_prefix('prodComp_')

In [44]:
prod_coun = pd.DataFrame(mlb.fit_transform(
    df_movies['production_countries']), columns=mlb.classes_).add_prefix('prodComp_')

In [45]:
years = pd.get_dummies(df_movies.apply(
    lambda row: str(row.release_date)[:3], axis=1))
years['before97'] = years.iloc[:, :6].sum(axis=1)
years = years.iloc[:, 6:]

In [46]:
df_movies.loc[df_movies['vote_average'] == 0,
              'vote_average'] = df_movies.loc[df_movies['vote_average'] != 0].vote_average.mean()
df_movies.loc[df_movies['vote_count'] == 0,
              'vote_count'] = df_movies.loc[df_movies['vote_count'] != 0].vote_count.mean()

In [47]:
df_movies = df_movies.reset_index(drop=True)

## Data processing/Feature engineering 

**In the end, we got**:


**Features** :  
+ **numeric**: budget, revenue, female_pct, male_pct, runtime 
+ **text(vectorized)**: keywords, overview, cast(names)

**Labels**:  
+ 20 genres, i.e., `genre_*`

In [48]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genres_dummies = pd.DataFrame(mlb.fit_transform(
    df_movies['genres']), columns=mlb.classes_).add_prefix('genre_')

In [49]:
movies_docvecs = pd.DataFrame(movies_docvecs)
features = pd.concat((df_movies.loc[:, ['budget', 'revenue', 'female_pct', 'male_pct', 'runtime',
                                        'vote_average', 'vote_count']],
                      keywords_dummies, prod_com, prod_coun, years, movies_docvecs, pd.DataFrame(overview_docvecs)), axis=1)

In [50]:
# We did MinMax Normalization after this
freeze_header(df=features, num_rows=5)

interactive(children=(IntSlider(value=5, description='rows', max=3036, min=5, readout=False), IntSlider(value=…

# Decision Tree Visualization Demo

## How to deal with subtitle features



**Tfidf** :  
+ Calculate tfidf score for the unique subtitle for each movie, sparse features

**Doc2vec(50d)**:  
+ Document version of word2vec

# Workflow

Train-test split: We used two ways to split the train/test data, the first one is by random 80/20 and the second one is by release_date

Pipeline: Data Processing + NLP on overview and subtitles + Standardization

Grid Search: 
+ Tfidf works better with MultinomialNB, they behave better in terms of Recall and F1 score
+ Doc2Vec works better with RandomForest, they behave better in terms of Precision and Hamming Loss

<center><h2>MultinomialNB Model Performance</h2></center>

_____


<center><img src="../images/1.png" width="100%"/></center>

<center><h2>Random Forest Model Performance</h2></center>

_____


<center><img src="../images/2.png" width="100%"/></center>

<center><h2>Model Comparison</h2></center>

_____


<center><img src="../images/Comparison.png" width="45%"/></center>

# Summary
______

+ Goal: auto tagging/chencking the Genre
+ Build multi-label classification model
+ Tfidf works better with MultinomialNB, they behave better in terms of Recall and F1 score
+ Doc2Vec works better with RandomForest, they behave better in terms of Precision and Hamming Loss
    

# Q&A