## Using Word2Vec in non-NLP context

In this notebook, we enconde movies' cast and crew into vectors and use them  to predict the probability of any given movie being among the 1% with most IMDB votes.

When a categorical variable can take such a huge number of possible values, such as the actors/actresses involved in movies, it becomes hard to apply one-hot encoding and use every person as a dummy variable and not overfit badly. There are ways around this though. Another option is to compute features regarding each actor/actress (such as avg number of votes in his/her previous movies) and include those interval variables in a model. In this case we do something different, and try to learn representations for each actor/actress (100 length vectors) using **Word2Vec**.

Even though Word2Vec was originally designed to produce word embeddings, you can use it for any case in which the context matters. In this case, we use actors' and actresses' names (or imdb id's) as the words that would usually be the input of a **Word2Vec** model, and the other actors and actresses who appeared in the same movies, as *context words*.

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score, f1_score, fbeta_score, precision_score, recall_score
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import ParameterGrid
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm.notebook import tqdm
import csv
from gensim.models.callbacks import CallbackAny2Vec
from itertools import groupby
from gensim.models import Word2Vec
import gensim
from pathlib import Path
import matplotlib.pyplot as plt
import os
import random
pd.options.display.max_columns = 999

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
PATH = Path('/kaggle/input/imdb-dataset/')

# Load Data

In [None]:
def load_data(start_year, min_minutes, min_votes):
    title_basics = pd.read_csv(PATH / 'title.basics.tsv' / 'title.basics.tsv', sep='\t')
    title_ratings = pd.read_csv(PATH / 'title.ratings.tsv'/ 'title.ratings.tsv', sep='\t')
    title_basics.genres = title_basics.genres.apply(
                        lambda x: x.split(',') if ((type(x)!=float) & (x!=r'\N')) else ['no_genre'])

    title_basics.runtimeMinutes = (
     title_basics.runtimeMinutes.apply(lambda x: np.nan if not x.isdigit() else x).astype(float)
                                )
    
    title_basics = title_basics[
        title_basics.titleType.isin(['movie'])
        & ~title_basics.runtimeMinutes.isna()
        & (title_basics.runtimeMinutes <= 3.5 * 60)
        & title_basics.genres.apply(lambda x: 'Short' not in x)
        ]
    
    movies = pd.merge(title_basics, title_ratings, on='tconst', how='left')
    movies['startYear'] = movies['startYear'].apply(lambda x: np.nan if x == r'\N' else int(x))
    
    # MY CONDITIONS:
    movies = movies[movies.startYear > start_year].dropna(subset=['averageRating'])
    movies = movies[movies.runtimeMinutes >= min_minutes]
    movies = movies[movies.numVotes>=min_votes]
    return movies

I impose a couple of conditions to make the dataset more manageable:

1- Only load movies from after 1960, as I suspect films older than that have a different voting pattern. The choice of 1960 as the cutoff is kind of subjective though.

2- Only load movies > 60 minutes long. It appears that the convention of the minimum duration to be considered a movie is either 40 or 80 minutes depending on the source. So I take 60 minutes which is in the middle of both and makes sense to me.

3- Only load movies with at least 15 votes. With this we get rid of entries that are not even worth looking into (less than 15 votes means that not even the people involved in the movie appears to have voted for it).

In [None]:
movies = load_data(start_year = 1960, min_minutes = 60, min_votes = 15)

In [None]:
principals = pd.read_csv(PATH /'title.principals.tsv'/'title.principals.tsv',delimiter="\t")

We divide our "principals" dataset into "cast" (actors and actresses) and the rest.
The idea is to add to our Movies dataframe two columns: one with a list containing all actors/actresses in the movie, and another including all other crew members who were involved in the movie.

In [None]:
cast = principals[principals['category'].isin(['actor', 'actress'])]
crew = principals[~principals['category'].isin(['actor', 'actress'])]

In [None]:
ordered_cast = pd.DataFrame(cast.groupby(by='tconst').apply(lambda x: [x for x in x['nconst']]))
ordered_cast = ordered_cast.reset_index().rename(columns={0: 'cast'})
ordered_cast.head(2)

In [None]:
# If it went right, we should have no duplicated movies (tconst)
ordered_cast.duplicated('tconst').any()

In [None]:
ordered_crew = pd.DataFrame(crew.groupby(by='tconst').apply(lambda x: [x for x in x['nconst']]))
ordered_crew = ordered_crew.reset_index().rename(columns={0: 'crew'})
ordered_crew.head(2)

In [None]:
# If it went right, we should have no duplicated movies (tconst)
ordered_crew['tconst'].value_counts().max()

Now we add this information to our movies data.

In [None]:
movies = pd.merge(movies, ordered_cast, on='tconst', how='left').merge(ordered_crew, on='tconst', how='left')

# If it went right, we should have no duplicated movies (tconst)
movies.duplicated('tconst').any()

In [None]:
# CHECK: Back to The Future should have Fox and C. Lloyd within Cast, and Robert Zemeckis within Crew.
movies[movies['tconst']=='tt0088763']

It worked.

In [None]:
movies['cast'].isna().sum()

In [None]:
movies['crew'].isna().sum()

We do however have some movies with no cast or crew. Let's check a couple.

In [None]:
movies[movies['cast'].isna()].head(2)

Manually checking these movies, I see that this is indeed the case. These movies have some people involved listed as "self", but nobody listed as actor/actresses. Therefore it makes sense. Let's just fill these movies with 'Unknown' cast/crew.

In [None]:
movies['cast'].fillna('Unknown', inplace=True)
movies['crew'].fillna('Unknown', inplace=True)

# Word2Vec

The idea is to estimate one vector for each person who is involved in a movie. We use the other cast/crew members of a movie as context for the algorithm to learn a vector for a certain person. We need to feed the Word2Vec with a list of sublists in which each of the latter contains the id's of the people involved in each movie. For example, something like:

[nm0000138, nm0000701, nm0000708, nm0000870, nm0365239, nm0000116, nm0484457, nm0000035]

In [None]:
to_wtv = pd.DataFrame(principals.groupby(by='tconst').apply(lambda x: [x for x in x['nconst']])).reset_index().iloc[:, 1]
to_wtv.head(3)

Now we build the Word2Vec model with *gensim*, which requires only a couple of lines of code. We use a window of 10, and 100 length vectors will be estimated for each person using the other people involved in each movie as context.

In [None]:
wtv = Word2Vec(window=10, iter=10, min_count=5)
wtv.build_vocab(to_wtv)
wtv.train(to_wtv, total_words=wtv.corpus_total_words, epochs=10)

# Results from Word2Vec

The following functions let us turn imdb id's into pictures and names of the cast/crew:

In [None]:
from IPython.display import HTML, display
from bs4 import BeautifulSoup
import requests

def get_name(id):
    response = requests.get(f'https://www.imdb.com/name/{id}/')
    soup = BeautifulSoup(response.content)
    return soup.select('.header .itemprop')[0].text

def get_image(id):
    response = requests.get(f'https://www.imdb.com/name/{id}/')
    soup = BeautifulSoup(response.content)
    candidates = soup.select('#name-poster')
    return candidates[0].attrs['src'] if candidates else 'https://m.media-amazon.com/images/G/01/imdb/images/nopicture/medium/name-2135195744._CB466677935_.png'

def render_person(id):
    name = get_name(id)
    picture = get_image(id)
    return f"""
    <div style="width: 150px; text-align: center">
        <h4 style='margin-top: -5px'>{name}</h4>
        <div style='font-size:75%; margin-bottom: 5px'>{id}</div>
        <a href="https://www.imdb.com/name/{id}" target="_blank">
            <img style="width: 100px; display: block; margin-left: auto; margin-right: auto;" src="{picture}"/>
        </a>
    </div>
    """

def show_similars(id, n=10):
    if id in wtv.wv: 
        display(HTML(render_person(id)))
    renders = []
    for similar_id, score in wtv.wv.most_similar(id, topn=n):
        renders.append(render_person(similar_id))
        
    carousel = ''.join(
        [
            f'<div style="margin-left: 10px; float: left">{p}</div>' 
            for p in renders
        ]
        )
    display(HTML(f'<div style="width: 1800px">{carousel}</div>'))

def show_similars_tovector(id, n=10):
    renders = []
    
    for similar_id, score in wtv.wv.most_similar(id, topn=n):
        renders.append(render_person(similar_id))
        
    carousel = ''.join(
        [
            f'<div style="margin-left: 10px; float: left">{p}</div>' 
            for p in renders
        ]
    )
    display(HTML(f'<div style="width: 1800px">{carousel}</div>'))


**Let's check a few similarities**. For each actor/actress, we see which others have the most similar vectors (cosine similarity). If we give the name of a famous person, it would make sense to get back other well-known actors/actresses of around the same magnitude in return.

In [None]:
show_similars('nm0000138')

In [None]:
show_similars('nm0000849')

In [None]:
show_similars('nm1297015')

Nice to see that results seem to make a lot of sense in most cases :) Let's now look for a random actor/actress. If our model is coherent, we should get other random not well-known people in return.

In [None]:
random_actor = random.sample(list(principals.nconst.unique()), 1)[0]

In [None]:
show_similars('nm4643289')

Indeed, we get a bunch of random unknown people (at least to me), just as we hoped. Our random reference actor is actually called Jon Snow, which I find pretty funny.

Of course looking at the vectors themselves make little sense to us humans. But they do appear to make sense to a computer, which is what makes this technique so interesting.
Here's for example how Di Caprio's vector looks like:

In [None]:
wtv.wv['nm0000138']

Let's play a bit more. W2V has a method called "doesn't match" that returns which element from a list is the one that does not belong in the group. More technically, it computes the center of the group (the mean of all vectors) and then returns the one who is furthest away from this center. I think this is a great way to check which parts of the cast spectrum are covered by these vectors.

Let's start by checking whether how "famous" the actors are is covered. We have seen already from the previous examples that this indeed seems to be covered.

In [None]:
def who_doesnt_match(person1, person2, person3, person4):
    
    p1 = str(person1); p2=str(person2); p3=str(person3); p4=str(person4); 
    result = wtv.wv.doesnt_match([p1, p2, p3, p4])
    
    if result in wtv.wv: 
        display(HTML(render_person(result)))

### **FAME**

We'll try with three random actors/actresses and a famous one, who should be the one that does not belong if our model is capturing "fame" correctly hidden in those mysterious numbers of the vectors.

In [None]:
angelina_jolie= 'nm0001401'
sean_penn = 'nm0000576'
ryan_gosling = 'nm0331516'
random1 = random.sample(list(principals.nconst.unique()), 1)[0]
random2 = random.sample(list(principals.nconst.unique()), 1)[0]
random3 = random.sample(list(principals.nconst.unique()), 1)[0]

In [None]:
print(random1, random2, random3)

In [None]:
who_doesnt_match(angelina_jolie, random1, random2, random3)

In [None]:
who_doesnt_match(sean_penn, random1, random2, random3)

In [None]:
who_doesnt_match(ryan_gosling, random1, random2, random3)

It sure looks like our model is pretty good at differentiating famous actors from the rest.

### **AGE**

I take 3 kind of new young stars and one star from the past. Despite being all famous, let's see if the model can differentiate according to which era they belong to.

In [None]:
paul_newman = 'nm0000056'
joseph_gordon_levitt = 'nm0330687'
jennifer_lawrence = 'nm2225369'
rooney_mara = 'nm1913734'

In [None]:
who_doesnt_match(paul_newman, joseph_gordon_levitt, jennifer_lawrence, rooney_mara)

Again we see it works pretty well.

### **GENRE**
To end with, let's see if the models also picks up which kind of genres the actor/actress is usually in. I'll include three (mostly) comedy actors and one more serious-type actor such as Christian Bale, who should not belong here.

In [None]:
christian_bale = 'nm0000288'
will_ferrell = 'nm0002071'
seth_rogen = 'nm0736622'
adam_sandler = 'nm0001191'

In [None]:
who_doesnt_match(christian_bale, will_ferrell, seth_rogen, adam_sandler)

We could keep trying things endlessly but let's end for now.

In summary, we have seen that using word2vec to embedd actors into vectors seem to work pretty well. It looks like these numbers contain information such as how famous the actor/actress is, in which era he/she was a star, and the kind of movies he/she is usually in.

Remember that these vectors were assembled by looking at who worked with whom across all movies in our data. So it seems that you can describe an actor/actress pretty well by knowing who he/she has worked with in the past!

Now we should decide what vector to estimate for cast/crew that do not have a vector (because they did not appear in a sufficient minimum number of movies).
I consider two obvious options:

1- We take the np.mean vector of all people.

2- We use a vector full of zeros.

Let's see who are the most similar people to each of these two default vectors:

In [None]:
default_vector = np.mean(wtv.wv.vectors, axis=0)
show_similars_tovector([default_vector])

In [None]:
default_vector = np.zeros(100)
show_similars_tovector([default_vector])

In both cases we get kind of obscure people, which is exactly what we would expect. Not sure what to decide here. For now let's use the np.zeros vector and later we check if we do better the other way.

# Choosing proxy for movies' popularity

I will use the number of IMDB votes as a proxy to measure a movie's popularity. The goal will be to predict which movies will be among the Top 1% in terms of IMDB votes ( > ~125,000 votes) based on its cast and crew. We'll call these kinds of movies a "Classic" in our dataset to easily differentiate them from the rest.

In [None]:
np.percentile(movies.numVotes, 99)

In [None]:
movies_docs = movies.to_dict(orient='records')
movies['Classic'] = [1 if x['numVotes']>np.percentile(movies.numVotes, 99) else 0 for x in movies_docs]

In [None]:
len(movies[movies['Classic']==1]) / len(movies)

Let's see which are some of these *classic* movies we have.

In [None]:
movies[movies['Classic']==1].sample(5)

Maybe not all of these movies can really be considered actual "classics", but they do have a very large number of votes and are truly famous movies.

# Feature Engineering

### Cast & Crew

The class below takes, for each movie, its entire cast or crew vectors, and then computes the mean of all of them. The result of this will be one 100-length vector that summarizes the cast (category='cast') or crew (category='crew') that was involved in the movie. We make the assumption that the mean of these vectors illustrates the movie's cast/crew quality/popularity as a whole.

In [None]:
class W2VFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, wtv, category, min_cnt_movies=2):
        self.category = category
        self.min_cnt_movies = min_cnt_movies
        self.wtv = wtv

    def fit(self, X, y):
        self.default_vector = np.zeros(100)
        #self.default_vector = np.mean(wtv.wv.vectors, axis=0)
        return self
    
    def _get_movie_vector(self, x_i):
        vectors = []
        for person in x_i[self.category]:
            if person not in self.wtv.wv or self.wtv.wv.vocab[person].count < self.min_cnt_movies: continue
            vectors.append(self.wtv.wv[person])
            
        if len(vectors) == 0:
            return self.default_vector
        else:
            return np.mean(vectors, axis=0)
            
    def transform(self, X):
        return np.asarray([self._get_movie_vector(x_i) for x_i in X])

### Other features

There may be some other things, like the Genre of a movie, its Runtime and the Year it was released, that affect the number of votes it gets. For this reason, it is better to include these control variables in order to better understand the effect that the cast has on the movie's popularity.

### RunTime

In [None]:
movies.runtimeMinutes.corr(np.log(movies.numVotes))

We see that there is a weak/moderate correlation between runtime and the number of votes, even though I really doubt this relation is linear (which is what the coefficient above measures). Anyway, looks like we should add RunTime to our model.

In [None]:
class RunTime(BaseEstimator, TransformerMixin):
    def fit(self, X, y): return self
    def transform(self, X):
        res = []
        for e in X:
            res.append({'runTime': int(e['runtimeMinutes'])})
        return res

### Genre

First let's see how balanced/imbalanced the genres are in our data according to whether they are "classic" movies:

In [None]:
class GenreDummies(BaseEstimator, TransformerMixin):
    def fit(self, X, y): return self
    def transform(self, X):
        res = []
        for e in X:
            res.append({g: 1 for g in e['genres']})
        return res  

In [None]:
v = DictVectorizer(sparse=False)
dummies_genre = v.fit_transform(GenreDummies().transform(movies_docs))

In [None]:
df_genres = pd.DataFrame(dummies_genre, columns=v.feature_names_)
df_genres = df_genres.astype(int)
genres_analysis = pd.concat([movies, df_genres], axis=1)
graph1 = pd.DataFrame(genres_analysis[genres_analysis['numVotes'] > np.percentile(genres_analysis.numVotes,97)].iloc[:,14:-1] \
                     .sum(axis=0) / len(genres_analysis[genres_analysis['numVotes'] > \
                    np.percentile(genres_analysis.numVotes,97)])).reset_index().rename(columns={'index': 'genre', 0: 'pct'})
graph1['classic']="Classic"
graph2 = pd.DataFrame(genres_analysis.iloc[:,14:-1] \
                     .sum(axis=0) / len(genres_analysis)).reset_index().rename(columns={'index': 'genre', 0: 'pct'})
graph2['classic'] = "All"
graph = pd.concat([graph1, graph2], axis=0) 

In [None]:
import seaborn as sns
sns.set_style("white")
fig, ax = plt.subplots(figsize=(14,20))
ax = sns.barplot(y='genre', x='pct', hue='classic', data=graph, palette='viridis')
ax.grid(color='grey', linestyle='-', linewidth=0.1, axis='x')
ax.set_xticks([0.05, 0.10, .15, .20, .25, .30, .35, .40, .45, .50])
ax.set_yticklabels(graph.genre[:23], size = 13, fontfamily='serif')
ax.set_xlabel('% of total movies', fontsize=15)
ax.set_ylabel('Genre', fontsize=15, fontfamily='serif')
ax.tick_params(labelbottom=True,labeltop=True)
plt.title('% of total movies per genre (1960-2020)', fontsize=20)

We find some very clear differences, mostly in the genres Action and Adventure. Action movies comprise around 12% of total movies, but almost 30% of these so-called *Classic* movies belong to the Action genre. We find a similar situation with the Adventure genre movies. This suggests that we must control for a movie's genre in the model.

Let's see now how certain genres' popularity has evolved through time.

In [None]:
genres_prog = genres_analysis.groupby(by='startYear').agg({'Horror': 'sum', 'Comedy': 'sum', \
                                                           'Drama':'sum', 'Sci-Fi':'sum', 'tconst': 'count'})
genres_prog.reset_index(inplace=True)
genres_prog['%_horror'] = genres_prog['Horror'] / genres_prog['tconst'] * 100
genres_prog['%_comedy'] = genres_prog['Comedy'] / genres_prog['tconst'] * 100
genres_prog['%_drama'] = genres_prog['Drama'] / genres_prog['tconst'] * 100
genres_prog['%_scifi'] = genres_prog['Sci-Fi'] / genres_prog['tconst'] * 100
genres_prog = genres_prog.iloc[:-1,:]

In [None]:
ax = plt.subplots(figsize=(18,8))
ax = sns.lineplot(x=genres_prog['startYear'], y=genres_prog['%_horror'], legend='brief')
ax.set_xlabel('Year', fontsize=15)
ax.set_ylabel('# of Horror films per 100 movies', fontsize=15)
ax.grid(color='grey', linestyle='-', linewidth=0.1)
ax.set_xticks([1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020])
plt.title('Progression of Horror films through time', fontsize=16)

We see that the production of Horror films decreased heavily during the 90's after the golden age of the genre in the 80's. It then started slowly increasing again after year 2000 and nowadays it seems to be quite popular again.

In [None]:
ax = plt.subplots(figsize=(18,8))
ax = sns.lineplot(x=genres_prog['startYear'], y=genres_prog['%_scifi'], legend='brief')
ax.set_xlabel('Year', fontsize=15)
ax.set_ylabel('# of Sci-Fi films per 100 movies', fontsize=15)
ax.grid(color='grey', linestyle='-', linewidth=0.1)
ax.set_xticks([1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020])
plt.title('Progression of Sci-Fi films through time', fontsize=16)

### Year of Release

In [None]:
class ReleaseYear(BaseEstimator, TransformerMixin):
    def fit(self, X, y): return self
    def transform(self, X):
        res = []
        for e in X:
            res.append({'release_year': int(e['startYear'])})
        return res

In [None]:
ax = plt.subplots(figsize=(18,8))
ax = sns.lineplot(x=movies['startYear'], y=np.log(movies['numVotes']))
ax.grid(color='grey', linestyle='-', linewidth=0.1)
plt.title('Year vs. logVotes')

We see that the number of votes increases until the beginning of the 00's and then slowly goes down. It seems to make sense to control for Year of Release of the movies.

# Predictive Model

## Train/Test Split

I'll use only movies after 1975 for training to reduce our data a bit and the needed computation power. To achieve a more realistic predictive scenario, I'll use movies from 1980-2016 for training, and movies from 2017-2019 for testing.

In [None]:
train_df = movies[movies.startYear.isin(range(1975,2017))]
test_df = movies[movies.startYear.isin(range(2017,2020))]
len(train_df), len(test_df), len(test_df) / len(train_df)

In [None]:
print(len(train_df[train_df['Classic']==1]) / len(train_df))
print(len(test_df[test_df['Classic']==1]) / len(test_df))

We have 1.2% of positive examples in our training set, and less than 0.6% in our testing set. This is not ideal, but I still prefer to train on older movies and test on more recent ones, as it is a more useful scenario in practical terms, so I'll keep it this way.

In [None]:
train_docs = train_df.to_dict(orient='records')
test_docs = test_df.to_dict(orient='records')

In [None]:
y_train = (train_df.Classic).values
y_test = (test_df.Classic).values

# Model Selection, Predictions & Feature Importances

## Logistic Regression

In [None]:
def test_pipe(pipe):
    precision, recall, _ = precision_recall_curve(y_test, pipe.predict_proba(test_docs)[:, 1])
    pr_auc_score = auc(recall, precision)
    return {
        'train_auc': roc_auc_score(y_train, pipe.predict_proba(train_docs)[:, 1]),
        'test_auc': roc_auc_score(y_test, pipe.predict_proba(test_docs)[:, 1]),
        'f1':f1_score(y_test, pipe.predict(test_docs)),
        'precision':precision_score(y_test, pipe.predict(test_docs)),
        'recall':recall_score(y_test, pipe.predict(test_docs)),
        'pr_auc_score_testing': pr_auc_score
        }

def see_preds(pipe):
    preds = pipe.predict_proba(test_docs)
    vis = test_df[['tconst','primaryTitle','startYear','runtimeMinutes','genres','numVotes','averageRating', 'Classic']]
    vis['prob_True'] = [preds[i][1] for i in range(len(preds))]
    return vis

In [None]:
def get_features_pipe():
    steps = []
    steps.append(make_pipeline(W2VFeatures(wtv, category='cast', min_cnt_movies=3)))
    steps.append(make_pipeline(W2VFeatures(wtv, category='crew', min_cnt_movies=3)))
    steps.append(make_pipeline(RunTime(), DictVectorizer(sparse=False)))
    steps.append(make_pipeline(GenreDummies(), DictVectorizer(sparse=False)))
    steps.append(make_pipeline(ReleaseYear(), DictVectorizer(sparse=False)))
    res = make_union(*steps)
    return res

def get_model_pipe(features_pipe, scaler, estimator):
    return make_pipeline(features_pipe, scaler, estimator)

In [None]:
features_pipe=get_features_pipe()
logistic_model = get_model_pipe(
                        features_pipe,
                        scaler = StandardScaler(), 
                        estimator= LogisticRegression(max_iter=400
                                                     ))
logistic_model.fit(train_docs, y_train)

In [None]:
results = test_pipe(logistic_model)
results

In [None]:
print('We get an outstanding ROC AUC score of {}% in our testing set. We have a Precision of {}% and a Recall of {}%, both very good values considering that only 0.6% of the examples in the testing set are positive. So there are very few movies that got this number of votes in our test data, but we can correctly identify {}% of them with our simple Logistic Regression model. Finally, we get a pr-auc in testing of {}.'.format(round(results['test_auc']*100,1), round(results['precision']*100,1), 
        round(results['recall']*100,1), round(results['recall']*100,1), round(results['pr_auc_score_testing'], 3)))

## LightGBM (Gradient Boosting Machine)

Now let's get a bit more serious and try a more complex and powerful algorithm, such as LGMB's Gradient Boosting Machine. We use hyperopt for adjusting the hyperparameters in a smart way. We'll use the Testing PR-AUC, instead of the ROC-AUC, as the metric we want to maximize here. I make this decision because (1) we've seen before that we can easily get a ROC-AUC of 99% in Testing already with a simple linear model, and (2) it is a more representative metric considering the huge imbalance in our data. Only 0.6% of the testing examples belong to the positive class. So then the PR-AUC we would get by random guessing is 0.6%. Let's see how much better than that we can do with our model.

In [None]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import roc_auc_score, average_precision_score, precision_score, recall_score, fbeta_score, precision_recall_curve, auc
import lightgbm as lgbm

def evaluate_model(params):
   
    parameters = {
                    'num_leaves':params['num_leaves'], 
                    'objective':'binary',
                    'max_depth':params['max_depth'],
                    'learning_rate':params['learning_rate'],
                    'max_bin':params['max_bin'], 
                    'metric': ['auc', 'binary_logloss']
                     }

    pipe = get_model_pipe(features_pipe,
                        scaler = StandardScaler(), 
                        estimator= lgbm.LGBMClassifier(**parameters)
                          )
    pipe.fit(train_docs, y_train)
    
    precision, recall, _ = precision_recall_curve(y_test, pipe.predict_proba(test_docs)[:, 1])
    pr_auc_score = auc(recall, precision)
    
    return {
        'num_leaves': params['num_leaves'],
        'max_depth': params['max_depth'],
        'learning_rate': params['learning_rate'],
        'max_bin': params['max_bin'],
        'Training ROC-AUC': round(roc_auc_score(y_train, pipe.predict_proba(train_docs)[:, 1]),3),
        'Testing ROC-AUC':round(roc_auc_score(y_test, pipe.predict_proba(test_docs)[:, 1]),3),
        'Testing PR-AUC':round(pr_auc_score,3),
        'Precision': round(precision_score(y_test, pipe.predict(test_docs), zero_division=1), 3),
        'Recall': round(recall_score(y_test, pipe.predict(test_docs)), 3)     ,
        }

def objective(params):
    res = evaluate_model(params)
    res['loss'] = - res['Testing PR-AUC']
    res['status'] = STATUS_OK
    return res 

hyperparameter_space = {
        'learning_rate': hp.uniform('learning_rate', 0.001, 0.3),
        'num_leaves': hp.choice('num_leaves', range(30, 270)),
        'max_depth': hp.choice('max_depth', range(3, 15)),
        'max_bin': hp.choice('max_features', range(20, 380)),
}

In [None]:
trials = Trials()
best = fmin(
    objective,
    space=hyperparameter_space,
    algo=tpe.suggest,
    max_evals=50,
    trials=trials
);

In [None]:
lgbm_results = pd.DataFrame(trials.results)
lgbm_results.sort_values(by='loss').head(5)

There is no doubt that we at least do way better than random guessing. It also appears that our LGBM model outperforms the simple linear model from before. To summarize all this, let us pick the best values from hyperopt to confirm how our LGBM model looks like and then we plot the Precision-Recall curves of both LGBM and Log. Regression, as well as a Naive model to compare the performance of them all.

In [None]:
# Final LGBM Model:
parameters = {'num_leaves': 108, 
                'objective':'binary',
                'max_depth': 12,
                'learning_rate': 0.019,
                'max_bin': 363, 
                'metric': ['auc', 'binary_logloss']}

lgbm_model = get_model_pipe(features_pipe,
                        scaler = StandardScaler(), 
                        estimator= lgbm.LGBMClassifier(**parameters))
lgbm_model.fit(train_docs, y_train)

In [None]:
def plot_roc_curve(y_test, naive_probs, log_model_probs, lgbm_model_probs):
    sns.set_style('white')
    fig, ax = plt.subplots(figsize=(14,8))
    # plot naive skill roc curve
    precision, recall, _ = precision_recall_curve(y_test, naive_probs)
    ax = sns.lineplot(recall, precision, label='No Skill')
    # plot log model roc curve
    precision, recall, _ = precision_recall_curve(y_test, log_model_probs)
    ax = sns.lineplot(recall, precision, markers=True, ci=False, label='Logistic Regression')
    # plot lgbm model roc curve
    precision, recall, _ = precision_recall_curve(y_test, lgbm_model_probs)
    ax = sns.lineplot(recall, precision, markers=True, ci=False, label='LightGBM')
    # axis labels
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_yticks(np.arange(0,1.1,0.1))
    ax.set_xticks(np.arange(0,1.1,0.1))
    ax.grid(color='black', linestyle='-', linewidth=0.1)
    # show the legend
    plt.legend()
    plt.title('Precision-Recall Curves', fontsize=15)
    # show the plot
    plt.show()

In [None]:
from sklearn.dummy import DummyClassifier
naive_model = DummyClassifier(strategy='stratified')
naive_model.fit(train_docs, y_train)
naive_probs = naive_model.predict_proba(test_docs)[:, 1]
log_model_probs = logistic_model.predict_proba(test_docs)[:, 1]
lgbm_model_probs = lgbm_model.predict_proba(test_docs)[:, 1]

In [None]:
plot_roc_curve(y_test, naive_probs, log_model_probs, lgbm_model_probs)

It looks clear from the plot above that LGBM gives us the best performance. This is particularly true if we are looking for a model with good Precision. We see that we can obtain higher Precision with LGBM without sacrificing as much Recall as with the linear model. If, on the other hand, we were more interested in obtaining high Recall, then both models are equally good. We see that for anything over 50% in Recall we get around the same Precision with both models. In this particular case, I would be more interested in high Precision indeed. It would be better for our model to be very accurate when it predicts that a certain movie will be among the Top 1% in votes, and I don't mind so much if there are many movies that actually achieve this status without our model being able to recognize them as such, which is what Recall measures.

On the other hand, we see that our Naive model performs very poorly. We can get a Precision of 100% with a 0% Recall or vice versa, and not much we can do in the middle.

Now let's see some of the predictions to get a more practical feel for what our model is predicting:

In [None]:
see_preds(lgbm_model).sort_values(by='numVotes', ascending=False).head(10)

The movies above are the ones with the highest number of votes in the test data. Our model correctly identified all of them as movies that would get >125,000 votes, except for **Dunkirk**, in which case it largely failed to recognize it as a future *Classics*, probably because it casts mostly unknown actors despite being a Nolan movie. It also made the wrong prediction for **Logan**, but in that case the model was very close to making the right prediction (it predicted 49% chance of becoming a *classic*). Now let's check some random predictions.

In [None]:
see_preds(lgbm_model).sample(5)

We find some unknown movies which the model correctly identified as films with zero chance of getting >125,000 votes.

To better understand what our W2V Cast vectors are doing, let's see how well we would do without the Word2Vec features.

In [None]:
def get_features_pipe():
    steps = []
    steps.append(make_pipeline(RunTime(), DictVectorizer(sparse=False)))
    steps.append(make_pipeline(GenreDummies(), DictVectorizer(sparse=False)))
    steps.append(make_pipeline(ReleaseYear(), DictVectorizer(sparse=False)))
    res = make_union(*steps)
    return res

features_pipe=get_features_pipe()
model = get_model_pipe(
                        features_pipe,
                        scaler = StandardScaler(), 
                        estimator= lgbm.LGBMClassifier(**parameters))
                                                   
model.fit(train_docs, y_train)

results = test_pipe(model)
results

We do much worse, as expected. Despite the deceiving 94% ROC AUC in testing, our recall drops to a poor 12%. This reveals that the cast and crew representations obtained with Word2Vec are fundamental to our model. Let's see if we can confirm this by seeing how well we do with those features only.

In [None]:
def get_features_pipe():
    steps = []
    steps.append(make_pipeline(W2VFeatures(wtv, category='cast', min_cnt_movies=3)))
    steps.append(make_pipeline(W2VFeatures(wtv, category='crew', min_cnt_movies=3)))
    res = make_union(*steps)
    return res

features_pipe=get_features_pipe()
model = get_model_pipe(
                        features_pipe,
                        scaler = StandardScaler(), 
                        estimator= lgbm.LGBMClassifier(**parameters))
                                                     
model.fit(train_docs, y_train)

test_pipe(model)

Our performance is reduced, so it seems that the *Genre* dummies, the *Runtime* and the *Release Year* were helping at least somewhat. But the lead stars of this predictive model are the vectors extracted from Word2Vec, as we can see we do decently well using those and nothing else.

# Probability Distributions

To end this notebook, we'll make use of the fact that our predictive model seems to work pretty well for drawing some approximations of probability distributions regarding movies' expected IMDB votes. To do this, we treat it as several different classification problems, in which the target in each case will be numVotes>Xi.

In [None]:
y_train = train_df.numVotes
y_test = test_df.numVotes

We'll use the following thresholds, one for each classification model we'll make. I don't use many thresholds among the first percentiles because they represent very low numbers of votes (apparently there are a lot of movies with less than 100 votes on IMDB). I use smaller intervals later, as the difference between the 95th and 96th percentile can be tens of thousands of votes.

In [None]:
thresholds = [
        np.percentile(movies.numVotes, 10), np.percentile(movies.numVotes, 25), np.percentile(movies.numVotes, 40), 
        np.percentile(movies.numVotes, 55), np.percentile(movies.numVotes, 70), np.percentile(movies.numVotes, 80), 
        np.percentile(movies.numVotes, 85), np.percentile(movies.numVotes, 87.5), np.percentile(movies.numVotes, 90), 
        np.percentile(movies.numVotes, 91.5), np.percentile(movies.numVotes, 93), np.percentile(movies.numVotes, 95),
        np.percentile(movies.numVotes, 96.5), np.percentile(movies.numVotes, 98), 
        np.percentile(movies.numVotes, 99), np.percentile(movies.numVotes, 99.25)
             ]

In [None]:
def get_bools(y):
    res = []
    for t in thresholds:
        res.append(y >= t)
    return res

ys_train = get_bools(y_train)
ys_test = get_bools(y_test)

In [None]:
models = [ get_model_pipe(
        features_pipe=get_features_pipe(),
        scaler=StandardScaler(),
        estimator=lgbm.LGBMClassifier(**parameters))  
    for _ in range(len(thresholds)) ]

In [None]:
for i, m in enumerate(models):
    m.fit(train_docs, ys_train[i])

In [None]:
from random import randint
from itertools import compress
from PIL import Image
from io import BytesIO

def get_movie_image(id):
    response = requests.get(f'https://www.imdb.com/title/{id}/')
    soup = BeautifulSoup(response.content)
    candidates = soup.find('img',)
    return candidates.attrs['src'] if candidates else 'https://i2.wp.com/www.fryskekrite.nl/wordpress/wp-content/uploads/2017/03/No-image-available.jpg'

def prob_dist(tconst):
    sns.set_style("white")
    filt = [x['tconst']==tconst for x in test_docs]
    movie = list(compress(test_docs, filt))[0]
    
    # We plot the movie image on the corner
    response = requests.get(get_movie_image(tconst))
    im = Image.open(BytesIO(response.content))
    im = im.resize((int(im.size[0]*0.40), int(im.size[1]*0.36)))
    height = im.size[1]
    fig, ax = plt.subplots(figsize=(13,8))
    fig.figimage(im, 50, fig.bbox.ymax-height*0.9)
    
    # The following is to make sure the prob distribution starts and ends with zero, and to remove the inconsistencies in which
    # we can end up with Prob<0 if for example Prob(votes>Xi) < Prob(votes>X(i-1).
    preds = np.asarray([m.predict_proba([movie])[0,1] for m in models])
    preds = preds[:-1] - preds[1:]
    preds = np.where(preds<0, 0, preds)
    preds = np.insert(preds, 0,0)
    thresholds_start = np.insert(thresholds,0,10)
    
    # We add more data points to the end of the numVotes range if this is a movie with a lot of votes:
    if movie['numVotes']>np.percentile(movies.numVotes,99):
        y_train_extra = (train_df.numVotes>np.percentile(movies.numVotes,99.5)).values
        y_train_extra2 = (train_df.numVotes>np.percentile(movies.numVotes,99.9)).values
        extra_model = get_model_pipe(
                                        features_pipe=get_features_pipe(),
                                        scaler=StandardScaler(),
                                        estimator=LogisticRegression(max_iter=400)   
                                    )
        extra_model.fit(train_docs, y_train_extra)
        preds = np.insert(preds, len(preds), extra_model.predict_proba([movie])[:,1])
        extra_model.fit(train_docs, y_train_extra2)
        preds = np.insert(preds, len(preds), extra_model.predict_proba([movie])[:,1])
        preds = np.insert(preds, len(preds),0)
        thresholds_start = np.insert(thresholds_start,len(thresholds_start),np.percentile(movies.numVotes,99.5))
        thresholds_start = np.insert(thresholds_start,len(thresholds_start),np.percentile(movies.numVotes,99.9))
        thresholds_start = np.insert(thresholds_start,len(thresholds_start),1500000)
    
    mids = [(x1 + x2) / 2 for x1, x2 in zip(thresholds_start[:-1], thresholds_start[1:])]
    
    ax = sns.lineplot(mids, preds, label='predicted distribution', marker="o")
    plt.plot([movie['numVotes'],movie['numVotes']], [0, preds.max()], color='seagreen', linewidth=1.5, label='true numVotes')
    ax.grid(color='black', linestyle='-', linewidth=0.1)
    plt.xscale("log")
    #plt.yticks(np.arange(0,0.75,0.05))
    plt.legend(loc='best', fontsize=13)
    plt.xlabel('Log(numVotes)', fontsize=14, fontfamily='serif')
    plt.ylabel('Probabilty', fontsize=14, fontfamily='serif')
    plt.suptitle('{originalTitle} \n'.format(**movie), fontsize=24, fontfamily='serif')
    plt.title(' \n \n \n ({startYear:.00f}, {genres})\n'.format(**movie), fontsize=16, fontfamily='serif')
    

### Some examples!

We should take into account that these are only approximations and not real probability distributions. For some reason sometimes the model may think that the probability of a movie receiving, let's say 50,000 votes is lower than receiving 100,000 votes but also lower than getting 20,000 votes. This is a bit weird and make some plots look like something that is not quite what one expects of a probability density. But does it make sense? Maybe it's not so crazy to predict that a movie will either get very low attention, or a lot of it, and the probability of receiving average number of votes or anything in between is low. With this being said, we can still see that the model sometimes is spot-on, predicting the right number of votes a movie gets almost perfectly.

In [None]:
prob_dist('tt7131622')

In [None]:
prob_dist('tt4123430')

In [None]:
prob_dist('tt5095030')

In [None]:
prob_dist('tt6323858')

And here I found a couple in which the model fails, just to show that even though we created a model that is pretty decent at identifying top 1% movies, we do not always make good predictions.

In [None]:
prob_dist('tt8106596')

In [None]:
prob_dist('tt5563334')

In [None]:
prob_dist('tt7923374')

# For the future...

It would be interesting to try the following:

- See how much our predictions could improve by adding more predictors that can easily be extracted from the data.

- Try to predict other things, such as the films' ratings.

- Identify clusters among actors/actresses according to their vectors resulted from Word2Vec. 

- Consider alternatives other than simply taking the np.mean of all the cast vectors to use as predictors. For example, it would be interesting to try a weighted average, in which for example the vector of the *lead actor* is given double importance compared to the rest of the cast. The *director*'s vector can also have twice the weight compared to other crew members.
