# Design an App that predicts movie genres and detects spoilers in reviews





### Part5: Modeling

Author: Sana Krichen    
https://www.linkedin.com/in/sanakrichen/    
https://github.com/skrichen

In machine learning, multi-label classification is a problem where multiple labels may be assigned to each instance. Unlike the traditional two-class and multiclass classification problems where classes are mutually exclusive, in the multi-label problem the classes are not mutually exclusive and there is no constraint on how many of the classes the instance can be assigned to. To better understand the difference, I am attaching the following visual.
<img src="https://drive.google.com/uc?export=view&id=1IuCAyXoFfyIk0Qo6c1vO427_lKvgxsbm" width="640" height="480">

There are different models that can handle the multilabel classification. The One-Vs-All model is one of them. As its name suggests, it is a strategy where you train multiple independent binary classifiers with one class at a time and leaving rest out. For instance, this can be translated to the following set of questions: Is this movie a Comedy or not? Is this movie a Drama or not? Is this movie a Thriller or not?... The main assumption here is that the labels are completely independent and we do not consider any correlation between the classes. 



In [1]:
# import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib

In [2]:
# read my train set
df_train = pd.read_csv(f'df_train_ready.csv')
df_train

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,youtub,youv,zealand,zero,zip,zoe,zombi,zone,zoo,zoom
0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9936,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9937,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9938,1,1,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
9939,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# read my test set
df_test = pd.read_csv(f'df_test_ready.csv')
df_test

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,youtub,youv,zealand,zero,zip,zoe,zombi,zone,zoo,zoom
0,0,1,1,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3309,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3310,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3311,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3312,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Select my features from the train set
X_train = df_train.loc[:, 'aaron':]
X_train

Unnamed: 0,aaron,aback,abandon,abbi,abdomen,abduct,abil,abl,ablaz,aboard,...,youtub,youv,zealand,zero,zip,zoe,zombi,zone,zoo,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9936,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9937,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9938,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9939,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Select my features from the test set
X_test = df_test.loc[:, 'aaron':]
X_test

Unnamed: 0,aaron,aback,abandon,abbi,abdomen,abduct,abil,abl,ablaz,aboard,...,youtub,youv,zealand,zero,zip,zoe,zombi,zone,zoo,zoom
0,0,0,1,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3309,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3310,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3311,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3312,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Select my target from the train set
y_train = df_train.loc[:, :'War']
y_train

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,History,Horror,Music,Musical,Mystery,Romance,SciFi,Sport,Thriller,War
0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0
4,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9936,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0
9937,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
9938,1,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0
9939,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [7]:
# Select my target from the test set
y_test = df_test.loc[:, :'War']
y_test

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,History,Horror,Music,Musical,Mystery,Romance,SciFi,Sport,Thriller,War
0,0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
4,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3309,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
3310,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0
3311,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0
3312,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1


It's time to do some modeling. Since I decided to use the One-Vs-All model along with Logistic Regression, I will proceed with optimizing the hyperparameters by conducting GridSearchCV.
In this way, we can define a pipeline, and then test combinations of parameters to see which will give us the best accuracy score. Let's define our pipeline:

In [8]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler



# Using pipeline for  scaling the data and applying logistic regression within one vs rest classifier

pipe = Pipeline([('scaler', StandardScaler()),
                ('clf', OneVsRestClassifier(LogisticRegression(solver='lbfgs'), n_jobs=-1)),
            ])

# # Define the parameter grid and create the gridsearch object
param_grid = [ 
{'scaler': [StandardScaler()],
'clf': [OneVsRestClassifier(LogisticRegression(solver='lbfgs', max_iter=3000), n_jobs=-1)],
'clf__estimator__C': [ 0.01,0.1, 1,10,100],
'clf__estimator__penalty': ['none','l2']},

]

grid = GridSearchCV(pipe, param_grid, cv=3, n_jobs=-1, verbose=1)

# fit the grid
fittedgrid = grid.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:  9.8min remaining:  1.1min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 10.0min finished


In multilabel classification, the accuracy computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. so for example if the model predicts a movie to be a drama and a comedy but the movie is labeled in reality as drama, comedy and family, well this is not an exact match so the accuracy will drop. 


In [9]:
import joblib
joblib.dump(fittedgrid, "fittedgrid.pkl")

['fittedgrid.pkl']

Let's get the best model based on the accuracy metric

In [10]:
fittedgrid.best_params_

{'clf': OneVsRestClassifier(estimator=LogisticRegression(C=0.01, class_weight=None,
                                                  dual=False, fit_intercept=True,
                                                  intercept_scaling=1,
                                                  l1_ratio=None, max_iter=3000,
                                                  multi_class='auto',
                                                  n_jobs=None, penalty='l2',
                                                  random_state=None,
                                                  solver='lbfgs', tol=0.0001,
                                                  verbose=0, warm_start=False),
                     n_jobs=-1),
 'clf__estimator__C': 0.01,
 'clf__estimator__penalty': 'l2',
 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True)}

Let's check the accuracy score for the train test

In [11]:
fittedgrid.score(X_train,y_train)

0.8568554471381149

Let's check the accuracy score for the train test

In [12]:
fittedgrid.score(X_test,y_test)

0.28032589016294507

The model is clearly overfitting here and is not learning enough to get a good accuracy score for the test set. This was expected due to huge unbalance of my data... In fact, the model is performing really well for some genres but is struggling to learn about some others and that is showing up in the test accuracy.

In [None]:
Coming next: my model