Daniel Rocha Ruiz, MSc in Data Science and Business Analytics

# Multi-Label Classification in Python

Tutorials:
- https://www.depends-on-the-definition.com/guide-to-multi-label-classification-with-neural-networks/
- https://towardsdatascience.com/multi-label-image-classification-with-neural-network-keras-ddc1ab1afede
- https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/
- https://stackoverflow.com/questions/38246559/how-to-create-a-heat-map-in-python-that-ranges-from-green-to-red

Dataset:
- www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz

Summary:
- In this exercise, we use Tensorflow to model neural networks and perform multi-label classification.
- The dataset contains text data on many different movies. Our task is to create a model that correctly predicts the genre of each movie based on its summary. As each movie may have one genre ore more, this is multi-label classification (i.e. each genre is a label, and a movie may have more than one genre).

- One technicality:
    - Multi-class is choosing one exclusive category out of many;
    - Multi-label is choosing at least one category out of many.
    
Notebooks:
- In *notebook 01* we will have a look at the *metadata* to understand a bit more about the labels.
- In *notebook 02* we will clean the text data of movie plots, and create a dataset that we can use in model training.
- In *notebook 03* we will train a baseline model using a logistic regression.
- In *notebook 04* we will finally train the neural network.

# Set-up
## Environment
- First, create the *tf_env* environment following the instructions in the README.
- Alternatively, you can use an environment of your own that has tensorflow installed.

## Packages

In [198]:
# a few tweaks
import sys
sys.path.append("../")

%matplotlib inline

%load_ext autoreload
%autoreload 2

# import packages
import warnings
from tqdm import tqdm
import joblib
import shutil

# neural networks
from tensorflow import keras
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

# text processing
import csv
import json
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from utils.functions import *

# graphs
import matplotlib.pyplot as plt 
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

# maths
import pandas as pd
import numpy as np

## Load the dataset

Let's load the dataset prepared in the previous notebook.

In [96]:
# replace path if different
df = pd.read_parquet("../data/cleaned_dataset.parquet")

# for ease
df.columns = ["x","y"]

print(df.shape)
df.head()

(41793, 2)


Unnamed: 0_level_0,x,y
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
23890098,shlykov hard working taxi driver lyosha saxoph...,"[114, 359]"
31186339,nation panem consists wealthy capitol twelve p...,"[2, 300, 5, 114]"
20663735,poovalli induchoodan sentenced six years priso...,"[232, 46, 2, 114]"
2231378,lemon drop kid new york city swindler illegall...,"[75, 302]"
595909,seventh day adventist church pastor michael ch...,"[98, 359, 109, 114, 93]"


# Baseline Prediction
First, we train a logistic regression, that we will use as baseline.

## Data Formatting
A few last-mile adjustements are needed:
- The Y data needs to be converted into a sparse matrix. This conversion doesn't spill information, so it's performed on the whole Y data at once.
- The X data featurized with TF-IDF. This conversion could potentially spill information. So, it's calibrated on the training data, and only then applied to the validation data, to avoid an spillover (i.e. bias).

In [97]:
# Y data
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['y'])
y = multilabel_binarizer.transform(df['y'])

# train-val split
xtrain, xval, ytrain, yval = train_test_split(df['x'], y, test_size=0.2, random_state=42)
print("X Training data:", xtrain.shape)
print("X Validation data:", xval.shape)
print("Y Training data:", ytrain.shape)
print("Y Validation data:", yval.shape)

# X data
# create TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

## Training the Classifier

Steps:
- For each label, we train a logistic regression. So, we will train 363 different logistic regressions.
- The output of each logistic regression is evaluated agains a threshold, an the result has a boolean interpretation (e.g. Is this the summary of an Action movie?).
- These will be combined with the *OneVsRestClassifier* object.

Notes:
- As the dataset is very sparse, we will see a few warnings from categories that weren't in both training and validation sets.

In [102]:
%%time
# usually takes 2min42sec

# create classifier objects
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)

# fit model on train data
clf.fit(xtrain_tfidf, ytrain)
print("N-Classes", clf.n_classes_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


N-Classes 363
CPU times: total: 2min 35s
Wall time: 2min 37s


## Performance evaluation

In [103]:
#####################################
# predictions for validation set
print("Validation Set, Default Threshold")
y_pred = clf.predict(xval_tfidf)
print("F1-Score:", f1_score(yval, y_pred, average="micro"))

# predictions for validation set with custom threshold t
print("Validation Set, Custom Threshold")
y_pred_prob = clf.predict_proba(xval_tfidf)
t = 0.3
y_pred_new = (y_pred_prob >= t).astype(int)
print("F1-Score:", f1_score(yval, y_pred_new, average="micro"))

Validation Set, Default Threshold
F1-Score: 0.3173826632013454
Validation Set, Custom Threshold
F1-Score: 0.44040165867143677


## A few examples
By looking at a few predictions, we can feel that the classifier is quite accurate. It is not saying *nonsense*. However, there are too many labels, which often don't constitute *really* consitute genre.

### Examples in the Validation Set

In [111]:
# decode genres
with open('../utils/decode_genres.json', 'r') as f:
    decode_genres = json.load(f)
    decode_genres = {int(k):v for k,v in decode_genres.items()}

# create a dataframe with the predictions on validation set
df_viz = pd.DataFrame()
df_viz.index = xval.index
df_viz = df_viz.rename_axis('movie_id')

#df_viz["y_val_encoded"]=multilabel_binarizer.inverse_transform(yval)
#df_viz["y_val_decoded"]=df_viz["y_val_encoded"].apply(lambda x: [decode_genres[i] for i in x])

df_viz["y_pred_encoded"]=multilabel_binarizer.inverse_transform(y_pred_new)
df_viz["y_pred_decoded"]=df_viz["y_pred_encoded"].apply(lambda x: [decode_genres[i] for i in x])

df_viz = df_viz[["y_pred_decoded"]]

# get the summary
df_sup = pd.read_parquet("../data/viz_dataset.parquet")

# merge dataframes on movie_id
df_viz = pd.merge(df_viz, df_sup, on="movie_id", how="left")
print(df.shape)

# reorder columns
df_viz = df_viz[['movie_title', 'movie_plot', 'movie_genre', 'y_pred_decoded']]

df_viz.head()

(41793, 2)


Unnamed: 0_level_0,movie_title,movie_plot,movie_genre,y_pred_decoded
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18578504,Table for Five,J.P. Tannen is a former professional golfer r...,"[Family Drama, Drama, Indie]","[Comedy, Drama]"
98508,The Sand Pebbles,"In 1926, Machinist's Mate 1st Class Jake Holma...","[Historical fiction, Adventure, War film, Epic...","[Action, Action/Adventure, Adventure, Drama, W..."
7018587,Minor Mishaps,Minor Mishaps is the story of a family's react...,"[Drama, Comedy]",[Drama]
35495058,Worth the Risk?,The film opens at an emergency telephone excha...,"[Short Film, Documentary]","[Drama, Short Film]"
2806249,Homo Erectus,"Ishbo is the younger son of Mookoo , the lead...",[Comedy],"[Adventure, Comedy, Drama, Fantasy]"


In [123]:
for i in range(5): 
    k = df_viz.sample(1).index[0]
    print("Title: ", df_viz['movie_title'][k]
          ,"\nPlot: ", df_viz['movie_plot'][k]
          ,"\nActual genre: ", df_viz['movie_genre'][k]
          ,"\nPredicted genre: ", df_viz['y_pred_decoded'][k]
          , "\n")

Movie:  Outrage! 
Plot:  Marcos, a young reporter, goes to a circus to write a Sunday supplement piece. As he is leaving, the next act is about to start. It involves a woman riding a horse and performing tricks; the presentation ends in shooting balloons from a horse while it is moving. Marcos is taken by the beauty of Ana, the equestrian sharpshooter, and returns to interview her. She invites him to dinner with the troupe. They dance, and then spend the night together. He falls in love with the beautiful horse-riding circus girl. An affair between them ensues; he considers following her around Europe and promises he would follow her to hell. Soon, Marco has to leave to cover a concert in Barcelona. Fate intervenes when three young mechanics come to repair circus equipment and the owner gives them complimentary tickets for the show. The trio makes a racket as they watch Ana perform. After the show, they follow Ana to her trailer and brutally rape her. Although she is badly hurt, she de

### New Examples 

In [193]:
def infer_genres(text
                 ,model=clf
                 ,binarizer=multilabel_binarizer
                 ,vectorizer=tfidf_vectorizer
                 ,decoder=decode_genres
                ):
    text = clean_text(text)
    text = remove_stopwords(text)
    q_vec = vectorizer.transform([text])
    q_pred = model.predict(q_vec)
    genres = binarizer.inverse_transform(q_pred)
    genres = [decoder[item] for t in genres for item in t]
    return genres

infer_genres("this is the plot of a romantic comedy movie")

['Comedy']

## Save the model

In [196]:
model_name = "clf_logistic"
if not os.path.exists(os.path.join("../outputs",model_name)):
    os.mkdir("../outputs/"+model_name)

joblib.dump(clf, "../outputs/"+model_name+"/"+"model.joblib")
joblib.dump(multilabel_binarizer, "../outputs/"+model_name+"/"+"multilabel_binarizer.joblib")
joblib.dump(tfidf_vectorizer, "../outputs/"+model_name+"/"+"tfidf_vectorizer.joblib")
joblib.dump(decode_genres, "../outputs/"+model_name+"/"+"decode_genres.joblib")

shutil.copyfile("../utils/functions.py", "../outputs/"+model_name+"/functions.py")

NameError: name 'joblib' is not defined

In [197]:
shutil

NameError: name 'shutil' is not defined

'../outputs/clf_logistic/functions.py'

In [202]:
%%writefile "../outputs/clf_logistic/inference_script.py"

# neural networks
from tensorflow import keras
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

# text processing
import csv
import json
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from functions import *

# graphs
import matplotlib.pyplot as plt 
import seaborn as sns

# maths
import pandas as pd
import numpy as np

# others
import os
import joblib
from tqdm import tqdm

model=joblib.load('model.joblib')
multilabel_binarizer=joblib.load('multilabel_binarizer.joblib') 
tfidf_vectorizer=joblib.load('tfidf_vectorizer.joblib') 
decode_genres=joblib.load('decode_genres.joblib') 

def infer_genres(text
                 ,model=model
                 ,binarizer=multilabel_binarizer
                 ,vectorizer=tfidf_vectorizer
                 ,decoder=decode_genres
                ):
    text = clean_text(text)
    text = remove_stopwords(text)
    q_vec = vectorizer.transform([text])
    q_pred = model.predict(q_vec)
    genres = binarizer.inverse_transform(q_pred)
    genres = [decoder[item] for t in genres for item in t]
    return genres

if __name__ == "__main__":
    print(infer_genres("this is the plot of a romantic comedy movie"))
    print(infer_genres("this is the plot of an action movie"))

Overwriting ../outputs/clf_logistic/inference_script.py
