Daniel Rocha Ruiz, MSc in Data Science and Business Analytics

# Multi-Label Classification in Python

Tutorials:
- https://www.depends-on-the-definition.com/guide-to-multi-label-classification-with-neural-networks/
- https://towardsdatascience.com/multi-label-image-classification-with-neural-network-keras-ddc1ab1afede
- https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/
- https://stackoverflow.com/questions/38246559/how-to-create-a-heat-map-in-python-that-ranges-from-green-to-red

Dataset:
- www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz

Summary:
- In this exercise, we use Tensorflow to model neural networks and perform multi-label classification.
- The dataset contains text data on many different movies. Our task is to create a model that correctly predicts the genre of each movie based on its summary. As each movie may have one genre ore more, this is multi-label classification (i.e. each genre is a label, and a movie may have more than one genre).

- One technicality:
    - Multi-class is choosing one exclusive category out of many;
    - Multi-label is choosing at least one category out of many.
    
Notebooks:
- In *notebook 01* we will have a look at the *metadata* to understand a bit more about the labels.
- In *notebook 02* we will clean the text data of movie plots, and create a dataset that we can use in model training.
- In *notebook 03* we will train a baseline model using a logistic regression.
- In *notebook 04* we will finally train the neural network.

# Set-up
## Environment
- First, create the *tf_env* environment following the instructions in the README.
- Alternatively, you can use an environment of your own that has tensorflow installed.

## Packages

In [170]:
# a few tweaks
import sys
sys.path.append("../")

%matplotlib inline

%load_ext autoreload
%autoreload 2

# import packages
import warnings
from tqdm import tqdm
import joblib
import shutil

# neural networks
from tensorflow import keras
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

# text processing
import csv
import json
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from utils.functions import *

# graphs
import matplotlib.pyplot as plt 
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

# maths
import pandas as pd
import numpy as np

## Load the dataset

Let's load the dataset prepared in the previous notebook.

In [96]:
# replace path if different
df = pd.read_parquet("../data/cleaned_dataset.parquet")

# for ease
df.columns = ["x","y"]

print(df.shape)
df.head()

(41793, 2)


Unnamed: 0_level_0,x,y
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
23890098,shlykov hard working taxi driver lyosha saxoph...,"[114, 359]"
31186339,nation panem consists wealthy capitol twelve p...,"[2, 300, 5, 114]"
20663735,poovalli induchoodan sentenced six years priso...,"[232, 46, 2, 114]"
2231378,lemon drop kid new york city swindler illegall...,"[75, 302]"
595909,seventh day adventist church pastor michael ch...,"[98, 359, 109, 114, 93]"


# Data Formatting
A few last-mile adjustements are needed:
- The Y data needs to be converted into a sparse matrix. This conversion doesn't spill information, so it's performed on the whole Y data at once.
- The X data featurized with TF-IDF. This conversion could potentially spill information. So, it's calibrated on the training data, and only then applied to the validation data, to avoid an spillover (i.e. bias).

In [1]:
# Y data
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['y'])
y = multilabel_binarizer.transform(df['y'])

# train-val split
xtrain, xval, ytrain, yval = train_test_split(df['x'], y, test_size=0.2, random_state=42)
print("X Training data:", xtrain.shape)
print("X Validation data:", xval.shape)
print("Y Training data:", ytrain.shape)
print("Y Validation data:", yval.shape)

# X data
# create TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

NameError: name 'MultiLabelBinarizer' is not defined

# Neural Network

## Activation Functions

Softmax:
- The softmax is a generalization of the *logistic*.
- The sum of the probabilities of the different labels add 1.

Sigmoid:
- The sigmoid assumes independence between the labels.
    - If a movie is classifying a movie as *label1* doesn't change it's probability of also being classified as *label2*.
    - So, the probabilities of different labels do not add 1.
    - Hence, it is more suited for multi-label classification.

In [11]:
def softmax(scores):
    exp=np.exp(scores)
    scores=exp/np.sum(exp)
    return scores

def sigmoid(scores):
    scores=np.negative(scores)
    exp=np.exp(scores)
    scores=1/(1+exp)
    return scores

sample = [2, -1, .15, 3]
print(softmax(sample))
print(sigmoid(sample))

[0.2547572  0.01268361 0.0400573  0.69250188]
[0.88079708 0.26894142 0.53742985 0.95257413]


## Design the Neural Network

In [None]:
# 5 classes -> 1 output node for each class
nn = keras.models.Sequential()
nn.add(keras.layers.Dense(10, activation='relu', input_shape=(10,)))
nn.add(keras.layers.Dense(5, activation='sigmoid'))
nn.compile(optimizer='adam', loss='binary_crossentropy' , metrics=['accuracy'])

# Multi-Class Classification: Final Layer = Softmax

model = keras.models.Sequential()
model.add(keras.layers.Conv2D(32, kernel_size=5, strides=2, activation='relu', input_shape=(268, 182, 3)))
model.add(keras.layers.Conv2D(64, kernel_size=3, strides=1, activation='relu'))       
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(8, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Multi-Label Classification: Final Layer = Sigmoid
model = keras.models.Sequential()
model.add(keras.layers.Conv2D(32, kernel_size=5, strides=2, activation='relu', input_shape=(268, 182, 3)))
model.add(keras.layers.Conv2D(64, kernel_size=3, strides=1, activation='relu'))       
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(8, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model Training

## Dataset fitting

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

In [56]:
y_pred[3]

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

('Action', 'Drama')

0.31540448604823657

0.4378137883178906

Movie:  The Boys Next Door 
Predicted genre:  [()]
Actual genre:  ['Crime Fiction', 'Thriller', 'Drama', 'Indie'] 

Movie:  Formosa Betrayed 
Predicted genre:  [('Action', 'Thriller')]
Actual genre:  ['Crime Fiction', 'Thriller', 'Mystery', 'Period piece', 'Drama', 'Political thriller', 'Crime Thriller', 'Political drama'] 

Movie:  Isn't Life Wonderful 
Predicted genre:  [('Drama',)]
Actual genre:  ['Silent film', 'Drama', 'Indie', 'Black-and-white'] 

Movie:  Belle Starr 
Predicted genre:  [('Drama',)]
Actual genre:  ['Western'] 

Movie:  Single Room Furnished 
Predicted genre:  [('Drama', 'Romance Film')]
Actual genre:  ['Drama'] 

