Daniel Rocha Ruiz, MSc in Data Science and Business Analytics

# Multi-Label Classification in Python

Tutorials:
- https://www.depends-on-the-definition.com/guide-to-multi-label-classification-with-neural-networks/
- https://towardsdatascience.com/multi-label-image-classification-with-neural-network-keras-ddc1ab1afede
- https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/

Dataset:
- www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz

Summary:
- In this exercise, we use Tensorflow to model neural networks and perform multi-label classification.
- The dataset contains text data on many different movies. Our task is to create a model that correctly predicts the genre of each movie based on its summary. As each movie may have one genre ore more, this is multi-label classification (i.e. each genre is a label, and a movie may have more than one genre). 


- One technicality:
    - Multi-class is choosing one exclusive category out of many;
    - Multi-label is choosing at least one category out of many.

# Set-up

## Select the adequate environment.
- First, if not done yet, create the *tf_env* environment following the instructions in the README.
- Alternatively, you can use an environment of your own that has tensorflow installed.

## Import Packages

In [8]:
# neural networks
from tensorflow import keras
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

# text processing
import csv
import json
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# graphs
import matplotlib.pyplot as plt 
import seaborn as sns

# maths
import pandas as pd
import numpy as np

%matplotlib inline

## Load the dataset

Let's load the dataset prepared in the previous notebook.

In [2]:
# replace path if different
df = pd.read_parquet("data/cleaned_dataset.parquet")

# for ease
df.columns = ["x","y"]

print(df.shape)
df.head()

(41793, 2)


Unnamed: 0,x,y
0,shlykov hard working taxi driver lyosha saxoph...,"[114, 359]"
1,nation panem consists wealthy capitol twelve p...,"[2, 300, 5, 114]"
2,poovalli induchoodan sentenced six years priso...,"[232, 46, 2, 114]"
3,lemon drop kid new york city swindler illegall...,"[75, 302]"
4,seventh day adventist church pastor michael ch...,"[98, 359, 109, 114, 93]"




# Baseline Prediction
First, we train a logistic regression, that we will use as baseline.

## Data Formatting
A few last-mile adjustements are needed:
- The Y data needs to be converted into a sparse matrix. This conversion doesn't spill information, so it's performed on the whole Y data at once.
- The X data featurized with TF-IDF. This conversion could potentially spill information. So, it's calibrated on the training data, and only then applied to the validation data, to avoid an spillover (i.e. bias).

In [3]:
# Y data
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['y'])

# transform target variable
y = multilabel_binarizer.transform(df['y'])

In [4]:
# train-val split
xtrain, xval, ytrain, yval = train_test_split(df['x'], y, test_size=0.2, random_state=42)
print("X Training data:", xtrain.shape)
print("X Validation data:", xval.shape)
print("Y Training data:", ytrain.shape)
print("Y Validation data:", yval.shape)

X Training data: (33434,)
X Validation data: (8359,)
Y Training data: (33434, 363)
Y Validation data: (8359, 363)


In [5]:
# X data
# create TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

## Training the Classifier

Steps:
- For each label, we train a logistic regression. So, we will train 363 different logistic regressions.
- The output of each logistic regression is evaluated agains a threshold, an the result has a boolean interpretation (e.g. Is this the summary of an Action movie?).
- These will be combined with the *OneVsRestClassifier* object.

In [9]:
%%time

# create classifier objects
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)

# fit model on train data
clf.fit(xtrain_tfidf, ytrain)
print("N-Classes", clf.n_classes_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[[0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
CPU times: total: 2min 38s
Wall time: 2min 42s


## Performance evaluation

In [16]:
#####################################
# predictions for validation set
print("Validation Set, Default Threshold")
y_pred = clf.predict(xval_tfidf)
print("F1-Score:", f1_score(yval, y_pred, average="micro"))

# predictions for validation set with custom threshold t
print("Validation Set, Custom Threshold")
y_pred_prob = clf.predict_proba(xval_tfidf)
t = 0.3
y_pred_new = (y_pred_prob >= t).astype(int)
print("F1-Score:", f1_score(yval, y_pred_new, average="micro"))

Validation Set, Default Threshold
F1-Score: 0.3173826632013454
Validation Set, Custom Threshold
F1-Score: 0.44040165867143677


## A few examples

In [18]:
df

Unnamed: 0,x,y
0,shlykov hard working taxi driver lyosha saxoph...,"[114, 359]"
1,nation panem consists wealthy capitol twelve p...,"[2, 300, 5, 114]"
2,poovalli induchoodan sentenced six years priso...,"[232, 46, 2, 114]"
3,lemon drop kid new york city swindler illegall...,"[75, 302]"
4,seventh day adventist church pastor michael ch...,"[98, 359, 109, 114, 93]"
...,...,...
42199,story reema young muslim schoolgirl malabar lo...,[63]
42200,hollywood director leo andreyev looks photogra...,"[352, 196, 261, 43, 306, 114, 255]"
42201,american luthier focuses randy parsons transfo...,"[305, 39, 110, 231]"
42202,abdur rehman khan middle aged dry fruit seller...,[114]


In [None]:
def infer_tags(q):
    q = clean_text(q)
    q = remove_stopwords(q)
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

# Neural Network

## Activation Functions

Softmax:
- The softmax is a generalization of the *logistic*.
- The sum of the probabilities of the different labels add 1.

Sigmoid:
- The sigmoid assumes independence between the labels.
    - If a movie is classifying a movie as *label1* doesn't change it's probability of also being classified as *label2*.
    - So, the probabilities of different labels do not add 1.
    - Hence, it is more suited for multi-label classification.

In [11]:
def softmax(scores):
    exp=np.exp(scores)
    scores=exp/np.sum(exp)
    return scores

def sigmoid(scores):
    scores=np.negative(scores)
    exp=np.exp(scores)
    scores=1/(1+exp)
    return scores

sample = [2, -1, .15, 3]
print(softmax(sample))
print(sigmoid(sample))

[0.2547572  0.01268361 0.0400573  0.69250188]
[0.88079708 0.26894142 0.53742985 0.95257413]


## Design the Neural Network

In [None]:
# 5 classes -> 1 output node for each class

nn = keras.models.Sequential()
nn.add(keras.layers.Dense(10, activation='relu', input_shape=(10,)))
nn.add(keras.layers.Dense(5, activation='sigmoid'))

nn.compile(optimizer='adam', loss='binary_crossentropy' , metrics=['accuracy'])

# Multi-Class Classification

model = keras.models.Sequential()

model.add(keras.layers.Conv2D(32, kernel_size=5, strides=2, activation='relu', input_shape=(268, 182, 3)))
model.add(keras.layers.Conv2D(64, kernel_size=3, strides=1, activation='relu'))       

model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(8, activation='softmax'))   # Final Layer using Softmax

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Multi-Label Classification

model = keras.models.Sequential()

model.add(keras.layers.Conv2D(32, kernel_size=5, strides=2, activation='relu', input_shape=(268, 182, 3)))
model.add(keras.layers.Conv2D(64, kernel_size=3, strides=1, activation='relu'))       

model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(8, activation='sigmoid'))   # Final Layer using Softmax

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model Training

## Dataset fitting

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

In [56]:
y_pred[3]

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [57]:
multilabel_binarizer.inverse_transform(y_pred)[3]

('Action', 'Drama')

0.31540448604823657

0.4378137883178906

In [64]:
for i in range(5): 
    k = xval.sample(1).index[0]
    print("Movie: ", movies_new['movie_name'][k]
          ,"\nPredicted genre: ", infer_tags(xval[k])
          ,"Actual genre": ,movies_new['genre_new'][k]
          , "\n")

Movie:  The Boys Next Door 
Predicted genre:  [()]
Actual genre:  ['Crime Fiction', 'Thriller', 'Drama', 'Indie'] 

Movie:  Formosa Betrayed 
Predicted genre:  [('Action', 'Thriller')]
Actual genre:  ['Crime Fiction', 'Thriller', 'Mystery', 'Period piece', 'Drama', 'Political thriller', 'Crime Thriller', 'Political drama'] 

Movie:  Isn't Life Wonderful 
Predicted genre:  [('Drama',)]
Actual genre:  ['Silent film', 'Drama', 'Indie', 'Black-and-white'] 

Movie:  Belle Starr 
Predicted genre:  [('Drama',)]
Actual genre:  ['Western'] 

Movie:  Single Room Furnished 
Predicted genre:  [('Drama', 'Romance Film')]
Actual genre:  ['Drama'] 

