## Multi-Label Classification

### Overview

The goal of the modeling step is to develop a final model that effectively predicts the stated goal in the problem identification section. Review de types of models that would be appropriate given your modeling response and the features in your dataset and build two to three models. In addition to considering different algorithm types in your model selection, also consider applying model hyperparameter tunin operations. Be sure to define metrics you use to choose your final model. 

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import multilabel_confusion_matrix

### Load Train and Test Data

In [2]:
train_df = pd.read_csv("../data/train_ready.csv")
test_df = pd.read_csv("../data/test_ready.csv")

In [3]:
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,tags,is_clean,...,youcaltlas continue,young,youre,yourselfgo,yourselfgo fuck,youtube,zero,zionist,zuck,zuckerberg
0,4812fdf09bc8fc46,i m afraid that you didn t follow the history ...,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,d9f2f633dce07c67,style border spacing px margin px ...,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,33a8b2393d346005,clans before using any words,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3f6fb24b6e8c9a11,the actors names in parenthesis i m reading ...,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5b5399363c42d377,further to notability mireille astore was one...,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
test_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,tags,is_clean,...,youcaltlas continue,young,youre,yourselfgo,yourselfgo fuck,youtube,zero,zionist,zuck,zuckerberg
0,73e51924905f2dcf,please refrain from removing content from wiki...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,13b1da3b1d0fa0d0,for example mostafa malekian is a thinker whic...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,351543aa0bba57ee,redirect talk five pence irish coin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ebf96fb0a6a8cbb9,in published dictionaries and,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ad01d6108ec293a4,i wouldn t worry about it too much for the mo...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Separate Labels from Dataset

In [5]:
# Extract label columns 
label_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
train_df.drop(columns=["id", "comment_text"], inplace=True)
X_train = train_df.drop(columns=label_columns)
y_train = train_df[label_columns]

test_df.drop(columns=["id", "comment_text"], inplace=True)
X_test = test_df.drop(columns=label_columns)
y_test = test_df[label_columns]

In [6]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(24225, 5005) (24225, 6)
(4952, 5005) (4952, 6)


### 1. Multiple Binary Classifications (One vs. Rest)

In [7]:
%%time
# Apply LogisticRegression with OnevsRestClassifier
clf_1 = OneVsRestClassifier(LogisticRegression(max_iter=10000)).fit(X_train, y_train)

# X_test Predictions
y_pred_1 = clf_1.predict(X_test)
acc_1 = accuracy_score(y_test, y_pred_1)
print("Accuracy Score: {:.3f}".format(acc_1))

Accuracy Score: 0.973
CPU times: user 8min 40s, sys: 19.2 s, total: 9min
Wall time: 1min 23s


In [8]:
# Print multilabel confusion matrix.
print(multilabel_confusion_matrix(y_test, y_pred_1, labels=clf_1.classes_))

[[[4487   13]
  [   1  451]]

 [[4922    5]
  [   7   18]]

 [[4649   24]
  [  38  241]]

 [[4930    1]
  [  17    4]]

 [[4662   31]
  [  26  233]]

 [[4898    7]
  [  27   20]]]


### 2. Multiple Binary Classifications - (Binary Relevance)


In [9]:
%%time
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB

# Perform classification per label
clf_2 = BinaryRelevance(classifier=GaussianNB())
clf_2.fit(X=X_train, y=y_train)
y_pred_2 = clf_2.predict(X_test)
acc_2 = accuracy_score(y_test, y_pred_2)
print("Accuracy Score: {:.3f}".format(acc_2))



Accuracy Score: 0.907
CPU times: user 7.84 s, sys: 6.8 s, total: 14.6 s
Wall time: 14 s


### 3. Classifier Chains

In [10]:
%%time
from skmultilearn.problem_transform import ClassifierChain

# Construct a bayesian conditioned chain of per label classifiers.
clf_3 = ClassifierChain(classifier=LogisticRegression(max_iter=10000))

# Train LogisticRegression model on train data
clf_3.fit(X_train, y_train)
y_pred_3 = clf_3.predict(X_test)
acc_3 = accuracy_score(y_test, y_pred_3)
print("Accuracy Score: {:.3f}".format(acc_3))

Accuracy Score: 0.978
CPU times: user 24min 25s, sys: 34.2 s, total: 24min 59s
Wall time: 3min 12s
