# Overview
Mult-Label classification is a machine learning problem where each sample can have multiple independent classes/labels at the same time. An example of this sort of problem is classifying a piece of text that can have multiple topics at the same time. 

In SciKit-Learn, some algorithms are multi-label friendly out of the box, while others need to be wrapped in a OneVsRestClassifier and the target labels will need to be in a 2D binary matrix. This sample notebook shows how to use an algorithm that supports multi-lable classification out of the box and two that need to be wrapped in a OneVsRestClassifier.

In [171]:
# Importing
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.cross_decomposition import PLSRegression
from sklearn import datasets

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report

In [168]:
# Our fake data
data = datasets.make_multilabel_classification(n_samples=1000,n_features=5,n_classes=3,n_labels=2,random_state=1)
df = pd.DataFrame(data[0],columns=['a','b','c','d','e'])
df['t1'] = data[1][:,0]
df['t2'] = data[1][:,1]
df['t3'] = data[1][:,2]
df.head()

Unnamed: 0,a,b,c,d,e,t1,t2,t3
0,5.0,12.0,9.0,31.0,14.0,0,1,0
1,9.0,7.0,7.0,18.0,10.0,0,1,0
2,9.0,8.0,8.0,10.0,8.0,0,0,0
3,2.0,5.0,16.0,5.0,12.0,1,1,1
4,6.0,3.0,11.0,22.0,13.0,0,1,0


In [169]:
features = df.ix[:,0:5]
targets = df.ix[:,-3:]

The following shows the outputs of three different classifiers and their cross validation prediction classification reports.

The first classifier, RandomForest, comes multi-label friendly out of the box.

The second and third classifier, Gaussian Naive Bayes and Linear Discriminate Analysis, are not multi-label friendly. You need to wrap them inside a OneVsRestClassifer scheme, and you will need to make sure the mult-label matrix are in a binary format. Our labels data are already binarized so there's no further pre-processing necessary in this example.


In [170]:
def multiLabel_Classifiers(model):
    print model
    pred = cross_val_predict(model,features,targets)
    print classification_report(targets,pred,target_names=['t1','t2','t3'])
    print ''
    
models = [RandomForestClassifier(n_estimators=100,random_state=1,n_jobs=-1),
          OneVsRestClassifier(GaussianNB()),
          OneVsRestClassifier(LinearDiscriminantAnalysis())]

for model in models:
    multiLabel_Classifiers(model)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=-1, oob_score=False, random_state=1,
            verbose=0, warm_start=False)
             precision    recall  f1-score   support

         t1       0.90      0.93      0.92       643
         t2       0.92      0.95      0.93       723
         t3       0.69      0.58      0.63       215

avg / total       0.88      0.89      0.89      1581


OneVsRestClassifier(estimator=GaussianNB(priors=None), n_jobs=1)
             precision    recall  f1-score   support

         t1       0.91      0.90      0.90       643
         t2       0.91      0.94      0.92       723
         t3       0.65      0.65      0.65       215

avg / total       0.87      0.88      0.88      1581


OneVsRestClassi