## Multilabel Classification with scikit-multilearn

### 1. Introduction

We typically group supervised machine learning problems into classification and regression problems. Within the classification problems sometimes, multiclass classification models are encountered where the classification is not binary but we have to assign a class from `n` choices. 

In multilabel classification, instead of one target variable $y$, we have multiple target variables $y_1$, $y_2$, ..., $y_n$. For example there can be multiple objects in an image and we need to correctly classify them all or we are attempting predict which combination of a product that a customer would buy. 

Certain decision tree based algorithms in [Scikit-Learn](http://scikit-learn.org/stable/modules/multiclass.html) are naturally able to handle multilabel classification. [scikit-multilearn](http://scikit.ml/) leverages Scikit-Learn and is built specifically for multilabel problems. 


### 2. Datasets

We use the AmazonCat-13K dataset to explore different multilabel algorithms available in Scikit-Multilearn. Our goal is not to optimize classifier performance but to explore the various algorithms applicable to multilabel classification problems. The dataset if fairly large with over 1.1M train points and 300k test points. There are over 200k features and 13k labels. 

This dataset was chosen in order to work with a fairly large dataset to illustrate difficulties in multilabel classification instead of a toy example. In particular when there are $N$ labels, the search space increases exponentially to $2^N$. A list of multilabel datasets can be found at Manik Varma's [Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html). The data is provided in sparse format and the authors only provide Matlab scripts to convert them; some data wrangling is needed in python to handle them. 


In [1]:
from scipy import sparse
import gc

#f = open(r'C:\Users\david\Downloads\RCV1-x\rcv1x_train.txt', 'r',encoding='utf-8')
f = open(r'C:\Users\kaoyuant\Downloads\RCV1-x\rcv1x_train.txt', 'r',encoding='utf-8')

    
    
size = f.readline()
nrows, nfeature,nlabel = [int(s) for s in size.split()]
x_m = [[] for i in range(nrows)]
pos = [[] for i in range(nrows)]
y_m = [[] for i in range(nrows)]

for i in range(nrows):
    line = f.readline()
    temp=[s for s in line.split(sep=' ')]
    pos[i]=[int(s.split(':')[0]) for s in temp[1:]]
    x_m[i]=[float(s.split(':')[1]) for s in temp[1:]]
    for s in temp[0].split(','):
        try:
            int(s)
            y_m[i]=[ int(s) for s in temp[0].split(',')]
        except:
            y_m[i]=[]
          
x_train=sparse.lil_matrix((nrows,nfeature))
for i in range(nrows):
    for j in range(len(pos[i])):
        x_train[i,pos[i][j]]=x_m[i][j]

del x_m,pos
gc.collect()

y_train=sparse.lil_matrix((nrows,nlabel))
for i in range(nrows):
    for j in y_m[i]:
        y_train[i,j]=1

del y_m
gc.collect()  



0

In [None]:
prod = [(x, y) for x in range(nrows) for y in pos] # combinations
r = [x for (x, y) in prod] # x_coordinate
c = [y for (x, y) in prod] # y_coordinate
data = [1] * len(r)
m = scipy.sparse.coo_matrix((data, (r, c)), shape=(nrows, nfeature))

In [None]:


#f = open(r'C:\Users\david\Downloads\RCV1-x\rcv1x_test.txt', 'r',encoding='utf-8')
f = open(r'C:\Users\kaoyuant\Downloads\RCV1-x\rcv1x_test.txt', 'r',encoding='utf-8')

size = f.readline()
nrows, nfeature,nlabel = [int(s) for s in size.split()]
x_m = [[] for i in range(nrows)]
pos = [[] for i in range(nrows)]
y_m = [[] for i in range(nrows)]

for i in range(nrows):
    line = f.readline()
    temp=[s for s in line.split(sep=' ')]
    pos[i]=[int(s.split(':')[0]) for s in temp[1:]]
    x_m[i]=[float(s.split(':')[1]) for s in temp[1:]]
    for s in temp[0].split(','):
        try:
            int(s)
            y_m[i]=[ int(s) for s in temp[0].split(',')]
        except:
            y_m[i]=[]
          

x_test=sparse.lil_matrix((nrows,nfeature))
for i in range(nrows):
    for j in range(len(pos[i])):
        x_test[i,pos[i][j]]=x_m[i][j]

del x_m,pos
gc.collect()

y_test=sparse.lil_matrix((nrows,nlabel))
for i in range(nrows):
    for j in y_m[i]:
        y_test[i,j]=1

del y_m
gc.collect()  

### 3. Metric  

Before going into the details of each multilabel classification method, we select a metric to gauge how well the algorithm is performing. Similar to a classification problem it is possible to use `Hamming Loss`, `Accuracy`, `Precision`, `Jaccard Similarity`, `Recall`, and `F1 Score`. These are available from Scikit-Learn. 

Going forward we'll chose the `F1 Score` as it averages both `Precision` and `Recall`. It is also helpful to plot the confusion matrix to understand how the classifier is performing, but in our case there are too many labels to visualize.




In [6]:
Image of confusion matrix 

SyntaxError: invalid syntax (<ipython-input-6-853b10b3eedc>, line 1)

### 3. Problem Transformation

#### Binary Relevance 

Binary relevance is simple; each target variable ($y_1$, $y_2$,..,$y_n$) is treated independently and we are reduced to $n$ classification problems. `Scikit-Multilearn` implements this for us, saving us the hassle of splitting the dataset and training each of them separately.  

In [None]:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.ensemble import RandomForestClassifier
import time

classifier = BinaryRelevance(
    classifier = RandomForestClassifier(),
    require_dense = [False, True]
)

start=time.time()
classifier.fit(x_train, y_train)
y_hat = classifier.predict(x_test)
print(time.time()-start)

In [None]:
import sklearn.metrics as metrics

metrics.f1_score(y_test, y_hat)

#### Label Powerset

This method transforms the problem into a multiclass classification problem; the target variables ($y_1$, $y_2$,..,$y_n$) are combined and each combination is treated as a unique class. This method will produce many classes. 

In [None]:
from skmultilearn.problem_transform import LabelPowerset

classifier = ClassifierChain(
    classifier = RandomForestClassifier(),
    require_dense = [False, True]
)

start=time.time()
classifier.fit(x_train, y_train)
y_hat = classifier.predict(x_test)
print(time.time()-start)

In [None]:
metrics.f1_score(y_test, y_hat)

#### Classifier Chains 

Classifier chains are akin to binary relevance, however the target variables ($y_1$, $y_2$,..,$y_n$) are not fully independent. 

In [None]:
from skmultilearn.problem_transform import ClassifierChain

classifier = ClassifierChain(
    classifier = RandomForestClassifier,
    require_dense = [False, True]
)

start=time.time()
classifier.fit(x_train, y_train)
y_hat = classifier.predict(x_test)
print(time.time()-start)

In [None]:
metrics.f1_score(y_test, y_hat)

### FastXML

In [18]:
import os
os.chdir(r'C:\Users\kaoyuant\github\fastxml\fastxml')
import pyximport
pyximport.install()
from fastxml import Trainer, Inferencer


SystemError: Parent module '' not loaded, cannot perform relative import

In [16]:
import inferencer



Error compiling Cython file:
------------------------------------------------------------
...
from .utility cimport pair

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U]:
        ctypedef T key_type
       ^
------------------------------------------------------------

C:\Users\kaoyuant\Anaconda3\lib\site-packages\Cython\Includes\libcpp\unordered_map.pxd:5:8: Expected an identifier, found 'ctypedef'

Error compiling Cython file:
------------------------------------------------------------
...
from .utility cimport pair

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U]:
        ctypedef T key_type
                  ^
------------------------------------------------------------

C:\Users\kaoyuant\Anaconda3\lib\site-packages\Cython\Includes\libcpp\unordered_map.pxd:5:19: Syntax error in C variable declaration

Error compiling Cython file:
--------------------------------------------------------

ImportError: Building module inferencer failed: ["distutils.errors.CompileError: command 'C:\\\\Program Files (x86)\\\\Microsoft Visual Studio 14.0\\\\VC\\\\BIN\\\\x86_amd64\\\\cl.exe' failed with exit status 2\n"]

In [13]:
from inferencer import IForest, LeafComputer, Blender, IForestBlender



Error compiling Cython file:
------------------------------------------------------------
...
from .utility cimport pair

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U]:
        ctypedef T key_type
       ^
------------------------------------------------------------

C:\Users\kaoyuant\Anaconda3\lib\site-packages\Cython\Includes\libcpp\unordered_map.pxd:5:8: Expected an identifier, found 'ctypedef'

Error compiling Cython file:
------------------------------------------------------------
...
from .utility cimport pair

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U]:
        ctypedef T key_type
                  ^
------------------------------------------------------------

C:\Users\kaoyuant\Anaconda3\lib\site-packages\Cython\Includes\libcpp\unordered_map.pxd:5:19: Syntax error in C variable declaration

Error compiling Cython file:
--------------------------------------------------------

ImportError: Building module inferencer failed: ["distutils.errors.CompileError: command 'C:\\\\Program Files (x86)\\\\Microsoft Visual Studio 14.0\\\\VC\\\\BIN\\\\x86_amd64\\\\cl.exe' failed with exit status 2\n"]

In [19]:
from sklearn.ensemble import RandomForestClassifier
import time

#y_train=y_train.todense()
classifier = RandomForestClassifier()
start=time.time()
classifier.fit(x_train, y_train)
#y_hat = classifier.predict(x_test)
print(time.time()-start)

ValueError: Unknown label type: 'unknown'

In [13]:
y_train.shape

(623847, 2456)

In [14]:
x_train.shape

(623847, 47236)

In [15]:
type(y_train)

scipy.sparse.lil.lil_matrix

In [16]:
type(x_train)

scipy.sparse.lil.lil_matrix

In [9]:
import pyximport

In [20]:
cimport cython
cimport numpy as np

SyntaxError: invalid syntax (<ipython-input-20-1f4a457f2f4b>, line 1)