# Multilabel classification using RandomForests
This notebook delves into the details of performing multilabel classification using sklearn's MultilabelBinarizer and RandomForests. The input data is in text format with a collection of labels as target for each observation.

## Import revscoring and sklearn

In [1]:
from revscoring.languages import english
from revscoring.datasources.meta import (frequencies, gramming, hashing,
                                         mappers)
from revscoring.utilities.util import dump_observation, read_observations
from revscoring.datasources import revision_oriented as ro
from revscoring.dependencies import solve
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
import json
import pdb

## Prepare data
Read json observations in an array

In [2]:
dataset = 'enwiki.labeled_wikiprojects.w_text_500.json'
obs = []
with open(dataset, 'r') as fin:
    for line in fin:
        obs.append(json.loads(line))
obs[0].keys()

dict_keys(['page_title', 'templates', 'rev_id', 'mid_level_categories', 'text', 'page_id'])

## Prepare a global corpus collection for fitting
Append all text in a corpus list to fit the vocabulary using CountVectorizer
Additionally, also do the same for the labels so that they can be fit using MultilabelBinarizer

In [3]:
corpus = []
labels = []
for ob in obs:
    corpus.append(ob['text'])
    labels.append(ob['mid_level_categories'])
labels[0]

['Biology']

## Fit data
Fit the corpus using CountVectorizer. Also fit the labels using [MultilabelBinarizer](http://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification-format)
The result of transforming a list of labels per observation is a matrix of [n_samples, n_classes] with '0' denoting absence of label for that observation and '1' denoting presence of label

In [4]:
vectorizer = CountVectorizer()
corpus_transformed = vectorizer.fit_transform(corpus)
mlb = MultiLabelBinarizer()
label_matrix = mlb.fit_transform(labels)
label_matrix[0]

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Fit classifier and score

In [5]:
rfc = RandomForestClassifier()
rfc.fit(corpus_transformed, label_matrix)
rfc.score(corpus_transformed, label_matrix)

0.77800000000000002

In [6]:
rfc.predict(corpus_transformed[0:2])

array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.]])