# Text Classification in scikit-learn

First, let's get the corpus we will be using, which is included in NLTK. You will need NLTK and Scikit-learn (as well as their dependencies, in particular scipy and numpy) to run this code.

In [1]:
import nltk
nltk.download("reuters") # if necessary
from nltk.corpus import reuters


[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\shivashankar\AppData\Roaming\nltk_data...


The NLTK sample of the Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and is divided into a training and test sets, a split which we will preserve here. Let's look at the counts of texts of the various categories.

In [2]:
for category in reuters.categories():
    print(category, len(reuters.fileids(category)))

(u'acq', 2369)
(u'alum', 58)
(u'barley', 51)
(u'bop', 105)
(u'carcass', 68)
(u'castor-oil', 2)
(u'cocoa', 73)
(u'coconut', 6)
(u'coconut-oil', 7)
(u'coffee', 139)
(u'copper', 65)
(u'copra-cake', 3)
(u'corn', 237)
(u'cotton', 59)
(u'cotton-oil', 3)
(u'cpi', 97)
(u'cpu', 4)
(u'crude', 578)
(u'dfl', 3)
(u'dlr', 175)
(u'dmk', 14)
(u'earn', 3964)
(u'fuel', 23)
(u'gas', 54)
(u'gnp', 136)
(u'gold', 124)
(u'grain', 582)
(u'groundnut', 9)
(u'groundnut-oil', 2)
(u'heat', 19)
(u'hog', 22)
(u'housing', 20)
(u'income', 16)
(u'instal-debt', 6)
(u'interest', 478)
(u'ipi', 53)
(u'iron-steel', 54)
(u'jet', 5)
(u'jobs', 67)
(u'l-cattle', 8)
(u'lead', 29)
(u'lei', 15)
(u'lin-oil', 2)
(u'livestock', 99)
(u'lumber', 16)
(u'meal-feed', 49)
(u'money-fx', 717)
(u'money-supply', 174)
(u'naphtha', 6)
(u'nat-gas', 105)
(u'nickel', 9)
(u'nkr', 3)
(u'nzdlr', 4)
(u'oat', 14)
(u'oilseed', 171)
(u'orange', 27)
(u'palladium', 3)
(u'palm-oil', 40)
(u'palmkernel', 3)
(u'pet-chem', 32)
(u'platinum', 12)
(u'potato', 6)
(u

Many of the documents in the corpus are tagged with multiple labels; in this situation, a straightforward approach is to build a classifier for each label. Let's build a classifier to distinguish the most common topic in the corpus, "acq" (acqusitions). First, here's some code to build a dataset in preparation for classification using scikit-learn.

In [3]:
from sklearn.feature_extraction import DictVectorizer

def get_BOW(text):
    BOW = {}
    for word in text:
        BOW[word] = BOW.get(word,0) + 1
    return BOW

def prepare_reuters_data(topic,feature_extractor):
    feature_matrix = []
    classifications = []
    for file_id in reuters.fileids():
        feature_dict = feature_extractor(reuters.words(file_id))   
        feature_matrix.append(feature_dict)
        if topic in reuters.categories(file_id):
            classifications.append(topic)
        else:
            classifications.append("not " + topic)
     
    vectorizer = DictVectorizer()
    dataset = vectorizer.fit_transform(feature_matrix)
    return dataset,classifications

dataset,classifications = prepare_reuters_data("acq",get_BOW)

The above code builds a sparse bag of words feature representation (a Python dictionary) for each text in the corpus (which is pre-tokenized) and puts it in a list; a corresponding list of correct classifications is created at the same time. The scikit-learn DictVectorizer class converts Python dictionaries into the scipy sparse matrices which Scikit-learn uses; when working with a single datset, use the fit_transform method to perform the conversion. We can look at the shape of the resulting spare matrix to see how many texts and features we have. 

In [4]:
dataset._shape

(10788, 41600)

There are 10788 texts with 41600 features, which is a fairly large feature set. Let's set up a Random Forest classifier...

In [12]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    GaussianNB(),
    LogisticRegression()]

To start, we are using default settings for this classifier. Let's do 10-fold crossvalidation, and looking at the accuracy, recall, precision, and f1-score... (if you are using the latest version of scikit learn (0.18) you will get a depreciation warning when using cross_validation, since cross_validation is included under feature_selection)

In [13]:
from sklearn import cross_validation 

c = 0
for clf in classifiers:
    c +=1 
    dataset, classification = prepare_reuters_data("acq",get_BOW_lowered_no_stopwords)
    if c==4:
        dataset = dataset.todense()
    predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)
    check_results(predictions, classifications)


accuracy
0.940674823878
             precision    recall  f1-score   support

        acq       0.88      0.84      0.86      2369
    not acq       0.96      0.97      0.96      8419

avg / total       0.94      0.94      0.94     10788

accuracy
0.981368186874
             precision    recall  f1-score   support

        acq       0.96      0.96      0.96      2369
    not acq       0.99      0.99      0.99      8419

avg / total       0.98      0.98      0.98     10788

accuracy
0.90971449759
             precision    recall  f1-score   support

        acq       0.85      0.71      0.78      2369
    not acq       0.92      0.97      0.94      8419

avg / total       0.91      0.91      0.91     10788

accuracy
0.821653689284
             precision    recall  f1-score   support

        acq       0.59      0.62      0.60      2369
    not acq       0.89      0.88      0.88      8419

avg / total       0.83      0.82      0.82     10788

accuracy
0.98155357805
             precision

It took a little while to build, that is because decision trees don't scale well with large feature sets, and we are building 10 sets of 10 decision tree classifiers, one for each crossvalidation fold. Let's use see what the results look like; Scikit-Learn has build in functions to calculate accuracy and recall/precision/f-score.

In [7]:
from sklearn.metrics import accuracy_score, classification_report

def check_results(predictions, classifications):
    print("accuracy")
    print(accuracy_score(classifications,predictions))
    print(classification_report(classifications,predictions))
    


 In this case, the classifier is not obviously biased towards a particular task, so accuracy and f-score are nearly the same. The performance is quite high, indicating that it is a fairly easy classification task. Let's try to improve performance by removing stopwords and doing lowercasing.

In [10]:
nltk.download('stopwords')
from nltk.corpus import stopwords


stopwords = set(stopwords.words('english'))

def get_BOW_lowered_no_stopwords(text):
    BOW = {}
    for word in text:
        word = word.lower()
        if word not in stopwords:
            BOW[word] = BOW.get(word,0) + 1
    return BOW



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shivashankar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


There is a gain in performance, though it is fairly modest.

The default number of decision trees (n_estimators) used in the model is only 10, which is fairly low: lets see if we can find a better number (this will take a while)...

Yup, more subclassifiers improved things, though at the cost of speed. Feel free to play around more with this or another classifier to see if you can do better. 