# Text Classification in scikit-learn

First, let's get the corpus we will be using, which is included in NLTK. You will need NLTK and Scikit-learn (as well as their dependencies, in particular scipy and numpy) to run this code.

In [1]:
import nltk
nltk.download("reuters") # if necessary
from nltk.corpus import reuters


[nltk_data] Downloading package reuters to /Users/daniel/nltk_data...


The NLTK sample of the Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and is divided into a training and test sets, a split which we will preserve here. Let's look at the counts of texts of the various categories.

In [3]:
for category in reuters.categories():
    print(category, len(reuters.fileids(category)))

acq 2369
alum 58
barley 51
bop 105
carcass 68
castor-oil 2
cocoa 73
coconut 6
coconut-oil 7
coffee 139
copper 65
copra-cake 3
corn 237
cotton 59
cotton-oil 3
cpi 97
cpu 4
crude 578
dfl 3
dlr 175
dmk 14
earn 3964
fuel 23
gas 54
gnp 136
gold 124
grain 582
groundnut 9
groundnut-oil 2
heat 19
hog 22
housing 20
income 16
instal-debt 6
interest 478
ipi 53
iron-steel 54
jet 5
jobs 67
l-cattle 8
lead 29
lei 15
lin-oil 2
livestock 99
lumber 16
meal-feed 49
money-fx 717
money-supply 174
naphtha 6
nat-gas 105
nickel 9
nkr 3
nzdlr 4
oat 14
oilseed 171
orange 27
palladium 3
palm-oil 40
palmkernel 3
pet-chem 32
platinum 12
potato 6
propane 6
rand 3
rape-oil 8
rapeseed 27
reserves 73
retail 25
rice 59
rubber 49
rye 2
ship 286
silver 29
sorghum 34
soy-meal 26
soy-oil 25
soybean 111
strategic-metal 27
sugar 162
sun-meal 2
sun-oil 7
sunseed 16
tea 13
tin 30
trade 485
veg-oil 124
wheat 283
wpi 29
yen 59
zinc 34


Many of the documents in the corpus are tagged with multiple labels; in this situation, a straightforward approach is to build a classifier for each label. Let's build a classifier to distinguish the most common topic in the corpus, "acq" (acqusitions). First, here's some code to build a dataset in preparation for classification using scikit-learn.

In [4]:
from sklearn.feature_extraction import DictVectorizer

def get_BOW(text):
    BOW = {}
    for word in text:
        BOW[word] = BOW.get(word,0) + 1
    return BOW

def prepare_reuters_data(topic,feature_extractor):
    feature_matrix = []
    classifications = []
    for file_id in reuters.fileids():
        feature_dict = feature_extractor(reuters.words(file_id))   
        feature_matrix.append(feature_dict)
        if topic in reuters.categories(file_id):
            classifications.append(topic)
        else:
            classifications.append("not " + topic)
     
    vectorizer = DictVectorizer()
    dataset = vectorizer.fit_transform(feature_matrix)
    return dataset,classifications

dataset,classifications = prepare_reuters_data("acq",get_BOW)

The above code builds a sparse bag of words feature representation (a Python dictionary) for each text in the corpus (which is pre-tokenized) and puts it in a list; a corresponding list of correct classifications is created at the same time. The scikit-learn DictVectorizer class converts Python dictionaries into the scipy sparse matrices which Scikit-learn uses; when working with a single datset, use the fit_transform method to perform the conversion. We can look at the shape of the resulting spare matrix to see how many texts and features we have. 

In [5]:
dataset._shape

(10788, 41600)

There are 10788 texts with 41600 features, which is a fairly large feature set. Let's set up a Random Forest classifier...

In [6]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()


To start, we are using default settings for this classifier. Let's do 10-fold crossvalidation, and looking at the accuracy, recall, precision, and f1-score... (if you are using the latest version of scikit learn (0.18) you will get a depreciation warning when using cross_validation, since cross_validation is included under feature_selection)

In [7]:
from sklearn import cross_validation 

predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)





It took a little while to build, that is because decision trees don't scale well with large feature sets, and we are building 10 sets of 10 decision tree classifiers, one for each crossvalidation fold. Let's use see what the results look like; Scikit-Learn has build in functions to calculate accuracy and recall/precision/f-score.

In [8]:
from sklearn.metrics import accuracy_score, classification_report

def check_results(predictions, classifications):
    print("accuracy")
    print(accuracy_score(classifications,predictions))
    print(classification_report(classifications,predictions))
    
check_results(predictions, classifications)

accuracy
0.957174638487
             precision    recall  f1-score   support

        acq       0.90      0.90      0.90      2369
    not acq       0.97      0.97      0.97      8419

avg / total       0.96      0.96      0.96     10788



 In this case, the classifier is not obviously biased towards a particular task, so accuracy and f-score are nearly the same. The performance is quite high, indicating that it is a fairly easy classification task. Let's try to improve performance by removing stopwords and doing lowercasing.

In [10]:
nltk.download('stopwords')
from nltk.corpus import stopwords


stopwords = set(stopwords.words('english'))

def get_BOW_lowered_no_stopwords(text):
    BOW = {}
    for word in text:
        word = word.lower()
        if word not in stopwords:
            BOW[word] = BOW.get(word,0) + 1
    return BOW

dataset, classification = prepare_reuters_data("acq",get_BOW_lowered_no_stopwords)
predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)
check_results(predictions, classifications)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/daniel/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
accuracy
0.958194289952
             precision    recall  f1-score   support

        acq       0.89      0.92      0.91      2369
    not acq       0.98      0.97      0.97      8419

avg / total       0.96      0.96      0.96     10788



There is a gain in performance, though it is fairly modest.

The default number of decision trees (n_estimators) used in the model is only 10, which is fairly low: lets see if we can find a better number (this will take a while)...

In [11]:
n_to_test = [5,50,100,150]
clfs = [RandomForestClassifier(n_estimators=n) for n in n_to_test]
for clf in clfs:
    predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)
    check_results(predictions, classifications)


accuracy
0.943548387097
             precision    recall  f1-score   support

        acq       0.88      0.85      0.87      2369
    not acq       0.96      0.97      0.96      8419

avg / total       0.94      0.94      0.94     10788

accuracy
0.970152020764
             precision    recall  f1-score   support

        acq       0.94      0.93      0.93      2369
    not acq       0.98      0.98      0.98      8419

avg / total       0.97      0.97      0.97     10788

accuracy
0.970615498702
             precision    recall  f1-score   support

        acq       0.94      0.92      0.93      2369
    not acq       0.98      0.98      0.98      8419

avg / total       0.97      0.97      0.97     10788

accuracy
0.97070819429
             precision    recall  f1-score   support

        acq       0.94      0.92      0.93      2369
    not acq       0.98      0.98      0.98      8419

avg / total       0.97      0.97      0.97     10788



Yup, more subclassifiers improved things, though at the cost of speed. Feel free to play around more with this or another classifier to see if you can do better. 