# Text Classification in scikit-learn

First, let's get the corpus we will be using, which is included in NLTK. You will need NLTK and Scikit-learn (as well as their dependencies, in particular scipy and numpy) to run this code.

In [None]:
import nltk
nltk.download("reuters") # if necessary
from nltk.corpus import reuters


The NLTK sample of the Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and is divided into a training and test sets, a split which we will preserve here. Let's look at the counts of texts of the various categories.

In [None]:
for category in reuters.categories():
    print category, len(reuters.fileids(category))

Many of the documents in the corpus are tagged with multiple labels; in this situation, a straightforward approach is to build a classifier for each label. Let's build a classifier to distinguish the most common topic in the corpus, "acq" (acqusitions). First, here's some code to build a dataset in preparation for classification using scikit-learn.

In [None]:
from sklearn.feature_extraction import DictVectorizer

def get_BOW(text):
    BOW = {}
    for word in text:
        BOW[word] = BOW.get(word,0) + 1
    return BOW

def prepare_reuters_data(topic,feature_extractor):
    feature_matrix = []
    classifications = []
    for file_id in reuters.fileids():
        feature_dict = feature_extractor(reuters.words(file_id))   
        feature_matrix.append(feature_dict)
        if topic in reuters.categories(file_id):
            classifications.append(topic)
        else:
            classifications.append("not " + topic)
     
    vectorizer = DictVectorizer()
    dataset = vectorizer.fit_transform(feature_matrix)
    return dataset,classifications

dataset,classifications = prepare_reuters_data("acq",get_BOW)

The above code builds a sparse bag of words feature representation (a Python dictionary) for each text in the corpus (which is pre-tokenized) and puts it in a list; a corresponding list of correct classifications is created at the same time. The scikit-learn DictVectorizer class converts Python dictionaries into the scipy sparse matrices which Scikit-learn uses; when working with a single datset, use the fit_transform method to perform the conversion. Next, let's prepare a Random Forest classifier to test...

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()


To start, we are using default settings for this classifier. Let's do 10-fold crossvalidation, and looking at the accuracy, recall, precision, and f1-score... (if you are using the latest version of scikit learn (0.18) you will get a depreciation warning when using cross_validation, since cross_validation is now included under feature_selection)

In [None]:
from sklearn import cross_validation 
from sklearn.metrics import accuracy_score, classification_report

def check_results(predictions, classifications):
    print "accuracy"
    print accuracy_score(classifications,predictions)
    print classification_report(classifications,predictions)

predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)
check_results(predictions, classifications)


In this case, the classifier is not obviously biased towards a particular task, so accuracy and f-score are nearly the same. The performance is quite high, indicating that it is a fairly easy classification task. Let's try to improve performance by removing stopwords and doing lowercasing.

In [None]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

def get_BOW_lowered_no_stopwords(text):
    BOW = {}
    for word in text:
        word = word.lower()
        if word not in stopwords:
            BOW[word] = BOW.get(word,0) + 1
    return BOW

dataset, classification = prepare_reuters_data("acq",get_BOW_lowered_no_stopwords)
predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)
check_results(predictions, classifications)


The gain in performance was fairly modest.

The default number of decision trees (n_estimators) used in the model is only 10, which is fairly low: lets see if we can find a better number...

In [None]:
n_to_test = [5,50,100,150,300]
clfs = [RandomForestClassifier(n_estimators=n) for n in n_to_test]
for clf in clfs:
    predictions = cross_validation.cross_val_predict(clf, dataset,classifications, cv=10)
    check_results(predictions, classifications)


Yup, more subclassifiers improved things, though at the cost of speed. Feel free to play around more with this or another classifier to see if you can do better. 