Let's try our hand at the classic [20 newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset. The goal is to train a classifier that can correctly categorize which newsgroup a given post came from.

We will restrict which of the 20 classes we are looking at. 

In [None]:
categories = ['rec.sport.baseball', 'rec.sport.hockey',
               'comp.graphics', 'sci.med']

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories, shuffle=True, random_state=42)

The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience, for instance the target_names holds the list of the requested category names:

In [None]:
twenty_train.target_names

You can access the data of the test data set with the `.data` attribute.

In [None]:
len(twenty_train.data)

And the labels are in the `.target` attribute.

In [None]:
len(twenty_train.target)

Here's a sample of the data.

In [None]:
# the newsgroup header.
print("\n".join(twenty_train.data[0].split("\n")[:3]))

In [None]:
print(twenty_train.target_names[twenty_train.target[0]])

Supervised learning algorithms will require a category label for each document in the training set. In this case the category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.

For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list (i.e. string interning). The category integer id of each sample is stored in the target attribute:

In [None]:
twenty_train.target[:10]

It is possible to get back the category names as follows:

In [None]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

You can notice that the samples have been shuffled randomly (with a fixed RNG seed): this is useful if you select only the first samples to quickly train a model and get a first idea of the results before re-training on the complete dataset later.

## Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

### Bags of words

The most intuitive way to do so is the bags of words representation:

1. assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
2. for each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary
3. The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

If n_samples == 10000, storing X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers.

Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

`scipy.sparse` matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

## Tokenizing text with scikit-learn

Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has built a dictionary of feature indices:

In [None]:
count_vect.vocabulary_.get(u'algorithm')

In [None]:
count_vect.vocabulary_.get(u'ergonomic')

In [None]:
# and something that does not exist returns null
count_vect.vocabulary_.get(u'deodato')

## Training a classifier

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, twenty_train.target)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call transform instead of fit_transform on the transformers, since they have already been fit to the training set:

In [None]:
docs_new = [
    'home runs are exciting',
    'OpenGL on the GPU is fast',
    'diagnosis of a viral respiratory infection',
    'blue line power play goals are rock-em-sock-em'
]

X_new_counts = count_vect.transform(docs_new)

predicted = clf.predict(X_new_counts)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

## Building a pipeline

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [None]:
from sklearn.pipeline import Pipeline

## Evaluation of the performance on the test set

Evaluating the predictive accuracy of the model is pretty easy:

In [None]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)

docs_test = twenty_test.data

In [None]:
text_clf_basic = Pipeline([('vect', CountVectorizer()),
                     ('clf', MultinomialNB()),
])
text_clf_basic.fit(twenty_train.data, twenty_train.target)
predicted = text_clf_basic.predict(docs_test)
np.mean(predicted == twenty_test.target)  

96% not bad!

compare that to a linear model with sgd:

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()), # ~91% without
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42,
                                           max_iter=5, tol=None)),
])
text_clf.fit(twenty_train.data, twenty_train.target)  
predicted2 = text_clf.predict(docs_test)
np.mean(predicted2 == twenty_test.target)

Note that with the tfidf, the SGDClassifier gets 92% accuracy on this.

scikit-learn further provides utilities for more detailed performance analysis of the results:

In [None]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

In [None]:
metrics.confusion_matrix(twenty_test.target, predicted)

As expected the confusion matrix shows that posts from the newsgroups on atheism and christian are more often confused for one another than with computer graphics.

