# Very basic introduction to text classification using sklearn

In [60]:
from __future__ import print_function, unicode_literals, division
import random
import re
from pprint import pprint
import numpy as np
import warnings

warnings.simplefilter("ignore")

# Getting some data

The toy example we are going to use for this introduction is subjects of emails from the corpora mailing list, which are categorized in job opening announcements vs. other emails on the mailing list.

In [16]:
training_data = [
    ("job", "Postdoc in NLP and Machine Learning at the University College London"),
    ("job", "Senior Research Associate Post in Language Development, Lancaster University"),
    ("job", "Postdoc position in computational linguistics (reminder!)"),
    ("job", "Call for applicants : PhD on incremental text mining"),
    ("job", "Job Opening: Post-doctoral research fellow (Osnabrück University; DFG-funded fixed term contract, 3 years)"),
    ("job", "second announcement: postdoc position in computational linguistics at Duesseldorf, Germany"),
    ("job", "A job opportunity at Lancaster: Java programmer"),
    ("job", "looking for a Farsi linguist"),
    ("job", "Lecturer in Data Engineering at Lancaster University"),
    ("job", "Job: Research or Lead Scientist - automatic language processing, text mining, and information management"),

    ("other", "New Journal: Language and Law / Linguagem e Direito"),
    ("other", "Deadline extension: Challenges in the management of large corpora"),
    ("other", "Call for Participation 5th Lisbon Machine Learning School"),
    ("other", "Final CFP: NAACL 2016 Workshop on Metaphor in NLP"),
    ("other", "Announcement: Course in Statistics for PhD Students"),
    ("other", "Reminder: English Profile Seminar"),
    ("other", "Beth PhD Thesis Prize"),
    ("other", "Looking for Transliteration Data"),
    ("other", "Participants needed for a research web study on interactive virtual characters"),
    ("other", "LINGUIST List birthday and new service GeoLing"),
]

random.shuffle(training_data)
labels, texts = zip(*training_data)

# Preprocessing & feature extraction

As a first step we transform our each of our texts to a set of features that will be used by the classifier.

We are going to use simple bag of word features for this demo, but you can create any features you like.

Note that sklearn also comes with a variety of feature extraction methods, so for basic things there is not even the need to "do it yourself" if you don't want to.
For more info see: [Feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html)

In [47]:
tokenizer = re.compile("[^\W_]+", re.UNICODE)

def extract_features(text):
    tokens = tokenizer.findall(text)
    # we are going to use simple bag of word features for this demo
    features = {token.lower(): 1 for token in tokens}
    return features

features = [extract_features(text) for text in texts]
features[0]

{'and': 1,
 'birthday': 1,
 'geoling': 1,
 'linguist': 1,
 'list': 1,
 'new': 1,
 'service': 1}

# Transforming the features

Next we need to transform our data to a format that sklearn estimators can work with.

As we stored our features in dictionaries, we are going to use the [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html).

In [45]:
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer()

# we create the feature matrix by fitting the vectorizer
# (building a mapping from feature names to vector indices)
# and transforming the given data
X = vectorizer.fit_transform(features)

X.shape  # number examples x number of features

(20, 106)

* If we had more data we could try to count the occurences of tokens in our texts and weight the features according to TF-IDF using a [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
* If we had too many features to efficiently train a classifier, we could use a [FeatureHasher](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher)
* Vectorizers, Transformers and Classifiers can be combined into [Pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

# Training a classifier


Let's train a simple Naive Bayes classifier and test it with two examples

In [54]:
y = labels

# train a classifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)

examples = [
    "Postdoc/Programmer positions in Biomedical NLP",
    "1st CfP: 2nd Workshop on NLP for Translation Memories at LREC 2016"
]

# it is important to use same vectorizer we fitted before (don't fit again!),
# to transform the test data, as we need the same vector dimensions for
# training and test data
X_test = vectorizer.transform([extract_features(text) for text in examples])

# let the model predict the categories of the examples
predictions = clf.predict(X_test)
pprint(list(zip(examples, predictions)))

[('Postdoc/Programmer positions in Biomedical NLP', 'job'),
 ('1st CfP: 2nd Workshop on NLP for Translation Memories at LREC 2016',
  'other')]


*So far so good - we have a model that can make predictions!*

* If you want to [persist your model](http://scikit-learn.org/stable/modules/model_persistence.html), look into joblib and don't forget to persist the vectorizer that goes with your model as well!

# Run a cross-validation evaluation of different classifiers

Sklearn offers inbuild methods to to train/test splits of the data and do cross validations.

As all classifiers in sklearn have the same basic interface, it is easy to compare a few

In [106]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import Perceptron
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier

classifiers_to_test = [
    DummyClassifier(strategy='uniform'),
    BernoulliNB(),
    LogisticRegression(),
    Perceptron(n_iter=100),
    LinearSVC(),
    KNeighborsClassifier(warn_on_equidistant=False, weights="distance"),
    RandomForestClassifier()
]

from sklearn import cross_validation
for clf in classifiers_to_test:
    print("\n", clf)
    scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring="accuracy", n_jobs=-1)
    print("Avg. accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() / 2))


 DummyClassifier(constant=None, random_state=None, strategy='uniform')
Avg. accuracy: 0.2500 (+/- 0.1250)

 BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
Avg. accuracy: 0.9000 (+/- 0.1000)

 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Avg. accuracy: 0.6000 (+/- 0.1000)

 Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      n_iter=100, n_jobs=1, penalty=None, random_state=0, shuffle=True,
      verbose=0, warm_start=False)
Avg. accuracy: 0.5500 (+/- 0.1346)

 LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Avg. accuracy: 0.6500 (+/- 0.1146)

 KNeighborsC

# Optimize hyperparameters

Let's see if we can get some better results for support vector machines if we tune the parameters.

In [98]:
clf = LinearSVC()
parameters = {
    'fit_intercept': [True, False],
    'loss': ["hinge", "squared_hinge"],
    'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000],
    'max_iter': [5, 10, 100, 1000]
}

from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(clf, parameters, n_jobs=-1, cv=10, scoring="accuracy")
grid.fit(X, y)
print("Best score:", grid.best_score_)
print("Best params:", grid.best_params_)
print("Best model:", grid.best_estimator_)

Best score: 0.75
Best params: {'C': 0.1, 'loss': 'hinge', 'fit_intercept': True, 'max_iter': 5}
Best model: LinearSVC(C=0.1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=5, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)


*An improvement of 0.1, but still not as good as the Naive Bayes classifier on this tiny dataset.*

# Test performance on hold out test set

In the end, let's test our best model with optimized hyperparameters against some hold-out test set

In [105]:
test_data = [
    ("job", "second announcement: postdoc position in computational linguistics at Duesseldorf, Germany"),
    ("job", "Job: Research Associate"),
    ("job", "10 PhD places in Data Science at the University of Edinburgh"),

    ("other", "Looking for available similar sentences corpora"),
    ("other", "PR: The Resource Management Agency (RMA) adopts the ISLRN"),
    ("other", "Announcement: Translating and the Computer 37"),
]
# transform data
test_labels, test_texts = zip(*test_data)
test_features = [extract_features(text) for text in test_texts]
X_test = vectorizer.transform(test_features)

# train classifier and make predictions
clf = BernoulliNB().fit(X, y)
test_predictions = clf.predict(X_test)

# evaluate predictions
from sklearn import metrics
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))

             precision    recall  f1-score   support

        job       1.00      1.00      1.00         3
      other       1.00      1.00      1.00         3

avg / total       1.00      1.00      1.00         6

Accuracy: 1.0000
