# Document Classification

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Example 1: SMS classification

### 1. Getting and preparing the data

In this example we are going to train two models to classify SMS as "Spam" or "Ham".

In [2]:
sms = pd.read_table('data/sms.tsv', header=None, names=['label', 'message'])

In [3]:
sms.shape

(5572, 2)

In [4]:
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# examine the class distribution
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
4825.0/(747+4828)

0.8654708520179372

In [7]:
# convert label to a numerical variable: 1 (positive class) will be "spam"
sms['label_num'] = (sms['label'] == 'spam').astype(int)

In [8]:
# check that the conversion worked
sms.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [9]:
X = sms.message # Each element of X is a 'document'
y = sms.label_num

In [10]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print X_train.shape, X_test.shape
print y_train.shape, y_test.shape

(4179L,) (1393L,)
(4179L,) (1393L,)


### 2. Vectorizing: getting the features.

In [11]:
# Import the object
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the vectorizer
vect = CountVectorizer()

# Producing the document-token matrix (in one step)
X_train_dtm = vect.fit_transform(X_train)

# Examine the document-term matrix
X_train_dtm

<4179x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

** Question: what does 4179 x 7456 mean? **

In [12]:
# transform testing data (using fitted vocabulary) into a document-token matrix
X_test_dtm = vect.transform(X_test)

### 3. Model building

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [13]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [14]:
# Train the model
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

### 4. Model evaluation

In [16]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.98851399856424982

In [17]:
# print the confusion matrix
conf_mat = metrics.confusion_matrix(y_test, y_pred_class)
conf_mat

array([[1203,    5],
       [  11,  174]])

In [18]:
pd.DataFrame(data=conf_mat, columns=['Pred_Ham','Pred_Spam'], index=['Obs_Ham','Obs_Spam'])

Unnamed: 0,Pred_Ham,Pred_Spam
Obs_Ham,1203,5
Obs_Spam,11,174


** Exercise: print "False Positives" messages in the testing set (X_test) **

** Exercise: print "False Negatives" messages in the testing set (X_test) **

## Congratulations! you have built your first text classifier!

** Exercise: predict the class for the following sms: **
1. "Today is your lucky day! claim $100 of free gas now! just text back saying YES."
2. "I have been calling you all day, please let me know if you are comming back before dinner."

In [19]:
sms1 = "Today is your lucky day! claim $100 of free gas now! just text back saying YES."
sms2 = "I have been calling you all day, please let me know if you are comming back before dinner."
nb.predict(vect.transform([sms1, sms2]))

array([1, 0])

In [20]:
vect.transform([sms1, sms2])

<2x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 31 stored elements in Compressed Sparse Row format>

** Exercise: train a new classifier, using now a Logistic Regression Model, evaluate your classifier with a confusion matrix and the accuracy metric. Use the default parameters. **

[Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [21]:
# Import the object
from sklearn.linear_model import LogisticRegression

In [22]:
# Instantiate the model
logreg = LogisticRegression()
# Fit using the vectorized data
logreg.fit(X_train_dtm, y_train)
# make class predictions for X_test_dtm
logreg_pred = logreg.predict(X_test_dtm)

In [23]:
# calculate accuracy
metrics.accuracy_score(y_test, logreg_pred)

0.9877961234745154

In [24]:
conf_mat_logreg = metrics.confusion_matrix(y_test, logreg_pred)
pd.DataFrame(data=conf_mat_logreg, columns=['Pred_Ham','Pred_Spam'], index=['Obs_Ham','Obs_Spam'])

Unnamed: 0,Pred_Ham,Pred_Spam
Obs_Ham,1207,1
Obs_Spam,16,169


## Example 2: Document Classification

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analysing a collection of text documents (newsgroups posts) on twenty different topics. In this notebook we will see how to:
    
- load the file contents and the categories
- extract feature vectors suitable for machine learning
- train different models to perform categorization
- use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

The bag of words representation is quite simplistic but surprisingly useful in practice.
In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers. This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words approach. 

The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached.

### Loading the 20 newsgroups dataset

The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:

*The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.*

In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:


### 1. Getting and preparing the data

In [25]:
from sklearn.datasets import fetch_20newsgroups
# We can now load the list of files matching those categories as follows:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience, for instance the target_names holds the list of the requested category names:

In [26]:
train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files themselves are loaded in memory in the data attribute. For reference the filenames are also available:

In [27]:
print len(train.data)

2257


In [28]:
# Let’s print the first loaded file:
print train.data[0]
#print "\n".join(train.data[0].split("\n")[:5])
print '----Category:'
print  train.target_names[train.target[0]]

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.

----Category:
comp.graphics


Supervised learning algorithms will require a category label for each document in the training set. In this case the category is the name of the newsgroup

In [29]:
print train.target[:10]
print np.array(train.target_names)[train.target[:10]]

[1 1 3 3 3 3 3 2 2 2]
['comp.graphics' 'comp.graphics' 'soc.religion.christian'
 'soc.religion.christian' 'soc.religion.christian' 'soc.religion.christian'
 'soc.religion.christian' 'sci.med' 'sci.med' 'sci.med']


### 2. Vectorizing: getting the features.

### Using Bag of Words

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)
X_train_counts.shape

(2257, 35788)

### 3. Model building

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [31]:
# Load the object
from sklearn.naive_bayes import MultinomialNB
# Instantiate the model 
MNNB_clf = MultinomialNB()
# Fit the model
MNNB_clf.fit(X_train_counts, train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### 4. Model evaluation

In [32]:
# Getting the test set
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = test.data
docs_test_dtm = count_vect.transform(docs_test)
MNNB_predicted = MNNB_clf.predict(docs_test_dtm)
## Accuracy
np.mean(MNNB_predicted == test.target)

0.93408788282290278

In [33]:
confusion_mtx1 = metrics.confusion_matrix(test.target, MNNB_predicted)
pd.DataFrame(data=confusion_mtx1, columns=train.target_names, index=train.target_names)

Unnamed: 0,alt.atheism,comp.graphics,sci.med,soc.religion.christian
alt.atheism,288,4,3,24
comp.graphics,8,370,8,3
sci.med,12,13,360,11
soc.religion.christian,5,4,4,385


** Exercise: using the trained model classify the following documents: **
1. "This document has to do with printing and computer graphics, for instance it says that OpenGL on the GPU is fast"
2. "Based on the life and teachings of Jesus Christ approximately 2,000 years ago, they teach that God is love"

In [34]:
new_docs = ['This document has to do with printing and computer graphics, for instance it says that OpenGL on the GPU is fast',
            'Based on the life and teachings of Jesus Christ approximately 2,000 years ago, it states that God is love']
X_new_dtm = count_vect.transform(new_docs)
new_docs_pred = MNNB_clf.predict(X_new_dtm)

In [35]:
print new_docs_pred
print train.target_names[1]
print train.target_names[3]

[1 3]
comp.graphics
soc.religion.christian


### Test with an article from Wikipedia!

In [36]:
test_file = open('test_article.txt', 'r')
test_article = test_file.read()
test_file.close()
test_article = [''.join([i if ord(i) < 128 else '' for i in test_article])] # Removes non-ascii characters.
test_article_dtm = count_vect.transform(test_article)
pred_class = MNNB_clf.predict(test_article_dtm)[0]
print train.target_names[pred_class]

comp.graphics


** Exercise: build a new classifier using the same algorithm, but this time use td-idf instead of bag of words. Compare the two models.**

### Using tf-idf

In [37]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

## Resources:
- Excelent documentation and examples can be found in: http://scikit-learn.org/
- For theory behind Machine Learning: [*The Elements of Statistical Learning: Data Mining, Inference, and Prediction.*]( http://statweb.stanford.edu/~tibs/ElemStatLearn/)