# Introduction to Data Science
## Text classification
***

Read in some packages.

In [76]:
# Import the libraries we will be using
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pylab as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8

np.random.seed(36)

### Data
We have a new data set in `data/spam_ham.csv`. Let's take a look at what it contains.

In [54]:
!head -2 data/spam_ham.csv

text,spam
'Hi...I have to use R to find out the 90\% confidence-interval for the sensitivityand specificity of the following diagnostic test:A particular diagnostic test for multiple sclerosis was conducted on 20 MSpatients and 20 healthy subjects, 6 MS patients were classified as healthyand 8 healthy subjects were classified as suffering from the MS.Furthermore, I need to find the number of MS patients required for asensitivity of 1\%...Is there a simple R-command which can do that for me?I am completely new to R...Help please!Jochen-- View this message in context: http://www.nabble.com/Confidence-Intervals....-help...-tf3544217.html#a9894014Sent from the R help mailing list archive at Nabble.com.______________________________________________R-help@stat.math.ethz.ch mailing listhttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.html',ham


Looks like we have two features: some text (looks like an email), and a label for spam or ham. What is the distribution of the target variable?

In [56]:
!cut -f2 -d',' data/spam_ham.csv | sort | uniq -c | head

   1 
   1                  Alonzo Houser
   1                  Andrea Winslow
   1                  Arron Tanner
   1                  Becky Conklin
   1                  Christie Slaughter
   1                  Danial Good
   1                  Darcy Berger
   1                  Dena Major
   1                  Donna Henderson


It doesn't look like that did what we wanted. Can you see why?

The data in this file has **text data**. The text data in the first column can have commas. The command line will have some issues reading this data since it will try to split on all instances of the delimeter. Ideally, we would like to have a way of **encapsulating** the first column. Note that we actually have something like this in the data. The first column is wrapped in single quotes. Python (and pandas) have more explicit ways of dealing with this:

In [57]:
data = pd.read_csv("data/spam_ham.csv", quotechar="'", escapechar="\\")

Above, we specify that fields that need to be encapsulated are done so with single quotes (`quotechar`). But, what if the text in this field uses single quotes? For example, apostrophes in words like "can't" would break the encapsulation. To overcome this, we **escape** single quotes that are actually just text. Here, we specify the escape character as a backslash (`escapechar`). So now, for example, "can't" would be written as "can\'t".

Let's take another look at our data.

In [58]:
data.head()

Unnamed: 0,text,spam
0,Hi...I have to use R to find out the 90% confi...,ham
1,"Francesco Poli wrote:> On Sun, 15 Apr 2007 21:...",ham
2,Stephen Thorne wrote:> What I was thinking was...,ham
3,"Hi,I have this site that auto generates an ind...",ham
4,Author: metzeDate: 2007-04-16 08:20:13 +0000 (...,ham


Here, the target is whether or not a record should be considered as spam. This is recorded as the string 'spam' or 'ham'. To make it a little easier for our classifier, let's recode it as `0` or `1`.

In [59]:
data['spam'] = pd.Series(data['spam'] == 'spam', dtype=int)

In [60]:
data.head()

Unnamed: 0,text,spam
0,Hi...I have to use R to find out the 90% confi...,0
1,"Francesco Poli wrote:> On Sun, 15 Apr 2007 21:...",0
2,Stephen Thorne wrote:> What I was thinking was...,0
3,"Hi,I have this site that auto generates an ind...",0
4,Author: metzeDate: 2007-04-16 08:20:13 +0000 (...,0


Since we are going to do some modeling, we should split our data into a training and test set.

In [61]:
X = data['text']
Y = data['spam']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=.75)

### Text as features
How can we turn the large amount of text for each record into useful features?


#### Counts
One way is to create a matrix that uses each word as a feature and keeps track of how often that word appears. You can do this in sklearn with a `CountVectorizer()`. Very similar to how you fit a model, you will fit a `CounterVectorizer()`. This will figure out what words exist in your data.

In [62]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(X_train)

CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Let's look at the vocabulary the `CountVectorizer()` learned.

In [63]:
count_vectorizer.vocabulary_.keys()[0:10]

[u'fawn',
 u'somegovernments',
 u'localstatedir',
 u'sonja',
 u'woods',
 u'spiders',
 u'hanging',
 u'woody',
 u'ksh_command',
 u'localized']

Now that we know what words are in the data, we can transform our blobs of text into a clean matrix. Simply `.transform()` the raw data using our fitted `CountVectorizer()`. You will do this for the training and test data. What do you think happens if there are new words in the test data that were not seen in the training data?

In [64]:
X_train_counts = count_vectorizer.transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

We can take a look at our new `X_test_counts`.

In [71]:
X_test_counts

<2028x71157 sparse matrix of type '<type 'numpy.int64'>'
	with 225416 stored elements in Compressed Sparse Row format>

Sparse matrix? Where is our data?

If you look at the output above, you will see that it is being stored in a *sparse* matrix (as opposed to the typical dense matrix) that is 2,028 rows long and 71,157 columns. This means there are 144,306,396 cells that should have values. However, from the above, we can see that only 225,416 cells (0.16%) of the cells have values! Why is this?

To save space, sklearn uses a sparse matrix. This means that only values that are not zero are stored! This saves a ton of space! This also means that visualizing the data is a little trickier. Let's look at a very small chunk.

In [70]:
X_test_counts[0:20, 0:20].todense()

matrix([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [3, 0, 1, 0,

The `CountVectorizer()` function has many options. You can restrict the words you would like in the vocabulary. You can add n-grams. You can stem words or use stop word lists. Which options you should use generally depend on the type of data you are dealing with. We can discuss some of them now.

#### Tf-idf (term frequency - inverse document frequency)
One of the most common ways to create features from text is to use the tf-idf measure instead of raw counts. This is easy to do in sklearn.

In [72]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)

TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [73]:
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

It's a little messier than the counts, but let's take a look at a small slice of the data.

In [75]:
X_test_tfidf[0:10, 0:10].todense()

matrix([[ 0.        ,  0.        ,  0.05372166,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.       

### Modeling
So far we have been exposed to tree classifiers and logistic regression in class. We have also seen SVMs in the homwork. Another popular type of classifier is the naive Bayes classifier.

We can apply this model to our text data using our count features.

In [78]:
model = MultinomialNB()
model.fit(X_train_counts, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [83]:
print "AUC on the count data = %.3f" % metrics.roc_auc_score(model.predict(X_test_counts), Y_test)

AUC on the count data = 0.981


What about using the tf-idf features?

In [80]:
model.fit(X_train_tfidf, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [84]:
print "AUC on the tf-idf data = %.3f" % metrics.roc_auc_score(model.predict(X_test_tfidf), Y_test)

AUC on the tf-idf data = 0.988
