# ML Commando Course, Cambridge 2018

## Session 2a Text Classification with Naïve Bayes

This notebook might not work properly without:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000

_One of the most successful applications of Naïve Bayes has been within the field
of Natural Language Processing (NLP). NLP is a field that has been much related
to machine learning, since many of its problems can be formulated as a classification task. Usually, NLP problems have important amounts of tagged data in the form of text documents. This data can be used as a training dataset for machine
learning algorithms.
In this section, we will use Naïve Bayes for text classification; we will have a set of text documents with their corresponding categories, and we will train a Naïve Bayes algorithm to learn to predict the categories of new unseen instances. This simple task has many practical applications; probably the most known and widely used one is spam filtering. In this section we will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging from politics and religion to sports and science_

Start by importing numpy, scikit-learn, and pyplot, the Python libraries we will be using in this chapter. Show the versions we will be using (in case you have problems running the notebooks).

In [1]:
%pylab inline
import IPython
import sklearn as sk
import numpy as np
from matplotlib import pyplot as plt
print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sk.__version__)
print ('matplotlib version:', matplotlib.__version__)

Populating the interactive namespace from numpy and matplotlib
IPython version: 6.2.1
numpy version: 1.13.3
scikit-learn version: 0.19.1
matplotlib version: 2.1.0


Import the newsgroup Dataset, and explore its structure and data (this could take some time, especially if sklearn has to download the 14MB dataset from the Internet)

In [2]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')

Let's explore the dataset structure:

In [3]:
print(type(news))
news.keys()

<class 'sklearn.utils.Bunch'>


dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

If we look at the properties of the dataset, we will find that we have the usual ones: DESCR, data, target, and target_names. The difference now is that data holds a list of text contents, instead of a numpy matrix:

In [4]:
print (type(news.data), type(news.target), type(news.target_names))
print (news.target_names)
print (len(news.data))
print (len(news.target))

<class 'list'> <class 'numpy.ndarray'> <class 'list'>
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
18846
18846


If you look at a random instance, you will see the content of a newsgroup message, and you can get its corresponding category:

In [5]:
rand_ix = np.random.randint(0, len(news.data))
print (news.data[rand_ix])
category = news.target[rand_ix]
category_name = news.target_names[category]
print (category, category_name)

From: gt6511a@prism.gatech.EDU (COCHRANE,JAMES SHAPLEIGH)
Subject: Re: Change of name ??
Organization: Georgia Institute of Technology
Lines: 35

In article <CMM.0.90.2.735315429.thomasp@holmenkollen.ifi.uio.no> thomasp@ifi.uio.no (Thomas Parsli) writes:
:
:
:	1. Make a new Newsgroup called talk.politics.guns.PARANOID or 
:	talk.politics.guns.THEY'R.HERE.TO.TAKE.ME.AWAY
:
:	2. Move all postings about waco and burn to (guess where)..
:
:	3. Stop posting #### on this newsgroup
;
:	We are all SO glad you're trying to save us from the evil 
:	goverment, but would you mail this #### in regular mail to
:	let's say 1000 people ????
:	
:
:                        Thomas Parsli
And everybody who talked about the evil arising in Europe was labeled 
reactionary in the late 1930's... after all, we could negotiate with Hitler and
trust him to keep his end of the bargain... at least that's what Stalin and
Chamberlin thought... I guess they forgot to teach you about your country being
overrun by the G

Let's build the training and testing datasets:

In [6]:
#rjm49 - this time we're doing a manual split
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)
X_train = news.data[:split_size]
X_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]


This function will serve to perform and evaluate a cross validation:

In [7]:
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    print("underway...")
    k_cv = KFold(n_splits=K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=k_cv)
    print (scores)
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores)))

Our machine learning algorithms can work only on numeric data...

Inside the `sklearn.feature_extraction.text` module, there are three classes that can transform text into numeric features: `CountVectorizer`, `HashingVectorizer`, and `TfidfVectorizer`.
The difference between them resides in the calculations they perform to derive the numeric features:
- `CountVectorizer` basically creates a dictionary of words from the corpus. Then, each instance is converted to a vector of numeric features where each element will be the frequency of each word in the document.
- `HashingVectorizer`, instead of constructing and maintaining the dictionary in memory, implements a hashing function that maps tokens into feature indexes, and then computes the count as in CountVectorizer. (Sadly seems intentionally broken at the time of writing!)
- `TfidfVectorizer` works like CountVectorizer, with a more advanced calculation called "Term Frequency - Inverse Document Frequency" (TF-IDF). This is a statistic for measuring the importance of a word in a document or corpus. Intuitively, it looks for words that are more frequent in the current document, compared with their frequency across all documents. You can see this as a way to normalize the results and avoid words that are too frequent, and thus not useful to characterise the instances.

We will create a Naïve Bayes classifier that is composed of a feature vectorizer and the actual Bayes classifier. We will use the MultinomialNB class from the `sklearn.naive_bayes` module. 


In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn.preprocessing import MaxAbsScaler

clf_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])
# clf_2 = Pipeline([
#     ('vect', HashingVectorizer()),
#     ('clf', MultinomialNB()),
# ])
clf_3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

In [9]:
clfs = [clf_1, clf_3]
for clf in clfs:
    print("for clf {}".format(clf))
    evaluate_cross_validation(clf, news.data, news.target, 5)


for clf Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
underway...
[ 0.85782493  0.85725657  0.84664367  0.85911382  0.8458477 ]
Mean score: 0.853 (+/-0.003)
for clf Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...rue,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1

We will keep the TF-IDF vectorizer but use a different regular expression to perform tokenization. The default regular expression: "\w\w+" considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: "[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+". If you have queries about how to define regular expressions, please refer to the Python re module documentation. Let's try our new classifier:

In [10]:
clf_4 = Pipeline([
    ('vect', TfidfVectorizer(
                token_pattern="[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+",
    )),
    ('clf', MultinomialNB()),
])

In [11]:
print(news.data)


evaluate_cross_validation(clf_4, news.data, news.target, 5)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=10000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[ 0.85702918  0.87476784  0.85752189  0.87052269  0.85805253]
Mean score: 0.864 (+/-0.004)


Another parameter that we can use is stop_words: this argument allows us to pass a list of words we do not want to take into account, such as too frequent words, or words we do not a priori expect to provide information about the particular topic. Let's try to improve performance filtering the stop words:

In [12]:
from nltk.corpus import stopwords
#nltk.download('stopwords')

def get_stop_words():
    return stopwords.words('english')

In [13]:
stop_words = get_stop_words()
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [14]:
clf_5 = Pipeline([
    ('vect', TfidfVectorizer(
                stop_words = stop_words,
                token_pattern='[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+',    
    )),
    ('clf', MultinomialNB()),
])

In [15]:
evaluate_cross_validation(clf_5, news.data, news.target, 5)

underway...
[ 0.87612732  0.8896259   0.87689042  0.88803396  0.87848236]
Mean score: 0.882 (+/-0.003)


Try to improve the classification.  Change the max number of features, the smoothing (alpha) parameter on the MultinomialNB classifier:

In [26]:
clf_7 = Pipeline([
    ('vect', TfidfVectorizer(
                max_features=None,
                stop_words=stop_words,
                token_pattern="[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+",         
    )),
    ('clf', MultinomialNB(alpha=0.1)),
])

In [30]:
evaluate_cross_validation(clf_7, news.data, news.target, 5)

underway...
[ 0.91246684  0.91430088  0.90793314  0.91536217  0.91164765]
Mean score: 0.912 (+/-0.001)


At this point, we could continue doing trials by using different values of alpha or doing new modifications of the vectorizer.

If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set.

In [22]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print ("Accuracy on training set:")
    print (clf.score(X_train, y_train))
    print ("Accuracy on testing set:")
    print (clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print ("Classification Report:")
    print (metrics.classification_report(y_test, y_pred))
    print ("Confusion Matrix:")
    print (metrics.confusion_matrix(y_test, y_pred))

In [32]:
train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.988113768218
Accuracy on testing set:
0.90662139219
Classification Report:
             precision    recall  f1-score   support

          0       0.92      0.87      0.89       216
          1       0.86      0.83      0.84       246
          2       0.91      0.84      0.87       274
          3       0.78      0.88      0.83       235
          4       0.90      0.90      0.90       231
          5       0.87      0.92      0.89       225
          6       0.88      0.76      0.81       248
          7       0.93      0.92      0.92       275
          8       0.95      0.97      0.96       226
          9       0.97      0.97      0.97       250
         10       0.98      1.00      0.99       257
         11       0.94      0.98      0.96       261
         12       0.88      0.90      0.89       216
         13       0.95      0.95      0.95       257
         14       0.94      0.96      0.95       246
         15       0.81      0.96      0.88      

As we can see, we obtained very good results, and as we would expect, the accuracy in the training set is quite better than in the testing set. We may expect, in new unseen instances, an accuracy of around 0.91.

If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:

In [33]:
clf_7.named_steps['vect'].get_feature_names()

['--------------------------------------------------------------------------------cal',
 '--------------------------------------------------------------------------------tom',
 '-------------------------------------------------------------------w--u--w---',
 '---------------------------------------------------------------tim',
 '---------------------------------------------ooo--',
 '---------------------------------signature---------------------------------',
 '--------------------------------ooo--u--ooo-----------------------',
 '-------------------------------ching',
 '-------------------------------cut',
 '-------------------------------it',
 '----------------------------ooo--',
 '----------------------------original',
 '----------------------------response',
 '------------------------e-mail',
 '------------------------the',
 '-----------------------s-o-u-l---i-s---t-h-e---r-i-d-e-r--------',
 '---------------------pan',
 '--------------------ooo-',
 '--------------------p----------

In [34]:
print(len(clf_7.named_steps['vect'].get_feature_names()))

169933
