Naïve Bayes is a simple but powerful classifier based on a probabilistic model
derived from the Bayes' theorem. Basically it determines the probability that an
instance belongs to a class based on each of the feature value probabilities. The naïve
term comes from the fact that it assumes that each feature is independent of the rest,
that is, the value of a feature has no relation to the value of another feature.

Despite being very simple, it has been used in many domains with very good
results. The independence assumption, although a naïve and strong simplification,
is one of the features that make the model useful in practical applications. Training
the model is reduced to the calculation of the involved conditional probabilities,
which can be estimated by counting frequencies of correlations between feature
values and class values.
One of the most successful applications of Naïve Bayes has been within the field
of Natural Language Processing (NLP). NLP is a field that has been much related
to machine learning, since many of its problems can be formulated as a classification
task. Usually, NLP problems have important amounts of tagged data in the form
of text documents. This data can be used as a training dataset for machine
learning algorithms.

dataset consists of around
19,000 newsgroup messages from 20 different topics ranging from politics and
religion to sports and science.

In [36]:
%pylab inline
import IPython
import sklearn as sk
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sk.__version__)
print ('matplotlib version:', matplotlib.__version__)

Populating the interactive namespace from numpy and matplotlib
IPython version: 4.0.1
numpy version: 1.13.1
scikit-learn version: 0.18.2
matplotlib version: 1.5.0


In [37]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')

In [38]:
print (type(news.data), type(news.target), type(news.target_names))

<class 'list'> <class 'numpy.ndarray'> <class 'list'>


In [39]:
print (news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [40]:
print (len(news.data))
print (len(news.target))

18846
18846


In [41]:
print (news.data[0])
print (news.target[0], news.target_names[news.target[0]])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


10 rec.sport.hockey


In [42]:
# The sklearn.feature_extraction.text module has some useful utilities 
# to build numeric feature vectors from text documents.

In [43]:
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)
X_train = news.data[:split_size]
X_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

If you look inside the sklearn.feature_extraction.text module, you
will find three different classes that can transform text into numeric features:
CountVectorizer, HashingVectorizer, and TfidfVectorizer. The difference
between them resides in the calculations they perform to obtain the numeric features.
CountVectorizer basically creates a dictionary of words from the text corpus. Then,
each instance is converted to a vector of numeric features where each element will be
the count of the number of times a particular word appears in the document.
HashingVectorizer, instead of constricting and maintaining the dictionary in
memory, implements a hashing function that maps tokens into feature indexes, and
then computes the count as in CountVectorizer.
TfidfVectorizer works like the CountVectorizer, but with a more advanced
calculation called Term Frequency Inverse Document Frequency (TF-IDF). This is a
statistic for measuring the importance of a word in a document or corpus. Intuitively,
it looks for words that are more frequent in the current document, compared with
their frequency in the whole corpus of documents. You can see this as a way to
normalize the results and avoid words that are too frequent, and thus not useful to
characterize the instances.

We will create three different classifiers by combining MultinomialNB with the three
different text vectorizers just mentioned, and compare which one performs better
using the default parameters

In [44]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

clf_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])
clf_2 = Pipeline([
    ('vect', HashingVectorizer(non_negative=True)),
    ('clf', MultinomialNB()),
])
clf_3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

We will define a function that takes a classifier and performs the K-fold crossvalidation
over the specified X and y values

In [45]:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print (scores)
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)))

Then we will perform a five-fold cross-validation by using each one of the classifiers

In [46]:
clfs = [clf_1, clf_2, clf_3]
for clf in clfs:
    evaluate_cross_validation(clf, news.data, news.target, 5)

[ 0.85782493  0.85725657  0.84664367  0.85911382  0.8458477 ]
Mean score: 0.853 (+/-0.003)
[ 0.75543767  0.77659857  0.77049615  0.78508888  0.76200584]
Mean score: 0.770 (+/-0.005)
[ 0.84482759  0.85990979  0.84558238  0.85990979  0.84213319]
Mean score: 0.850 (+/-0.004)


Let's continue with TfidfVectorizer; we could try to improve the results by trying
to parse the text documents into tokens with a different regular expression

In [47]:
clf_4 = Pipeline([
    ('vect', TfidfVectorizer(
                token_pattern=r'\b[a-z0-9_\-\.]+[a-z][a-z0->>> 9_\-\.]+\b'
    )),
    ('clf', MultinomialNB()),
])

In [48]:
evaluate_cross_validation(clf_4, news.data, news.target, 5)

[ 0.80450928  0.81029451  0.8113558   0.82143805  0.80817193]
Mean score: 0.811 (+/-0.003)


In [49]:
# Adding stop words
def get_stop_words():
    result = set()
    for line in open(r'data\stopwords_en.txt', 'r').readlines():
        result.add(line.strip())
    return result

In [50]:
clf_5 = Pipeline([
    ('vect', TfidfVectorizer(
        stop_words= get_stop_words(),
        token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB()),
])

In [51]:
evaluate_cross_validation(clf_5, news.data, news.target, 5)

[ 0.88116711  0.89519767  0.88325816  0.89227912  0.88113558]
Mean score: 0.887 (+/-0.003)


Let's keep this vectorizer and start looking at the MultinomialNB parameters. This
classifier has few parameters to tweak; the most important is the alpha parameter,
which is a smoothing parameter. Let's set it to a lower value; instead of setting alpha
to 1.0 (the default value), we will set it to 0.01

In Multinomial Naive Bayes, the alpha parameter is what is known as a hyperparameter; i.e. a parameter that controls the form of the model itself. In most cases, the best way to determine optimal values for hyperparameters is through a grid search over possible parameter values, using cross validation to evaluate the performance of the model on your data at each value. Read the above links for details on how to do this with scikit-learn.

alpha : float, optional (default=1.0) - Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

In [52]:
clf_7 = Pipeline([
    ('vect', TfidfVectorizer(
        stop_words= get_stop_words(),
        token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",
    )),
    ('clf', MultinomialNB(alpha=0.01)),
])

In [53]:
evaluate_cross_validation(clf_7, news.data, news.target, 5)

[ 0.9204244   0.91960732  0.91828071  0.92677103  0.91854603]
Mean score: 0.921 (+/-0.002)


# Evaluating the performance
If we decide that we have made enough improvements in our model, we are ready to
evaluate its performance on the testing set.
We will define a helper function that will train the model in the entire training set
and evaluate the accuracy in the training and in the testing sets. It will also print
a classification report (precision and recall on every class) and the corresponding
confusion matrix:

In [55]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):

    clf.fit(X_train, y_train)

    print ("Accuracy on training set:")
    print (clf.score(X_train, y_train))
    print ("Accuracy on testing set:")
    print (clf.score(X_test, y_test))
    y_pred = clf.predict(X_test)

    print ("Classification Report:")
    print (metrics.classification_report(y_test, y_pred))
    print ("Confusion Matrix:")
    print (metrics.confusion_matrix(y_test, y_pred))

In [56]:
train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.996957690675
Accuracy on testing set:
0.917869269949
Classification Report:
             precision    recall  f1-score   support

          0       0.95      0.88      0.91       216
          1       0.85      0.85      0.85       246
          2       0.91      0.84      0.87       274
          3       0.81      0.86      0.83       235
          4       0.88      0.90      0.89       231
          5       0.89      0.91      0.90       225
          6       0.88      0.80      0.84       248
          7       0.92      0.93      0.93       275
          8       0.96      0.98      0.97       226
          9       0.97      0.94      0.96       250
         10       0.97      1.00      0.98       257
         11       0.97      0.97      0.97       261
         12       0.90      0.91      0.91       216
         13       0.94      0.95      0.95       257
         14       0.94      0.97      0.95       246
         15       0.90      0.96      0.93     

In [57]:
# If we look inside the vectorizer, we can see which tokens have been used to create our dictionary
print (len(clf_7.named_steps['vect'].get_feature_names()))

145767


In [66]:
# Let's print the feature names.
clf_7.named_steps['vect'].get_feature_names()[110000:110050]

['repay',
 'repeal',
 'repealed',
 'repealing',
 'repeat',
 'repeat.ps',
 'repeatability',
 'repeatable',
 'repeate',
 'repeated',
 'repeated-key',
 'repeatedly',
 'repeater',
 'repeating',
 'repeats',
 'repeattime',
 'repectable',
 'repel',
 'repellant',
 'repelled',
 'repellent',
 'repeller',
 'repellers',
 'repelling',
 'repels',
 'repent',
 'repentance',
 'repentant',
 'repented',
 'repentence',
 'repentently',
 'repenting',
 'repercussions',
 'repertoire',
 'repetative',
 'repetition',
 'repetitions',
 'repetitive',
 'repetoire',
 'rephrase',
 'rephrased',
 'rephrases',
 'rephrasing',
 'repinski',
 'repitition',
 'replacable',
 'replace',
 'replace-string',
 'replaced',
 'replaceing']

You can see that some words are semantically very similar, for example, sand
and sands, sanctuaries and sanctuary. Perhaps if the plurals and the singulars are
counted to the same bucket, we would better represent the documents. This is a very
common task, which could be solved using stemming, a technique that relates two
words having the same lexical root.