## Naïve Bayes for text classification; 

Set of text documents with their corresponding categories, and train a Naïve Bayes algorithm to learn to predict the categories of new unseen instances. 
This simple task has many practical applications - probably the most known and widely used one is spam filtering. 
In this section I will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. 

This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging from politics and religion to sports and science.


In [29]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

In [2]:
news = fetch_20newsgroups(subset='all')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


data holds a list of text contents, instead of a numpy matrix:

In [5]:
print(type(news.data), type(news.target), type(news.target_names))
print(news.target_names)
print(len(news.data))
print(len(news.target))

<class 'list'> <class 'numpy.ndarray'> <class 'list'>
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
18846
18846


In [11]:
print(news.data[0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [10]:
print(news.target[0], news.target_names[news.target[0]])

10 rec.sport.hockey


### Preprocessing the data
Our machine learning algorithms can work only on numeric data, so our next
step will be to convert our text-based dataset to a numeric dataset. 
sklearn.feature_extraction.text module can transform text into numeric features: CountVectorizer, HashingVectorizer, and TfidfVectorizer

#### CountVectorizer: 
Creates a dictionary of words from the text corpus. Then, each instance is converted to a vector of numeric features where each element will be the count of the number of times a particular word appears in the document.

#### HashingVectorizer:
Instead of constricting and maintaining the dictionary in memory, implements a hashing function that maps tokens into feature indexes, and then computes the count as in CountVectorizer.

#### TfidfVectorizer:
Works like the CountVectorizer, but with a more advanced calculation called Term Frequency Inverse Document Frequency (TF-IDF). This is a statistic for measuring the importance of a word in a document or corpus. Intuitively, it looks for words that are more frequent in the current document, compared with their frequency in the whole corpus of documents. You can see this as a way to normalize the results and avoid words that are too frequent, and thus not useful to characterize the instances.

In [13]:
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)

X_train = news.data[:split_size]
X_test = news.data[split_size:]

y_train = news.target[:split_size]
y_test = news.target[split_size:]

### Training a Naïve Bayes classifier

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

In [24]:
clf_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB())
])

clf_2 = Pipeline([
    ('vect', HashingVectorizer()),
    ('clf', MultinomialNB())
])

clf_3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

In [26]:
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import sem

In [27]:
def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator
    cv = KFold(K, random_state=0, shuffle=True)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)

    print(scores)
    print('Mean score: {0:.3f} (+/-{1:.3f})'.format(np.mean(scores), sem(scores)))

In [30]:
clfs = [clf_1, clf_2, clf_3]
for clf in clfs:
    evaluate_cross_validation(clf, news.data, news.target, 5)

[0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]
Mean score: 0.853 (+/-0.003)




ValueError: Input X must be non-negative

CountVectorizer and TfidfVectorizer had similar performances, and much better than HashingVectorizer
Let's continue with TfidfVectorizer; we could try to improve the results by trying to parse the text documents into tokens with a different regular expression

In [34]:
clf_4 = Pipeline([
    ('vect', TfidfVectorizer(
        token_pattern=ur'\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b'
    )),
    ('clf', MultinomialNB()),
])

SyntaxError: invalid syntax (<ipython-input-34-5f663da5c330>, line 3)