# Demo: Naive Bayes for Text Classification
This is an example from sklearn (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) of using Naive Bayes for Text Classification.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

In [3]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
     categories=categories, shuffle=True, random_state=42)


In [4]:
len(twenty_train.data)

2257

In [5]:
print(twenty_train.data[0])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



## Bags of words
The most intuitive way to featurize text is the bags of words representation:

- assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
- for each document $i$, count the number of occurrences of each word $w$ and store it in $X[i, j]$ as the value of feature $j$ where $j$ is the index of word $w$ in the dictionary
The bags of words representation implies that $n_{features}$ is the number of distinct words in the corpus: this number is typically larger than 100,000.

If $n_{samples}$ == 10000, storing $X$ as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM.

Fortunately, most values in $X$ will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [7]:
type(X_train_counts)

scipy.sparse.csr.csr_matrix

### Note under the hood this built vocabulary mapping from words to ints.

In [8]:
count_vect.vocabulary_.get(u'algorithm')

4690

## TF-IDF

Scaling using TF-IDF gives better performance on this task -- see sklearn docs for explanation of this transformer. Note this scale the counts, but doesn't change the size of the feature space.

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

# Build our classifier

We build a Naive Bayes classifier for text classification and observe the effect of Laplace Smoothing.

In [10]:
from sklearn.naive_bayes import MultinomialNB
def train(alpha):
    text_clf = MultinomialNB(alpha=alpha)
    text_clf.fit(X_train_tf, twenty_train.target)
    return text_clf

## Train
Let's train our Naive Bayes classifer with `alpha=1`

In [11]:
clf = train(alpha=1)

Below we observe a new word, **Coronavirus** in the test documents.

In [37]:
docs = ['God is the eternal being who created', 
        'Deep learning is used to generate images with the aid of computers', 
        'Coronavirus is infectious disease caused by a new virus', 
]

In [38]:
count_vect.vocabulary_.get(u'Coronavirus')

In [39]:
def predict(clf, docs):
    X_new_counts = count_vect.transform(docs)
    X_new_tfidf = tf_transformer.transform(X_new_counts)
    
    predicted = clf.predict(X_new_tfidf)
    for doc, category in zip(docs, predicted):
        print('%r => %s' % (doc, twenty_train.target_names[category]))

In [40]:
predict(clf, docs)

'God is the eternal being who created' => soc.religion.christian
'Deep learning is used to generate images with the aid of computers' => comp.graphics
'Coronavirus is infectious disease caused by a new virus' => sci.med
