## Overview

See Section 7.2 in the sklearn User Guide  http://scikit-learn.org/stable/user_guide.html for the dataset used in this notebook

## Load in the data - a subset from 20 News groups 

In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.sport.baseball', 'talk.politics.guns','comp.graphics', 'sci.med']
twentyTrain = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [2]:
# You can check the target names (categories) and some data files by following commands.
list(twentyTrain.target_names) #prints all the categories

['comp.graphics', 'rec.sport.baseball', 'sci.med', 'talk.politics.guns']

In [3]:
type(twentyTrain)  # the type 

sklearn.utils.Bunch

In [4]:
len(twentyTrain.data)  # the size


2321

In [5]:
len(twentyTrain.filenames)

2321

In [6]:
print(twentyTrain.data[0])   # print one instance of the data - .data is the data
print("\n".join(twentyTrain.data[0].split("\n")[:3]))  #print out the first 3 lines only
print("Target class is {}".format(twentyTrain.target_names[twentyTrain.target[0]]))   #print the class of that instance - .target is the class

From: marcbg@feenix.metronet.com (Marc Grant)
Subject: Adult Chicken Pox
Organization: Tx Metronet Communications Services, Dallas Tx
Distribution: usa
Lines: 13

I am 35 and am recovering from a case of Chicken Pox which I contracted
from my 5 year old daughter.  I have quite a few of these little puppies
all over my bod.  At what point am I no longer infectious?  My physician's
office says when they are all scabbed over.  Is this true?

Is there any medications which can promote healing of the pox?  Speed up
healing?  Please e-mail replies, and thanks in advance.

-- 
|Marc Grant          | Internet: marcbg@feenix.metronet.com |
|POB 850472          | Amateur Radio Station N5MEI          |
|Richardson, TX 75085| Voice/Fax: 214-231-3998              |
    - .... .- - ...  .- .-.. .-..    ..-. --- .-.. -.- ...

From: marcbg@feenix.metronet.com (Marc Grant)
Subject: Adult Chicken Pox
Organization: Tx Metronet Communications Services, Dallas Tx
Target class is sci.med


### Remove the meta data so the classifier doesn't overfit to the headers etc.,

In [7]:
categories = ['rec.sport.baseball', 'talk.politics.guns','comp.graphics', 'sci.med']
twentyTrain = fetch_20newsgroups(subset='train', 
                                 categories=categories, 
                                 remove=('headers', 'footers', 'quotes'), 
                                 shuffle=True, 
                                 random_state=42)    # random seed 
print(twentyTrain.data[0])   # print one instance of the data - .data is the data

I am 35 and am recovering from a case of Chicken Pox which I contracted
from my 5 year old daughter.  I have quite a few of these little puppies
all over my bod.  At what point am I no longer infectious?  My physician's
office says when they are all scabbed over.  Is this true?

Is there any medications which can promote healing of the pox?  Speed up
healing?  Please e-mail replies, and thanks in advance.



In [8]:
print(twentyTrain.target[:10])   #.target are the classes

[2 1 1 1 1 1 0 3 3 1]


In [9]:
for t in twentyTrain.target[:10]:
    print(twentyTrain.target_names[t])  # .target_names are the class names

sci.med
rec.sport.baseball
rec.sport.baseball
rec.sport.baseball
rec.sport.baseball
rec.sport.baseball
comp.graphics
talk.politics.guns
talk.politics.guns
rec.sport.baseball


##  Tokenising

The tokenising can be changed by changing the parameters to the Vectorizer:  
- `analyser` and `ngram_range` params will allow tokenising by char n-grams.  
- `max_df` and `min_df` will allow document frequency reduction to be performed

Look up the documentation to see what can be changed. 

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()       
count_vect.get_params()      #shows the default parameters

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

## Create the Term-Document Matrix

In [11]:
tdm = count_vect.fit_transform(twentyTrain.data)   #tdm is a matrix - 2-d array
tdm.shape     


(2321, 31164)

In [12]:
count_vect.vocabulary_.get('and')  #count_vect is a dictionary - show the freq of word 'and'

4204

### Transform the TDM to a normalised tf or tf-idf matrix 

Check the `TfidfTransformer` parameters - they allow for tf vs tfidf and l1 vs l2 normalisation

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()   
transformer.get_params()     #show default parameters 

{'norm': 'l2', 'smooth_idf': True, 'sublinear_tf': False, 'use_idf': True}

In [14]:
tdm_tfidf = transformer.fit_transform(tdm)   #transform the TDM
tdm_tfidf.shape



(2321, 31164)

## Build a NB classifier

In [15]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(tdm_tfidf, twentyTrain.target)    #build the classifier (data, classes)

docs_test = ['I am sick', 'No more gun control']      #set up 2 test instances

# transform the test data in the same way as the training through CountVector and TfidfTransformer
test_counts = count_vect.transform(docs_test)       # don't fit as the vocab has been generated from the training data
test_tfidf = transformer.transform(test_counts)

predicted = clf.predict(test_tfidf)   #predict  

for doc, category in zip(docs_test, predicted):
    print('%r => %s' % (doc, twentyTrain.target_names[category]))

'I am sick' => sci.med
'No more gun control' => talk.politics.guns


## Use Pipeline to do it all in one

In [17]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),])

text_clf.fit(twentyTrain.data, twentyTrain.target)  

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

### Load the 20 NG test data

In [18]:
twenty_test = fetch_20newsgroups(subset='test', 
                                 categories=categories, 
                                 shuffle=True, 
                                 random_state=42)  
docs_test = twenty_test.data

import numpy as np
predicted = text_clf.predict(docs_test)   # predict
np.mean(predicted == twenty_test.target)  #report accuracy

0.9450194049159121

### Using metrics package

In [19]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))    #print classification results

                    precision    recall  f1-score   support

     comp.graphics       0.92      0.94      0.93       389
rec.sport.baseball       0.99      0.94      0.96       397
           sci.med       0.95      0.91      0.93       396
talk.politics.guns       0.92      0.99      0.95       364

          accuracy                           0.95      1546
         macro avg       0.95      0.95      0.95      1546
      weighted avg       0.95      0.95      0.95      1546



In [20]:
metrics.confusion_matrix(twenty_test.target, predicted)  #print confusion matrix

array([[367,   2,  12,   8],
       [ 10, 373,   4,  10],
       [ 20,   1, 362,  13],
       [  2,   1,   2, 359]], dtype=int64)

In [21]:
metrics.f1_score(twenty_test.target, predicted, average='macro')   #print f-score



0.9451351758138555

## Using TfidfVectorizer 
TfidfVectorizer combines using CountVectorizer and TfidfTransformer 

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories,
                                    shuffle=True,
                                     random_state=42)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)

newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories,
                                     shuffle=True,
                                     random_state=42)
vectors_test = vectorizer.transform(newsgroups_test.data)

classifier = MultinomialNB(alpha=.01)
classifier.fit(vectors, newsgroups_train.target)
predicted = classifier.predict(vectors_test)
print(metrics.classification_report(newsgroups_test.target, predicted,
    target_names=newsgroups_train.target_names))

                    precision    recall  f1-score   support

       alt.atheism       0.82      0.86      0.84       319
     comp.graphics       0.95      0.96      0.95       389
           sci.med       0.95      0.94      0.94       396
talk.religion.misc       0.82      0.77      0.79       251

          accuracy                           0.89      1355
         macro avg       0.88      0.88      0.88      1355
      weighted avg       0.89      0.89      0.89      1355

