# Naive Bayes Model for Newsgroups Data

For an explanation of the Naive Bayes model, see [our course notes](https://jennselby.github.io/MachineLearningCourseNotes/#naive-bayes).

This notebook uses code from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

## Instructions
0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.
0. Read through the code in the following sections:
  * [Newgroups Data](#Newgroups-Data)
  * [Model Training](#Model-Training)
  * [Prediction](#Prediction)
0. Complete at least one of the following exercises:
  * [Exercise Option #1 - Standard Difficulty](#Exercise-Option-#1---Standard-Difficulty)
  * [Exercise Option #2 - Advanced Difficulty](#Exercise-Option-#2---Advanced-Difficulty)

In [4]:
from sklearn.datasets import fetch_20newsgroups # the 20 newgroups set is included in scikit-learn
from sklearn.naive_bayes import MultinomialNB # we need this for our Naive Bayes model

# These next two are about processing the data. We'll look into this more later in the semester.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## Newgroups Data

Back in the day, [Usenet](https://en.wikipedia.org/wiki/Usenet_newsgroup) was a popular discussion system where people could discuss topics in relevant newsgroups (think Slack channel or subreddit). At some point, someone pulled together messages sent to 20 different newsgroups, to use as [a dataset for doing text processing](http://qwone.com/~jason/20Newsgroups/).

We are going to pull out messages from just a few different groups to try out a Naive Bayes model.

Examine the newsgroups dictionary, to make sure you understand the dataset.

**Note**: If you get an error about SSL certificates, you can fix this with the following:
1. In Finder, click on Applications in the list on the left panel
1. Double click to go into the Python folder (it will be called something like Python 3.7)
1. Double click on the Install Certificates command in that folder


In [28]:
# which newsgroups we want to download
newsgroup_names = ['comp.graphics', 'rec.sport.hockey', 'sci.electronics', 'sci.space']

# get the newsgroup data (organized much like the iris data)
newsgroups = fetch_20newsgroups(categories=newsgroup_names, shuffle=True, random_state=265)

In [31]:
print(newsgroups.keys())
print(len(newsgroups))
print(newsgroups['data'][:10])

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
5
["From: jca2@cec1.wustl.edu (Joseph Charles Achkar)\nSubject: Blues steal game 1 from Hawks\nKeywords: Blues, Hull, Shanahan, Joseph, Blackhawks, Belfour\nNntp-Posting-Host: cec1\nOrganization: Washington University, St. Louis MO\nLines: 125\n\n\n  The Blues scored two power-play goals in 17 seconds in the third period\nand the beat the Chicago Blackhawks 4-3 Sunday afternoon at Chicago Stadium.\nBrendan Shanahan tied the game 3-3 and Brett Hull scored the game winner 17\nseconds later. Jeff Brown and Denny Felsner scored the other Blues goals.\nBrian Noonan had the hat trick for the Hawks, who also had some very good\ngoaltending from Ed Belfour. Blues goalie Curtis Joseph was solid down the\nstretch to preserve the Blues lead.\n\nThe Hawks came out strong in the first period, outshooting the Blues 6-1 and\ntaking a 1-0 lead on Noonan's first goal. Right after an interference penalty\non Rick Zombo had expired, Keit

In [38]:
print(newsgroups['filenames'])
print(newsgroups['target_names'])
print(newsgroups['target'][:20])
print(newsgroups['DESCR'])

['/Users/yoarafa/scikit_learn_data/20news_home/20news-bydate-train/rec.sport.hockey/53622'
 '/Users/yoarafa/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38737'
 '/Users/yoarafa/scikit_learn_data/20news_home/20news-bydate-train/rec.sport.hockey/53704'
 ...
 '/Users/yoarafa/scikit_learn_data/20news_home/20news-bydate-train/sci.electronics/53811'
 '/Users/yoarafa/scikit_learn_data/20news_home/20news-bydate-train/rec.sport.hockey/53726'
 '/Users/yoarafa/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38497']
['comp.graphics', 'rec.sport.hockey', 'sci.electronics', 'sci.space']
[1 0 1 3 3 3 3 0 3 1 0 3 3 1 0 3 3 3 2 0]
.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon 

In [6]:
from platform import python_version
print(python_version())

3.8.5


This next part does some processing of the data, because the scikit-learn Naive Bayes module is expecting numerical data rather than text data. We will talk more about what this code is doing later in the semester. For now, you can ignore it.

In [7]:
# Convert the text into numbers that represent each word (bag of words method)
word_vector = CountVectorizer()
word_vector_counts = word_vector.fit_transform(newsgroups.data)

# Account for the length of the documents:
#   get the frequency with which the word occurs instead of the raw number of times
term_freq_transformer = TfidfTransformer()
term_freq = term_freq_transformer.fit_transform(word_vector_counts)

## Model Training

Now we fit the Naive Bayes model to the subset of the 20 newsgroups data that we've pulled out.

In [8]:
# Train the Naive Bayes model
model = MultinomialNB().fit(term_freq, newsgroups.target)

## Prediction

Let's see how the model does on some (very short) documents that we made up to fit into the specific categories our model is trained on.

In [46]:
# Predict some new fake documents
fake_docs = [
    'That GPU has amazing performance with a lot of shaders',
    'The player had a wicked slap shot',
    'I spent all day yesterday soldering banks of capacitors',
    'Today I have to solder a bank of capacitors',
    'NASA has rovers on Mars']
fake_counts = word_vector.transform(fake_docs)
fake_term_freq = term_freq_transformer.transform(fake_counts)

predicted = model.predict(fake_term_freq)
print('Predictions:')
for doc, group in zip(fake_docs, predicted):
    print('\t{0} => {1}'.format(doc, newsgroups.target_names[group]))

probabilities = model.predict_proba(fake_term_freq)
print('Probabilities:')
print(''.join(['{:17}'.format(name) for name in newsgroups.target_names]))
for probs in probabilities:
    print(''.join(['{:<17.8}'.format(prob) for prob in probs]))

Predictions:
	That GPU has amazing performance with a lot of shaders => comp.graphics
	The player had a wicked slap shot => rec.sport.hockey
	I spent all day yesterday soldering banks of capacitors => sci.space
	Today I have to solder a bank of capacitors => sci.electronics
	NASA has rovers on Mars => sci.space
Probabilities:
comp.graphics    rec.sport.hockey sci.electronics  sci.space        
0.29466149       0.22895149       0.24926344       0.22712357       
0.12948055       0.51155698       0.18248712       0.17647535       
0.18604814       0.24117771       0.27540452       0.29736963       
0.21285086       0.21081302       0.3486507        0.22768541       
0.079185633      0.066225915      0.10236622       0.75222223       


# Exercise Option #1 - Standard Difficulty

Modify the fake documents and add some new documents of your own. 

What words in your documents have particularly large effects on the model probabilities? Note that we're not looking for documents that consist of a single word, but for words that, when included or excluded from a document, tend to change the model's output.



In [66]:
# Predict some new fake documents
new_fake_docs = [
    'That has amazing performance with a lot of shaders',
    'The player had a wicked slap shot NASA',
    'I spent all day yesterday soldering banks of capacitors electronics',
    'Today I have to solder a bank of capacitors on Mars',
    'I love using ray casting for the ice rink',
    'ray casting',
    'ray casting mars',
    'ray casting capacitors',
    'GPU',
    'GPU capacitors',
    'GPU mars',
    'GPU puck',
    'I spent all day yesterday banks of capacitors',
    'I spent all day yesterday soldering of capacitors',
    'I all day yesterday soldering banks of capacitors',
    'I spent all yesterday soldering banks of capacitors',
    'I spent all day yesterday soldering banks of capacitor',
    'I spent all day soldering banks of capacitors',
    'I spent all day yesterday'
]
new_fake_counts = word_vector.transform(new_fake_docs)
new_fake_term_freq = term_freq_transformer.transform(new_fake_counts)

predicted = model.predict(new_fake_term_freq)
print('Predictions:')
for doc, group in zip(new_fake_docs, predicted):
    print('\t{0} => {1}'.format(doc, newsgroups.target_names[group]))

probabilities = model.predict_proba(new_fake_term_freq)
print('Probabilities:')
print(''.join(['{:17}'.format(name) for name in newsgroups.target_names]))
for probs in probabilities:
    print(''.join(['{:<17.8}'.format(prob) for prob in probs]))

Predictions:
	That has amazing performance with a lot of shaders => comp.graphics
	The player had a wicked slap shot NASA => rec.sport.hockey
	I spent all day yesterday soldering banks of capacitors electronics => sci.electronics
	Today I have to solder a bank of capacitors on Mars => sci.space
	I love using ray casting for the ice rink => rec.sport.hockey
	ray casting => comp.graphics
	ray casting mars => sci.space
	ray casting capacitors => sci.electronics
	GPU => rec.sport.hockey
	GPU capacitors => sci.electronics
	GPU mars => sci.space
	GPU puck => rec.sport.hockey
	I spent all day yesterday banks of capacitors => sci.space
	I spent all day yesterday soldering of capacitors => sci.space
	I all day yesterday soldering banks of capacitors => sci.electronics
	I spent all yesterday soldering banks of capacitors => sci.electronics
	I spent all day yesterday soldering banks of capacitor => sci.space
	I spent all day soldering banks of capacitors => sci.electronics
	I spent all day yester

It is very difficult to determine which words have a stronger effect without looking at the models itself, but I can extrapolate some things from my experiments above. For one, GPU and ray casting very weakly affect the model towards graphics (CORRECTION: it seems like GPU wasn't even in the dataset). Mars and puck had strong leanings towards the fields they come from. Capacitors was interesting because on one hand it affected many statements strongly towards electronics, but at the same time in a sentence with spent, day, and yesterday, it was not strong enough to change the output away from space.

# Exercise Option #2 - Advanced Difficulty

Write some code to count up how often the words you found in the exercise above appear in each category in the training dataset. Does this match up with your intuition?

In [53]:
graphics_counts = word_vector_counts.toarray()[newsgroups['target'].T==0].sum(axis=0)
hockey_counts = word_vector_counts.toarray()[newsgroups['target'].T==1].sum(axis=0)
electronics_counts = word_vector_counts.toarray()[newsgroups['target'].T==2].sum(axis=0)
space_counts = word_vector_counts.toarray()[newsgroups['target'].T==3].sum(axis=0)

In [54]:
graphics_counts

array([34, 17,  0, ...,  1,  0,  0])

In [56]:
word_vector.get_feature_names()

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '000005102000',
 '000021',
 '000062david42',
 '000100255pixel',
 '000256',
 '00041032',
 '0004136',
 '0004246',
 '0004422',
 '00044513',
 '0004847546',
 '0005',
 '0007',
 '00090711',
 '000usd',
 '001',
 '0010580b',
 '0012',
 '001200201pixel',
 '001323',
 '001428',
 '001555',
 '001718',
 '001757',
 '0018',
 '00196',
 '002',
 '0020',
 '0022',
 '0028',
 '0029',
 '00309',
 '003221',
 '0033',
 '0034',
 '003719',
 '0038',
 '003800',
 '0039',
 '004418',
 '0049',
 '005',
 '005150',
 '005512',
 '0059',
 '006',
 '0065',
 '007',
 '0078',
 '008',
 '0086',
 '0094',
 '00969fba',
 '0098',
 '00index',
 '00pm',
 '00r',
 '01',
 '0100',
 '01002',
 '010305',
 '010326',
 '010821',
 '011',
 '011605',
 '011720',
 '013',
 '013653rap115',
 '013846',
 '013939',
 '014',
 '014237',
 '014305',
 '014506',
 '01463',
 '0150',
 '015225',
 '015415',
 '015936',
 '016',
 '01609',
 '0164',
 '01752',
 '01760',
 '01775',
 '01776',
 '0179',
 '01821',
 '01826',
 '0184',
 '01852',


In [60]:
fake_doc_vector = CountVectorizer()
fake_doc_counts = fake_doc_vector.fit_transform(fake_docs)
fake_doc_vector.get_feature_names()

['all',
 'amazing',
 'bank',
 'banks',
 'capacitors',
 'day',
 'gpu',
 'had',
 'has',
 'have',
 'lot',
 'mars',
 'nasa',
 'of',
 'on',
 'performance',
 'player',
 'rovers',
 'shaders',
 'shot',
 'slap',
 'solder',
 'soldering',
 'spent',
 'that',
 'the',
 'to',
 'today',
 'wicked',
 'with',
 'yesterday']

In [None]:
for word in feature_names
'comp.graphics    rec.sport.hockey sci.electronics  sci.space'

In [69]:
print(''.join(['{:17}'.format(name) for name in newsgroups.target_names]))
for word in fake_doc_vector.get_feature_names():
    if word in ('gpu', 'rovers'):
        continue
    i = word_vector.get_feature_names().index(word)
    target_counts = [graphics_counts[i], hockey_counts[i], electronics_counts[i], space_counts[i]]
    print(''.join(['{:<17}'.format(count) for count in target_counts])+word)

comp.graphics    rec.sport.hockey sci.electronics  sci.space        
296              438              240              394              all
3                6                4                4                amazing
12               1                1                7                bank
0                0                1                4                banks
0                0                22               0                capacitors
28               48               30               75               day
69               291              98               201              had
273              423              235              360              has
664              844              721              760              have
76               63               72               68               lot
2                0                2                141              mars
75               14               68               737              nasa
2470             2604             2213            

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://stackoverflow.com/questions/27488446/how-do-i-get-word-frequency-in-a-corpus-using-scikit-learn-- 
- https://stackoverflow.com/questions/5954603/transposing-a-numpy-array

It now makes sense why gpu doesn't affect the probablities much: it didn't appear in any of the datasets! Also it makes a lot more sense why day and spent had such large biases towards space, as they are so overly represented in that group. The other terms, such as mars and nasa, match up pretty well with my intuition.