In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()

We’ve imported a dataset of emails from scikit-learn’s datasets. All of these emails are tagged based on their content.

In [2]:
emails.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. We can select the categories of articles we want from fetch_20newsgroups by adding the parameter categories.

In [18]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'soc.religion.christian'])

In [19]:
print(emails.data[5])

From: luigi@sgi.com (Randy Palermo)
Subject: Re: Grateful Dead?
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 16
Nntp-Posting-Host: bullpen.csd.sgi.com

In article <93095.172834IO21087@MAINE.MAINE.EDU> IO21087@MAINE.MAINE.EDU writes:
>Being a baseball fan and a fan of the above mentioned band I was
>wondering if anyone could clue me in on whether the Dead (or members
>of) sang the national anthem at todays Giant opener?
>
>I would imagine that it is a bit too early for anyone to know, but
>an answer would be greatly appreciated.
>
It is my understanding that the Dead will sing the NA at the Giants
home opener on Mon. 4/12. The Giants are opening today in St. Louis.

luigi
--
Randy Palermo   luigi@csd.sgi.com    Fax: (415)961-6502
Silicon Graphics Computer Systems, 2011 N. Shoreline Blvd Mt. View, CA 94039
"Play an accordion, go to jail. That's the LAW"



In [20]:
print(emails.target[5])

0


Create the train and test datasets

In [6]:
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'soc.religion.christian'], subset = 'train',shuffle= True,random_state=42)

In [7]:
# print(train_emails)

In [8]:
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'soc.religion.christian'], subset = 'test',shuffle= True,random_state=42)

We want to transform these emails into lists of word counts. The CountVectorizer class makes this easy for us.

In [9]:
counter = CountVectorizer()

We need to tell counter what possible words can exist in our emails. counter has a .fit() a function that takes a list of all our data.

In [10]:
counter.fit(train_emails.data+test_emails.data)

In [11]:
train_counts = counter.transform(train_emails.data)

In [12]:
print(train_counts)

  (0, 1145)	2
  (0, 1196)	1
  (0, 2727)	3
  (0, 2999)	1
  (0, 3091)	1
  (0, 3189)	1
  (0, 3317)	1
  (0, 3510)	4
  (0, 3597)	1
  (0, 3944)	2
  (0, 4075)	1
  (0, 4150)	1
  (0, 4241)	1
  (0, 4574)	1
  (0, 4696)	1
  (0, 4821)	5
  (0, 4869)	2
  (0, 4977)	1
  (0, 5079)	1
  (0, 5247)	4
  (0, 5531)	1
  (0, 5973)	2
  (0, 6053)	1
  (0, 6291)	2
  (0, 6308)	1
  :	:
  (1195, 23756)	1
  (1195, 23757)	3
  (1195, 23964)	1
  (1195, 23974)	1
  (1195, 23978)	17
  (1195, 24028)	2
  (1195, 24160)	1
  (1195, 24200)	1
  (1195, 24233)	1
  (1195, 24259)	6
  (1195, 24726)	1
  (1195, 25443)	1
  (1195, 25722)	1
  (1195, 25761)	1
  (1195, 25798)	1
  (1195, 25846)	2
  (1195, 25946)	1
  (1195, 25957)	1
  (1195, 25959)	1
  (1195, 26012)	1
  (1195, 26142)	1
  (1195, 26329)	1
  (1195, 26446)	1
  (1195, 26481)	7
  (1195, 26491)	1


In [13]:
test_counts = counter.transform(test_emails.data)

In [14]:
nvc = MultinomialNB()

In [15]:
nvc.fit(train_counts,train_emails.target)

In [16]:
print(nvc.score(test_counts,test_emails.target))

0.9924528301886792


When testing the accuracy of our classifier we have a very high accuracy of 99%

In [17]:
emails_new = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
email_counts_new = counter.transform(emails_new.data)
print(nvc.score(email_counts_new,emails_new.target))

0.5054302422723476


The classifier was 99% accurate when trying to classify hockey and religious emails. But only 50% accurate when trying to distingusih betwen two sports

This is worse and makes sense — emails about sports probably share more words in common.