# Email Similarity
In this project, we will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, we’ll find out exactly how difficult those two tasks are.

In [20]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [3]:
data = fetch_20newsgroups()
print(data.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']



We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email.

In [8]:
emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])
print(emails.data[5])
print(emails.target[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

1 correspond to Hockey i.e this email is related to hockey

# Making the Training and Test Sets

In [13]:
train_email =fetch_20newsgroups(categories=['rec.sport.baseball','rec.sport.hockey'],
                                random_state=100,subset='train',shuffle = True)
test_email =fetch_20newsgroups(categories=['rec.sport.baseball','rec.sport.hockey'],
                               random_state=100,subset='test',shuffle = True)


We want to transform these emails into lists of word counts. The CountVectorizer class makes this easy for us.

In [18]:
vectorizer = CountVectorizer()
vectorizer.fit(train_email.data + test_email.data)
train_count = vectorizer.transform(train_email.data)
test_count = vectorizer.transform(test_email.data)


# Making a Naive Bayes Classifier

In [22]:
model = MultinomialNB()
model.fit(train_count,train_email.target)
print(model.score(test_count,test_email.target))

0.9723618090452262



Our classifier does a pretty good job distinguishing between soccer emails and hockey emails.Having 97.23% accuracy

# Testing Other Datasets

In [23]:
emails2 = fetch_20newsgroups(categories=['comp.sys.mac.hardware', 'rec.sport.hockey'])
print(emails2.data[5])
print(emails2.target[5])

From: cr292@cleveland.Freenet.Edu (Jim Schenk)
Subject: Re: the hawks WILL return to the finals!!!!!
Organization: Case Western Reserve University, Cleveland, Ohio (USA)
Lines: 9
NNTP-Posting-Host: hela.ins.cwru.edu


The Hawks won the Norris div, and sealed their fate.  It's bad luck
to win the Norris.  The Hawks will sweep the Blues in their dreams but will
lose in 6 in reality.  I predict that in the 6 game with the Blues Belfour
will go down on his knees 7000 time s and will spend the rest of the time 
looking behind him self.  Butcher will pound Roenick and The warthawks have
no one tough enough to prevent it

Bye Bye Wart HAwks

1


1 correspond to Hockey i.e this email is related to hockey

In [29]:
train_email2 =fetch_20newsgroups(categories=['comp.sys.mac.hardware','rec.sport.hockey'],
                                random_state=100,subset='train',shuffle = True)
test_email2 =fetch_20newsgroups(categories=['comp.sys.mac.hardware','rec.sport.hockey'],
                               random_state=100,subset='test',shuffle = True)
vectorizer2 = CountVectorizer()
vectorizer2.fit(train_email2.data + test_email2.data)
train_count2 = vectorizer2.transform(train_email2.data)
test_count2 = vectorizer2.transform(test_email2.data)
model2 = MultinomialNB()
model2.fit(train_count2,train_email2.target)
print(model2.score(test_count2,test_email2.target))

0.9961734693877551


Our classifier does a pretty good job distinguishing between harware emails and hockey emails.Having 99.61% accuracy