## Email Similarity

In this project, we will use scikit-learn’s Naive Bayes implementation on several different datasets.
By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech?
In this project, we’ll find out exactly how difficult those two tasks are.

#### 1. We’ve imported a dataset of emails from scikit-learn’s datasets. All of these emails are tagged based on their content.

In [3]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


#### 2. We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a ***baseball*** email and a ***hockey*** email.

In [14]:
emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])
print(emails.target_names)
print(emails.data[5])
print("Target name of the label:", emails.target_names[emails.target[5]])

['rec.sport.baseball', 'rec.sport.hockey']
From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  T

#### 3. We now want to split our data into training and test sets.

Let's set the name of the training set to **train_emails**. We add these three parameters to the function call:

- subset='train'
- shuffle = True
- random_state = 108

Adding the random_state parameter will make sure that every time you run the code, your dataset is split in the same way.
We set the name of the testning set to **test_emails**. And we change parameter subset to 'test'.

In [16]:
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'train', shuffle=True, random_state=108)
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'test', shuffle=True, random_state=108)

#### 4. We want to transform these emails into lists of word counts.
We use the CountVectorizer class for this.

In [17]:
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

#### 5. We can now make a list of the counts of our words in our training and testing sets.

In [18]:
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

#### 6. Let’s now make a Naive Bayes classifier that we can train and test on.
We use MultinomialNB classifier for this.

In [19]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

#### 7. Now, we test the Naive Bayes Classifier.

We print classifier‘s .score() function.

This function returns the accuracy of the classifier on the test data. Accuracy measures the percentage of classifications a classifier correctly made.

In [20]:
print("Accuracy of NB classifier:", classifier.score(test_counts, test_emails.target))

Accuracy of NB classifier: 0.9723618090452262


#### Result
Our classifier does a pretty good job distinguishing between soccer emails and hockey emails, with an accuracy of ***97.24%***


####  8. Now, let’s see how it does with emails about really different topics.

We select different categories to be used ['comp.sys.ibm.pc.hardware','rec.sport.hockey'] and repeat the same algorithm.

In [23]:
train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'train', shuffle=True, random_state=108)
test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'test', shuffle=True, random_state=108)

counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print("Accuracy of NB classifier:", classifier.score(test_counts, test_emails.target))

Accuracy of NB classifier: 0.9974715549936789


#### Result
Our classifier works much better with emails with subjects that are so distinct. Accuracy is ***99.75%*** !

#### 9. We play more, and we'll select possibly the closest categories to find out if our classifier works well on a challenging task.
We select categories to be used ['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware'] and repeat the same algorithm.

In [24]:
train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware'], subset = 'train', shuffle=True, random_state=108)
test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware'], subset = 'test', shuffle=True, random_state=108)

counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print("Accuracy of NB classifier:", classifier.score(test_counts, test_emails.target))

Accuracy of NB classifier: 0.8996138996138996


### Conclusion
Our classifier works with different accuracy on different categories of emails. We've found accuracies of ***89.96%, 97.24%***, and ***99.75%***. Even the lowest score is still pretty good for this classifier and a job of email comparison and classification. We can continue our work and make a cross table to compare all possible pairs of email categories to determine all scores. That work will help us understand when this algorithm is appropriate and when we should find another way for better accuracy.