# **Email Similiarity**

**In this project , implementing Naive Bayes Classifier on different datasets we are going to see how difficult it is to distinguish between the emails about different topics**

In [7]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd



## **Exploring the datasets**

Loading the datasets of emails about baseball and hockey.

In [10]:
emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])

print(emails.data[5])
print(emails.target[5])



From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

## **Training the Naive Bayes Classifier**

**First we split the dataset into training data and test data. We transform the emails into a list of word counts using the CountVectorizer classifier. After transforming the datasets, we train the Naive Bayes Classifier i.e. MultinomialNB classifier and print the accuracy of the classifier measured with test dataset.**

In [19]:
train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset="train",shuffle=True,random_state=180)

test_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset="test",shuffle=True,random_state=180)

counter = CountVectorizer()

counter.fit(test_emails.data+train_emails.data)

train_counts=counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts,train_emails.target)

print(classifier.score(test_counts,test_emails.target))

0.9723618090452262


**As we can see the accuracy is 0.9723 and thus, the emails about hockey and baseball can be classified with accuracy of 97.2% which is intuitive as both the emails are related to sports. Now, I am going to compare the similarity between emails about two non-related topics i.e. system hardware and religion and then check the accuracy**

In [13]:
emails_1 = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','talk.religion.misc'])

In [21]:
train_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','talk.religion.misc'],subset="train",shuffle=True,random_state=180)

test_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','talk.religion.misc'],subset="test",shuffle=True,random_state=180)

counter = CountVectorizer()

counter.fit(test_emails.data+train_emails.data)

train_counts=counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts,train_emails.target)

print(classifier.score(test_counts,test_emails.target))

0.9906687402799378


**As we can see the accuracy increases from 97 to 99 which shows that it is easier to distinguish between religion and hardwares than between baseball and hockey. Lets see how the emails about politics in middle east can differed from emails about politics and guns**

In [20]:
train_emails = fetch_20newsgroups(categories=['talk.politics.guns','talk.politics.mideast'],subset="train",shuffle=True,random_state=180)

test_emails = fetch_20newsgroups(categories=['talk.politics.guns','talk.politics.mideast'],subset="test",shuffle=True,random_state=180)

counter = CountVectorizer()

counter.fit(test_emails.data+train_emails.data)

train_counts=counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts,train_emails.target)

print(classifier.score(test_counts,test_emails.target))

0.9837837837837838


**As we can see the accuracy is more for emails about politics in middle east and guns is more than emails about hockey and baseball. Thus, the emails about politics in middle east and politics in guns is more similar than hockey and baseball emails which might refer to a theory about involvement of guns in politics of middle east.**

**Lets see the similarities between emails about politics and emails about religion**

In [22]:
train_emails = fetch_20newsgroups(categories=['talk.politics.misc','talk.religion.misc'],subset="train",shuffle=True,random_state=180)

test_emails = fetch_20newsgroups(categories=['talk.politics.misc','talk.religion.misc'],subset="test",shuffle=True,random_state=180)

counter = CountVectorizer()

counter.fit(test_emails.data+train_emails.data)

train_counts=counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts,train_emails.target)

print(classifier.score(test_counts,test_emails.target))

0.8805704099821747


**As we see the accuracy is low compared to other accuracies calculated above which indicates there is less similarity in emails between politics and religion than emails about hockey and baseball which indicates that emails related to politics has less chance of involving a religion related topics and more chance of involving gun-related topics.**