# Email Classification using Naive Bayes

A project for my Codecademy Certified Data Scientist: Machine Learning Specialist professional certification.

Robert Hall

01/08/2025

### 1. Import Data and Libraries, Do Basic Data Exploration

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

In [8]:
# print target names to ensure data importation happened correctly
print(f"Target Index 0: {emails.target_names[0]}")
print(f"Target Index 1: {emails.target_names[1]}")

Target Index 0: rec.sport.baseball
Target Index 1: rec.sport.hockey


In [7]:
# print email at index 5
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

In [9]:
# get target classification of email 5 to ensure that it's labeled correctly as a hockey email
print(f"Email 5 Classification: {emails.target[5]}")

Email 5 Classification: 1


### 2. Create Training and Validation Datasets

In [11]:
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],
                                  subset='train',
                                  shuffle=True,
                                  random_state=108)

In [12]:
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],
                                  subset='test',
                                  shuffle=True,
                                  random_state=108)

### 3. Transform Emails into Lists of Word Counts

In [14]:
# instantiate counter
counter = CountVectorizer()

In [15]:
# fit training and test data to counter
counter.fit(train_emails.data + test_emails.data)

In [16]:
# transform data to get training word counts 
train_counts = counter.transform(train_emails.data)

In [17]:
# transform data to get validation word counts
test_counts = counter.transform(test_emails.data)

### 4. Build, Train and Score the Naive Bayes Classifier

In [18]:
classifier = MultinomialNB()

In [20]:
# fit training data to classifier
classifier.fit(train_counts, train_emails.target)

In [21]:
# score the model
classifier.score(test_counts, test_emails.target)

0.9723618090452262

### 5. Build, Train, and Score Model on Different Features to Compare Accuracy

In [22]:
# using features ['comp.sys.ibm.pc.hardware','rec.sport.hockey']

train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset='train', shuffle=True, random_state=108)
test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

counter = CountVectorizer()

counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)

test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()

classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9974715549936789


The model appears to better distinctualize data regarding PC Hardware and Hockey than data between Hockey and Baseball, likely because the subjects are more distinct and thus probably have more divergent vocabulary sets. 