# Email Spam Classification 


We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.

In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam. Let’s get
started.

Aditya Shinde
Oasis Infobyte Data Science Internship

In [26]:
#importing packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [27]:
# import data
spam_df = pd.read_csv('C:\Aditya_Work\MLCourse\spam.csv')

In [28]:
# inspect df
spam_df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [29]:
# Create new column for spam/ham classification
spam_df['Spam'] = spam_df['Category'].apply(lambda x: 1 if x=='spam' else 0)

In [30]:
# Create train test split
xtrain, xtest, ytrain, ytest = train_test_split(spam_df.Message, spam_df.Spam, test_size = 0.25)

In [31]:
# Find word count and store data as matrix
cv = CountVectorizer()
x_train_count = cv.fit_transform(xtrain.values)

In [32]:
x_train_count

<4179x7407 sparse matrix of type '<class 'numpy.int64'>'
	with 55496 stored elements in Compressed Sparse Row format>

In [33]:
ytrain

4110    0
5183    0
1837    0
2970    0
5381    1
       ..
133     0
4694    0
4539    0
4874    0
3145    0
Name: Spam, Length: 4179, dtype: int64

In [34]:
# Train model
model = MultinomialNB()
model.fit(x_train_count, ytrain)

In [35]:
# Testing user created ham email with model
email_ham = ["Selling tickets here"]
email_ham_count = cv.transform(email_ham)
result = model.predict(email_ham_count)
if result == [0]:
    print("Ham email")
else:
    print("Spam email")

Ham email


In [36]:
# Testing user created spam email with model
email_spam = ["Reward money click here"]
email_spam_count = cv.transform(email_spam)
result = model.predict(email_spam_count)
if result == [0]:
    print("Ham email")
else:
    print("Spam email")

Spam email


In [37]:
# Test model
x_test_count = cv.transform(xtest)
score = model.score(x_test_count, ytest)
accuracy = format(score * 100, ".2f")
print("Accuracy of our spam filter is", accuracy, "%")

Accuracy of our spam filter is 98.56 %
