# Email Spam Detection with Machine Learning

#### Task:-

- We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content.

- In this Project, use Python to build an email spam detector. Then, use machine learning to train the spam detector to recognize and classify emails into spam and non-spam. Let’s get started!

### Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Loading and Exploring Dataset

In [3]:
df = pd.read_csv('spam.csv', encoding='latin1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df.shape

(5572, 5)

In [5]:
df.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [6]:
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [7]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
df = df.rename(columns={'v1':'Category', 'v2':'Message'})
df.sample(5)

Unnamed: 0,Category,Message
3278,ham,Solve d Case : A Man Was Found Murdered On &l...
2031,ham,"I noe la... U wana pei bf oso rite... K lor, o..."
936,ham,"Since when, which side, any fever, any vomitin."
5161,ham,Lol no. I just need to cash in my nitros. Hurr...
5244,ham,thanks for the temales it was wonderful. Thank...


In [9]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


In [10]:
df['Spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,Spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


### Training the Model

In [11]:
X = df['Message']
y = df['Spam']

In [12]:
# Splitting the Dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [13]:
vectorizer = CountVectorizer()
x_train_count = vectorizer.fit_transform(x_train.values)
x_train_count.toarray()[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [14]:
model = MultinomialNB()
model.fit(x_train_count, y_train)

In [15]:
x_test_count = vectorizer.transform(x_test)
model.predict(x_test_count)

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

### Evaluating the Model

In [16]:
y_pred = model.predict(x_test_count)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9856502242152466


In [17]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[960,   0],
       [ 16, 139]], dtype=int64)

In [18]:
print('Classification Report:\n')
print(classification_report(y_test, y_pred))

Classification Report:

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       960
           1       1.00      0.90      0.95       155

    accuracy                           0.99      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [19]:
input_mail = ["Congratulations! You've won a free vacation to an exotic destination. Claim your prize now by clicking the link below."]
input_features = vectorizer.transform(input_mail)

In [20]:
predicted_label = model.predict(input_features)

if predicted_label[0] == 1:
    print('The email is classified as "ham" (non-spam).')
elif predicted_label[0] == 0:
    print('The email is classified as "spam".')
else:
    print('Invalid prediction label.')

The email is classified as "ham" (non-spam).


In [21]:
probability_spam = model.predict_proba(input_features)[0][0]
probability_ham = model.predict_proba(input_features)[0][1]

print(f"Spam Probability: {probability_spam:.4f}")
print(f"Ham Probability: {probability_ham:.4f}")

Spam Probability: 0.0000
Ham Probability: 1.0000
