<a href="https://colab.research.google.com/github/tuneday-hub/email_spam_detector/blob/master/Spam_mail.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **AN EMAIL SPAM DETECTOR**
  Dataset was extracted from kaggle



**1) MOUNT GOOGLE DRIVE THAT CONTAINS THE DATASET**

In [80]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**2) IMPORT PACKAGES**

In [81]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

**3) READ THE CSV FILE**

In [82]:
dataset = pd.read_csv('/content/drive/MyDrive/Dataset/spam_ham_dataset.csv')

**4) PRINT THE FIRST TEN ROWS OF DATA**

In [83]:
dataset.head(10)

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
5,2949,ham,Subject: ehronline web address change\r\nthis ...,0
6,2793,ham,Subject: spring savings certificate - take 30 ...,0
7,4185,spam,Subject: looking for medication ? we ` re the ...,1
8,2641,ham,Subject: noms / actual flow for 2 / 26\r\nwe a...,0
9,1870,ham,"Subject: nominations for oct . 21 - 23 , 2000\...",0


**5) CHECK FOR NUMBER OF ROWS AND COLUMNS IN DATASET**

In [84]:
dataset.shape

(5171, 4)

This is a dataset with 5171 rows and 4 columns

**6) CHECK FOR COLUMNS IN DATASET**

In [85]:
dataset.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

**ATTRIBUTE INFORMATION**


*   Unnamed: Unknown
*   label: a categorical variable representing whether a mail is a spam or ham


*   text: email messages
*   label_num: a binary variable representing whether a mail is spam or ham


    **  1 represents for spam

    **  0 represents for ham











**7) DROP THE UNKNOWN AND LABEL_NUM COLUMNS**

In [86]:
dataset.drop(['Unnamed: 0', 'label_num'], inplace=True, axis=1)

In [87]:
#Check if the columns have been dropped
dataset.head(10)

Unnamed: 0,label,text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,spam,"Subject: photoshop , windows , office . cheap ..."
4,ham,Subject: re : indian springs\r\nthis deal is t...
5,ham,Subject: ehronline web address change\r\nthis ...
6,ham,Subject: spring savings certificate - take 30 ...
7,spam,Subject: looking for medication ? we ` re the ...
8,ham,Subject: noms / actual flow for 2 / 26\r\nwe a...
9,ham,"Subject: nominations for oct . 21 - 23 , 2000\..."


**8) CHECK FOR DUPLICATES AND REMOVE THEM**

In [88]:
dataset.drop_duplicates(inplace=True)

In [89]:
#Check if duplicates are being dropped
dataset.shape

(4993, 2)

The dataset now has 4993 rows and 2 columns

**9) CHECK FOR NULL VALUES IN DATASET**

In [90]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4993 entries, 0 to 5170
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   4993 non-null   object
 1   text    4993 non-null   object
dtypes: object(2)
memory usage: 117.0+ KB


In [91]:
dataset.isnull().sum()

label    0
text     0
dtype: int64

Data contains no null values

In [92]:
dataset['label'].unique()

array(['ham', 'spam'], dtype=object)

In [93]:
dataset['label'].value_counts()

ham     3531
spam    1462
Name: label, dtype: int64

The dataset contains 3531 ham mails and 1462 spam mails

**10) DOWNLOADING THE STOPWORD PACKAGE**

In [94]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [95]:
def process_text(text):
  no_punctuation = [char for char in text if char not in string.punctuation]
  no_punctuation = ''.join(no_punctuation)

  clean_words = [words for words in no_punctuation.split() if words.lower() not in stopwords.words('english')]

  return clean_words

In [96]:
dataset['text'].head(10).apply(process_text)

0    [Subject, enron, methanol, meter, 988291, foll...
1    [Subject, hpl, nom, january, 9, 2001, see, att...
2    [Subject, neon, retreat, ho, ho, ho, around, w...
3    [Subject, photoshop, windows, office, cheap, m...
4    [Subject, indian, springs, deal, book, teco, p...
5    [Subject, ehronline, web, address, change, mes...
6    [Subject, spring, savings, certificate, take, ...
7    [Subject, looking, medication, best, source, d...
8    [Subject, noms, actual, flow, 2, 26, agree, fo...
9    [Subject, nominations, oct, 21, 23, 2000, see,...
Name: text, dtype: object

**11) CONVERT A COLLECTION OF TEXT TO A MATRIX OF TOKENS**

In [97]:
from sklearn.feature_extraction.text import CountVectorizer
messages = CountVectorizer(analyzer=process_text).fit_transform(dataset['text'])

In [99]:
messages.shape

(4993, 50381)

**12) SPLIT THE DATA INTO 80% TRAINING AND 20% TESTING**

In [102]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(messages, dataset['label'], test_size = 0.20, random_state = 0)

**13) CREATE AND TRAIN THE NAIVE BAYES CLASSIFIER**

In [103]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)

**14) PRINT PREDICTIONS AND ACTUAL VALUES FOR TRAINING DATASET**

In [104]:
#print predictions
print(classifier.predict(X_train))
#print actual values
print(y_train.values)

['spam' 'ham' 'spam' ... 'ham' 'ham' 'spam']
['spam' 'ham' 'spam' ... 'ham' 'ham' 'spam']


**15) EVALUATE THE MODEL ON THE TRAINING DATASET**

In [105]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
prediction = classifier.predict(X_train)
class_repo = classification_report(y_train, prediction)
confus_mat = confusion_matrix(y_train, prediction)
acc_sc = accuracy_score(y_train, prediction)
print("----------------------")
print("CLASSIFICATION REPORT: \n---------------------- \n\n", class_repo)
print()
print("----------------------")
print("CONFUSION MATRIX: \n---------------------- \n\n", confus_mat)
print()
print("----------------------")
print("ACCURACY SCORE: \n---------------------- \n\n", acc_sc)

----------------------
CLASSIFICATION REPORT: 
---------------------- 

               precision    recall  f1-score   support

         ham       1.00      0.99      0.99      2809
        spam       0.98      0.99      0.99      1185

    accuracy                           0.99      3994
   macro avg       0.99      0.99      0.99      3994
weighted avg       0.99      0.99      0.99      3994


----------------------
CONFUSION MATRIX: 
---------------------- 

 [[2787   22]
 [  13 1172]]

----------------------
ACCURACY SCORE: 
---------------------- 

 0.9912368552829244


**TRAIN ACCURACY LEVEL: 99% which indicates a high precision value**

**16) PRINT PREDICTIONS AND ACTUAL VALUES FOR TEST DATASET**

In [106]:
#print predictions
print(classifier.predict(X_test))
#print actual values
print(y_test.values)

['ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham'
 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'spam' 'ham' 'ham'
 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'spam' 'spam'
 'ham' 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'ham'
 'spam' 'ham' 'ham' 'ham' 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham'
 'spam' 'ham' 'ham' 'spam' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'spam'
 'spam' 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham'
 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'spam'
 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham'
 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham'
 'ham' 'spam' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham' 'ham'
 'spam' 'ham' 'ham' 'spam' 'spam' 'ham' 'ham' 'spam' 'spam' 'ham' 'ham'
 'spam' 'ham' 'ham' 'spam' 'ham' 'spam' 'spam' 'ham' 'spam' 'ham' 'spam'
 'ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'spam' 'ham' 'spam' 'spam' 

**17) EVALUATE THE MODEL ON THE TEST DATASET**

In [107]:
prediction = classifier.predict(X_test)
class_repo = classification_report(y_test, prediction)
confus_mat = confusion_matrix(y_test, prediction)
acc_sc = accuracy_score(y_test, prediction)
print("----------------------")
print("CLASSIFICATION REPORT: \n---------------------- \n\n", class_repo)
print()
print("----------------------")
print("CONFUSION MATRIX: \n---------------------- \n\n", confus_mat)
print()
print("----------------------")
print("ACCURACY SCORE: \n---------------------- \n\n", acc_sc)

----------------------
CLASSIFICATION REPORT: 
---------------------- 

               precision    recall  f1-score   support

         ham       0.98      0.98      0.98       722
        spam       0.95      0.96      0.96       277

    accuracy                           0.98       999
   macro avg       0.97      0.97      0.97       999
weighted avg       0.98      0.98      0.98       999


----------------------
CONFUSION MATRIX: 
---------------------- 

 [[709  13]
 [ 11 266]]

----------------------
ACCURACY SCORE: 
---------------------- 

 0.975975975975976


**TEST ACCURACY LEVEL: 97% which indicates a high precision value**