# SPAM E-MAIL DETECTOR

**Email spam detection is a classification problem. Some algorithms like Naive Bayes Classifier, Decision Trees work well 
for spam detection. Algorithms like KNN, Linear Regression don’t really work well due to inherent disadvantages 
such as curse of dimensionality.**

# Approach -  Naive Bayes Classifier

**Naive Bayes is the easiest classification algorithm (fast to build, regularly used for spam detection). 
It is a popular (baseline) method for text categorization, the problem of judging documents as
belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) 
with word frequencies as the features.**

**Why use Naive Bayes?**

NB is very simple, easy to implement and fast because essentially you’re just doing a bunch of counts.

If the NB conditional independence assumption holds, then it will converge quicker than 
discriminative models like logistic regression.
NB needs works well even with less sample data.

NB is highly scalable. It scales linearly with the number of predictors and data points.

NB can be used for both binary and multi-class classification problems and handles continuous and discrete data [2].

NB is not sensitive to irrelevant features.

**Step 1 = IMPORTING LIBRARIES**

In [32]:
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Step2 = LOADING DATASET**

In [33]:
data = pd.read_csv("spam.csv")
data.head()

Unnamed: 0,Label,EmailText
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [34]:
#check number of rows & columns in our dataset
data.shape

(5572, 2)

In [35]:
# checking the columns in our dataset
data.columns

Index(['Label', 'EmailText'], dtype='object')

In [36]:
#this step would drop all the duplicate values in our dataset
data.drop_duplicates(inplace=True)

In [37]:
# now after deleting all the duplicacy lets again check no .of rows & columns left in our data
data.shape

(5169, 2)

In [38]:
# checking for NAN values in the dataset
data.isnull().sum()

Label        0
EmailText    0
dtype: int64

**this function would remove all the punctuations,stopwords & would return clean text words**

In [39]:
#this function would remove all the punctuations,stopwords & would return clean text words
def process_text(text):
    nopunc=[char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    clean_words = [words for words in nopunc.split() if words.lower() not in stopwords.words('english')]
    return clean_words

**Tokenization(a list of tokens also known as lemmas)**

In [41]:
#show tokenization(a list of tokens also known as lemmas)
data['EmailText'].head().apply(process_text)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: EmailText, dtype: object

In [44]:
#example
message4 = "hello world hello world test"
message5 = "play stop play stop"
print(message4)
print()

# convert the text to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
a = CountVectorizer(analyzer=process_text).fit_transform([[message4],[message5]])
print(a)
print()
print(a.shape)

hello world hello world test

  (0, 0)	2
  (0, 4)	2
  (0, 3)	1
  (1, 1)	2
  (1, 2)	2

(2, 5)


In [45]:
# convert a collection of text into a matrix of tokens
from sklearn.feature_extraction.text import CountVectorizer
message_bag = CountVectorizer(analyzer=process_text).fit_transform(data['EmailText'])

**Split the data into 80% training & 20% testing**

In [47]:
#split the data into 80% training & 20% testing
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test = train_test_split(message_bag,data['Label'],test_size=0.20,random_state=0) 

In [48]:
message_bag.shape

(5169, 11304)

**Create & train the NaiveBayes Classifier**

In [49]:
#create & train the NaiveBayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(x_train,y_train)

In [62]:
# print the predictions
print(classifier.predict(x_train))
print()
# print the actual values
print(y_train.values)

['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']

['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']


**Evaluate the model on training dataset**

In [64]:

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
pred = classifier.predict(x_train)
print(classification_report(y_train,pred))
print()
print("Confusion matrix : \n",confusion_matrix(y_train,pred))
print()
print("Accuracy:\n",accuracy_score(y_train,pred))

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3631
        spam       0.98      0.98      0.98       504

    accuracy                           1.00      4135
   macro avg       0.99      0.99      0.99      4135
weighted avg       1.00      1.00      1.00      4135


Confusion matrix : 
 [[3623    8]
 [  11  493]]

Accuracy:
 0.9954050785973397


In [60]:
print(classifier.predict(x_test))
print()
print(y_test.values)

['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']

['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']


**Evaluate the model on testing dataset**

In [61]:

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
pred = classifier.predict(x_test)
print(classification_report(y_test,pred))
print()
print("Confusion matrix : \n",confusion_matrix(y_test,pred))
print()
print("Accuracy:\n",accuracy_score(y_test,pred))

              precision    recall  f1-score   support

         ham       0.99      0.96      0.97       885
        spam       0.80      0.93      0.86       149

    accuracy                           0.96      1034
   macro avg       0.89      0.94      0.92      1034
weighted avg       0.96      0.96      0.96      1034


Confusion matrix : 
 [[850  35]
 [ 11 138]]

Accuracy:
 0.9555125725338491


# CONCLUSION

**The accuracy of my email spam detection model on training data is 99.5% & on testing data is 95.5%**