# This program detects if an email is spam (1) or not (0)

**Import libraries**

In [11]:
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer
import re

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\elsaw\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


**Load the data**

In [12]:
df = pd.read_csv(r'C:\Users\elsaw\OneDrive\Bureau\Machine Learning\Support Vector Machines\spam_ham_dataset.csv')

In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


**Data Cleaning and Preprocessing**

In [14]:
ps = PorterStemmer()

In [15]:
corpus = []
n = df.shape[0]
for i in range(n):
    #remove all the characters except a to z and A to Z
    review = re.sub('[^a-zA-Z]', ' ', df['text'][i])
    review = review.lower()
    review = review.split()
    #in the stemming process, you get the base form of words: 'going' -> 'go'
    #in the stopwords: ‘the’, ‘is’, ‘are’
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

This was the 1st original email:

In [19]:
df['text'][0]

"Subject: enron methanol ; meter # : 988291\r\nthis is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary\r\nflow data provided by daren } .\r\nplease override pop ' s daily volume { presently zero } to reflect daily\r\nactivity you can obtain from gas control .\r\nthis change is needed asap for economics purposes ."

After the modifications, here is the 1st email:

In [21]:
corpus[0]

'subject enron methanol meter follow note gave monday preliminari flow data provid daren pleas overrid pop daili volum present zero reflect daili activ obtain ga control chang need asap econom purpos'

**Creating the Bag of words model**

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

In [28]:
#I take the top 5000 most frequent words
cv = CountVectorizer(max_features=5000)

In [29]:
X = cv.fit_transform(corpus).toarray()

In [30]:
y = pd.get_dummies(df['label'])

In [32]:
y.head()

Unnamed: 0,ham,spam
0,1,0
1,1,0
2,1,0
3,0,1
4,1,0


The spam column is enough information.

In [33]:
y = y.iloc[:,1].values

**Train Test Split**

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**Traning model using Naive bayes classifier**

In [37]:
from sklearn.naive_bayes import MultinomialNB

In [38]:
spam_detect_model = MultinomialNB().fit(X_train,y_train)

In [40]:
y_pred = spam_detect_model.predict(X_test)

**Comparing prediction and truth**

In [42]:
from sklearn.metrics import confusion_matrix

In [49]:
A = confusion_matrix(y_test, y_pred)
print(A)

[[701  31]
 [ 17 286]]


For 701 spam mails, the model labeled them as spam.<br>
For 31 spam mails, the model labeled them as ham.<br>
For 17 ham mails, the model labeled them as spam.<br>
For 286 ham mails, the model labeled them as ham.

**Accuracy**

In [50]:
from sklearn.metrics import accuracy_score

In [51]:
accuracy = accuracy_score(y_test,y_pred)

In [52]:
accuracy

0.9536231884057971

The accuracy is 95%.