<a href="https://colab.research.google.com/github/weammoghazy/MLEmailSpamFilter/blob/master/SpamFilter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Imports 

We will import some libraries (pandas, numpy, io, sklearn).


In [0]:
import pandas as pd
import numpy as np
import io
import random
from google.colab import files
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

## Load the data 


###Load the data from a file (don't run this if you don't have the file on your local drive)

In [0]:
# Upload the csv file from your local drive
uploaded = files.upload()

# Load in the data from a CSV file that is a tab (\t) separated.
# put the filename instead of 'SMSSpamCollection' if you are using this method to load a file
df = pd.read_csv(io.StringIO(uploaded['SMSSpamCollection'].decode('utf-8')),sep='\\t')

### OR Load the data from a URL (RUN THIS CODE)

In [0]:
# load the file from github repository
url = "https://raw.githubusercontent.com/weammoghazy/MLEmailSpamFilter/master/emails.csv"

# use unicode to decode the file and a comma separator to form the data frame
df = pd.read_csv(url, encoding='utf-8',sep=',')

##Randomize the data

In [150]:
# We'll then randomize the data, just to be sure not to get any pathological
# ordering effects that might harm the performance of the model.
df = df.reindex(np.random.permutation(df.index))

# view the first 10 rows of the data, just to check its structure 
df.head(10)

Unnamed: 0,text,spam
5382,Subject: re : christmas basket list here is t...,0
157,Subject: check this impotence medication don ...,1
4791,"Subject: re : address shirley , they are fro...",0
464,"Subject: v and more hello , welcome to the me...",1
3475,"Subject: updated org chart vince , i updated...",0
2595,"Subject: preface for book vince , ? hope yo...",0
4216,"Subject: re : presentation dawn , i met davi...",0
1551,Subject: re : implementing term - structure of...,0
683,Subject: re : money issues ygr repair your cr...,1
1175,Subject: your membership community charset = i...,1


## Feature Extraction

In [151]:
# extract features with CountVectorizer
feature = df["text"]

# import a list of stopwords that will be removed from th4e feature list
my_stop_words = text.ENGLISH_STOP_WORDS.union("Subject")

# Use a TFIDFVectorizer that considers the more frequent words less significant 
vectorizer = TfidfVectorizer(lowercase=True, stop_words=my_stop_words)
X = vectorizer.fit_transform(feature).toarray()

# print the 2D matrix [words, count]
print (X.shape)

# View 20 features of the list - just to get an idea of what it looks like
vectorizer.get_feature_names()[:20]

(5728, 36996)


['00',
 '000',
 '0000',
 '000000',
 '00000000',
 '0000000000',
 '000000000003619',
 '000000000003991',
 '000000000003997',
 '000000000005168',
 '000000000005409',
 '000000000005411',
 '000000000005412',
 '000000000005413',
 '000000000005820',
 '000000000006238',
 '000000000006452',
 '000000000007494',
 '000000000007498',
 '000000000007876']

In [0]:
## label encoding: map the label (spam, ham) to (0,1) - this part is only needed if the emails were marked "spam" and "ham"

# categories = df["type"].unique()
# category_dict = {value:index for index, value in enumerate(categories)}
# Y = df["type"].map(category_dict)
# category_dict

# now, get the class [spam/ham]
Y = df["spam"]

### Naive Bayes Classifier

"Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. Naive Bayes classifier is successfully used in various applications such as spam filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes theorem of probability for prediction of unknown class"

Split the data into training and testing sets with 8:2 ratio

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=1) # 80% training and 20% testing 

Use a multinomial naive bayes classifier, which will eventually simplify into a binary naive base classifier (as we only have two classes[spam/ham])

In [154]:
# create a model
model = MultinomialNB()
# use the training set to train the model
model.fit(x_train,y_train)
# use the testing set to test the model
model.score(x_test,y_test)

0.9153577661431065

In [155]:
# You can compare the true y values with the predicted values 
print("true:", y_test[:20])
print("predicted:", model.predict(x_test)[:20])

true: 772     1
735     1
5636    0
913     1
979     1
2997    0
5572    0
2252    0
552     1
561     1
4194    0
2646    0
1327    1
468     1
136     1
2608    0
3451    0
2250    0
3571    0
1328    1
Name: spam, dtype: int64
predicted: [0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1]


## Classification Report
Calculate how many items are spam from your test dataset.
Calculate accuracy.


In [156]:
  y_true = y_test
y_pred = model.predict(x_test)
target_names = ['ham', 'spam']
print(classification_report(y_true, y_pred, target_names=target_names))


              precision    recall  f1-score   support

         ham       0.90      1.00      0.95       869
        spam       1.00      0.65      0.79       277

    accuracy                           0.92      1146
   macro avg       0.95      0.82      0.87      1146
weighted avg       0.92      0.92      0.91      1146



From the classification report above, we can see that out of 1146 emails, 276 are spam. Also, the accuracy of predicting a spam email is 1, which is very important for a spam filter (we wouldn''t wanna lose a ham email to the spam folder but the opposite is okay).

## Manual Testing
Select an email from the test dataset and predict if it is spam or ham.


In [157]:
# select a random email from the test dataset
test = x_test[(random.randint(0,len(x_test)))]
# Reshape the test smaple using array.reshape(1, -1) as it contains a single sample.
predict_test = model.predict(test.reshape(1,-1))
# the test result is spam for '1' and ham for '0' 
result = "ham" if predict_test[0] == 0 else "spam"
result #print the result

'ham'