# Spam Detector

This program uses a multinomial naive bayes algorithm to determine whether an email is spam or ham (not spam). This is done by training the model to correlate certain words with ham and other words with spam through strict probabilty. The model, when fed with a test email, will calculate the probability that the email is spam and the probability it is ham independantly, and will classify the email as the one with the higher probability. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

## Basic Analysis

Lets read in the dataset and see what it's all about. 

In [2]:
df = pd.read_csv('/Users/zacrossman/Downloads/archive (3)/spam:ham.csv')
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,ham,Id
0,0.0,14.28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.8,5,9,True,1947
1,0.0,0.0,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,...,0.357,0.0,0.892,0.0,0.0,2.0,19,172,False,2159
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,0.43,...,0.124,0.0,0.31,0.062,0.0,1.477,8,65,False,4223
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.444,0.0,0.0,2.8,7,28,True,2624
4,0.0,0.0,0.0,0.0,1.17,0.0,0.0,0.0,0.0,1.17,...,0.0,0.0,0.0,0.0,0.0,1.551,10,45,True,2743


As we can see from the first few rows here, this dataset is mostly comprised of the frequencies of certain words. Each row represents an email, and the word freq represents what percentage of the email is made up of that word. The data set has character frequencies as well. We can also see in the second to last column whether or not that particular email was ham or spam. 

In [3]:
print(df.shape)
df.info()

(3680, 59)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3680 entries, 0 to 3679
Data columns (total 59 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              3680 non-null   float64
 1   word_freq_address           3680 non-null   float64
 2   word_freq_all               3680 non-null   float64
 3   word_freq_3d                3680 non-null   float64
 4   word_freq_our               3680 non-null   float64
 5   word_freq_over              3680 non-null   float64
 6   word_freq_remove            3680 non-null   float64
 7   word_freq_internet          3680 non-null   float64
 8   word_freq_order             3680 non-null   float64
 9   word_freq_mail              3680 non-null   float64
 10  word_freq_receive           3680 non-null   float64
 11  word_freq_will              3680 non-null   float64
 12  word_freq_people            3680 non-null   float64
 13  word_freq_report      

We can see we have 3680 emails to work with. The 'capital_run' columns will not be of any use to us, so we will go ahead and get rid of them. The same goes for the Id column. the rest of our columns all contain 3680 non null values, which is perfect because that is how many emails we have, so we don't have to worry about any null values messing with our model. 

In [4]:
df = df.drop(['Id'], axis = 1)
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,ham
0,0.0,14.28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.8,5,9,True
1,0.0,0.0,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,...,0.0,0.357,0.0,0.892,0.0,0.0,2.0,19,172,False
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,0.43,...,0.0,0.124,0.0,0.31,0.062,0.0,1.477,8,65,False
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.444,0.0,0.0,2.8,7,28,True
4,0.0,0.0,0.0,0.0,1.17,0.0,0.0,0.0,0.0,1.17,...,0.0,0.0,0.0,0.0,0.0,0.0,1.551,10,45,True


Instead of deleting all the other columns we don't want, we'll just select the columns we do want when we convert our features and label into arrays, which we'll do next.

In [5]:
X = np.array(df.loc[:, 'word_freq_make': 'char_freq_#'])
y = np.array(df['ham'])

Next, we'll split our data into training and testing subsets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state=0)
print('Training data shape:',X_train.shape, y_train.shape)
print()
print('Test data shape:', X_test.shape, y_test.shape)

Training data shape: (2944, 54) (2944,)

Test data shape: (736, 54) (736,)


We set our test size to .2%, so we'll use 2944 emails to train our model, and we'll test our model on the remaining 736 emails.

## Training and Implementing our Model

In [7]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB()

In [8]:
y_pred = classifier.predict(X_test)

Now lets evaluate our model.

In [9]:
print('Accuracy:', classifier.score(X_test, y_test))
print(y_test[:10])
cm = confusion_matrix(y_test, y_pred)
print(cm)

Accuracy: 0.8559782608695652
[ True  True False  True  True False  True  True  True  True]
[[256  18]
 [ 88 374]]


In [10]:
ham = 0
spam = 0
for i in range(len(y_test)):
    if y_test[i] == True:
        ham += 1
    else:
        spam += 1

print('Total Ham:', ham)
print('Total Spam:', spam)

Total Ham: 462
Total Spam: 274


Our model is filtering emails at around 86% accuracy, which is quite solid for this short and quick model. Now lets program a function that will take an email as an input, and output whether it is spam or ham. To do this, we first have to format the email in a way that our model is familiar with.

In [11]:
def formator(email):
    a_list = [' ', '3','0', '6', '5', '8', '7', '4', '1', '9', 'q', 'w', 'e', 'r', 't', 'y', 'u', 'i', 'o', 'p', 'a',
              's', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'z', 'x','c', 'v', 'b', 'n', 'm', 'Q', 'W', 'E', 'R', 'T', 'Y',
              'U', 'I', 'O', 'P', 'A', 'S', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'Z', 'X', 'C', 'V', 'B', 'N', 'M', ';',
              '(', '[','!', '$', '#']

    #Deleting characters not included in our training data as well as puncuation so we can more easily split the email
    #into words
    for character in email:
        if character not in a_list:
            email = email.replace(character, '')

    #Splitting our email into a list, where each element in the list is a word
    email = email.split(' ') 

    #Counting the frequency of each word in the email
    word_count = {}
    for word in email:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1

    #These are the words and characters that are relevant for our model
    relevant_words = ['make', 'address', 'all', '3d', 'our', 'over', 'remove', 'internet', 'order', 'mail', 'receive',
                      'will', 'people', 'report', 'addresses', 'free', 'business' 'email', 'you', 'credit', 'your',
                      'font', '000', 'money', 'hp', 'hpl', 'george', '650', 'lab', 'labs', 'telnet', '857', 'data',
                      '415', '85', 'technology', '1999', 'parts', 'pm', 'direct', 'cs', 'meeting', 'original', 'project',
                      're', 'edu', 'table', 'conference', ';', '(', '[', '!', '$', '#', '*']

    #Here we parse through our relevant word list. If the word is in word_count, then we calculate the percent that 
    #that word makes up of the email, and append that percentage to final_format. If the word is not in word_count,
    #then a 0% is appended to final_format for that word.
    total_words = len(email)
    final_format = []
    for element in relevant_words:
        if element in word_count:
            percentage = word_count[element] / total_words
            final_format.append(percentage)
        else:
            percentage = 0.0
            final_format.append(percentage)
    return final_format

Now that our email has been converted to a list of percentages like our train data, we can convert it into an array and feed it to our classifier.

In [12]:
def spam_or_ham(email):
    input = formator(email)
    input_array = np.array(input)
    input_array = np.reshape(input_array, (1, -1))
    result = classifier.predict(input_array)
    return result

Our function is built so that it returns 'True' if the prediction is ham, and 'False' if the prediction is spam.

### Some Examples:

In [13]:
#This is a very normal email and should be classified as ham
ham_email = '''Good morning! How are you? I am just writing to you today to see how things are going and if you
need anything. Let me know!'''
print(spam_or_ham(ham_email))

[ True]


In [14]:
#This is a very strange email that is obviously not written by a human, and is just asking for money. This should
#be classified as spam
spam_email = '''Please give us credit card and address so we can see if  qualify for our 'money now'
program, where u receive money and rewards. Who doesn't like money money money! Thanks for business!'''
print(spam_or_ham(spam_email))

[False]
