# Bayes-Theorem based Spam Filter

# Info 371

In [322]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## 1 (5pt) Explore and clean the data

1. (2pt) Load the lingspam-emails.csv.bz2 dataset.
Browse a handful of emails, both spam and non-spam ones, to see what kind of text we are working
with here.


Hint: check out textwrap module to print long strings on multiple lines.

In [13]:
emails = pd.read_csv('/Users/romilkinger/Downloads/lingspam-emails.csv.bz2', sep='\t')
emails.head()

Unnamed: 0,spam,files,message
0,False,3-1msg1.txt,Subject: re : 2 . 882 s - > np np > date : su...
1,False,3-1msg2.txt,Subject: s - > np + np the discussion of s - ...
2,False,3-1msg3.txt,Subject: 2 . 882 s - > np np . . . for me it ...
3,False,3-375msg1.txt,"Subject: gent conference "" for the listserv ""..."
4,False,3-378msg1.txt,Subject: query : causatives in korean could a...


In [14]:
emails.message

0       Subject: re : 2 . 882 s - > np np  > date : su...
1       Subject: s - > np + np  the discussion of s - ...
2       Subject: 2 . 882 s - > np np  . . . for me it ...
3       Subject: gent conference  " for the listserv "...
4       Subject: query : causatives in korean  could a...
                              ...                        
2888    Subject: .  international driver ' s license n...
2889    Subject: new on 95 . 8 capital fm  this is new...
2890    Subject: re : new medical technology  company ...
2891    Subject: re : your request for an overview  ye...
2892    Subject: new on capital fm  this is new at htt...
Name: message, Length: 2893, dtype: object

In [16]:
import textwrap

In [26]:
wrapper = textwrap.TextWrapper(width=50)
word_list = wrapper.wrap(text=emails.message[0])

#Print each line.
for element in word_list:
    print(element)

Subject: re : 2 . 882 s - > np np  > date : sun ,
15 dec 91 02 : 25 : 02 est > from : michael <
mmorse @ vm1 . yorku . ca > > subject : re : 2 .
864 queries > > wlodek zadrozny asks if there is "
anything interesting " to be said > about the
construction " s > np np " . . . second , > and
very much related : might we consider the
construction to be a form > of what has been
discussed on this list of late as reduplication ?
the > logical sense of " john mcnamara the name "
is tautologous and thus , at > that level ,
indistinguishable from " well , well now , what
have we here ? " . to say that ' john mcnamara the
name ' is tautologous is to give support to those
who say that a logic-based semantics is irrelevant
to natural language . in what sense is it
tautologous ? it supplies the value of an
attribute followed by the attribute of which it is
the value . if in fact the value of the name-
attribute for the relevant entity were ' chaim
shmendrik ' , ' john mcnamara the name ' would be
f

2. (3pt) Ensure the data is clean: remove all cases with missing spam and empty message field. We do not care about the file names

In [47]:
emails.eq('').sum()

spam       0
files      0
message    0
dtype: int64

In [48]:
emails.isna().sum()

spam       0
files      0
message    0
dtype: int64

In [52]:
emails.spam.value_counts()

False    2412
True      481
Name: spam, dtype: int64

In [54]:
emails.shape

(2893, 3)

In [55]:
2412+481

2893

## 2. (15pt) Create Document-term matrix (DTM)

1. (2pt) Choose ∼ 10 words which might be good to distinguish between spam/non-spam. Use these four: viagra, deadline, million, and and. Choose more words yourself (you may want to return here and reconsider your choice later).

In [62]:
list_of_words = ['viagra', 'deadline', 'million', 'and', 'congratulations', 'winner', 'lucky', 'draw', 'money', 'scholarship']

2.
(10pt) Convert your messages into DTM. We do not use the full 60k-words DTM here but only a baby-DTM of the 10 words you picked above. You may add the DTM columns to the original data frame, or keep those in a separate structure.


Creating the DTM involves finding whether the word is contained in the message for all emails in data. You can loop over emails and check each one individually, but pandas string methods make life much easier. 

You will want to do case-insensitive matching, checking for both upper and lower case. 

It is more intuitive to work with your data if you convert the logical values returned by contains to numbers.

In [63]:
dtm = pd.DataFrame()

In [68]:
for w in list_of_words:
    dtm[w] = emails.message.str.lower().str.contains(w) + 0

In [78]:
dtm.head()

Unnamed: 0,viagra,deadline,million,and,congratulations,winner,lucky,draw,money,scholarship
0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,1,0
4,0,0,0,1,0,0,0,0,0,0


3. (3pt) Split your work data (i.e. the DTM) and target (the spam indicator) into training and validation chunks (80/20 is a good split).

In [81]:
X = dtm
y = emails.spam + 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [82]:
X_train.head()

Unnamed: 0,viagra,deadline,million,and,congratulations,winner,lucky,draw,money,scholarship
1509,0,0,0,1,0,0,0,0,0,0
1885,0,1,0,1,0,0,0,0,0,0
1154,0,0,0,1,0,0,0,0,0,0
2016,0,0,0,1,0,0,0,0,1,0
1603,0,0,0,1,0,0,0,0,0,0


In [83]:
X_test.head()

Unnamed: 0,viagra,deadline,million,and,congratulations,winner,lucky,draw,money,scholarship
1515,0,1,0,1,0,0,0,0,0,0
1396,0,0,0,1,0,0,0,0,0,0
2131,0,0,0,1,0,0,0,0,0,0
2170,0,1,0,1,0,0,0,0,0,0
2192,0,1,0,1,0,0,0,0,0,0


In [84]:
y_train

1509    0
1885    0
1154    1
2016    1
1603    0
       ..
502     0
2144    0
552     1
141     0
2825    0
Name: spam, Length: 2314, dtype: int64

## 3. (80pt) Estimate and validate

1. (2pt) Design a scheme for your variable names that describes these probabilities so that a) you understand what they mean; and b) the others (including your grader) will understand those!

pr_s1, pr_s0, pr_w1, pr_w0, pr_w1s1, prw0s1, pr_w1s0, pr_w0s1 

2. (4pt) Compute the priors, the unconditional probabilities for an email being spam and non-spam, Pr(category = S) and Pr(category = NS). These probabilities are based on the spam variable alone, not on the text.

In [87]:
pr_s1 = (y_train == 1).mean()
pr_s1

0.16421780466724287

In [89]:
pr_s0 = (y_train == 0).mean()
pr_s0

0.8357821953327571

In [90]:
pr_s1 + pr_s0

1.0

3. (4pt) For each word w, compute the normalizers, Pr(w = 1) and Pr(w = 0).
Hint: this is Pr(million = 1) = 0.0484. But note this value (and the following hints) depends on
your random training/validation split!


In [253]:
prob_words = pd.DataFrame()
prob_words['pr_w1'] = ''

In [254]:
for w in X_train:
    prob_words.loc[w] = ((X_train[w] == 1).mean())

In [255]:
prob_words

Unnamed: 0,pr_w1
viagra,0.000432
deadline,0.149957
million,0.047537
and,0.942956
congratulations,0.002161
winner,0.009507
lucky,0.010804
draw,0.044944
money,0.082541
scholarship,0.008643


In [256]:
prob_words['pr_w0'] = ''
for w in X_train:
    prob_words.loc[[w],['pr_w0']] = ((X_train[w] == 0).mean())
prob_words

Unnamed: 0,pr_w1,pr_w0
viagra,0.000432,0.999568
deadline,0.149957,0.850043
million,0.047537,0.952463
and,0.942956,0.057044
congratulations,0.002161,0.997839
winner,0.009507,0.990493
lucky,0.010804,0.989196
draw,0.044944,0.955056
money,0.082541,0.917459
scholarship,0.008643,0.991357


4. (7pt) For each word w, compute Pr(w = 1|category = S) and Pr(w = 1|category = NS). These probabilities are based on both the spam-variable and on the DTM component that corresponds to the word w.
Hint: Pr(million = 1|category = S) = 0.252


In [257]:
temp_train = X_train.copy(deep=True)
temp_train['spam'] = y_train
temp_train.head()

Unnamed: 0,viagra,deadline,million,and,congratulations,winner,lucky,draw,money,scholarship,spam
1509,0,0,0,1,0,0,0,0,0,0,0
1885,0,1,0,1,0,0,0,0,0,0,0
1154,0,0,0,1,0,0,0,0,0,0,1
2016,0,0,0,1,0,0,0,0,1,0,1
1603,0,0,0,1,0,0,0,0,0,0,0


In [258]:
for w in X_train:
    prob_words.loc[[w],['pr_w1s1']] = temp_train[temp_train['spam']==1][w].mean()
    prob_words.loc[[w],['pr_w0s1']] = 1 - temp_train[temp_train['spam']==1][w].mean()
    prob_words.loc[[w],['pr_w1s0']] = temp_train[temp_train['spam']==0][w].mean()
    prob_words.loc[[w],['pr_w0s0']] = 1 - temp_train[temp_train['spam']==0][w].mean()

prob_words

Unnamed: 0,pr_w1,pr_w0,pr_w1s1,pr_w0s1,pr_w1s0,pr_w0s0
viagra,0.000432,0.999568,0.002632,0.997368,0.0,1.0
deadline,0.149957,0.850043,0.0,1.0,0.179421,0.820579
million,0.047537,0.952463,0.244737,0.755263,0.00879,0.99121
and,0.942956,0.057044,0.921053,0.078947,0.94726,0.05274
congratulations,0.002161,0.997839,0.013158,0.986842,0.0,1.0
winner,0.009507,0.990493,0.042105,0.957895,0.003102,0.996898
lucky,0.010804,0.989196,0.060526,0.939474,0.001034,0.998966
draw,0.044944,0.955056,0.018421,0.981579,0.050155,0.949845
money,0.082541,0.917459,0.360526,0.639474,0.027921,0.972079
scholarship,0.008643,0.991357,0.002632,0.997368,0.009824,0.990176


5. (5pt) Finally, compute the probabilities of interest, Pr(category = S|w = 1) and Pr(category =
S|w = 0). Compute this value using Bayes theorem, not directly by counting!
For the check, you may also compute Pr(category = NS|w = 1) and Pr(category = NS|w = 0)
Hint: Pr(category = S|million = 1) = 0.843. But note this number depends on your random testing-validation split!

In [259]:
(prob_words.loc[['million'],['pr_w1s1']]['pr_w1s1'] * pr_s1) / (prob_words.loc[['million'],['pr_w1']]['pr_w1'])

million    0.845455
dtype: float64

In [261]:
for w in X_train:
    pr_w1s1 = prob_words.loc[[w],['pr_w1s1']]['pr_w1s1']
    pr_w1 = prob_words.loc[[w],['pr_w1']]['pr_w1']
    prob_words.loc[[w],['pr_s1w1']] = (pr_w1s1 * pr_s1) / pr_w1
    
    pr_w1s0 = prob_words.loc[[w],['pr_w1s0']]['pr_w1s0']
    pr_w1 = prob_words.loc[[w],['pr_w1']]['pr_w1']
    prob_words.loc[[w],['pr_s0w1']] = (pr_w1s0 * pr_s0) / pr_w1
    
    pr_w0s1 = prob_words.loc[[w],['pr_w0s1']]['pr_w0s1']
    pr_w0 = prob_words.loc[[w],['pr_w0']]['pr_w0']
    prob_words.loc[[w],['pr_s1w0']] = (pr_w0s1 * pr_s1) / pr_w0
    
    pr_w0s0 = prob_words.loc[[w],['pr_w0s0']]['pr_w0s0']
    pr_w0 = prob_words.loc[[w],['pr_w0']]['pr_w0']
    prob_words.loc[[w],['pr_s0w0']] = (pr_w0s0 * pr_s0) / pr_w0
prob_words

Unnamed: 0,pr_w1,pr_w0,pr_w1s1,pr_w0s1,pr_w1s0,pr_w0s0,pr_s1w1,pr_s0w1,pr_s1w0,pr_s0w0
viagra,0.000432,0.999568,0.002632,0.997368,0.0,1.0,1.0,0.0,0.163856,0.836144
deadline,0.149957,0.850043,0.0,1.0,0.179421,0.820579,0.0,1.0,0.193188,0.806812
million,0.047537,0.952463,0.244737,0.755263,0.00879,0.99121,0.845455,0.154545,0.130218,0.869782
and,0.942956,0.057044,0.921053,0.078947,0.94726,0.05274,0.160403,0.839597,0.227273,0.772727
congratulations,0.002161,0.997839,0.013158,0.986842,0.0,1.0,1.0,0.0,0.162408,0.837592
winner,0.009507,0.990493,0.042105,0.957895,0.003102,0.996898,0.727273,0.272727,0.158813,0.841187
lucky,0.010804,0.989196,0.060526,0.939474,0.001034,0.998966,0.92,0.08,0.155963,0.844037
draw,0.044944,0.955056,0.018421,0.981579,0.050155,0.949845,0.067308,0.932692,0.168778,0.831222
money,0.082541,0.917459,0.360526,0.639474,0.027921,0.972079,0.717277,0.282723,0.114461,0.885539
scholarship,0.008643,0.991357,0.002632,0.997368,0.009824,0.990176,0.05,0.95,0.165214,0.834786


6. (6pt) Which of these probabilities have to sum to one? (E.g. Pr(category = 1) + Pr(category = 0) = 1.) Which ones do not? Explain!

pr_w1 and pr_w0 have to sum to 1 because they are probabilities of one factor being true. 

the following sets would sum to 1:
- pr_w1s1 and pr_w0s1
- pr_w1s0 and pr_w0s0 
- pr_s1w1 and pr_s0w1
- pr_s1w0 and pr_s0w0

because they are conditional probabilities and when the condition is same, the sum of probabilities has to be 1.


Now we are done with the estimator. Your fitted model is completely described by these probabilities. Let’s now turn to prediction, using your validation data. Note that we are still inside the loop over each word w!

7. (8pt) For each email in your validation set, predict whether it is predicted to be spam or non- spam. Hint: you should check if it contains the word w and use the appropriate probability, Pr(category = S|w = 1) or Pr(category = S|w = 0).

In [264]:
X_test

Unnamed: 0,viagra,deadline,million,and,congratulations,winner,lucky,draw,money,scholarship
1515,0,1,0,1,0,0,0,0,0,0
1396,0,0,0,1,0,0,0,0,0,0
2131,0,0,0,1,0,0,0,0,0,0
2170,0,1,0,1,0,0,0,0,0,0
2192,0,1,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
774,0,0,0,1,0,0,0,0,0,0
2130,0,0,0,1,0,0,0,0,0,0
1255,0,0,0,1,0,0,0,0,0,0
646,0,0,0,1,0,0,0,0,0,0


In [271]:
np.array(X_test.iloc[0])

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0])

In [280]:
prob_words['pr_s1w1'].values 
pred = prob_words[['pr_s1w1', 'pr_s1w0']]

In [340]:
result_list = []
for i in range(X_test.shape[0]): 
    result_list.append(np.array(X_test.iloc[i]) @ pred['pr_s1w1'].values)

In [334]:
result_arr = np.array(result_list)

In [336]:
result_arr[result_arr > 0.5] = 1
result_arr[result_arr < 0.5] = 0

In [339]:
(result_arr == y_test).mean()

0.8756476683937824

8. (5pt) Print the resulting confusion matrix and compute accuracy, precision and recall.

In [344]:
from sklearn.metrics import confusion_matrix,precision_score, recall_score, accuracy_score

In [342]:
confusion_matrix(y_test, result_arr)

array([[454,  24],
       [ 48,  53]])

In [348]:
print('Precision: ',precision_score(y_test, result_arr))

Precision:  0.6883116883116883


In [349]:
print('Recall: ',recall_score(y_test, result_arr))

Recall:  0.5247524752475248


In [350]:
print('Accuracy: ',accuracy_score(y_test, result_arr))

Accuracy:  0.8756476683937824


9. (5pt) Which steps above constitute model training? In which steps do you use trained model? What
is a trained model in this case? Explain!

training data is used to train the model and I used training data to calculate the probabilities so the model can leverage 80% of the data to calculate probabilities. In this case those probabilities are the trained model since they are used to predict. Prediction uses the testing data and the trained probabilities to predict the outcome. That outcome is then compared with testing result to calculate the accuracy, precision and recall scores of the method.

10. (4pt) Comment the overall performance of the model–how do accuracy, precision and recall look like?


The overall performance of the model was good as the accuracy was 87.5% which means most of the predictions were correct. Further the recall was lower because we only used 10 words to train our model. The precision is also decent but it can also be improved by using more words.

### Self-Note: (not for grading, but for learning only.)

Accuracy: percentage of correct answers = (TN + TP) / T

Precision: percentage of correct predicted positive = TP / (TP + FP)
    sensitive to false positive
    good measure if false positive is bad - judges must be sure the defendant is guilty, scientific discovery 
                                            is not just a statistical blip
                                      
Recall: percentage of actual positives captured = TP / TP + FN
    
    


11. (8pt) Explain why do you see very low recall while the other indicators do not look that bad.

Recall is the percentage of actual positives captured that is: true positives / (true positive + false negatives).
Recall is sensitive to false negatives. If prediction is false but actual result is true, it is considered false negative. In our model, we have many false negatives. Therefore we see a low recall. The reason behind low recall can be that we only had 10 words to make the predictions and based on those 10 words, we calculated the probabilities, so our model missed a lot of spam because those words were not in our model.

12. (8pt) Explain why some words work well and others not:

(a) why does “million” improve accuracy? 

(b) why does “viagra” not work?

(c) why does “deadline” not work? 

(d) why does “and” not work?

Hint: You may just see where in which emails these words occur, and how frequently. These are all different reasons!

a) because million appears in a lot of the spam emails so the probability of the spam with this word is higher.


b) viagra doesn't work because it appears in only one emails therefore the probability of spam with this spam is very high.


c) deadline doesn't work because it doesn't appear in spam emails.


d) and doesn't work because it appears in most of the emails - either spam or not spam - so the the probability of an email being spam with this word is very low.

13. (5pt) Add such smoothing to the model. You can either literally add two such lines of data, or alternatively manipulate the way you compute the probabilities.


In [437]:
prob_words_sooth = pd.DataFrame()
prob_words_sooth['pr_w1'] = ''
for w in X_train:
    prob_words_sooth.loc[w] = ((X_train[w] == 1).sum()+2)/X_train.shape[0]

prob_words_sooth['pr_w0'] = ''
for w in X_train:
    prob_words_sooth.loc[[w],['pr_w0']] = ((X_train[w] == 0).mean())
prob_words_sooth

Unnamed: 0,pr_w1,pr_w0
viagra,0.001296,0.999568
deadline,0.150821,0.850043
million,0.048401,0.952463
and,0.94382,0.057044
congratulations,0.003025,0.997839
winner,0.010372,0.990493
lucky,0.011668,0.989196
draw,0.045808,0.955056
money,0.083405,0.917459
scholarship,0.009507,0.991357


In [438]:
for w in X_train:
    prob_words_sooth.loc[[w],['pr_w1s1']] = ((temp_train[temp_train['spam']==1][w]).sum()+1)/((temp_train['spam']==1).sum()+2)
    prob_words_sooth.loc[[w],['pr_w0s1']] = 1 - prob_words_sooth.loc[[w],['pr_w1s1']]['pr_w1s1']
    prob_words_sooth.loc[[w],['pr_w1s0']] = ((temp_train[temp_train['spam']==0][w]).sum()+1)/((temp_train['spam']==0).sum()+2)
    prob_words_sooth.loc[[w],['pr_w0s0']] = 1 - prob_words_sooth.loc[[w],['pr_w1s0']]['pr_w1s0']


    
for w in X_train:
    pr_w1s1 = prob_words_sooth.loc[[w],['pr_w1s1']]['pr_w1s1']
    pr_w1 = prob_words_sooth.loc[[w],['pr_w1']]['pr_w1']
    pr_s1 = ((temp_train['spam']==1).sum()+2) / (temp_train.shape[0] + 2)
    prob_words_sooth.loc[[w],['pr_s1w1']] = (pr_w1s1 * pr_s1) / pr_w1
    
    pr_w1s0 = prob_words_sooth.loc[[w],['pr_w1s0']]['pr_w1s0']
    pr_w1 = prob_words_sooth.loc[[w],['pr_w1']]['pr_w1']
    pr_s0 = ((temp_train['spam']==0).sum()+2) / (temp_train.shape[0] + 2)
    prob_words_sooth.loc[[w],['pr_s0w1']] = (pr_w1s0 * pr_s0) / pr_w1
    
    pr_w0s1 = prob_words_sooth.loc[[w],['pr_w0s1']]['pr_w0s1']
    pr_w0 = prob_words_sooth.loc[[w],['pr_w0']]['pr_w0']
    pr_s1 = ((temp_train['spam']==1).sum()+2) / (temp_train.shape[0] + 2)
    prob_words_sooth.loc[[w],['pr_s1w0']] = (pr_w0s1 * pr_s1) / pr_w0
    
    pr_w0s0 = prob_words_sooth.loc[[w],['pr_w0s0']]['pr_w0s0']
    pr_w0 = prob_words_sooth.loc[[w],['pr_w0']]['pr_w0']
    pr_s0 = ((temp_train['spam']==0).sum()+2) / (temp_train.shape[0] + 2)
    prob_words_sooth.loc[[w],['pr_s0w0']] = (pr_w0s0 * pr_s0) / pr_w0


prob_words_sooth

Unnamed: 0,pr_w1,pr_w0,pr_w1s1,pr_w0s1,pr_w1s0,pr_w0s0,pr_s1w1,pr_s0w1,pr_s1w0,pr_s0w0
viagra,0.001296,0.999568,0.005236,0.994764,0.000517,0.999483,0.666091,0.333045,0.164147,0.835853
deadline,0.150821,0.850043,0.002618,0.997382,0.179752,0.820248,0.002863,0.996274,0.193529,0.806624
million,0.048401,0.952463,0.246073,0.753927,0.009298,0.990702,0.838561,0.160575,0.130559,0.869484
and,0.94382,0.057044,0.918848,0.081152,0.946798,0.053202,0.160575,0.838561,0.234646,0.779629
congratulations,0.003025,0.997839,0.015707,0.984293,0.000517,0.999483,0.856403,0.142734,0.1627,0.837301
winner,0.010372,0.990493,0.044503,0.955497,0.003616,0.996384,0.707722,0.291415,0.159112,0.840896
lucky,0.011668,0.989196,0.062827,0.937173,0.00155,0.99845,0.888121,0.111015,0.156265,0.843744
draw,0.045808,0.955056,0.020942,0.979058,0.05062,0.94938,0.075407,0.92373,0.169085,0.830956
money,0.083405,0.917459,0.361257,0.638743,0.028409,0.971591,0.714408,0.284728,0.114832,0.885245
scholarship,0.009507,0.991357,0.005236,0.994764,0.010331,0.989669,0.090831,0.908306,0.165506,0.834501


14. (5pt) Repeat the tasks above: compute the probabilities, do predictions, compute the accuracy, precision, recall for all words.

In [441]:
pred_sooth = prob_words_sooth[['pr_s1w1', 'pr_s1w0']]
result_list_sooth = []
for i in range(X_test.shape[0]): 
    result_list_sooth.append(np.array(X_test.iloc[i]) @ pred_sooth['pr_s1w1'].values)
result_arr_sooth = np.array(result_list_sooth)
result_arr_sooth[result_arr_sooth > 0.5] = 1
result_arr_sooth[result_arr_sooth < 0.5] = 0

In [442]:
confusion_matrix(y_test, result_arr_sooth)

array([[454,  24],
       [ 48,  53]])

In [443]:
print('Precision: ',precision_score(y_test, result_arr_sooth))
print('Recall: ',recall_score(y_test, result_arr_sooth))
print('Accuracy: ',accuracy_score(y_test, result_arr_sooth))

Precision:  0.6883116883116883
Recall:  0.5247524752475248
Accuracy:  0.8756476683937824


15. (4pt) Comment on the results. Does smoothing improve the overall performance?

There is no such difference in the performance after soothing but the probabilites look better for words like 'viagra' that had the probability of 1 for spam because of only 1 instance but now that is 0.66.