Building a spam filter using the mulitnomial Naive Bayes algorithm

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

The data collection process is described in more details on this page, where you can also find some of the authors' papers.

http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition

# Read-in and data screening

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
sms=pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["Label","SMS"])

In [3]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
sms["Label"].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

# Creating a training and a test set

In [6]:
randomized_sms=sms.sample(frac=1, random_state=1)

In [7]:
randomized_sms.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [8]:
randomized_sms.shape[0]*.8

4457.6

In [9]:
training_set=randomized_sms.iloc[:4458,:].reset_index(drop=True).copy()

In [10]:
training_set.shape

(4458, 2)

In [11]:
test_set=randomized_sms.iloc[4458:,:].reset_index(drop=True).copy()

In [12]:
test_set.shape

(1114, 2)

In [13]:
training_set["Label"].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [14]:
test_set["Label"].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

# Data Cleaning

Zuerst wird die Punktuation entfernt.

In [15]:
re.sub("[^ \w]","","Secret!! Money, goods.")

'Secret Money goods'

Negiertes Set wird verwendet, weil \W auch Whitespaces erfasst

In [16]:
training_set["SMS"]=training_set["SMS"].str.replace("[^ \w]","")

In [17]:
training_set["SMS"]=training_set["SMS"].str.lower()

Hierdurch sind zwei leere Zeilen entstanden (Wahrscheinlich ursprünglich nur ein Satzzeichen)

In [18]:
(training_set["SMS"]==" ").sum()

2

In [19]:
training_set[training_set["SMS"]==" "]

Unnamed: 0,Label,SMS
1098,ham,
2700,ham,


In [20]:
training_set.shape

(4458, 2)

In [21]:
training_set.drop(labels=[1098,2700], inplace=True)

In [22]:
training_set.reset_index(inplace=True)

In [23]:
training_set.shape

(4456, 3)

In [24]:
training_set["SMS"].head()

0                          yep by the pretty sculpture
1           yes princess are you going to make me moan
2                           welp apparently he retired
3                                               havent
4    i forgot 2 ask ü all smth theres a card on da ...
Name: SMS, dtype: object

Das Gleiche wird mit dem Test-Set wiederholt

In [25]:
test_set["SMS"]=test_set["SMS"].str.replace("[^ \w]","")

In [26]:
test_set["SMS"]=test_set["SMS"].str.lower()

In [67]:
test_set[test_set["SMS"]==" "]

Unnamed: 0,Label,SMS,predicted


In [28]:
test_set["SMS"].head()

0              later i guess i needa do mcat study too
1                  but i haf enuff space got like 4 mb
2    had your mobile 10 mths update to latest orang...
3    all sounds good fingers  makes it difficult to...
4    all done all handed in dont know if mega shop ...
Name: SMS, dtype: object

# Creating a vocabulary

In [29]:
vocabulary=list(set(training_set["SMS"].str.split(expand=True).stack()))

In [30]:
len(vocabulary)

8450

In [31]:
word_counts_per_sms={unique_word: [0] * len(training_set["SMS"]) for unique_word in vocabulary}

In [32]:
len(word_counts_per_sms)

8450

In [33]:
for index, sms in enumerate(training_set["SMS"]):
    for word in sms.split():
        word_counts_per_sms[word][index]+=1

Manipulations-Check

In [34]:
training_set["SMS"][0].split()

['yep', 'by', 'the', 'pretty', 'sculpture']

In [35]:
word_counts_per_sms["yep"][0]

1

In [36]:
word_counts_per_sms["pretty"][0]

1

In [37]:
word_counts=pd.DataFrame(word_counts_per_sms)

In [38]:
word_counts.head()

Unnamed: 0,thanks,massages,750,hunny,cbe,vegetables,insha,konw,09064018838,residency,...,minutes,club4mobilescom,battery,vargu,windy,mk45,carefully,heat,07973788240,require
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
word_counts.shape

(4456, 8450)

In [40]:
word_counts.isnull().sum().sum()

0

In [41]:
training_set_wc=pd.concat([training_set,word_counts],axis=1)

Check

In [42]:
training_set_wc.shape

(4456, 8453)

In [43]:
training_set_wc.loc[0]["sculpture"]

1

# Creating the mathematical framework

The constants

P(Spam) & P(Ham)

In [44]:
training_set["Label"].value_counts(normalize=True)

ham     0.86535
spam    0.13465
Name: Label, dtype: float64

In [45]:
p_spam=training_set["Label"].value_counts(normalize=True)["spam"]

In [46]:
p_ham=1-p_spam

N_spam, N_ham, N_vocabulary

In [47]:
n_spam=training_set_wc[training_set_wc["Label"]=="spam"].loc[:,"0":].sum().sum()

In [48]:
n_ham=training_set_wc[training_set_wc["Label"]=="ham"].loc[:,"0":].sum().sum()

In [49]:
n_vocabulary=len(vocabulary)

In [50]:
alpha=1

The parameters

P(Word|Spam), P(Word|Ham)

In [51]:
p_spam_by_word={unique_word: 0 for unique_word in vocabulary}

In [52]:
p_ham_by_word={unique_word:0 for unique_word in vocabulary}

In [53]:
spam_df=training_set_wc[training_set_wc["Label"]=="spam"]

In [54]:
ham_df=training_set_wc[training_set_wc["Label"]=="ham"]

In [55]:
for key in p_spam_by_word.keys():
    n_word_given_spam=spam_df[key].sum()
    p_spam_by_word[key]=(n_word_given_spam+alpha)/(n_spam+alpha*n_vocabulary)
    
    n_word_given_ham=ham_df[key].sum()
    p_ham_by_word[key]=(n_word_given_ham+alpha)/(n_ham+alpha*n_vocabulary)

P(Spam|Message), P(Ham|Message)

In [56]:
import re

In [57]:
def classify(message):
    message=re.sub("[^ \w]","",message)
    message=message.lower()
    message=message.split()
    
    p_spam_given_message=p_spam
    p_ham_given_message=p_ham
    
    for word in message:
        if word in p_spam_by_word.keys():
            p_spam_given_message*=p_spam_by_word[word]
        if word in p_ham_by_word.keys():
            p_ham_given_message*=p_ham_by_word[word]
    
    if p_spam_given_message>p_ham_given_message:
        print("Label: Spam")
    elif p_ham_given_message>p_spam_given_message:
        print("Label: Ham")
    else:
        print("Equal propabilities, have a human classify this!")

In [58]:
classify("WINNER!! This is the secret code to unlock the money: C3421.")

Label: Spam


In [59]:
classify("Sounds good, Tom, then see u there")

Label: Ham


Die Klassifizierungsfunktion wird dahingehend geändert, dass die Labels zurückgegeben werden. Damit können sie zur automatischen Klassifizierung des Test-Datensatzes verwendet werden.

In [60]:
def classify_test_set(message):
    
    message=re.sub("[^ \w]","",message)
    message=message.lower()
    message=message.split()
    
    p_spam_given_message=p_spam
    p_ham_given_message=p_ham
    
    for word in message:
        if word in p_spam_by_word.keys():
            p_spam_given_message*=p_spam_by_word[word]
        if word in p_ham_by_word.keys():
            p_ham_given_message*=p_ham_by_word[word]
    
    if p_spam_given_message>p_ham_given_message:
        return "spam"
    elif p_ham_given_message>p_spam_given_message:
        return "ham"
    else:
        return "needs human classification"

In [61]:
test_set["predicted"]=test_set["SMS"].apply(classify_test_set)

In [62]:
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest orang...,spam
3,ham,all sounds good fingers makes it difficult to...,ham
4,ham,all done all handed in dont know if mega shop ...,ham


calculating the accuracy

In [63]:
accuracy=test_set[test_set["Label"]==test_set["predicted"]].shape[0]/test_set.shape[0]

In [64]:
accuracy

0.9712746858168761

Eine Klassifizierungsgenauigkeit von >97% ist deutlich besser als die angestrebten >80%. Der Filter funktioniert sehr gut.

In [65]:
test_set[test_set["Label"]!=test_set["predicted"]]

Unnamed: 0,Label,SMS,predicted
51,spam,freemsg hey im buffy 25 and love to satisfy m...,ham
89,spam,goldviking 29m is inviting you to be his frien...,ham
114,spam,not heard from u4 a while call me now am here ...,ham
135,spam,more people are dogging in your area now call ...,ham
152,ham,unlimited texts limited minutes,spam
159,ham,26th of july,spam
263,spam,themobyo yo yohere comes a new selection of ho...,ham
284,ham,nokia phone is lovly,spam
287,spam,cashbincouk get lots of cash this weekend wwwc...,ham
363,spam,email alertfrom jeri stewartsize 2kbsubject lo...,ham


Wie man sieht bestehen die Falsch-Klassifikationen fast ausschließlich aus dem Verpassen von Spam-Mails (false negative).