# Building a Spam Filter with Naive Bayes

In this short notebook, we are going to describe the Naive Bayes algorithm and how it can be used to build a rudimentary spam classifier for SMS messages.

First

In [2]:
import pandas as pd
messages=pd.read_csv('SMSSpamCollection.txt',sep='\t',header=None,names=['Labels','SMS'])

In [2]:
print(messages.head())

print(messages.shape)

  Labels                                                SMS
0    ham  Go until jurong point, crazy.. Available only ...
1    ham                      Ok lar... Joking wif u oni...
2   spam  Free entry in 2 a wkly comp to win FA Cup fina...
3    ham  U dun say so early hor... U c already then say...
4    ham  Nah I don't think he goes to usf, he lives aro...
(5572, 2)


Now, lets see how many of the messages are classified as spam or not spam (i.e. ham).

In [3]:
messages['Labels'].value_counts()

ham     4825
spam     747
Name: Labels, dtype: int64

In [4]:
messages['Labels'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Labels, dtype: float64

So we see that about 87% of the messages are not spam.

In order to build our spam filter we are goin to split our labeled data in training and test sets by an 80-20 split. We will build our model based on the training set and then treat the test set as new messages and then compare the label generated by our model to the original label.

We'll start by randomizing our dataset.

In [5]:
messages_random=messages.sample(frac=1,random_state=1)
training_test_index = round(len(messages_random) * 0.8)
training_set=messages_random.iloc[:training_test_index,:].reset_index(drop=True)
test_set=messages_random.iloc[training_test_index:,:].reset_index(drop=True)

In [6]:
print(training_set['Labels'].value_counts(normalize=True))
print(test_set['Labels'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Labels, dtype: float64
ham     0.868043
spam    0.131957
Name: Labels, dtype: float64


Percentages looking good.

To apply the Naive Bayes we want to count the number of unique words that appear throughout all SMS messages and count how often each word appears in each message. First we are going to strip all punctuation from each entry in the `SMS` column of the `training_set` data frame and reduce all resulting strings to lower-case only.

In [7]:
training_set['SMS']=training_set['SMS'].str.replace('\W',' ').str.lower()
training_set.head()

Unnamed: 0,Labels,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Now we'll collect all the unique word present.

In [8]:
vocabulary=[]
training_set['SMS']=training_set['SMS'].str.split()


In [9]:
for text in training_set['SMS']:
    for word in text:
        vocabulary.append(word)
            
vocabulary=list(set(vocabulary))

Now the list `vocabulary` contains all the unique words that appeared in the training set SMS messages. Now we'll create a disctionary whose keys are each word in `vocabulary` and whose values are lists keeping track of how often the given word occurs in each message

In [10]:
word_counts_per_sms={word:[0]*len(training_set['SMS']) for word in vocabulary}
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index]+=1

In [11]:
word_counts=pd.DataFrame(word_counts_per_sms)
training_set_clean=pd.concat([training_set,word_counts],axis=1)
training_set_clean.head()

Unnamed: 0,Labels,SMS,network,endof,62468,woodland,forms,08715203649,iscoming,tyrone,...,trust,easter,costume,snappy,tulip,evaluation,02085076972,move,person2die,rec
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we'll compute the parameters we need to apply the Naive Bayes algorithm with Laplace smoothing.

In [12]:
alpha=1
p_ham_tr=training_set['Labels'].value_counts(normalize=True)['ham']
p_spam_tr=training_set['Labels'].value_counts(normalize=True)['spam']

In [13]:
spam_messages = training_set_clean[training_set_clean['Labels'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Labels'] == 'ham']

n_words_per_spam=spam_messages['SMS'].apply(len)
n_spam=n_words_per_spam.sum()

n_words_per_ham=ham_messages['SMS'].apply(len)
n_ham=n_words_per_ham.sum()


Now we'll compute $P(w_i|Spam)$ and $P(w_i|Ham)$ with the smoothing adjustment
$$P(w_i|Spam)\ =\ \frac{N_{w_i|Spam}+\alpha}{N_{Spam}+\alpha\cdot N_{Vocabulary}}$$
$$P(w_i|Ham)\ =\ \frac{N_{w_i|Ham}+\alpha}{N_{Ham}+\alpha\cdot N_{Vocabulary}}$$

In [19]:
parameters_spam={unique_word:0 for unique_word in vocabulary}
parameters_ham={unique_word:0 for unique_word in vocabulary}

In [20]:
spam_messages.sum().head()

Labels     spamspamspamspamspamspamspamspamspamspamspamsp...
SMS        [freemsg, why, haven, t, you, replied, to, my,...
network                                                   20
endof                                                      0
62468                                                      5
dtype: object

In [21]:
spam_words=spam_messages.sum()
ham_words=ham_messages.sum()

In [22]:
n_v=len(vocabulary)
denominator_spam=n_spam+alpha*n_v
denominator_ham=n_ham+alpha*n_v
for word in vocabulary:
    numerator_spam=spam_words[word]+alpha
    parameters_spam[word]=numerator_spam/denominator_spam
    
    numerator_ham=ham_words[word]+alpha
    parameters_ham[word]=numerator_ham/denominator_ham
    

In [24]:
parameters_ham['a']

0.013395878191325745

In [48]:
import re

def naive_bayes(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    

    p_spam_given_message = n_spam
    p_ham_given_message = n_ham
      
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
        
        

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Let's see how the classifier does for the following messages which to the human eye seem to clearly be not spam and spam respectively.

In [49]:
print(test_set['SMS'][0])
print(test_set['SMS'][1112])

Later i guess. I needa do mcat study too.
Text & meet someone sexy today. U can find a date or even flirt its up to U. Join 4 just 10p. REPLY with NAME & AGE eg Sam 25. 18 -msg recd@thirtyeight pence


In [50]:
naive_bayes(test_set['SMS'][0])
naive_bayes(test_set['SMS'][1112])

P(Spam|message): 7.448657242146347e-30
P(Ham|message): 6.65397533521218e-24
Label: Ham
P(Spam|message): 1.2972615438064536e-97
P(Ham|message): 2.543250046020176e-106
Label: Spam


Now lets change the above function a bit so it returns the classification label rather than just printing it.

In [51]:
def naive_bayes_classifier(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    

    p_spam_given_message = n_spam
    p_ham_given_message = n_ham
      
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
        
        

    

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'requires human classification'


We can now apply our classifier to the test set as a whole. We will add a `predicted` colum to the `test_set` dataframe. We'll then also add a Boolean column `correct` to keep track of whether or not the classifier chose the correct label.

In [63]:
test_set['predicted']=test_set['SMS'].apply(naive_bayes_classifier)

In [64]:
test_set

Unnamed: 0,Labels,SMS,predicted,correct
0,ham,Later i guess. I needa do mcat study too.,ham,True
1,ham,But i haf enuff space got like 4 mb...,ham,True
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,True
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,True
4,ham,"All done, all handed in. Don't know if mega sh...",ham,True
...,...,...,...,...
1109,ham,"We're all getting worried over here, derek and...",ham,True
1110,ham,Oh oh... Den muz change plan liao... Go back h...,ham,True
1111,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...,ham,True
1112,spam,Text & meet someone sexy today. U can find a d...,spam,True


In [58]:
test_set['correct']=test_set['Labels']==test_set['predicted']

In [59]:
test_set

Unnamed: 0,Labels,SMS,predicted,correct
0,ham,Later i guess. I needa do mcat study too.,ham,True
1,ham,But i haf enuff space got like 4 mb...,ham,True
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,True
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,True
4,ham,"All done, all handed in. Don't know if mega sh...",ham,True
...,...,...,...,...
1109,ham,"We're all getting worried over here, derek and...",ham,True
1110,ham,Oh oh... Den muz change plan liao... Go back h...,ham,True
1111,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...,ham,True
1112,spam,Text & meet someone sexy today. U can find a d...,spam,True


In [65]:
accuracy=test_set['correct'].sum()/test_set.shape[0]
accuracy

0.9865350089766607