# Building a Spam Filter with Naive Bayes

In this project, we are going to create a spam filter for SMS messages.
To classify a message as spam or non-spam, the computer:
- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

You can download the dataset from [here](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).

## Exploring the Dataset

In [1]:
import pandas as pd

sms_spam = pd.read_csv("SMSSpamCollection",sep="\t",header=None,names=["Label","SMS"])
print(sms_spam.shape)
sms_spam["Label"].value_counts(normalize=True)*100

(5572, 2)


ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Training and Test Set

We're now going to split our dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%.

In [2]:
# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


We'll now analyze the percentage of spam and ham messages in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

In [3]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [4]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

## Data Cleaning

In [5]:
training_set["SMS"] = training_set["SMS"].str.replace("\W"," ")
training_set["SMS"] = training_set["SMS"].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [6]:
vocabulary = []
training_set["SMS"] = training_set["SMS"].str.split()
for sms in training_set["SMS"]:
    for word in sms:
        vocabulary.append(word)
    
vocabulary = list(set(vocabulary))
print(vocabulary)



## The Final Training Set

In [7]:
word_counts_per_sms = {unique_word: [0] * len(training_set["SMS"]) for unique_word in vocabulary}
for index,sms in enumerate(training_set["SMS"]):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [8]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [9]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants First

In [10]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

## Calculating Parameters

In [12]:
p_wi_spam = {word: (spam_messages[word].sum() + alpha)/(n_spam + alpha*n_vocabulary) for word in vocabulary}
p_wi_ham = {word: (ham_messages[word].sum() + alpha)/(n_ham + alpha*n_vocabulary) for word in vocabulary}

## Classifying A New Message

In [22]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:

    p_spam_given_message = ?
    p_ham_given_message = ?
    '''    
    p_word_given_spam = 1
    for word in message:
        if word in p_wi_spam:
            p_word_given_spam *= p_wi_spam[word]
    p_word_given_ham = 1
    for word in message:
        if word in p_wi_ham:
            p_word_given_ham *= p_wi_ham[word]
            
    p_spam_given_message = p_spam * p_word_given_spam
    p_ham_given_message = p_ham * p_word_given_ham
    
    #print('P(Spam|message):', p_spam_given_message)
    #print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return('ham')
    elif p_ham_given_message < p_spam_given_message:
        return('spam')
    else:
        return('needs human classification')

In [23]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [24]:
classify('Sounds good, Tom, then see u there')

'ham'

## Measuring the Spam Filter's Accuracy

In [26]:
test_set['predicted'] = test_set['SMS'].apply(classify)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [34]:
correct = (test_set["Label"] == test_set["predicted"]).sum()
total = len(test_set)
accuracy = correct/total

print('correct:',correct)
print("total:",total)
print("accuracy:",accuracy)

correct: 1100
total: 1114
accuracy: 0.9874326750448833
