# SMS Spam Filter

In this project, we're using a dataset of SMS messages that have been categorized as 'spam' or 'not spam' in order to develop a spam filter using the Naive Bayes algorithm. Our goal is to achieve accuracy of 80%, meaning that more than 80% of new messages will be classified correctly.

The dataset comes from Tiago A. Almeida and José María Gómez Hidalgo, and can be found [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). It includes 5,572 SMS messages.

## Data Exploration

Start by reading the dataset and getting to know it.

In [1]:
import pandas as pd

sms = pd.read_csv('SMSSpamCollection', sep = '\t', header=None, names=['Label','SMS'])

print(sms.shape)
sms.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
print(sms['Label'].value_counts())
print(sms['Label'].value_counts(normalize=True))

ham     4825
spam     747
Name: Label, dtype: int64
ham     0.865937
spam    0.134063
Name: Label, dtype: float64


Above we can see that about 13% of the messages are spam. This isn't surprizing since most messages we receive are genuine ham.

## Training Set and Testing Set

We're going to split our data into two sets - one to train the program and one to test the accuracy.

We will use 80% for training and 20% for testing.

In [3]:
randomized = sms.sample(frac=1,random_state=1)

training_index = round(len(randomized)*0.8)

train = randomized[:training_index].reset_index(drop=True)
test = randomized[training_index:].reset_index(drop=True)

print(train.shape)
print(test.shape)



(4458, 2)
(1114, 2)


In [4]:
print(train.Label.value_counts(normalize=True))
print(test.Label.value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


We can see that the ratio of spam to ham has remained consistent in each subset with the overall dataset.

## Data Cleaning

### Punctuation and Case

In order to process the data effectively, we need to strip punctuation and uppercase characters:

In [5]:
train.SMS = train.SMS.str.replace('\W',' ').str.lower()
test.SMS = test.SMS.str.replace('\W',' ').str.lower()

In [6]:
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating Vocabulary

We need a list of all the unique words in our training set - we'll call this the vocabulary of the training set.

In [7]:
train.SMS = train.SMS.str.split()

In [8]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [9]:
vocab = []

for msg in train.SMS:
    for word in msg:
        vocab.append(word)
vocab = list(set(vocab))

In [10]:
len(vocab)

7783

We have 7,783 unqiue words in our training set of sms's.

### Data Transformation

We now need to convert the unique vocabulary into columns.

In [11]:
word_counts_per_sms = {x: [0] * len(train.SMS) for x in vocab}

for index, msg in enumerate(train.SMS):
    for word in msg:
        word_counts_per_sms[word][index] +=1

In [12]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,blimey,mornings,interested,acnt,following,islands,disconnect,girlie,ah,uncle,...,intend,concern,brain,proof,manage,mandy,wish,09071512433,recession,finn
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
train_clean = pd.concat([train, word_counts],axis=1)
train_clean.head()

Unnamed: 0,Label,SMS,blimey,mornings,interested,acnt,following,islands,disconnect,girlie,...,intend,concern,brain,proof,manage,mandy,wish,09071512433,recession,finn
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Creating the Spam Filter

### Constants

 <font size="2.5"> The Naive Bayes algorithm relies on resolving these two equations in order to classify new messages:

&nbsp;

 <font size="4">$ P(Spam| w_1,w_2,...,w_n) \propto P(Spam)*    \prod_{i=1}^{n} P(w_i|Spam)     $

$ P(Ham| w_1,w_2,...,w_n) \propto P(Ham)*    \prod_{i=1}^{n} P(w_i|Ham)     $

    
&nbsp;

 <font size="2.5"> In order to calculate $P(w_i|Ham)$ and $P(w_i|Spam)$, we need to use these equations:

     
&nbsp;

<font size="4"> $ P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha * N_{Vocabulary}} $
    
$ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha * N_{Vocabulary}} $

&nbsp;    
    
<font size="2.5"> We must calculate the following constants in order to use the Naive Bayes algorithm

- P(Spam)
- P(Ham)
- N(Spam) -- note that this is the total number of words in all spam messages
- N(Ham) -- note that this is the total number of words in all ham messages
- N(Vocabulary)
- Alpha
    
 We will set $\alpha = 1$ and calculate the remaining constants below:

In [14]:
p_spam = train_clean.Label.value_counts(normalize= True)['spam']
p_ham = train_clean.Label.value_counts(normalize= True)['ham']


spam = train_clean[train_clean.Label=='spam']
ham = train_clean[train_clean.Label=='ham']

n_spam = 0

for x in spam['SMS']:
    n_spam += len(x)
    
n_ham = 0
for x in ham['SMS']:
    n_ham += len(x)

n_vocab = len(vocab)
alpha=1

### Parameters

Now that we've found our constants, we need to calculate our parameters $ P(w_i|Spam) $ and $ P(w_i|Ham) $. Each parameter represents a conditional probability associated with each word in our vocabulary. The formulas used are as follows:
&nbsp; 
<font size="4">
    
$ P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam}+\alpha*N_{Vocabulary}} $
    
$ P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham}+\alpha*N_{Vocabulary}} $

In [15]:
param_spam = {x:0 for x in vocab}
param_ham = {x:0 for x in vocab}

for word in vocab:
    p_word_given_spam = (spam[word].sum() + alpha) / (n_spam + alpha*n_vocab)
    param_spam[word] = p_word_given_spam

    p_word_given_ham = (ham[word].sum() + alpha) / (n_ham + alpha*n_vocab)
    param_ham[word] = p_word_given_ham

## Classify Function

The following function uses the Naive Bayes algorithm to classify incoming text messages based on the classifications in the original training data.

In [16]:
import re

def classify(message):

    # Clean incoming words to remove punctuation, lower the case and split into a list of words
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    # Initialize probabilities of spam and ham
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    
    # Loop over each word in the input message and if the word is present in our vocab, 
    # multiply p_spam and p_ham by the parameter for that word
    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
        if word in param_ham:
            p_ham_given_message *= param_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

### Initial Check

Before we test the algorithm with our test data, let's do a quick sanity check on two messages

In [17]:
classify ('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [18]:
classify ("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Testing the Filter

The results of our initial check look good, but let's test our data more thoroughly on our test data.

In [19]:
def classify_test(message):

    # Clean incoming words to remove punctuation, lower the case and split into a list of words
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    # Initialize probabilities of spam and ham
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    
    # Loop over each word in the input message and if the word is present in our vocab, 
    # multiply p_spam and p_ham by the parameter for that word
    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
        if word in param_ham:
            p_ham_given_message *= param_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human'

Let's use this function to add a column to our test set:

In [20]:
test['prediction'] = test.SMS.apply(classify_test)
test.head()

Unnamed: 0,Label,SMS,prediction
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


Now that we have a prediction and a label for each item in the test set, how accurate is the spam filter?

In [21]:
test['correct'] = test['Label']==test['prediction']

print('Accuracy: ',test.correct.sum()/len(test.correct))

Accuracy:  0.9874326750448833


The accuracy is about 98.7% which is great!

Let's look at the messages that were classified incorrectly:

In [22]:
print(test[~test.correct][['Label','SMS']])

    Label                                                SMS
114  spam  not heard from u4 a while  call me now am here...
135  spam  more people are dogging in your area now  call...
152   ham                  unlimited texts  limited minutes 
159   ham                                       26th of july
284   ham                             nokia phone is lovly  
293   ham  a boy loved a gal  he propsd bt she didnt mind...
302   ham                   no calls  messages  missed calls
319   ham  we have sent jd for customer service cum accou...
504  spam  oh my god  i ve found your number again  i m s...
546  spam  hi babe its chloe  how r u  i was smashed on s...
741  spam  0a networks allow companies to bill for sms  s...
876  spam           rct  thnq adrian for u text  rgds vatian
885  spam                                      2 2 146tf150p
953  spam  hello  we need some posh birds and chaps to us...


## Next Steps

In this project, we built a spam filter that correctly identified 98.7% of messages in our test data. 

Next steps could include diagnosing the issues that caused incorrect classifications. We can see that the filter made mistakes on both true ham and true spam, so it would be interesting to understand what causes these issues.