# Detecting Spam Messages - a Naive Bayes' Algorithm Filter
In this project we will be building an algorithm to detect spam messages. It is based on 5,000+ human classified messages and is trained using the Naive Bayes' algorithm.<br>
The algorithm classifies new messages based on common words used in spam and non-spam messages. For the test set used in this project, it has a 98.7% accuracy rate. At the conclusion of this project, we will apply the algorithm to real life texts, and see how it holds up in realistic situations.<br>
The text dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

## Exploring the Dataset
We'll begin by reading in the dataset and getting a general overview.

In [131]:
import pandas as pd
texts = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
display(texts.head())
display(texts.shape)
display(texts['Label'].value_counts(normalize=True))

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


(5572, 2)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

We could see that the dataset consists of 5,572 messages, 86% of which are non-spam. This is generally proportional, as most texts an average person receives are not spam.

## Train and test sets
Now we'll split our dataset into two parts, a train set with which we will build our algorithm, and a test set to test it's accuracy. We'll use 80% for our train set and the remaining 20% for testing.

In [2]:
#Randomizing the data
texts = texts.sample(frac=1, random_state=1)
texts_train = texts[:4458].reset_index(drop=True)
texts_test = texts[4458:].reset_index(drop=True)

In [3]:
texts_train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [4]:
texts_test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

We can see that the porpotions of spam and non-spam messages in both sections remain mostly unchanged, assuring that neither part is skewed.

## Cleaning and Preparing the Training Set
From here on out (until the testing phase) we'll be working with the training set. First we'll clean and reformat the dataset to prepare it for the algorithm.

In [5]:
#Reformatting the texts into a series of lists
texts_train['SMS'] = texts_train['SMS'].str.replace('\W', ' ', regex=True).str.lower().str.split()

In [6]:
texts_train['SMS'].head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

In [7]:
#Building a vocabulary of all unique words found in the dataset
vocabulary = []
for row in texts_train['SMS']:
    for word in row:
        vocabulary.append(word)
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

In [8]:
# Creating a dictionary (and dataframe) of frequency counts of each word in all messages
word_counts_per_sms = {unique_word: [0] * len(texts_train['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(texts_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
word_counts = pd.DataFrame(word_counts_per_sms)

In [9]:
texts_train = pd.concat([texts_train, word_counts], axis=1)
texts_train.head()

Unnamed: 0,Label,SMS,october,lul,kl341,tb,mins,genuine,badrith,propsd,...,lag,facilities,comment,820554ad0a1705572711,round,finding,challenging,educational,pack,become
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


At this point we have transformed our dataset into a series of columns indicating every unique word found in each message. This will help us iterate through this dataset when building our algorithm.

## Building the Algorithm
We are now ready to create our spam filter.

We'll begin by computing the constant variables that will be used in every new message:
- p_spam - the probability of receiving a spam message
- p_ham - the probability of receiving a non-spam message
- n_spam - the number of (non-unique) words in spam messages
- n_ham - the number of (non-unique) words in non-spam messages
- n_vocabulary - the number of unique words in all messages
- alpha - for additive smoothing

In [10]:
p_spam = texts_train['Label'].value_counts(normalize=True)[1]
p_ham = texts_train['Label'].value_counts(normalize=True)[0]
spam_msg = texts_train[texts_train['Label'] == 'spam']
ham_msg = texts_train[texts_train['Label'] == 'ham']
n_spam = spam_msg['SMS'].apply(len).sum()
n_ham = ham_msg['SMS'].apply(len).sum()
n_vocabulary = len(vocabulary)
alpha = 1

We'll now build our algorithm, creating two dictionaries containing the conditional probability of every word in the vocabulary being par of a spam or non-spam message. When we recieve a new message to classify, all the calculations will already be computed, greatly speeding up the process.

In [12]:
spam_dict = {unique_word: 0 for unique_word in vocabulary}
ham_dict = {unique_word: 0 for unique_word in vocabulary}

In [13]:
for word in vocabulary:
    n_word_spam = spam_msg[word].sum()
    p_word_given_spam = (n_word_spam + alpha) / (n_spam + alpha * n_vocabulary)
    spam_dict[word] = p_word_given_spam
    
    n_word_ham = ham_msg[word].sum()
    p_word_given_ham = (n_word_ham + alpha) / (n_ham + alpha * n_vocabulary)
    ham_dict[word] = p_word_given_ham

To illustrate what we have just created, we'll display a few words in each dictionary:

In [40]:
display(dict(list(spam_dict.items())[:3]))
display(dict(list(ham_dict.items())[:3]))

{'october': 4.3529360553693465e-05,
 'lul': 4.3529360553693465e-05,
 'kl341': 0.0001305880816610804}

{'october': 3.075976622577668e-05,
 'lul': 3.075976622577668e-05,
 'kl341': 1.537988311288834e-05}

Each word now has a number it is associated with in both the spam and ham dictionaries. This number is the (porportional) conditional probability that the word would end up in a spam or non-spam message.

With this algorithm, we'll create a function that takes in the new message and returns the classification - either 'ham', 'spam', or 'requires human clarification' (in case there is equal chance it is a spam or non-spam message).

In [14]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'requires human clarification'

## Measuring Classification Accuracy
Now that we've created our algorithm, we'll put it to the test using our test set. Our algorithm should have at least an 80% accuracy rate for it to be considered reliable.

In [15]:
texts_test['predicted'] = texts_test['SMS'].apply(classify)
texts_test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [55]:
correct = 0
total = len(texts_test['SMS'])
for row in texts_test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
accuracy = correct / total
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', '{}%'.format(round(accuracy, 3) * 100))

Correct: 1100
Incorrect: 14
Accuracy: 98.7%


The accuracy rate is over 98%, which is well above our threshold. Overall, the algorithm misclassified only 14 out of 1,114 messages.

Below, I'll list the 14 instances where the prediction was off. We can see that one message was determined as 'requires human clarification', and the rest were misclassified.

In [91]:
misclassified = pd.DataFrame([])
for i, row in texts_test.iterrows():
    if row['Label'] != row['predicted']:
        misclassified = pd.concat([misclassified, row], axis=1, ignore_index=True)
misclassified = misclassified.transpose()
misclassified

Unnamed: 0,Label,SMS,predicted
0,spam,Not heard from U4 a while. Call me now am here...,ham
1,spam,More people are dogging in your area now. Call...,ham
2,ham,Unlimited texts. Limited minutes.,spam
3,ham,26th OF JULY,spam
4,ham,Nokia phone is lovly..,spam
5,ham,A Boy loved a gal. He propsd bt she didnt mind...,requires human clarification
6,ham,No calls..messages..missed calls,spam
7,ham,We have sent JD for Customer Service cum Accou...,spam
8,spam,Oh my god! I've found your number again! I'm s...,ham
9,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


## Real-life Examples
After creating this algorithm, I wanted to try it out on actual messages to get a sence of how it would perform in real life. I collected a handful of messages, 80% non-spam and 20% spam, and applied the algorithm to them. Let's see how it performed!

In [128]:
# Creating the dataset
non_spam = ['Can we talk tonight?', 'Hey there isnt any bed sheet on one of the beds. Can u bring one over?', 'Capital One wont call you for this code. The temporary code you requested is 288516. Please use this code to complete your request.', 'But I still wanna know how work is going… And stop WINING about it… Get it? Get it? I did that just in case mommy hasn’t yet', 'Was a close call but i miss my kids, ready to go home lol', 'no.. we only have left some old cars for the weekend :(', 'FYI the mold test will take place today between 11 and 1.', 'I once looked it up in a jewish encyclopaedia on last names and found the closest thing to its origin was a German town named very similar to it l, something like Rummel, and many years ago, Jews from Poland would travel there to trade (buy/sell) horses so those jewish families became known by last names having that root association. Bottom line, be proud of your yichud....as a most respectable and  prominent horse dealer! Todays equivalent would be a car dealer!', 'Updated link for Berris calendar https://calendly.com/berrihires/quick-hello',  'U guys at 741? Should I come there? Wanna come here for a little while? Something else?', 'Washer in Hotel works,.   Whoever needs Tide pods or quarters can ask the shift manager before leaving the plant', 'The brixs arent quite where they want them. Samples every 5-10 minutes', 'I have one bedroom apartment available in 294 Alabny It’s $175 per night $2,500 per month', 'You can leave it at Rogers house. Ill get it later', 'Hi, i saw your listing on col for a short term rental on carrol. Is it available for elul?', 'My radio is dead can you bring a new battery']
spam = ['High speed crash including athlete Tiger woods leaves 4 dead :( http://notedwitness.boutique/G9Wo4MHo', 'dmlyn iqrco yokj http://easy.fitness', 'heres a new way to stay stiff. its natural without bad side effects highfor.online/VGGi7cn', '[U.S.P.S]:While trying to send your parceI to your address, we found that your details arent correct. Resolve this issue at https://reports-delivery.com/view']
real_texts = pd.DataFrame(columns=['label', 'SMS'])
for sms in non_spam:
    real_texts.loc[len(real_texts)] = ['ham', sms]
for sms in spam:
    real_texts.loc[len(real_texts)] = ['spam', sms]
display(real_texts)
display(real_texts['label'].value_counts(normalize=True))

Unnamed: 0,label,SMS
0,ham,Can we talk tonight?
1,ham,Hey there isnt any bed sheet on one of the bed...
2,ham,Capital One wont call you for this code. The t...
3,ham,But I still wanna know how work is going… And ...
4,ham,"Was a close call but i miss my kids, ready to ..."
5,ham,no.. we only have left some old cars for the w...
6,ham,FYI the mold test will take place today betwee...
7,ham,I once looked it up in a jewish encyclopaedia ...
8,ham,Updated link for Berris calendar https://calen...
9,ham,U guys at 741? Should I come there? Wanna come...


ham     0.8
spam    0.2
Name: label, dtype: float64

In [129]:
# Applying the algorithm to our dataset
real_texts['predicted'] = real_texts['SMS'].apply(classify)
display(real_texts)
correct = 0
total = len(real_texts['SMS'])
for i, row in real_texts.iterrows():
    if row['label'] == row['predicted']:
        correct += 1
accuracy = correct / total
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', '{}%'.format(round(accuracy, 3) * 100))

Unnamed: 0,label,SMS,predicted
0,ham,Can we talk tonight?,ham
1,ham,Hey there isnt any bed sheet on one of the bed...,ham
2,ham,Capital One wont call you for this code. The t...,spam
3,ham,But I still wanna know how work is going… And ...,ham
4,ham,"Was a close call but i miss my kids, ready to ...",ham
5,ham,no.. we only have left some old cars for the w...,ham
6,ham,FYI the mold test will take place today betwee...,ham
7,ham,I once looked it up in a jewish encyclopaedia ...,ham
8,ham,Updated link for Berris calendar https://calen...,ham
9,ham,U guys at 741? Should I come there? Wanna come...,ham


Correct: 17
Incorrect: 3
Accuracy: 85.0%


We can see that although the accuracy is still quite high, it has dropped significantly. Presumably, the origional dataset was working a stereotype of spam that may not always mirror the more realistic version used by modern spammers.<br>
Upon closer inspection, you can see that the text misclassified as spam was an automated text (which has a strong resemblance to spam), which may account for that error. The problem seems to be primarily in catching spam messages that appear to be legitimate.

## Conclusion
Overall, it seems that the algorithm itself performed incredibly well, and the underperformance with real world data was likely due to the dataset we used. Persumably, if a more realistic dataset were to be developed, this shortcome would be solved, and predictions would be reliable and realistic.