# Spam Classifier

<font size="3"> In this project I build a spam classifier using multinomial naive bayes algorithm. The dataset containing spam and non-spam messages that was used can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).<font size="3">
<br />
<br />

## Import library & dataset

In [136]:
# import pandas
import pandas as pd

# Read the data
sms_spam_collection = pd.read_csv('smsspamcollection/SMSSpamCollection',sep='\t',header=None, names=['Label','SMS'])

## Explorartory Data Analysis

In [137]:
# Get info on our dataframe
sms_spam_collection.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


<font size="3"> It seems all values are non-null, so we don't have to worry about dealing with null values. Now, let's take a look at the first few rows of our dataframe. <font/>

In [138]:
# First five rows of our data
pd.options.display.max_colwidth = 200
sms_spam_collection.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


<font size="3">As per the documentation, non-spam messages are labeled as ham. We will refer to them as such moving forward.<font/>

<br />
<font size="3">Let's take a loot at the distribution of messages labels.<font/>

In [139]:
# How many spam vs. how many ham
print('Frequency of spam & ham messages:')
print(sms_spam_collection['Label'].value_counts())
print()
print('Percentage of spam & ham messages:')
print(sms_spam_collection['Label'].value_counts(normalize=True))

Frequency of spam & ham messages:
ham     4825
spam     747
Name: Label, dtype: int64

Percentage of spam & ham messages:
ham     0.865937
spam    0.134063
Name: Label, dtype: float64


<br />
<font size="3">Now let's take a look at some of the ham messages.<font/>

In [140]:
# Take a look at some non-spam messages
ham_messages = sms_spam_collection[sms_spam_collection['Label'] == 'ham']
rand_ham_hessages = ham_messages['SMS'].sample(n=15)
print(rand_ham_hessages)

2688                                                                                                                                                                                                       Okie
4535                                                                                                                                                                            I have no money 4 steve mate! !
5463                                                                                                                                                                                          U GOIN OUT 2NITE?
1582                                                                                                                                           Hhahhaahahah rofl wtf nig was leonardo in your room or something
3594                                                                                                                                                                    

<br />
<font size="3">Now we take a look at some of the spam messages.<font/>

In [141]:
# Take a look at some full spam message
spam_messages = sms_spam_collection[sms_spam_collection['Label'] == 'spam']
rand_spam_hessages = spam_messages['SMS'].sample(n=15)
print(rand_spam_hessages)

3009                       Loan for any purpose £500 - £75,000. Homeowners + Tenants welcome. Have you been previously refused? We can still help. Call Free 0800 1956669 or text back 'help'
367                     Update_Now - Xmas Offer! Latest Motorola, SonyEricsson & Nokia & FREE Bluetooth! Double Mins & 1000 Txt on Orange. Call MobileUpd8 on 08000839402 or call2optout/F4Q=
3864                                                                               Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50
4410                                                                                                        For your chance to WIN a FREE Bluetooth Headset then simply reply back with "ADP"
797                          Orange customer, you may now claim your FREE CAMERA PHONE upgrade for your loyalty. Call now on 0207 153 9996. Offer ends 14thMarch. T&C's apply. Opt-out availa
2556                             FreeMSG You have 

<br />
<font size="3">After running the code cell above a few times it appears there are two features that distinguish spam from ham messages. Phone numbers and money amounts. Let's explore this further.<font/>

In [142]:
# let's look at messages with numbers
messages_with_num = sms_spam_collection[sms_spam_collection['SMS']\
                    .str.contains(r'(?:[0-9]{1,3},(?:[0-9]{3},)*[0-9]{3}|[0-9]+)(?:.[0-9][0-9])?')]
messages_with_num['Label'].value_counts()

ham     755
spam    708
Name: Label, dtype: int64

<font size="3">Filtering based on numbers doesn't distingush between the types of messages very well. What if we look at messages containing money symbols?<font/>

In [143]:
messages_with_money = sms_spam_collection[sms_spam_collection['SMS']\
                    .str.contains(r'£(?:[0-9]{1,3},(?:[0-9]{3},)*[0-9]{3}|[0-9]+)(?:.[0-9][0-9])?')]

In [144]:
messages_with_money['Label'].value_counts()

spam    252
ham       5
Name: Label, dtype: int64

<font size="3">Clearly, having money in the message is a distinguishing feautre of spam messages.</font>

In [145]:
# here are the 5 ham messages that do have money
messages_with_money[messages_with_money['Label'] == 'ham']

Unnamed: 0,Label,SMS
1677,ham,"Yeah, that's fine! It's £6 to get in, is that ok?"
1724,ham,"Hi Jon, Pete here, Ive bin 2 Spain recently & hav sum dinero left, Bill said u or ur rents mayb interested in it, I hav 12,000pes, so around £48, tb, James."
1998,ham,"YEH I AM DEF UP4 SOMETHING SAT,JUST GOT PAYED2DAY & I HAVBEEN GIVEN A£50 PAY RISE 4MY WORK & HAVEBEEN MADE PRESCHOOLCO-ORDINATOR 2I AM FEELINGOOD LUV"
3044,ham,Your bill at 3 is £33.65 so thats not bad!
3736,ham,"It‘s £6 to get in, is that ok?"


<br />
<font size="3">Another potential differentiating feature is phone numbers. It appears numbers are either 11 or 5 digits long.<font/>

In [146]:
messages_with_phone_number = sms_spam_collection[sms_spam_collection['SMS']\
                    .str.contains(r'(?:\b[0-9]{11}\b)|(?:\b[0-9]{5}\b)')]

In [147]:
messages_with_phone_number['Label'].value_counts()

spam    544
Name: Label, dtype: int64

<font size="3"> Phone numbers clearly distinguish spam from ham messages.</font>
<br />
<br />

## Data Cleaning & Feature Engineering

<font size="3">
In implementing Naive Bayes Algorithm (NBA) we calculate for each word its conditional probabilities $P(word|spam)$ and $P(word|ham)$. However for some strings in our messages, it may make more sense to group them together into a class. For example, as we saw above, a distinguishing feature of the messages is the prescence of phone numbers. Spam messages tend to have phone numbers of length 11 or 6, while ham messages don't. However, as one might expect, the phone number in each spam message is probably different. Hence, if we just run NBA with individual phone numbers, the phone numbers won't have much of an effect when we run our classifier on the test data since the spam messages in the test data probably have different phone numbers from the spam messages in the training data. Instead, it would be more effective to introduce a phone_num feature that equals the number of times a phone number appears in a message. Similarly, we will also introduce a money_num feature that equals the number of money values in a message.
<font/>

In [148]:
sms_spam_collection['phone_nums'] = sms_spam_collection['SMS'].str.count(r'(?:\b[0-9]{11}\b)|(?:\b[0-9]{5}\b)')

In [149]:
sms_spam_collection['money_num'] = sms_spam_collection['SMS']\
.str.count(r'£(?:[0-9]{1,3},(?:[0-9]{3},)*[0-9]{3}|[0-9]+)(?:.[0-9][0-9])?')

<br />
<font size="3">Now let's clean our data and get rid of those numbers<font/>

In [150]:
sms_spam_collection['clean_message'] = sms_spam_collection['SMS']\
.str.replace(r'(?:\b[0-9]{11}\b)|(?:\b[0-9]{5}\b)', ' ', regex=True)
sms_spam_collection['clean_message'] = sms_spam_collection['clean_message']\
.str.replace(r'£(?:[0-9]{1,3},(?:[0-9]{3},)*[0-9]{3}|[0-9]+)(?:.[0-9][0-9])?', ' ', regex=True)

<br />
<font size="3">It's unlikely that nonalphanumeric characters like periods, apostrophes, etc will distinguish between spam and ham messages. We'll further clean our messages by removing them. Also, we will split each message into a list of words.<font/>

In [151]:
sms_spam_collection['clean_message'] = sms_spam_collection['clean_message'].str.replace(r'\W', ' ', regex=True).str.split()

### Creating our features table
<font size="3">We now create our features table.<font/>

In [152]:
# Get all the words in the messages and construct a dataframe

word_list = []
for row in sms_spam_collection['clean_message']:
    for word in row:
        word_list.append(word)
        
# make all words in list unique
word_list = list(set(word_list))

In [153]:
# very big!
len(word_list)

10570

In [154]:
# Count freq of each word
word_count_per_message = {unique_word: [0] * len(sms_spam_collection['clean_message']) for unique_word in word_list}
for indx, row in enumerate(sms_spam_collection['clean_message']):
    for word in row:
        word_count_per_message[word][indx] += 1

In [155]:
# convert to dataframe and then concatanate to our original dataframe
words_dataframe = pd.DataFrame(word_count_per_message)

# dataframe of all the unique words in our messages plus the phone_nums and money_num columns
words_dataframe.head()

Unnamed: 0,Honestly,er,uterus,7ish,yellow,wer,areyouunique,QUITEAMUZING,website,sleepin,...,Ldn,dictionary,About,English,raised,Tel,senses,Sam,ve,Nic
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [156]:
# reorganize columns of sms_spam_collection
cols = ['Label', 'SMS', 'clean_message','phone_nums','money_num']
sms_spam_collection = sms_spam_collection[cols]

In [171]:
# To make sure that none of the words in our words dataframe are the same as any of the column names in the
# sms_spam_collection dataframe, we will add an underscore to some of the current columns names in sms_spam_collection
# before we concatanate the two dataframes.
sms_spam_collection.columns = ['Label_', 'SMS_', 'clean_message','phone_nums','money_num']

In [172]:
# concatanate words dataframe to sms_spam_collection dataframe
full_df = pd.concat([sms_spam_collection, words_dataframe],axis=1)
full_df.head()

Unnamed: 0,Label_,SMS_,clean_message,phone_nums,money_num,Honestly,er,uterus,7ish,yellow,...,Ldn,dictionary,About,English,raised,Tel,senses,Sam,ve,Nic
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Cine, there, got, amore, wat]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,Ok lar... Joking wif u oni...,"[Ok, lar, Joking, wif, u, oni]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,"[Free, entry, in, 2, a, wkly, comp, to, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, to, to, receive, entry, question, std, txt, rate, T, C, s, apply, 08452810075over18, s]",1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,U dun say so early hor... U c already then say...,"[U, dun, say, so, early, hor, U, c, already, then, say]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"Nah I don't think he goes to usf, he lives around here though","[Nah, I, don, t, think, he, goes, to, usf, he, lives, around, here, though]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Split data into train and test set
<font size="3">Now we split our dataframe into a train and test set<font/>

In [173]:
# First randomize the dataset
data_randomized = full_df.sample(frac=1, random_state=1)

rows = len(data_randomized)
train = data_randomized[:round(rows*0.8)].reset_index(drop=True)
test = data_randomized[round(rows*0.8):].reset_index(drop=True)

## Model Training

Now we build our classifier. For each message we calculate $P(Spam|message)$ & $P(Ham|message)$. If the first probability is higher we classify the message as spam and vice-versa. Now,
\begin{equation*}
P(spam|message) = P(spam|w_1, w_2, ... ,w_n) = P(spam|w_1 \cap w_2 \cap ... \cap w_n)
\end{equation*}
where $w_1, w_2, ... w_n$ are the words in message.
<br />
Then, we have:
\begin{equation*}
P(spam|w_1 \cap w_2 \cap ... \cap w_n) = \frac{P(spam \cap w_1 \cap w_2 \cap ... \cap w_n)}{P(w_1 \cap w_2 \cap ... \cap w_n)} =
\end{equation*}
<br />
\begin{equation*}
\frac{P(w_1 \cap w_2 \cap ... \cap w_n|spam) * P(spam)}{P(w_1 \cap w_2 \cap ... \cap w_n)} \propto P(w_1 \cap w_2 \cap ... \cap w_n|spam) * P(spam)
\end{equation*}
by Baye's theorem.
<br />
Finally, we assume conditional independence so our final result is:

\begin{equation*}
P(spam|message) \propto P(w_1 \cap w_2 \cap ... \cap w_n|spam) * P(spam) = \prod_{i = 1}^{n}P(w_i|spam) * P(spam)
\end{equation*}

We calculate $P(w_i|spam)$ as follows:
\begin{equation*}
P(w_i|spam) = \frac{(n\_w_i\_spam + \alpha)}{(n\_spam + \alpha * n\_vocab)}
\end{equation*}
where:
* $n\_w_i\_spam$ is number of times $w_i$ appears in spam messages
* $n\_spam$ is number of words in spam messages
* $n\_vocab$ is the total number of words in the messages
* $\alpha$ is a smoothing parameter we will set to 1

$P(ham|message)$ is calculated analogously.

*Note: Below we will also build the classifier using sklearn's NB module and compare results.*

In [175]:
# Calculate p_spam, p_ham, n_vocab n_ham & n_spam
alpha = 1

p_spam = round(float(len(train[train['Label_'] == 'spam'])) / len(train), 2)
p_ham = 1 - p_spam

n_vocab = train['clean_message'].apply(len).sum()

n_ham = train[train['Label_'] == 'ham']['clean_message'].apply(len).sum()
n_spam = n_vocab - n_ham

In [176]:
# now we calculate probability for each word: p(w_i|spam)
prob_of_word_given_spam = {}
prob_of_word_given_ham = {}
train_spam = train[train['Label_'] == 'spam']
train_ham = train[train['Label_'] == 'ham']
for word in word_list:
    n_wi_spam = train_spam[word].sum()
    n_wi_ham = train_ham[word].sum()
    prob_of_word_given_spam[word] = (n_wi_spam + alpha) / (n_spam + alpha * n_vocab)
    prob_of_word_given_ham[word] = (n_wi_ham + alpha) / (n_ham + alpha * n_vocab)

<br />
<font size="3">Now we can build our classifier!<font/>

In [177]:
# the message we take in is already cleaned and is split into a list of words
def classify(message):
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        p_spam_given_message *= prob_of_word_given_spam[word]
        p_ham_given_message *= prob_of_word_given_ham[word]
        
    if (p_spam_given_message > p_ham_given_message):
        return 'spam'
    elif (p_spam_given_message < p_ham_given_message):
        return 'ham'
    return 'Could not classify'

## Predict the Results

In [178]:
# Now let's get our predictions
predictions = test['clean_message'].apply(classify)

## Check Accuracy

In [181]:
print('Accuracy is: ', (sum(predictions == test['Label_']) / len(predictions)) * 100)

Accuracy is:  97.30700179533214


In [183]:
# Let's take a look at messages we didn't accurately predict
pred_wrong = test[predictions != test['Label_']]
print(len(pred_wrong))
pred_wrong[['Label_', 'SMS_']].head(len(pred_wrong))

30


Unnamed: 0,Label_,SMS_
51,spam,FreeMsg: Hey - I'm Buffy. 25 and love to satisfy men. Home alone feeling randy. Reply 2 C my PIX! QlynnBV Help08700621170150p a msg Send stop to stop txts
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
135,spam,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
141,spam,Dear Voucher Holder 2 claim your 1st class airport lounge passes when using Your holiday voucher call 08704439680. When booking quote 1st class x 2
180,spam,"Win the newest Harry Potter and the Order of the Phoenix (Book 5) reply HARRY, answer 5 questions - chance to be the first among readers!"
263,spam,TheMob>Yo yo yo-Here comes a new selection of hot downloads for our members to get for FREE! Just click & open the next link sent to ur fone...
284,ham,Nokia phone is lovly..
293,ham,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d gi..."
343,spam,U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094565
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123


### Conclusion
<font size="3">Looking the messages our classifier mislabeled, all except for two of them are spam messages that were mislabeled as ham messages. Moreover, most of these spam messages appear to have 5 or 11 digit phone numbers--which was a feature we accounted for in our model above. Perhaps we need to assign a larger weight to phone_nums feature. One thing we can try to improve the accuracy of our model is introduce a $\beta$ parameter that gives a larger weight to phone_nums. So our new probability for $P(w_i|spam)$ would be:<font/>
 
\begin{equation*}
P(w_i|spam) = (n\_w_i\_spam + \alpha + \beta * phone\_nums) / (n\_spam + \alpha * n\_vocab + \beta * phone\_nums)
\end{equation*}
    
<font size="3">We can then experiment with different values of $\beta$ to see which yields the best results.<font/>

## Model Fitting using SKlearn Multinomial Naive Bayes module
<font size="3">In this section we'll use sklearn's built-in multinomial naive bayes module to train and test the data and compare the accuracy achieved to the accuracy achieved above.<font/>

In [184]:
# Declare feature vector & target variable
X = full_df.drop(['Label_', 'SMS_', 'clean_message'], axis=1)
y = full_df['Label_']

In [185]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [186]:
print(X_train.shape)
print(X_test.shape)

(4457, 10572)
(1115, 10572)


In [187]:
# train a Gaussian Naive Bayes classifier on the training set
from sklearn.naive_bayes import MultinomialNB

# instantiate the model
mb = MultinomialNB()

# fit the model
mb.fit(X_train, y_train)

MultinomialNB()

In [188]:
# Make predictions
y_pred = mb.predict(X_test)

In [191]:
# determine model accuracy
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

Model accuracy score: 0.9695


<font size="3">Using SKlearn's MultinomialNB module also yields a classifier with accurate predictions.<font/>