We will use a public set of SMS labeled messages that have been collected for mobile phone spam research to classify new messages as spam or ham using multinomial Naive Bayes algorithm.

The data set contains 2 columns.

Label: Which classifies a message as spam or ham
SMS: Actual content of the message

In [1]:
#Imports
import pandas as pd
import re

#settings
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [10]:

#Read tab delimited data
data = pd.read_csv(r"C:\Users\vijay aakula\Downloads\smsspamcollection\SMSSpamCollection", sep='\t', header=None, names=['Label', 'SMS'])

data.head()
data.describe()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30



There are 5572 rows with 2 columns. As mentioned earlier the lable column has two classes 747 rows with "spam" and 4825 rows with "ham" values. There are no null or NA values. So we are good to build a learning model.

Let's split the data set into train and test.

In [11]:
#randomize data set
data_randomized = data.sample(frac=1, random_state=1)

#split data to train and test
split_row = int(data.shape[0] * 0.8)
train = data_randomized[0:split_row].reset_index(drop=True)
test = data_randomized[split_row:].reset_index(drop=True)

#Verify data
print("training set\n",train['Label'].value_counts(normalize=True)*100)
print("test set\n", test['Label'].value_counts(normalize=True)*100)

training set
 ham     86.53803
spam    13.46197
Name: Label, dtype: float64
test set
 ham     86.816143
spam    13.183857
Name: Label, dtype: float64


Name: Label, dtype: float64
Above we have validated the sets to have equal proportions of data which important during testing process. Let's use the train set to train the model.

First we will manully implement the multiclass Naive Bayes algorithm and later use the sklearn version.

Let's do some data cleaning to extract required information from the data.

remove any punctuation characters (we keep only a-z, A-Z and 0-9)
convert everything to lower case

In [12]:
train['SMS'] = train['SMS'].str.replace('\W', ' ', regex=True).str.strip().str.replace(' +', ' ', regex=True)
train['SMS'] = train['SMS'].str.lower()

train['SMS'].head()

0                          yep by the pretty sculpture
1           yes princess are you going to make me moan
2                           welp apparently he retired
3                                               havent
4    i forgot 2 ask ü all smth there s a card on da...
Name: SMS, dtype: object

Now let's create a vocabulary (a list with all unique words accross all messages)

In [13]:
#split messages on space
train['SMS'] = train['SMS'].str.split()

#collect words from all messages and filter unique
vocabulary = []
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))

Now that we have a list of all unique words accross all messages, we need a way to count number of times each word in the vocabulary appeared in each message. For this we shall create a new dataframe that contain a label column and a column for each unique word in the vocabulary with count as values for each message.

In [17]:

#Create empty dictionary with all 0's as count for each word in each message
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

In [18]:
#get the actual counts
for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
#create new dataframe
train_clean = pd.concat([train['Label'], pd.DataFrame(word_counts_per_sms)], axis=1)

train_clean.head()

Unnamed: 0,Label,ans,09066362206,99,worse,4th,m60,deal,strong,bitching,...,huh,gail,holby,signing,real,finishing,hesitant,tddnewsletter,festival,hw
0,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0



Now that we have a data set to work with let's create few terms which we will use in the Naive Bayes algorithm.

p_spam: probability of a message being spam
p_ham: probability of a message being ham
n_spam: number of words (all words, not just unique) in spam messages
n_ham: number of words (all words, not just unique) in ham messages
n_vocabulary: number of unique words accross all messages
alpha: Laplace smoothing which will be set to 1.

In [19]:

#calculate probabilities
p_spam = train_clean['Label'].value_counts(normalize=True)['spam']
p_ham = train_clean['Label'].value_counts(normalize=True)['ham']

#calculate counts
n_spam = train_clean[train_clean['Label'] == 'spam'].sum(axis=1).sum()
n_ham = train_clean[train_clean['Label'] == 'ham'].sum(axis=1).sum()
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1


Now we need to calculate probability of a word given message is spam and ham. So for each word we will calculate:

$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$
The fact that this calculation is done before hand makes this algorithm perform really fast.

In [20]:
#create two empty dictionaries with probabilities of each word
p_words_spam = {word: 0 for word in vocabulary}
p_words_ham = {word: 0 for word in vocabulary}

#split data into spam and ham
train_spam = train_clean[train_clean['Label'] == 'spam']
train_ham = train_clean[train_clean['Label'] == 'ham']

#calculate probabilites
for word in vocabulary:
    #calculate total number times this word appeared in messages
    n_word_spam = train_spam[word].sum()
    n_word_ham = train_ham[word].sum()
    
    #calculate probabilites
    p_word_spam = (n_word_spam + alpha)/(n_spam + (alpha * n_vocabulary))
    p_word_ham = (n_word_ham + alpha)/(n_ham + (alpha * n_vocabulary))
    
    #append to dictionaries
    p_words_spam[word] += p_word_spam
    p_words_ham[word] += p_word_ham

In [21]:
#print first 3 items in both dicts
print("spam probabilites")
for key in list(p_words_spam)[0:3]:
    print("{}: {}".format(key, p_words_spam[key]))
    
print("\nham probabilites")
for key in list(p_words_ham)[0:3]:
    print("{}: {}".format(key, p_words_ham[key]))

spam probabilites
ans: 0.0002611875326484416
09066362206: 8.706251088281386e-05
99: 0.0001305937663242208

ham probabilites
ans: 0.0001076674613550719
09066362206: 1.5381065907867414e-05
99: 1.5381065907867414e-05



We now have the probabilities of all the words and others constants we need classify new messages.

We will classify the message into three categories.

spam: if the probability of message being a spam is more
ham: if the probability of message being a ham is more.
needs human classification: if probabilities are equa.

In [22]:
#Create a function that takes in a input string and classify the message
def classify(message):
    #clean and split the message
    message = re.sub('\W', ' ', message)
    message = message.lower().strip()
    message = message.split()

    #initiate values
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    #calculate spam and ham probabilities
    for word in message:
        if word in p_words_spam:
            p_spam_given_message *= p_words_spam[word]
        if word in p_words_ham:
            p_ham_given_message *= p_words_ham[word]
    
    #return labels and probabilities
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'


We have algorithm ready that can be used on test data set.

In [23]:

#create new column with predicted values
test['predicted'] = test['SMS'].apply(classify)

test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Wherre's my boytoy ? :-(,ham
1,ham,Later i guess. I needa do mcat study too.,ham
2,ham,But i haf enuff space got like 4 mb...,ham
3,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4,ham,All sounds good. Fingers . Makes it difficult ...,ham



First few rows looks accurate. Let's calculate accuracy and display confusion matrix.

In [24]:
#calculate accuracy
print("Accuracy score:", (test['Label']==test['predicted']).sum()/len(test))

#confusion matrix
confusion_matirx = pd.crosstab(test['Label'], test['predicted'], rownames=['True'], colnames=['Predicted'], margins=True)
confusion_matirx

Accuracy score: 0.9874439461883409


Predicted,ham,needs human classification,spam,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ham,962,1,5,968
spam,8,0,139,147
All,970,1,144,1115


We achived an accuracy of 98.74% which is really high. Out of 1115 message our message filter incorrectly classified only 13 and 1 where human interaction is need.

However, we need to be careful about the True ham messages being classified as spam. These are "False negative". We need to aim for 0 in this field. For this our "True positive rate (Recall)" should be 100%.

In achiving this we might end up having more spam messages as ham which is still fine than blocing a ham message as spam. This is the trade-off we need to consider.