## Building a spam filter with Naive Bayes Theorem

The goal of his project tio build a spam filter using the multinomial Naive Bayes algorithm. It learsn how human classify the messages and uses that human knowledge to estimate probabilities for new messages by calculating the probabillities for both spam and non spam messages. Then based on the probability values, if the probability of the spam messages is higher than that of the non-spam messages, it classifies the new message as a spam message.This classification task is performed on the dataset of 5,572 messages extracted from UCI Machine Learning Repository.


### Exploring the dataset

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
sms=pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label',"SMS"])

In [2]:
sms.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
shape=sms.shape
print('shape of the datasest:',{shape})
unique_labels=sms['Label'].value_counts(normalize=True)*100
print(unique_labels)

shape of the datasest: {(5572, 2)}
ham     86.593683
spam    13.406317
Name: Label, dtype: float64


The dataset contains 5,572 messages. Out of them, 86.59% are non-spam messages while rest of 13.41% are identified as spam messages.

### Training and Test datasets

In [4]:
sms_cp=sms.sample(frac=1,random_state=1) # randomize the entire datasest

# Using the manual method to split the data to test and trainig sets
data_index = round(len(sms_cp) * 0.8)

# Training/Test split
training_set = sms_cp[:data_index].reset_index(drop=True)
test_set = sms_cp[data_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [5]:
print(training_set['Label'].value_counts(normalize=True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64


In [6]:
print(test_set['Label'].value_counts(normalize=True)*100)

ham     86.804309
spam    13.195691
Name: Label, dtype: float64


Based on the percentages of spam and non-spam messages in both training and test datasets above, the values corresponds with the percentages of the each label in the full dataset.


In [7]:
'''# Using the sklearn's train_test_split to split dataset
X=sms_cp['SMS']
y=sms_cp['Label']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=72)
training_labels= y_train.value_counts(normalize=True)*100
test_labels=y_test.value_counts(normalize=True)*100'''

"# Using the sklearn's train_test_split to split dataset\nX=sms_cp['SMS']\ny=sms_cp['Label']\nX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=72)\ntraining_labels= y_train.value_counts(normalize=True)*100\ntest_labels=y_test.value_counts(normalize=True)*100"

### Data Cleaning

Each letter in the sms messages should be cleaned and transformed in to a standard format which makes the clasification task for each word in the message easier.


In [8]:
# Replacing the punctuation marks with a space using the regex '\w'
training_set['SMS']=training_set['SMS'].str.replace('\W',' ')

#Transforming the words to lowercase letters
training_set['SMS']=training_set['SMS'].str.lower()

training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

creating a vocabulary which is a list containinig all unique words across all messages.

In [9]:
training_set['SMS']=training_set['SMS'].str.split() #splitting string at space character

vocabulary=[]
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary=list(set(vocabulary)) # ELiminated duplicate by transforming to a set and retransformed back to a list

creating a dictionary with the unique word count and transform the dictionary to the format of the DataFrame which is necessary for this analysis.

In [10]:
word_counts_per_sms={unique_word:[0]*len(training_set['SMS']) for unique_word in vocabulary} # initialize the empty dictionary 
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1


In [11]:

word_counts = pd.DataFrame(word_counts_per_sms)

training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


### Calculating constants to apply the Naive Bayes Theorem 

P(Spam|w1,w2,...,wn)∝**P(Spam)**⋅n∏i=1 **P(wi|Spam)**

P(Ham|w1,w2,...,wn)∝**P(Ham)**⋅n∏i=1  **P(wi|Ham)**

**P(wi|Spam)**=Nwi|Spam+α/(NSpam+α⋅NVocabulary)

**P(wi|Ham)**=Nwi|Ham+α /(NHam+α⋅NVocabulary)

In [12]:
#Isolating spam and ham messages 
training_spam=training_set_clean[training_set_clean['Label']=='spam']
training_ham=training_set_clean[training_set_clean['Label']=='ham']

N_spam=training_spam['SMS'].apply(len).sum()
N_ham=training_ham['SMS'].apply(len).sum()

N_vocabulary= len(vocabulary)


alpha = 1 # Laplace smoothing

p_spam=len(training_spam)/len(training_set_clean)
p_ham=len(training_ham)/len(training_set_clean)

### Calculating the parameters

In [13]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# calculate parameters 

for word in vocabulary:
    N_word_given_spam=training_spam[word].sum()
    P_word_given_spam=(N_word_given_spam*alpha)/(N_spam+alpha*N_vocabulary)
    parameters_spam[word]=P_word_given_spam
    
    N_word_given_ham=training_ham[word].sum()
    P_word_given_ham=(N_word_given_ham*alpha)/(N_ham+alpha*N_vocabulary)
    parameters_ham[word]=P_word_given_ham
       

### Classifying a new message

In [14]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')


In [15]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')


P(Spam|message): 8.100463856544482e-26
P(Ham|message): 0.0
Label: Spam


In [16]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 0.0
P(Ham|message): 2.1950292848845383e-21
Label: Ham


### Meassure spam filter's accuracy 

Modifying the classification functin to test on the test dataset and return the label as 'spam' or 'ham' for each message.

In [17]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'


In [18]:

#Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set

test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()


Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",needs human classification


In [19]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)


Correct: 1025
Incorrect: 89
Accuracy: 0.9201077199281867


In [21]:
incorrect = []

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] != row['predicted']:
        incorrect.append(row)
        
incor = pd.DataFrame(incorrect)
incor.head(89)

Unnamed: 0,Label,SMS,predicted
4,ham,"All done, all handed in. Don't know if mega sh...",needs human classification
16,ham,Doc prescribed me morphine cause the other pai...,needs human classification
57,spam,Want 2 get laid tonight? Want real Dogging loc...,needs human classification
71,ham,We know TAJ MAHAL as symbol of love. But the o...,needs human classification
84,spam,Show ur colours! Euro 2004 2-4-1 Offer! Get an...,needs human classification
89,spam,goldviking (29/M) is inviting you to be his fr...,needs human classification
93,ham,WE REGRET TO INFORM U THAT THE NHS HAS MADE A ...,needs human classification
98,ham,Ola would get back to you maybe not today but ...,needs human classification
114,spam,Not heard from U4 a while. Call me now am here...,needs human classification
121,spam,1000's flirting NOW! Txt GIRL or BLOKE & ur NA...,needs human classification


## Conclusion
In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 92% on the test set, which is a great result. The messages that were classified incorrect were difficult to classify might be due to inclusion of many new words out of the vocabulary or not having the complete message displayed in the SMS.