# Naive Bayes algorithm for spam detection

Here we predict if a sms message is 'spam' or 'ham' (i.e. not 'spam') using the Bernoulli Naive Bayes *classifier*.

The training data is from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection.


## Step 1:  Getting, understanding, and cleaning the dataset

###  Importing the dataset

In [18]:
# Import the usual libraries
import numpy as np 
import pandas as pd  
df = pd.read_csv('SMSSpamCollection', sep = '\t', header=None, names=['label', 'sms_message'])

df.head() 

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data preprocesssing
Changing our labels as 0 and 1 as it is easier to work with numerical values

In [19]:
df['label']=df.label.map({'spam':1, 'ham':0})
df.head() 

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Splitting the dcoument into training set and test set

In [20]:
# Using the sklearn's method for seperating the data
from sklearn.model_selection import train_test_split

df_train_msgs, df_test_msgs, df_ytrain, df_ytest = train_test_split(df['sms_message'],df['label'], random_state=0)

#Looking at the train/test split
print("The number of training examples: ", df_train_msgs.shape[0])
print("The number of test exampels: ", df_test_msgs.shape)

The number of training examples:  4179
The number of test exampels:  (1393,)


###  Creating the feature vector from the text (feature extraction)

Each message will have its own feature vector.  For each message we will create its feature vector; we will have a feature for every word in our vocabulary.  The $j$th feature is set to one ($x_j=1$) if the $j$th word from our vocabulary occurs in the message, and set the $j$ feature to $0$ otherwise ($x_j=0$).

We will use the sklearn method CountVectorize to create the feature vectors for every message.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
# creating an instance of CountVectorizer

vectorizer = CountVectorizer(binary = True, stop_words='english')

In [22]:
# Creating the vocabulary for our feature transformation
vectorizer.fit(df_train_msgs)

# Next we create the feature vectors for both the training data and the test data
X_train = vectorizer.transform(df_train_msgs).toarray() # code to turn the training emails into a feature vector
X_test = vectorizer.transform(df_test_msgs).toarray() # code to turn the test email into a feature vector

# Changing the target vectors data type  
y_train= df_ytrain.to_numpy() 
y_test = df_ytest.to_numpy()

# To observe what the data looks like 
print("The label of the first training example: ", y_train[0])
print("The first training example: ", len(X_train[2].tolist()))

The label of the first training example:  0
The first training example:  7287


In [23]:
probablity_spam_1 = len(df[df['label'] == 1]) / df.shape[0]
probablity_ham_0 = len(df[df['label'] == 0]) / df.shape[0]

print("probablity for spam:" , probablity_spam_1)
print("probablity for ham:" , probablity_ham_0)

X_train_spam_1 = X_train[y_train == 1]
X_train_ham_0 = X_train[y_train == 0]


m = 0.1   # we can change the value of m 
    
X_train_probablity_spam_numerator = X_train_spam_1.sum(axis = 0) + m
X_train_probablity_ham_numerator = X_train_ham_0.sum(axis = 0) + m

X_train_probablity_spam_denominator = (2 * m) + len(X_train_spam_1)
X_train_probablity_ham_denominator = (2 * m) + len(X_train_ham_0)

X_train_probablity_given_1 = X_train_probablity_spam_numerator / X_train_probablity_spam_denominator
X_train_probablity_given_0 = X_train_probablity_ham_numerator / X_train_probablity_ham_denominator

print("probabality for class spam")
print(X_train_probablity_given_1)
print("probbality fot class ham")
print(X_train_probablity_given_0)


probablity for spam: 0.13406317300789664
probablity for ham: 0.8659368269921034
probabality for class spam
[0.01262896 0.0357524  0.00017787 ... 0.00017787 0.0019566  0.00017787]
probbality fot class ham
[2.76456928e-05 2.76456928e-05 3.04102621e-04 ... 3.04102621e-04
 2.76456928e-05 3.04102621e-04]


# Here we also predict using the ZeroR classification method





In [24]:
probablity_matrix_given_ham = np.where(X_test == 1, X_train_probablity_given_0, 1 - X_train_probablity_given_0)
probablity_matrix_given_spam = np.where(X_test == 1, X_train_probablity_given_1, 1 - X_train_probablity_given_1)

log_of_probablity_matrix_given_spam = np.log(probablity_matrix_given_spam)
log_of_probablity_matrix_given_ham = np.log(probablity_matrix_given_ham)

naive_bayes_probablity_spam = np.sum(log_of_probablity_matrix_given_spam, axis = 1) + np.log(probablity_spam_1)
naive_bayes_probablity_ham = np.sum(log_of_probablity_matrix_given_ham, axis = 1) + np.log(probablity_ham_0)

predicted_class = np.where(naive_bayes_probablity_spam > naive_bayes_probablity_ham , 1, 0)

#error-rate for zero-R 

if(len(X_train_spam_1) > len(X_train_ham_0)):
    highest_probablity = 1
else :
    highest_probablity = 0

accuracy_zero_r = len(y_test[y_test == highest_probablity]) / len(y_test)


print(predicted_class[:5])   #first five lements
print(predicted_class[-5:])
print("predicted class for first 50 samples")
print(predicted_class[:50])

print("percentage_error:" , percentage_error(predicted_class, y_test))
print("accuracy for zero_R:" , accuracy_zero_r)



[0 0 0 0 0]
[0 1 1 0 0]
predicted class for first 50 samples
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1
 0 0 1 0 0 0 0 0 0 0 0 0 1]
total number of test examples classified incorrectly
19
total number of test examples classified correctly
1374
accuracy for bernoulli: 0.9863603732950467
percentage_error: 0.013639626704953339
accuracy for zero_R: 0.8671931083991385


In [25]:
def percentage_error(predicted_class, y_test):
    error_matrix = predicted_class - y_test
    error_percentage = (len(error_matrix)-(error_matrix==0).sum())/len(error_matrix)
    classified_correctly = len(error_matrix)-(error_matrix!=0).sum()
    classified_incorrectly = len(error_matrix)-(error_matrix==0).sum()
    print("total number of test examples classified incorrectly")
    print(classified_incorrectly)
    print("total number of test examples classified correctly")
    print(classified_correctly)
    accuracy_bernoulli = classified_correctly / (classified_incorrectly + classified_correctly)
    print("accuracy for bernoulli:" , accuracy_bernoulli)
    return error_percentage