<a href="https://colab.research.google.com/github/victorsmoreschi/Data-Studies/blob/main/Building_a_Spam_Filter_with_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Beggining
To classify messages as spam or non-spam, we saw in the previous lesson that the computer:

* Learns how humans classify messages.
* Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
* Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re

In [4]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SMSSpamCollection', sep = '\t', header = None, names=['Label', 'SMS'])

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [7]:
df['Label'].value_counts(normalize= True)

Unnamed: 0_level_0,proportion
Label,Unnamed: 1_level_1
ham,0.865937
spam,0.134063


## Separating Training DF and Test DF

To test the spam filter, we're first going to split our dataset into two categories:

* A **training set**, which we'll use to "train" the computer how to classify messages.
* A **test set**, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

The training set will have 4,458 messages (about 80% of the dataset).
The test set will have 1,114 messages (about 20% of the dataset).

In [8]:
sample = df.sample(frac=1, random_state=1)

In [9]:
training = sample[:4458].reset_index(drop=True)
test = sample[4458:].reset_index(drop=True)

In [10]:
training.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4458 entries, 0 to 4457
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   4458 non-null   object
 1   SMS     4458 non-null   object
dtypes: object(2)
memory usage: 69.8+ KB


In [11]:
print(training['Label'].value_counts(normalize= True))
print(test['Label'].value_counts(normalize= True))

Label
ham     0.86541
spam    0.13459
Name: proportion, dtype: float64
Label
ham     0.868043
spam    0.131957
Name: proportion, dtype: float64


In [19]:

training['SMS'] = training['SMS'].str.replace('\W', ' ', regex = True)
training['SMS'] = training['SMS'].str.lower()
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Adjusting the Training Data

In [20]:
# Create a vocabulary for the messages in the training set
training['SMS'] = training['SMS'].str.split()

vocabulary = []
for sms in training['SMS']:
    for word in sms:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))

In [21]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [22]:
world_count = pd.DataFrame(word_counts_per_sms)

In [25]:
full_training_df = pd.concat([training, world_count], axis=1)

In [26]:
full_training_df.head(2)

Unnamed: 0,Label,SMS,mca,generally,screamed,tron,s89,savamob,govt,nike,...,ithink,sang,mmmmmm,130,accommodationvouchers,fgkslpo,gymnastics,1146,palm,dai
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating Initial Probs

All these terms will have constant values in our equations for every new message (regardless of the message or each individual word in the message).

However, P(wi|Spam) and P(wi|Ham) will vary depending on the individual words. For instance, P("secret"|Spam) will have a certain probability value, while P("cousin"|Spam) or P("lovely"|Spam) will most likely have other values.

Although both P(wi|Spam) and P(wi|Ham) vary depending on the word, the probability for each individual word is constant for every new message.

In [28]:
p_spam = full_training_df['Label'].value_counts(normalize=True)['spam']
p_ham = full_training_df['Label'].value_counts(normalize=True)['ham']
n_spam = full_training_df[full_training_df['Label'] == 'spam']['SMS'].apply(len).sum()
n_ham = full_training_df[full_training_df['Label'] == 'ham']['SMS'].apply(len).sum()
n_vocabulary = len(vocabulary)
alpha = 1

In [32]:
words_spam = {unique_word: 0 for unique_word in vocabulary}
words_ham = {unique_word: 0 for unique_word in vocabulary}
# Isolating the spam and the ham messages in the training set into two different DataFrames
spam_messages = full_training_df[full_training_df['Label'] == 'spam']
ham_messages = full_training_df[full_training_df['Label'] == 'ham']

## Calculating Probs for Each Word

In [33]:
for word in vocabulary:
  parameter = spam_messages[word].sum()
  words_spam[word] = (parameter + alpha) / (n_spam + alpha * n_vocabulary)
  parameter = ham_messages[word].sum()
  words_ham[word] = (parameter + alpha) / (n_ham + alpha * n_vocabulary)

In [35]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
      if word in words_spam:
        p_spam_given_message *= words_spam[word]
      if word in words_ham:
        p_ham_given_message *= words_ham[word]
      else:
        continue

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

## Testing the Classification System

In [36]:
test1 = classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [37]:
test2 = classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [46]:
# Function to apply in df

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
      if word in words_spam:
        p_spam_given_message *= words_spam[word]
      if word in words_ham:
        p_ham_given_message *= words_ham[word]
      else:
        continue

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal proabilities, have a human classify this!'

In [47]:
test['prediction'] = test['SMS'].apply(classify)

P(Spam|message): 3.4831070937898343e-26
P(Ham|message): 4.253245130534654e-19
P(Spam|message): 3.113880725982859e-34
P(Ham|message): 9.669410959057878e-29
P(Spam|message): 7.548549643070596e-83
P(Ham|message): 4.338466063216561e-98
P(Spam|message): 3.608707853113582e-34
P(Ham|message): 1.4814957224618124e-28
P(Spam|message): 2.764395015074001e-68
P(Ham|message): 6.58114288387539e-58
P(Spam|message): 3.003832099003537e-110
P(Ham|message): 1.3968662892114072e-88
P(Spam|message): 6.630543201285272e-08
P(Ham|message): 1.536822002759187e-05
P(Spam|message): 1.6750161514573394e-44
P(Ham|message): 9.822271357766382e-39
P(Spam|message): 1.2938388793330702e-42
P(Ham|message): 5.700659615758871e-36
P(Spam|message): 1.0298350092198955e-15
P(Ham|message): 6.458118857256101e-15
P(Spam|message): 6.697358800236469e-16
P(Ham|message): 1.8371872356584514e-12
P(Spam|message): 1.0054992015050658e-41
P(Ham|message): 3.978760177910219e-34
P(Spam|message): 6.888979648453438e-99
P(Ham|message): 5.55519583276

In [48]:
test.head(20)

Unnamed: 0,Label,SMS,prediction
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
5,ham,But my family not responding for anything. Now...,ham
6,ham,U too...,ham
7,ham,Boo what time u get out? U were supposed to ta...,ham
8,ham,Genius what's up. How your brother. Pls send h...,ham
9,ham,I liked the new mobile,ham


In [49]:
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['prediction']:
        correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833
