# Building a Spam Filter with Naive Bayes

1. The **SMS Spam Collection** Dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#). 


2. The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

In [39]:
import pandas as pd
import numpy as np
import re

In [13]:
# let read the dataset and study the data
file_loc = '/Users/sni/Documents/Python/Dataquest-Online-Courses-2022/Datasets/SMSSpamCollection.csv'
df = pd.read_csv(file_loc,sep='\t',header=None)
names=['Label','SMS']
df.columns=names
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [20]:
# let get the basic information about the dataset
print(df.shape)
print('\n')
print(df['Label'].value_counts())
print('\n')
print(df['Label'].value_counts(normalize=True))

(5572, 2)


ham     4825
spam     747
Name: Label, dtype: int64


ham     0.865937
spam    0.134063
Name: Label, dtype: float64


Before we move on and create a Spam Filter, it's very helpful to first think of a way of testing how well it works. A good rule of thumb is that designing the test comes before creating the software. 

Once our Spam Filter is done, we need to test how good it is with classifying new message. To test the spam filter, we are going to split our dataset into two categories:

- A **Training Set**, which we'll use to train the computer how to classify messages.

- A **Test Set**, which we'll use to test how good the spam filter is with classifying new messages.

We are going to keep 80% of our dataset for training, and 20% for testing.

- The training set will have 4,458 messages
- The testing set will have 1,114 messages

### For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [78]:
'''let's start taking samples from df, and a create test data set 
called test, and a training data set called training'''

df_randomized = df.sample(frac=1,random_state=1)

training_rows = round(len(df_randomized)*0.8)

training = df_randomized[:training_rows].copy()
test = df_randomized[training_rows:].copy()

training.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

print(f'training dataset shape {training.shape}')
print(f'test dataset shape {test.shape}')

training dataset shape (4458, 2)
test dataset shape (1114, 2)


In [79]:
#Check the percentage split between spam and non-spam for each dataset
training['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [80]:
test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [81]:
training.head(2)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"


In [82]:
'''the dataset contain messages in different format, some have capital
letters (which need to be convert to lower letters), 
some have punctuation (which need to be removed), let's firstly cleanup
the dataset'''

# let's firstly transform every letter in every word to lower case
training['SMS'] = training['SMS'].str.lower()

# write a function to remove all punctuation using re.sub() methond from 
# regular expression package, and then apply to 'SMS' series.
def sub_W(x):
    return re.sub('\W',' ',x)

training['SMS'] = training['SMS'].apply(sub_W)
training

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
...,...,...
4453,ham,sorry i ll call later in meeting any thing re...
4454,ham,babe i fucking love you too you know fuck...
4455,spam,u ve been selected to stay in 1 of 250 top bri...
4456,ham,hello my boytoy geeee i miss you already a...
