# Building a Spam Filter Using Naive Bayes

Our aim in this project is to build a spam filter for SMS messages using the multinomial naive Bayes classifier. At the end of this project we will have a program that classifies messages as spam or non-spam (ham) with an accuracy greater than 98%.

In addition to implementing our own filter, we will look at `sklearn`'s multinomaial naive Bayes, random forest and support vector classifer models. We will compare the performances of all four using cross-validation.

We are going to use a data set that contains 5,572 SMS messages that are already classified. The data set we will use, and more info on it, can be found on [the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

## Exploring the Data Set

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [2]:
sms = pd.read_csv('files/SMSSpamCollection', sep='\t', header=None, names=['sms_label', 'sms_message'])

sms.head()

Unnamed: 0,sms_label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.shape

(5572, 2)

Below we take a look at the distribution of the target and encode the target column in order to get it ready for ML algorithms.

In [4]:
sms['sms_label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: sms_label, dtype: float64

In [5]:
sms['sms_label'].replace({'spam':1, 'ham':0}, inplace=True)
sms.rename(columns={'sms_label':'is_spam'}, inplace=True)

## Forming Training and Test Sets

In order to check our filter, we'll initially use a simple pair of training and test sets; we'll separate the last fifth of our data set for testing. Later on, we'll move on to cross-validation for all models we consider.

In [6]:
# The index for splitting
cutoff = round(len(sms) * 0.8)

# Randomize the data set before splitting
sms_randomized = sms.sample(frac=1, random_state=1)

initial_training = sms_randomized[:cutoff].reset_index(drop=True)
initial_test = sms_randomized[cutoff:].reset_index(drop=True)

initial_training.head()

Unnamed: 0,is_spam,sms_message
0,0,"Yep, by the pretty sculpture"
1,0,"Yes, princess. Are you going to make me moan?"
2,0,Welp apparently he retired
3,0,Havent.
4,0,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
initial_test.head()

Unnamed: 0,is_spam,sms_message
0,0,Later i guess. I needa do mcat study too.
1,0,But i haf enuff space got like 4 mb...
2,1,Had your mobile 10 mths? Update to latest Oran...
3,0,All sounds good. Fingers . Makes it difficult ...
4,0,"All done, all handed in. Don't know if mega sh..."


Let's check whether the distributions of the target in these sets are similar to the whole data set.

In [8]:
initial_training['is_spam'].value_counts(normalize=True)

0    0.86541
1    0.13459
Name: is_spam, dtype: float64

In [9]:
initial_test['is_spam'].value_counts(normalize=True)

0    0.868043
1    0.131957
Name: is_spam, dtype: float64

## Using Multinomial Naive Bayes

Now we will start our calculations needed for the spam filter. We'll write three separate functions for our end goal:

- `vectorize` will transform the training set to a matrix of token counts

- `get_probs` will calculate the terms that recur for each word using the output of `vectorize`

- `classify` will classify a message using the output of `get_probs`

Recall that, given a message, the multinomial naive Bayes algorithm uses the following proportions to classify the message:

\begin{equation*}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation*}

\begin{equation*}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation*}

Here, to calculate $P(w_i|Spam)$ and $P(w_i|Ham)$, we will use Laplace smoothing ($\alpha = 1$).

\begin{equation*}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation*}

\begin{equation*}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation*}

- We will start by calculating the terms that recur for each word, that is, the ones independent of $w_i$. These are $P(Spam)$, $P(Ham)$, $N_{Spam}$, $N_{Ham}$ and $N_{Vocabulary}$.

- Next we will calculate the conditional probabilities for each word, namely $P(w_i|Spam)$ and $P(w_i|Ham)$. We will use separate dictionaries for the spam and ham cases.

- Then we'll be ready to go ahead with the classification. Recall that the classifier compares the products $P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)$ and $P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)$.

In [10]:
def vectorize(train):
    '''train: subdataframe of sms
    returns: vocabulay of train (list)
    '''
    cvec = CountVectorizer(ngram_range=(1, 1), token_pattern=r"(?u)\b\w+\b")

    train_vec = cvec.fit_transform(train['sms_message'])

    train_vectorized = pd.concat(
        [train,
         pd.DataFrame(train_vec.toarray(),
                      columns=cvec.get_feature_names_out())],
         axis=1
    )

    vocabulary = set(train_vectorized.columns) - {'sms_message', 'is_spam'}
    return train_vectorized, vocabulary

In [11]:
def get_probs(train, vocabulary):
    '''Calculates probabilities using train as the training set

    train: sub-dataframe of sms_vectorized
    returns: cond_spam and cond_ham (dictionaries), p_spam and p_ham (floats)
    '''
    n_vocabulary = len(vocabulary)

    alpha = 1 # Laplace smoothing

    spam = train[train['is_spam'] == 1]
    ham = train[train['is_spam'] == 0]

    p_spam = len(spam) / len(train)
    p_ham = len(ham) / len(train)

    n_spam = spam['sms_message'].apply(len).sum()
    n_ham = ham['sms_message'].apply(len).sum()

    cond_spam, cond_ham = {}, {}

    for w in vocabulary:
        cond_spam[w] = (spam[w].sum() + alpha) / (n_spam + alpha * n_vocabulary)
        cond_ham[w] = (ham[w].sum() + alpha) / (n_ham + alpha * n_vocabulary)
    
    return p_spam, p_ham, cond_spam, cond_ham

In [12]:
def classify(sms, p_spam, p_ham, cond_spam, cond_ham, print_posterior=False):
    '''CLassifies sms using train as the training set
    
    sms: (str) an sms message
    train: sub-dataframe of sms_vectorized
    returns: (str) classification of s
    '''    
    p_spam_given_sms = p_spam
    p_ham_given_sms = p_ham
    
    for w in sms.split():
        if w in cond_spam:
            p_spam_given_sms *= cond_spam[w]
        if w in cond_ham:
            p_ham_given_sms *= cond_ham[w]
    
    if print_posterior:
        print("SMS: \"{}\"".format(sms))
        print("P(Spam|SMS) = {}".format(p_spam_given_sms))
        print("P(Ham|SMS) = {}".format(p_ham_given_sms))
    
    if p_spam_given_sms > p_ham_given_sms:
        return 1
    elif p_spam_given_sms < p_ham_given_sms:
        return 0
    else:
        return 'unclassified'

## Initial test of our filter

In [13]:
initial_test['sms_message'] = initial_test['sms_message'].str.replace('\W', ' ', regex=True).str.lower()

In [14]:
train_vectorized, vocabulary = vectorize(initial_training)
p_spam, p_ham, cond_spam, cond_ham = get_probs(train_vectorized, vocabulary)

Let's check a couple of messages.

In [15]:
classify(initial_test['sms_message'][0], p_spam, p_ham, cond_spam, cond_ham, print_posterior=True)

SMS: "later i guess  i needa do mcat study too "
P(Spam|SMS) = 2.358527594972535e-30
P(Ham|SMS) = 1.3859954358708115e-23


0

In [16]:
classify(initial_test['sms_message'][2], p_spam, p_ham, cond_spam, cond_ham, print_posterior=True)

SMS: "had your mobile 10 mths  update to latest orange camera video phones for free  save  s with free texts weekend calls  text yes for a callback orno to opt out"
P(Spam|SMS) = 4.026633211507193e-100
P(Ham|SMS) = 1.118169555940427e-116


1

It seems to be working. Let's see the general accuracy.

In [17]:
preds = initial_test['sms_message'].apply(
    lambda s: classify(s, p_spam, p_ham, cond_spam, cond_ham)
)

In [18]:
(preds == initial_test['is_spam']).value_counts()

True     1098
False      16
dtype: int64

In [19]:
(preds == initial_test['is_spam']).value_counts(normalize=True)

True     0.985637
False    0.014363
dtype: float64

The results look very good on our inital test set.

## Cross-validation on our filter

Now we'll proceed with cross-validaton. First we'll take a look at the distribution of the target on training and test sets on each iteration to check whether they are similar to the whole data set.

In [20]:
cv = KFold(n_splits=5, shuffle=True, random_state=1)

for i, (tr, tt) in enumerate(cv.split(sms)):
    print(f"Distribution of target on iteration {i}")
    print("Training set:")
    print(sms['is_spam'].iloc[tr].value_counts(normalize=True))
    print("Test set:")
    print(sms['is_spam'].iloc[tt].value_counts(normalize=True))
    print()

Distribution of target on iteration 0
Training set:
0    0.86538
1    0.13462
Name: is_spam, dtype: float64
Test set:
0    0.868161
1    0.131839
Name: is_spam, dtype: float64

Distribution of target on iteration 1
Training set:
0    0.868521
1    0.131479
Name: is_spam, dtype: float64
Test set:
0    0.855605
1    0.144395
Name: is_spam, dtype: float64

Distribution of target on iteration 2
Training set:
0    0.866981
1    0.133019
Name: is_spam, dtype: float64
Test set:
0    0.861759
1    0.138241
Name: is_spam, dtype: float64

Distribution of target on iteration 3
Training set:
0    0.863392
1    0.136608
Name: is_spam, dtype: float64
Test set:
0    0.876122
1    0.123878
Name: is_spam, dtype: float64

Distribution of target on iteration 4
Training set:
0    0.86541
1    0.13459
Name: is_spam, dtype: float64
Test set:
0    0.868043
1    0.131957
Name: is_spam, dtype: float64



We'll keep records for each model in a separate dataframe.

In [21]:
results = pd.DataFrame()

Note that the output of `get_probs` is the same for each iteration of `cv`, so we want to run it only once for each iteration.

In [22]:
accuracies = []

for i, (tr, tt) in enumerate(cv.split(sms)):
    train =  sms.iloc[tr].reset_index(drop=True)
    test = sms.iloc[tt].reset_index(drop=True)

    test['sms_message'] = test['sms_message'].str.replace('\W', ' ', regex=True).str.lower()

    train_vectorized, vocabulary = vectorize(train)
    p_spam, p_ham, cond_spam, cond_ham = get_probs(train_vectorized, vocabulary)
    
    preds = test['sms_message'].apply(
        lambda sms: classify(sms, p_spam, p_ham, cond_spam, cond_ham)
    )

    mask = (preds == test['is_spam'])
    print(mask.value_counts())

    accuracies.append(mask.value_counts(normalize=True)[True])

print()
print(f"accuracies: {accuracies}")
print(f"mean accuracy: {np.mean(accuracies)}")
print(f"standard deviation: {np.std(accuracies)}")

results['our_filter'] = [np.mean(accuracies), np.std(accuracies)]

True     1101
False      14
dtype: int64
True     1092
False      23
dtype: int64
True     1099
False      15
dtype: int64
True     1100
False      14
dtype: int64
True     1098
False      16
dtype: int64

accuracies: [0.9874439461883409, 0.979372197309417, 0.9865350089766607, 0.9874326750448833, 0.9856373429084381]
mean accuracy: 0.985284234085548
standard deviation: 0.0030305595996140775


## Models from `scikit-learn`

Now we move on to try a few models from `scikit-learn` on our data set. We will use the same `cv` splits we used above on all models.

In [23]:
def get_scores(model):
    accuracies = []

    for i, (tr, tt) in enumerate(cv.split(sms)):
        cvec = CountVectorizer(ngram_range=(1, 2), token_pattern=r"(?u)\b\w+\b")
        train_vectorized = cvec.fit_transform(sms.iloc[tr]['sms_message'])
        test_vectorized = cvec.transform(sms.iloc[tt]['sms_message'])

        model.fit(train_vectorized, sms.iloc[tr]['is_spam'])
        preds = model.predict(test_vectorized)

        mask = (preds == sms.iloc[tt]['is_spam'])
        print(mask.value_counts())

        accuracies.append(mask.value_counts(normalize=True)[True])
    
    print()
    print(f"accuracies: {accuracies}")
    print(f"mean accuracy: {np.mean(accuracies)}")
    print(f"standard deviation: {np.std(accuracies)}")

    results[str(model)] = [np.mean(accuracies), np.std(accuracies)]

### Multinomial Naive Bayes

In [24]:
mnb = MultinomialNB(alpha=1) # Laplace smoothing as before

get_scores(mnb)

True     1105
False      10
Name: is_spam, dtype: int64
True     1097
False      18
Name: is_spam, dtype: int64
True     1098
False      16
Name: is_spam, dtype: int64
True     1101
False      13
Name: is_spam, dtype: int64
True     1101
False      13
Name: is_spam, dtype: int64

accuracies: [0.9910313901345291, 0.9838565022421525, 0.9856373429084381, 0.9883303411131059, 0.9883303411131059]
mean accuracy: 0.9874371835022664
standard deviation: 0.0024728318503580344


### Random Forest Classifier

In [25]:
rf = RandomForestClassifier()

get_scores(rf)

True     1088
False      27
Name: is_spam, dtype: int64
True     1080
False      35
Name: is_spam, dtype: int64
True     1078
False      36
Name: is_spam, dtype: int64
True     1080
False      34
Name: is_spam, dtype: int64
True     1088
False      26
Name: is_spam, dtype: int64

accuracies: [0.9757847533632287, 0.968609865470852, 0.9676840215439856, 0.9694793536804309, 0.9766606822262118]
mean accuracy: 0.9716437352569418
standard deviation: 0.003791728735691018


### Support Vector Classifier

In [26]:
svc = SVC()

get_scores(svc)

True     1104
False      11
Name: is_spam, dtype: int64
True     1090
False      25
Name: is_spam, dtype: int64
True     1095
False      19
Name: is_spam, dtype: int64
True     1095
False      19
Name: is_spam, dtype: int64
True     1096
False      18
Name: is_spam, dtype: int64

accuracies: [0.9901345291479821, 0.9775784753363229, 0.9829443447037702, 0.9829443447037702, 0.9838420107719928]
mean accuracy: 0.9834887409327677
standard deviation: 0.003995379193758718


### Conclusion

In [27]:
results.rename({0:'mean accuracy', 1:'stdev of accuracies'}, axis=0, inplace=True)

results

Unnamed: 0,our_filter,MultinomialNB(alpha=1),RandomForestClassifier(),SVC()
mean accuracy,0.985284,0.987437,0.971644,0.983489
stdev of accuracies,0.003031,0.002473,0.003792,0.003995


Among the models we've tried, multinomial naive Bayes algorithm gives the best scores on this data set. We see that the model we've written is a close second to `sklearn`'s multinomial naive Bayes model, with support vector classifier and random forest classifier following as third and fourth best according to accuracy values.