# Simple SMS Spam Filter in Python

## Agenda

This project aims at implementing several methods for SPAM-filtering in Python, on a dataset containing SMS data. It is developed in the script *sms_spam_filter.py*

It uses a [dataset](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv) of 5,572 labelled SMS, and a Multinomial Naive Bayes model for classification.

This notebook is organised as follow:
* [1) Reading the SMS data](#1-reading-the-sms-data)
* [2) Learning to predict "spamminess"](#2-learning-to-predict-spamminess)
* [3) Analyzing the results](#3-analyzing-the-results)

## 1) Reading the SMS data <a class="anchor" id="1-reading-the-sms-data"></a>

This first section focuses on reading the data, going through it a bit and preparing the datasets we are going to use for the purpose of this study.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Let's load the data from an online TSV file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
sms = pd.read_table(url, header=None, names=['label', 'message'])

In [3]:
# Let's check how the data looks
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
# Let's check the balance of the dataset
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

The dataset seems to be very unbalanced. Depending on the chosen model, this might be an issue during the learning process (as the null model might be a relatively precise model).

In [5]:
# Let us create a proper target column. 1 will be spam, since this is what we are trying to detect
sms['target'] = sms.label.map({'ham':0, 'spam':1})
sms.head()

Unnamed: 0,label,message,target
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [6]:
# Let us now create our overall dataset
X = sms.message
y = sms.target
print(X.shape, y.shape)

(5572,) (5572,)


Both have the same length, which is normal, and are one-dimensional. This is due to the fact that for now, the data is not yet vectorized and each observation of X is one unique string.

But first, let us split our dataset into two sets:
* A training set, on which we will train our model (using cross-validation first to assess its performance and try to optimize parameters, before training the chosen model on the whole dataset)
* A valid set, on which we will assess the performance of our best model, once our model is optimized and trained.

In [7]:
from sklearn.cross_validation import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

## 2) Learning to predict "spamminess" <a class="anchor" id="2-learning-to-predict-spamminess"></a>

The first thing to do is to vectorize the datasets (i.e. get a numerical matrix out of this string one-dimensional object). In order to do so, we will use CountVectorizer, which will create a sparse document-term matrix (DTM) containing the observations ("documents") on the first dimension, the tokens ("terms") on the second dimension, and the number of appearances of each token in each observation as values in the matrix.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB as MNB

from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import cross_val_score
from sklearn import metrics

The first difficulty here is the following: our vectorizer vect must be fitted only on the training set, so that the document-term matrix of the training set does not contain any information of the testing set (i.e. so that the frequencies of the training DTM do not include testing observations).

How to use cross-validation, then? The solution is pipelines. They enable to pass a pipeline(vectorizer, model) which will be fitted on the training set only, before being applied on the testing set, for every iteration of the cross-validation process.

In [9]:
vect = CountVectorizer(max_features=4000, max_df=0.4)
mnb = MNB()

pipeline = make_pipeline(vect, mnb)

In [10]:
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=5)
np.mean(scores)

0.9851037347738929

The parameters chosen above have been adjusted by trial and error, trying to improve the overall mean score obtained through cross-validation. Other parameters have been tested, such as *stop_words*, *mean_df* and *ngram_range*.

Now, we can just fit the pipeline on the whole train set. This simple operation does the following things:
* Fits the vectorizer on the training set
* Transforms the training set with the fitted vectorizer
* Fits the model on the training set with the given targets

In [11]:
pipeline.fit(X_train, y_train);

In [12]:
y_pred = pipeline.predict(X_valid)

In [13]:
metrics.accuracy_score(y_valid, y_pred)

0.99103139013452912

Our model gets a prediction accuracy of 99.1%, which is very satisfying… 99% of the SMS tested were accurately predicted spam or ham.

## 3) Analyzing the results <a class="anchor" id="3-analyzing-the-results"></a>

We are going to need to access to our vectorizer's attributes for going further in the analyses of the results, so let's repeat the previous process without using the pipeline (this is also the opportunity to notice how convenient the pipeline is).

In [14]:
vect = CountVectorizer(max_df=0.4, max_features=4000)
mnb = MNB()

In [15]:
X_train_dtm = vect.fit_transform(X_train, y_train)
mnb.fit(X_train_dtm, y_train);

In [16]:
X_valid_dtm = vect.transform(X_valid)
y_pred = mnb.predict(X_valid_dtm)

metrics.accuracy_score(y_valid, y_pred)

0.99103139013452912

### 3.a) Measuring accuracy

First, let's check out the ROC AUC score, which can be seen as a measure of how efficient the model is to rank the items per probability of belonging to each class.
This is often a good indicator when you want to get the predictions per order of probability (*e.g.* if you want to know those which are the most likely to be class 1).

In [17]:
metrics.roc_auc_score(y_valid, y_pred)

0.9746408894136166

Let's check confusion matrix. This could be a useful metric if we decided that one mistake should weigh more than another.

For example, one could rather have more actual spam predicted ham, and less actual ham predicted spam (in order not to miss information): it would thus be needed to check this confusion metric (or the false positive rate) during the model optimisation phase.

In [18]:
metrics.confusion_matrix(y_valid, y_pred)

array([[965,   3],
       [  7, 140]])

It can be interesting to know on which SMS our model failed to predict accurately the class. To do so, let's check out the actual false positives and false negatives of our prediction:

In [19]:
# False positives:
pd.set_option("display.max_colwidth",150)
X_valid[y_valid < y_pred].head()

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
Name: message, dtype: object

In [20]:
# False negatives:
pd.set_option("display.max_colwidth",150)
X_valid[y_valid > y_pred].head()

3132    LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MM...
5         FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
3530    Xmas & New Years Eve tickets are now on sale from the club, during the day from 10am till 8pm, and on Thurs, Fri & Sat night this week. They're se...
1875                                                                     Would you like to see my XXX pics they are so hot they were nearly banned in the uk!
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIVE CHAT GOING ON IN THE OFFICE RIGHT NOW TOTAL PRIVACY NO ONE KNOWS YOUR [sic] LISTENING 60P MIN 24/7...
Name: message, dtype: object

One can notice false positives are all very short messages, and my guess would be that they have all one "spammy" word among those few words ("call", for example), which makes them get a spam prediction, since there are not enough "hammy" words to counter-balance this.

### 3.b) Spamminess of each token

Let us now try to compute the "spamminess" and "hamminess" we were talking about in the previous section.

To do so, one can take a look at the frequency of appearance of every word in spam content, and in ham content. Then, the ratio spam_frequency by ham_frequency could be taken as a measurement of spamminess.

In [21]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

4000

In [22]:
# Naive Bayes counts the number of times each token appears in each class
mnb.feature_count_

array([[  0.,   0.,   0., ...,   0.,   2.,   1.],
       [  5.,  24.,   2., ...,   6.,   0.,   1.]])

In [23]:
# number of times each token appears across all HAM messages
ham_token_count = mnb.feature_count_[0, :]
ham_token_count

array([ 0.,  0.,  0., ...,  0.,  2.,  1.])

In [24]:
# number of times each token appears across all SPAM messages
spam_token_count = mnb.feature_count_[1, :]
spam_token_count

array([  5.,  24.,   2., ...,   6.,   0.,   1.])

In [25]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
respect,4.0,0.0
senses,1.0,0.0
lol,64.0,0.0
affair,2.0,0.0
haven,19.0,0.0


Before we can use this to calculate the "spamminess" of each token, we need to avoid dividing by zero and take care of the class imbalance (by using frequencies rather than counts).

In [26]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1

In [27]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / mnb.class_count_[0]
tokens['spam'] = tokens.spam / mnb.class_count_[1]
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
respect,0.001296,0.001667
senses,0.000519,0.001667
lol,0.016852,0.001667
affair,0.000778,0.001667
haven,0.005185,0.001667


In [28]:
# calculate the spamminess (spam_frequency/ham_frequency)
tokens['spamminess'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,spamminess
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
respect,0.001296,0.001667,1.285667
senses,0.000519,0.001667,3.214167
lol,0.016852,0.001667,0.098897
affair,0.000778,0.001667,2.142778
haven,0.005185,0.001667,0.321417


Now, one can look at the following list, sorted by descending spamminess... And it all seems quite understandable!

In [30]:
# examine the DataFrame sorted by spam_ratio
tokens.sort_values('spamminess', ascending=False)

Unnamed: 0_level_0,ham,spam,spamminess
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,0.000259,0.150000,578.550000
prize,0.000259,0.133333,514.266667
150p,0.000259,0.090000,347.130000
tone,0.000259,0.083333,321.416667
guaranteed,0.000259,0.076667,295.703333
18,0.000259,0.075000,289.275000
www,0.000519,0.135000,260.347500
cs,0.000259,0.061667,237.848333
1000,0.000259,0.058333,224.991667
500,0.000259,0.056667,218.563333
