# SMS Spam Classification
***Reynaldo Vazquez***  
***November 16, 2017***

[GitHub Repository](https://github.com/reyvaz/SMS-Classification)

## Tests different classification algorithms to detect spam in SMS messages


Different specifications of Naïve-Bayes, Logistic-Regression, and Support Vector Machine models with varying features' complexity are tested in order to classify SMS (text) messages as spam or legitimate (ham). 

The dataset contains 5,574 SMS messages in English, tagged according to being ham or spam. It was downloaded from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) on November 15, 2017, and can also be found at the [UCI](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) Machine Learning Repository. Acknowledgements to Tiago A. Almeida and José María Gómez Hidalgo, creators of the original dataset. More information can be found [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/).

Due to the nature of the problem, it is important to avoid misclassification of legitimate messages as spam. i.e. it would be preferable to let some spam to be misclassified as legitimate than the other way around. Thus, special attention is placed on the precision  metric at determining best model performance, although accuracy, recall, and area under the ROC curve are also considered. 

The best performing algorithms are a Naïve-Bayes specification (MultinomialNB) on the entire training dataset dictionary with alpha parameter 0.1. And a Support Vector Machine (SVM) specification on 2 to 5 character ngrams, plus 3 additional features regarding type and length of character content and C parameter 1000.

```
                           Param  Accuracy    Recall  Precision   ROC AUC
MultinomialNB        alpha = 0.1  0.992103  0.944162   1.000000  0.972081
SVC                     C = 1000  0.993539  0.959391   0.994737  0.979277
```

The Naïve-Bayes specification is slightly superior in terms simplicity, speed, and better precision on the test dataset. While the SVC specification performs slightly better in all of the other metrics.

### Importing the dataset

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('spam.csv', encoding='latin-1', header=0, 
                 names = ['target', 'text'], usecols = [0, 1])
pd.set_option('display.max_colwidth', 120)
print('\nDataset shape: ', df.shape)
df.head()


Dataset shape:  (5572, 2)


Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std t...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### Transform target variable to dummy

In [2]:
df['target'] = np.where(df['target']=='spam',1,0)
import random
sample = random.sample(range(1, len(df)), 10)
df.iloc[sample]

Unnamed: 0,target,text
2380,0,"If i let you do this, i want you in the house by 8am."
1373,1,"Bears Pic Nick, and Tom, Pete and ... Dick. In fact, all types try gay chat with photo upload call 08718730666 (10p/..."
3057,1,You are now unsubscribed all services. Get tons of sexy babes or hunks straight to your phone! go to http://gotbabes...
2904,0,"Ha. You donÛ÷t know either. I did a a clever but simple thing with pears the other day, perfect for christmas."
2126,0,You do got a shitload of diamonds though
2359,1,"Spook up your mob with a Halloween collection of a logo & pic message plus a free eerie tone, txt CARD SPOOK to 8007..."
4677,0,It is a good thing I'm now getting the connection to bw
383,0,Hey i will be late ah... Meet you at 945+
5441,0,"By the way, make sure u get train to worc foregate street not shrub hill. Have fun night x"
210,0,"What's up bruv, hope you had a great break. Do have a rewarding semester."


### Split dataset into training and testing

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'],
                                                    random_state=0)
print('Fraction Spam:', df['target'].mean())
X_train.head(10)
print('\nX_train shape: ', X_train.shape)

Fraction Spam: 0.13406317300789664

X_train shape:  (4179,)


### Set up function that will calculate performance metrics and populate comparison matrix

In [4]:
from sklearn.metrics import (roc_auc_score, accuracy_score, recall_score, 
                             precision_score)

def get_metrics(X_to_test, model, notes, y_test = y_test):
    '''
    Calculates model performance metrics and populates model comparison matrix
    '''
    model.fit(X_train_transformed, y_train)
    predictions = model.predict(X_to_test)
    auc  = roc_auc_score(y_test, predictions)
    acc  = accuracy_score(y_test, predictions)
    rec  = recall_score(y_test, predictions)
    prec = precision_score(y_test, predictions)
    model_name = str(model)[0:str(model).index('(')]
    metrics.loc[len(metrics)] = [model_name, notes, acc, rec, prec, auc]

### Set up classification specifications to be tested
Naive-Bayes, Linear Regression and Support Vector Machine with varying parameters

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

def  test_models():
    for param in [0.01, 0.1]:
        notes = 'alpha = ' + str(param)
        get_metrics(X_test_transformed, MultinomialNB(alpha=param), notes)
    for param in [10, 100]:
        notes = 'C = ' + str(param)
        get_metrics(X_test_transformed, LogisticRegression(C = param), notes)
    for param in [1000, 10000]:
        notes = 'C = ' + str(param)
        get_metrics(X_test_transformed, SVC(C = param), notes)

### Set up performance comparison matrix

In [6]:
colnames = ['Model', 'Param', 'Accuracy', 'Recall', 'Precision', 'ROC AUC']
metrics =  pd.DataFrame(columns = colnames)

### Feature Specification 1:  Features built using CountVectorizer with default settings
#### i.e. features are all the words in the training set, and no extra features

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)
X_train_transformed = vect.transform(X_train)
X_test_transformed = vect.transform(X_test)

test_models()
metrics.set_index('Model')

Unnamed: 0_level_0,Param,Accuracy,Recall,Precision,ROC AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MultinomialNB,alpha = 0.01,0.990668,0.93401,1.0,0.967005
MultinomialNB,alpha = 0.1,0.992103,0.944162,1.0,0.972081
LogisticRegression,C = 10,0.983489,0.888325,0.994318,0.943744
LogisticRegression,C = 100,0.984207,0.893401,0.99435,0.946282
SVC,C = 1000,0.979182,0.857868,0.994118,0.928516
SVC,C = 10000,0.983489,0.898477,0.983333,0.947984


By all metrics, the best performing algorithm with **Feature Specification 1** is the **MultinomialNB with alpha = 0.1** (henceforth **baseline**). 

### Feature Specification 2:  Weighted features using term frequency–inverse document frequency, no extra features

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer().fit(X_train)
X_train_transformed = vect.transform(X_train)
X_test_transformed = vect.transform(X_test)

metrics =  pd.DataFrame(columns = colnames)
test_models()
metrics.set_index('Model')

Unnamed: 0_level_0,Param,Accuracy,Recall,Precision,ROC AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MultinomialNB,alpha = 0.01,0.98636,0.903553,1.0,0.951777
MultinomialNB,alpha = 0.1,0.985642,0.898477,1.0,0.949239
LogisticRegression,C = 10,0.984207,0.893401,0.99435,0.946282
LogisticRegression,C = 100,0.985642,0.913706,0.983607,0.955599
SVC,C = 1000,0.970567,0.796954,0.993671,0.898059
SVC,C = 10000,0.98636,0.923858,0.978495,0.960257


Best performing with **Feature Specification 2** are MultinomialNB with alpha = 0.1, and SVC with C = 1000. Neither is superior to baseline.

### Feature Specification 3:
Same as Feature Specification 2, but ignoring terms that have a document frequency strictly lower than 3.

In [9]:
vect = TfidfVectorizer(min_df=3).fit(X_train)
ft_names = vect.get_feature_names()
feature_names = np.array(ft_names)

X_train_transformed = vect.transform(X_train)
X_test_transformed  = vect.transform(X_test)

metrics =  pd.DataFrame(columns = colnames)
test_models()
metrics.set_index('Model')

Unnamed: 0_level_0,Param,Accuracy,Recall,Precision,ROC AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MultinomialNB,alpha = 0.01,0.984925,0.893401,1.0,0.946701
MultinomialNB,alpha = 0.1,0.983489,0.883249,1.0,0.941624
LogisticRegression,C = 10,0.984925,0.903553,0.988889,0.950941
LogisticRegression,C = 100,0.98636,0.918782,0.983696,0.958137
SVC,C = 1000,0.985642,0.898477,1.0,0.949239
SVC,C = 10000,0.982053,0.918782,0.952632,0.955628


With **Feature Specification 3**, recall is significantly lower for the MultinomialNB models, SVC 1000 performs better with this feature specification than the previous. No algorithm is superior to baseline.

### Explore New Features

#### Find indexes of ham and spam in training set for additional feature search

In [10]:
ham  = y_train == 0
ham  = [i for i, x in enumerate(ham) if x]
spam = y_train == 1
spam = [i for i, x in enumerate(spam) if x]

### Compare Lenghts (Number of Characters) between Ham and Spam

In [11]:
doc_lengths = np.array([len(d) for d in X_train])
mean_len_ham = np.mean(doc_lengths[ham])
mean_len_spam = np.mean(doc_lengths[spam])
print('Average Length', '\n\nHam:  ', mean_len_ham, 'chars',
      '\nSpam: ', mean_len_spam, 'chars')

Average Length 

Ham:   70.34913199228437 chars 
Spam:  139.90727272727273 chars


#### On average, spam messages are much longer

### Compare Digit Counts between Ham and Spam

In [12]:
digits = X_train.str.findall('\d')
dig_counts = np.array([len(d) for d in digits])
mean_dig_ham = np.mean(dig_counts[ham])
mean_dig_spam = np.mean(dig_counts[spam])
print('Average digit counts', '\n\nHam:  ', mean_dig_ham, 'digits',
      '\nSpam: ', mean_dig_spam, 'digits')

Average digit counts 

Ham:   0.2931937172774869 digits 
Spam:  15.841818181818182 digits


#### On average, spam messages have a lot more digits

### Compare Non-Alphanumeric Characters between Ham and Spam

In [13]:
non_alnum = X_train.str.findall('\W')
non_alnum_counts = np.array([len(d) for d in non_alnum])
mean_nw_ham = np.mean(non_alnum_counts[ham])
mean_nw_spam = np.mean(non_alnum_counts[spam])
print('Average non-alnum char counts', '\n\nHam:  ', mean_nw_ham, 'non-alnum chars',
      '\nSpam: ', mean_nw_spam, 'non-alnum chars')

Average non-alnum char counts 

Ham:   17.158170294847064 non-alnum chars 
Spam:  29.325454545454544 non-alnum chars


#### On average, spam messages contain more non-alphanumeric characters

### Calculate the New Features for the Test Set

In [14]:
doc_lengths_test = np.array([len(d) for d in X_test])

digits = X_test.str.findall('\d')
dig_counts_test = np.array([len(d) for d in digits])

non_alnum = X_test.str.findall('\W')
non_alnum_counts_test = np.array([len(d) for d in non_alnum])

### Functions to transform original X_sets into sparse matrices with additional features

In [15]:
from scipy.sparse import csr_matrix, hstack
def add_features(X_sparse, new_features):
    """
    Returns sparse feature matrix with new feature added.
    new_features can be a feature or a list of features.
    """
    return hstack([X_sparse, csr_matrix(new_features).T], 'csr')

def transform_X(X, new_features):
    X_vectorized  = vect.transform(X)
    X_transformed = add_features(X_vectorized, new_features)
    return X_transformed

### Feature Specification 4: 
* 2 to 5 character ngrams as Features built using CountVectorizer  
* Ignoring terms that have a document frequency strictly lower than 5  
* Add text lenght, digit count, and non-alphanumeric character count as features  

In [16]:
vect = CountVectorizer(min_df=5, ngram_range=(2,5), 
                       analyzer='char_wb').fit(X_train)

new_features_train = [doc_lengths, dig_counts, non_alnum_counts]
X_train_transformed = transform_X(X_train, new_features_train)

new_features_test = [doc_lengths_test, dig_counts_test, non_alnum_counts_test]
X_test_transformed = transform_X(X_test, new_features_test)

metrics =  pd.DataFrame(columns = colnames)
test_models()
metrics.set_index('Model')

Unnamed: 0_level_0,Param,Accuracy,Recall,Precision,ROC AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MultinomialNB,alpha = 0.01,0.991385,0.964467,0.974359,0.980143
MultinomialNB,alpha = 0.1,0.990668,0.969543,0.964646,0.981845
LogisticRegression,C = 10,0.992103,0.954315,0.989474,0.976321
LogisticRegression,C = 100,0.992821,0.959391,0.989529,0.978859
SVC,C = 1000,0.993539,0.959391,0.994737,0.979277
SVC,C = 10000,0.993539,0.959391,0.994737,0.979277


With **Feature Specification 4** MultinomialNB 0.1's recall improved compared to baseline, but at the cost of all other metrics, notably precission.  Both SVCs perfomed generally better under this feature specification than in previous. The SVCs perform better than baseline in all metrics with the exception of precision, which is sligthly lower. 

### Conclusion
The best performing algorithms are the **baseline**, and the **Support Vector Machine** algorithms with **Feature Specification 4** which uses character ngrams and additional features.