# EVALUATING CLASSIFIERS FOR SMS SPAM DETECTION
### Daniel Loden, May 2017

# Overview


----------
This document presents an evaluation of models for classifying SMS spam and ham messages based on message text. The following classifiers were assessed:

 - Naive Bayes;
 - random forest;
 - logistic regression (L1 and L2); and
 - support vector machine.

Multiple classifiers were trained and compared using 10-fold cross-validation.  Of these, the support vector machine performed best, based on Cohen's kappa.  When applied to test data, the SVM performed fairly well, with kappa of 0.90 and F1-scores of 0.99 and 0.92 for the 'ham' (legitimate) and 'spam' (not legitimate) classes, respectively.

# Data Processing


----------
A dataset containing 5,572 labelled records was loaded and split into training and testing sets, based on a 70:30 ratio.

In [None]:
# Load packages
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import cohen_kappa_score, make_scorer, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Read data
data = pd.read_csv('../input/spam.csv',
                   encoding = 'ISO-8859-1')
data = data.ix[:, [1, 0]]
data.rename(columns={'v2':'text', 'v1':'ham_spam'}, inplace=True)

# Code spam flag
le = LabelEncoder()
y = le.fit_transform(data['ham_spam'])

# Create text variable
X = data['text']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=2008)

# Comparison of Classifiers

----------

## Assessment Approach
Multiple classifiers were assessed to determine which one should be applied to the test dataset.  

Due to the imbalance in classes (see below), Cohen's kappa was used as the performance metric, to account for chance agreement between actual and predicted values.  10-fold cross-validation was used to estimate performance on new data.

In [None]:
print('Proportion of spam messages in the training data:', round(np.mean(y_train), 2))
kappa = make_scorer(cohen_kappa_score)

## Document-Term Matrix Specifications
The below specifications were used to create document-term matrices.

In [None]:
count_vec = CountVectorizer(analyzer='word',
                            stop_words='english',
                            max_features=500)

## Naive Bayes
Naive Bayes was tried first, as has been commonly used for spam detection.  This classifier performed fairly well.

In [None]:
nb = MultinomialNB()

nb_clf = Pipeline([('Count vectorizer', count_vec),
                   ('Naive Bayes', nb)])

print('Kappa (10-fold CV): ', 
      round(np.mean(cross_val_score(nb_clf, X_train, y_train, scoring=kappa, cv=10)), 3))

## Random Forest
Random Forest did not perform as well as Naive Bayes.

In [None]:
rf = RandomForestClassifier()

rf_clf = Pipeline([('Count vectorizer', count_vec),
                   ('Random Forest', rf)])

print('Kappa (10-fold CV): ', 
      round(np.mean(cross_val_score(rf_clf, X_train, y_train, scoring=kappa, cv=10)), 3))

## Logistic Regression

### L1 Regularisation
Logistic regression with L1 regularisation did not perform as well as Naive Bayes.

In [None]:
lr = LogisticRegression(penalty='l1')

lr_l1_clf = Pipeline([('Count vectorizer', count_vec),
                      ('Logistic regression', lr)])

print('Kappa (10-fold CV): ', 
      round(np.mean(cross_val_score(lr_l1_clf, X_train, y_train, scoring=kappa, cv=10)), 3))

### L2 Regularisation
However, logistic regression with L2 regularisation *did* perform better than Naive Bayes.

In [None]:
lr = LogisticRegression(penalty='l2')

lr_l2_clf = Pipeline([('Count vectorizer', count_vec),
                      ('Logistic regression', lr)])

print('Kappa (10-fold CV): ', 
      round(np.mean(cross_val_score(lr_l2_clf, X_train, y_train, scoring=kappa, cv=10)), 3))

## Support Vector Machine
The linear SVM performed better than all other classifiers and was chosen as the final model to be tested.

In [None]:
svm = SVC(kernel='linear')

svm_clf = Pipeline([('Count vectorizer', count_vec),
                    ('SVM', svm)])

print('Kappa (10-fold CV): ', 
      round(np.mean(cross_val_score(svm_clf, X_train, y_train, scoring=kappa, cv=10)), 3))

# Support Vector Machine Performance on the Test Set

----------

As expected, the SVM performed well on the test data, with strong F1-scores and Cohen's kappa.  

In [None]:
svm_clf.fit(X_train, y_train)
y_test_pred = svm_clf.predict(X_test)
print('Classification report: ')
print("----------------------")
print("")
print(classification_report(y_test, y_test_pred))
print("")
print("Cohen's kappa:")
print("--------------")
kappa = cohen_kappa_score(y_test, y_test_pred)
print(round(kappa, 2))