# SVM SPAM Filter

I recently learnt about SVMs so I thought I would try out my new knowledge on a dataset they were well suited for. I chose a SPAM dataset as its a binary classification problem, and because the dataset was small (SVMs scale well with features but not with instances). This dataset also gave me an opportunity to look at vectorizing text.

In [None]:
import matplotlib.pyplot as plt
from matplotlib import colors
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


### Vectorizing Text Data

First we need to read in the data. The first thing to note is that only the first two columns in the csv file are needed, hence we can ignore all others. Also, whilst the first row in the csv file gives headers, I preferred to use ones I found more descriptive. Finally, I noticed that choosing the right encoding was important. 

In [None]:
data = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv', usecols=(0,1),
                   encoding='latin-1', names=["Label","Text"], skiprows = 1)
data.head()

The instances were already shuffled. To get a feel for the data available, we can count the instances of each. 

In [None]:
data["Label"].value_counts()

The two labels present were spam and ham, as expected. The next step is to separate the dataset into a training and testing dataset. 

In [None]:
from sklearn.model_selection import train_test_split

# Create a training and testing dataset.
train, test = train_test_split(data,shuffle=True, test_size=0.2, random_state=42)

There are a lot more ham instances than spam. The datasets were shuffled prior to splitting, so there should be decent numbers of each instance in each dataset, but we can check quickly to make sure one dataset isn't lacking. 

In [None]:
# Compare ratio of SPAM instances to HAM instances
print('Percentage of Instances which are SPAM:')
print('Train: ',round(100.*len(train.loc[train["Label"]=="spam"])/len(train),2),'%')
print('Test: ',round(100.*len(test.loc[test["Label"]=="spam"])/len(test),2),'%')

We can now start to think about vectorizing the text data. First of all, the spam and ham labels need to be converted into a numeric value. I chose spam=1 (positive) and ham=0 (negative). Secondly, the text has to be converted into a numeric representation. There are several ways to do this, but for simplicity I chose to use CountVectorizer in sci-kit learn. This will create a dictionary of the words found in the training set of the sms messages. We can then use it to convert each message into a sparse array of the same length of the dictionary, in which the only non-zero elements will correspond to words which occur in that message. Because we fit the tokenizer (CountVectorizer) using only the training set, its possible that words will occur in the test set, which will not be in the dictionary. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# First replace the label with a numeric value
mapping = {'spam': 1, 'ham': 0}
train_labels = train.replace({"Label": mapping})
train_labels = train_labels["Label"]
test_labels = test.replace({"Label": mapping})
test_labels = test_labels["Label"]

# Define a method to vectorize the text data, and fit it using the 
# training dataset.
vectorizer = CountVectorizer()
vectorizer.fit(train["Text"])

# Now convert all sms data to vector form
train_text = vectorizer.transform(train["Text"])
test_text = vectorizer.transform(test["Text"])

### Training a Support Vector Machine

A SVM is a good option for this dataset, as there are lots of features (one per word which occurs in the training data) and comparitively fewer instances. Because of this, training should be fairly fast. We will first compare different kernel types: Linear, Polynomial, Sigmoid and Gaussian Radial Bias Function (RBF).

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

linear_classifier = SVC(kernel='linear',random_state = 42)
linear_scores = cross_val_score(linear_classifier,train_text,train_labels,cv=5)

poly_classifier = SVC(kernel='poly',random_state = 42)
poly_scores = cross_val_score(poly_classifier,train_text,train_labels,cv=5)

sigmoid_classifier = SVC(kernel='sigmoid',random_state = 42)
sigmoid_scores = cross_val_score(sigmoid_classifier,train_text,train_labels,cv=5)

rbf_classifier = SVC(kernel='rbf',random_state = 42)
rbf_scores = cross_val_score(rbf_classifier,train_text,train_labels,cv=5)

print('Kernel\tMean Score')
print('Linear: ',round(100*np.mean(linear_scores),2),'%')
print('Polynomial: ',round(100*np.mean(poly_scores),2),'%')
print('Sigmoid: ',round(100*np.mean(sigmoid_scores),2),'%')
print('RBF: ',round(100*np.mean(rbf_scores),2),'%')

Those scores are already pretty good (as a baseline, we can use that roughly 13% of the instances are spam, so if a model simply outputted that every instance was ham, we would be right 87% of the time). This means a SVM is well suited to this classification task. The linear and RBF kernels look the most promising, so we will use Grid Search to tune the hyperparameters of these. 

In [None]:
from sklearn.model_selection import GridSearchCV

classifier = SVC(random_state = 26)
param_grid = [{'kernel': ['linear','rbf'], 'C':[0.5,0.75,1.0,1.5,2.0], 'gamma': ['auto','scale']}]

grid_search = GridSearchCV(classifier, param_grid, cv=5, scoring="accuracy", return_train_score=True)
grid_search.fit(train_text, train_labels)

curves = grid_search.cv_results_
print(f'Highest Score: ', round(100.*max(curves["mean_test_score"]),2), '%')
print(f'Corresponding Parameters: ', curves["params"][np.argmax(curves["mean_test_score"])])

The default parameters seem to be the best here. Whilst accuracy is the final scoring metric we will use, we can also check the models generalisation by looking at the ROC curve, and the area underneath it. 

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict

classifier = SVC(random_state=34, kernel='linear', probability=True)
spam_prob = cross_val_predict(classifier, train_text, train_labels, cv=3, method="predict_proba" )
spam_score = spam_prob[:,1] # Probability text is spam
fpr,tpr, thresholds = roc_curve(train_labels,spam_score)

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
ax.plot([0,1],[0,1], color='black', ls='dashed', label='Random Baseline')
ax.plot(fpr, tpr, color='mediumvioletred', label='Linear SVM')
ax.set_xlabel('False Positive Rate',fontsize=12); ax.set_ylabel("True Positive Rate",fontsize=12)
ax.set_xlim(0,1); ax.set_ylim(0,1)
plt.legend(frameon=False)
plt.show()

print('AUC Score: ',round(roc_auc_score(train_labels,spam_score),5))

Again, that's a fairly good score. We can take a closer look at the TPR, FPR and F1 scores as well.

In [None]:
from sklearn.metrics import classification_report

spam_pred = cross_val_predict(classifier, train_text, train_labels, cv=3, method="predict" )
print(classification_report(train_labels,spam_pred))

In general, the precision, recall and f1 scores are all fairly high for this model. However, in order to see how well it really does, we should evaluate it on the test datset.

In [None]:
# Evaluate model on test data
classifier.fit(train_text, train_labels)
predictions = classifier.predict(test_text)
correct = test_labels==predictions

print('Accuracy: ', round(100.*np.sum(correct)/len(correct),2),'%')

In [None]:
print(classification_report(test_labels,predictions))

The accuracy, precision, recall and f1 scores are not much lower on the test data, which implies the model generalises well. Overall, a SVM seems to be a good model for this problem, and the linear kernel works well. To improve this score, its probably worth looking more at vectorizing the text (I was really more interested in testing out SVM classifiers).