# Support Vector Machines (SVM)

SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

In this notebook, we will use SVM to build and train a model using human cell records, and classify cells as benign or malignant.

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

## Data Pre-processing

The dataset consists of human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

|Field name|Description|
|--- |--- |
|ID|Identification|
|Clump|Clump thickness|
|UnifSize|Uniformity of cell size|
|UnifShape|Uniformity of cell shape|
|MargAdh|Marginal adhesion|
|SingEpiSize|Single epithelial cell size|
|BareNuc|Bare nuclei|
|BlandChrom|Bland chromatin|
|NormNucl|Normal nucleoli|
|Mit|Mitoses|
|Class|Benign or malignant|

In [None]:
df = pd.read_csv('cell_samples.csv')

In [None]:
df.head()

In [None]:
df.shape

The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.

The Class field contains the diagnosis, as confirmed by separate medical procedures, classifying the samples as benign (value = 2) or malignant (value = 4).

Lets look at the distribution of the classes based on clump thickness and uniformity of cell size:

In [None]:
ax = df[df['Class']==4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='Malignant')
df[df['Class']==2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='Benign', ax=ax)
plt.show()

In [None]:
df.dtypes

It looks like the __BareNuc__ column includes some values that are not numerical. We can drop those rows:

In [None]:
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()]
df['BareNuc'] = df['BareNuc'].astype(int)

In [None]:
df.dtypes

Now, let's get the feature matrix X and target vector y:

In [None]:
f_df = df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(f_df)
y = np.asarray(df['Class'])

In [None]:
y[0:5]

Let's split our data into train/test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)

## Modelling with SVM

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

    1. Linear
    2. Polynomial
    3. Radial basis function (RBF)
    4. Sigmoid

Each of these functions has its characteristics, pros and cons, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this exercise.

In [None]:
from sklearn import svm

clf = svm.SVC(kernel = 'rbf')
clf.fit(X_train, y_train)

Now that we have the fitted model, we can make predictions on the test set

In [None]:
yhat = clf.predict(X_test)

In [None]:
yhat[0:5]

Let's estimate the accuracy of our model

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, normalize = False, title = 'Confusion Matrix', cmap = plt.cm.Blues):
    """This function prints/plots the confusion matrix. Normalization can be applied by setting normalize = True"""
    
    if normalize:
        cm = cm.astype('float')/cm.sum(axis = 1)[:, np.newaxis]
        print('Normalized confusion matrix')
    else:
        print('Confusion matrix, without normalization')
        
    print(cm)
    
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max()/2.
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment = 'center',
                 color = 'white' if cm[i,j] > thresh else 'black')

    plt.tight_layout()
    plt.xlabel('Predicted label')
    plt.ylabel('True label')

In [None]:
# compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels = [2,4])
np.set_printoptions(precision = 2)

# plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes = ['Benign(2)', 'Malignant(4)'], normalize = False, title = 'Confusion matrix')

In [None]:
print(classification_report(y_test, yhat))

We can also calculate F1 score using sklearn library as:

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, yhat, average = 'weighted')

Finally, let's calculate Jaccard index for accuracy

In [None]:
from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score(y_test, yhat)