# Support vector machines

In this section, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best "out of the box" classifiers.

- [Lab: 9.6.1 Support Vector Classifier](#9.6.1-Support-Vector-Classifier)
- [Lab: 9.6.2 Support Vector Machine](#9.6.2-Support-Vector-Machine)
- [Lab: 9.6.3 ROC Curves](#9.6.3-ROC-Curves)
- [Lab: 9.6.4 SVM with Multiple Classes](#9.6.4-SVM-with-Multiple-Classes)
- [Lab: 9.6.5 Application to Gene Expression Data](#9.6.5-Application-to-Gene-Expression-Data)

In [None]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC, LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report

%matplotlib inline

## Maximum margin classifier

In this section, we define a hyperplane and introduce the concept of an optimal separating hyperplane.

In a $D$-dimensional space, a hyperplane is a flat affine subspace of dimension $D − 1$. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane. In $D > 3$ dimensions, it can be hard to visualize a hyperplane, but the notion of a $(D − 1)$-dimensional flat subspace still applies.

The mathematical definition of a hyperplane is quite simple. In two dimensions, a hyperplane is defined by the equation,

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 = 0
$$

where $\beta_0, \beta_1, \beta_2$ are parameters. When we say that the above equation “defines” the hyperplane, we mean that any $\mathbf{x} = [x_1, x_2]^\top $ satisfies the equation is a point on the hyperplane. Note the equation above is simply the equation of a line, since indeed in two dimensions a hyperplane is a line. In the $D$-dimensional case, the equation defining a hyperplane is

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D = 0
$$

where $\beta_0, \beta_1, \beta_2, \ldots, \beta_D$ are parameters. In this equation, any $\mathbf{x} = [x_1, x_2, \ldots, x_D]^\top$ satisfies the above equation is a point on the hyperplane. If $\mathbf{x} $ does not satisfy the equation, e.g.

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D > 0
$$

then $\mathbf{x} $ is on one side of the hyperplane, and if

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D < 0
$$

then $\mathbf{x} $ is on the other side of the hyperplane. So we can think of the hyperplane as dividing $D$-dimensional space into two halves. One can easily determine on which side of the hyperplane a point lies by simply calculating
the sign of the left hand side of the hyperplane equation. The following code plots a hyperplane in two dimensions.

In [None]:
# plot hyperplane
x = np.arange(-1.5, 1.51, 0.01)
y = np.arange(-1.5, 1.51, 0.01)

X, Y = np.meshgrid(x, y)
zz = np.array([X.ravel(), Y.ravel()]).T
Z = zz[:, 0] * 2 + zz[:, 1] * 3 + 1
Z[np.where(Z > 0)] = 1
Z[np.where(Z <= 0)] = -1
Z = Z.reshape(X.shape)
plt.contourf(X, Y, Z, cmap=plt.cm.Paired, alpha=0.2)
plt.show()

### Classification Using a Separating Hyperplane

Suppose we have a set of observations $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N$ in $D$-dimensional space

$$
\mathbf{x}_1 = \begin{bmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1D} \end{bmatrix}, \ldots, \mathbf{x}_N = \begin{bmatrix} x_{N1} \\ x_{N2} \\ \vdots \\ x_{ND} \end{bmatrix},
$$

and that these observations fall into two classes, $y_1, \ldots, y_N \in \{-1, 1\}$, where 1 represents one class and -1 the other class. We also have a test observation $\mathbf{x}^* = [x^*_1, \ldots, x^*_N]^\top$ that we would like to classify. We can do so by finding the hyperplane that separates the two classes. Such a hyperplane is known as an separating hyperplane. 

We can label the observations from the blue class as $ y_i = 1 $ and those from the purple class as $ y_i = −1 $. Then a separating hyperplane has the property that

$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD} > 0, \text{ if } y_i = 1
$$

and

$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD} < 0, \text{ if } y_i = -1.
$$

Equivalently, we can write this as

$$
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD}) > 0, \text{ for } i = 1, \ldots, N.
$$

If a separating hyperplane exists, we can use it to construct a very natural classifier: a test observation is assigned a class depending on which side of the hyperplane it is located. 

### The Maximum Margin Classifier

In general, if the data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up or down, or rotated, without coming into contact with any of the observations.

A natural choice is the _maximal margin hyperplane_ (also known as the optimal separating hyperplane), which is the separating hyperplane that is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is margin largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies. This is known as the maximal margin classifier. We hope that a classifier that has a large margin on the training data will also have a large margin on the test data, and hence will classify the test observations correctly. 


### Construction of maximum margin classifier

```{math}
:label: eq:max-margin-classifier
\begin{aligned}
& \max_{\beta_0, \beta_1, \ldots, \beta_D} M \\
& \text{subject to } \sum_{j=1}^D \beta_j^2 = 1 ,\\
& y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD}) \geq 1, \text{ for } i = 1, \ldots, N.
\end{aligned}
```

where $M$ represents the margin of our hyperplane, and the optimization problem chooses $ \beta_0, \beta_1, \ldots, \beta_D $ to maximize M. This is exactly the definition of the maximal margin hyperplane! The problem {numref}`max-margin-classifier` can be solved efficiently via quadratic programming, but details of this optimization are outside of the scope of this course.

In [None]:
zz = np.array([X.ravel(), Y.ravel()]).T

In [None]:
Z = zz[:, 0] * 2 + zz[:, 1] * 3 + 0.3

In [None]:
Z

## LAB

### 9.6.1 Support Vector Classifier

Define a function to plot a classifier with support vectors.

In [None]:
def plot_svc(svc, X, y, h=0.02, pad=0.25):
    x_min, x_max = X[:, 0].min() - pad, X[:, 0].max() + pad
    y_min, y_max = X[:, 1].min() - pad, X[:, 1].max() + pad
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    print(xx.shape)
    print(yy)
    Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.2)

    plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
    # Support vectors indicated in plot by vertical lines
    sv = svc.support_vectors_
    plt.scatter(
        sv[:, 0], sv[:, 1], c="k", marker="x", s=100, alpha=0.5
    )  # , linewidths=1)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel("X1")
    plt.ylabel("X2")
    plt.show()
    print("Number of support vectors: ", svc.support_.size)

In [None]:
# Generating random data: 20 observations of 2 features and divide into two classes.
np.random.seed(5)
X = np.random.randn(20, 2)
y = np.repeat([1, -1], 10)

X[y == -1] = X[y == -1] + 1
plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")

In [None]:
y

In [None]:
# Support Vector Classifier with linear kernel.
svc = SVC(C=1.0, kernel="linear")
svc.fit(X, y)

plot_svc(svc, X, y)

In [None]:
# When using a smaller cost parameter (C=0.1) the margin is wider, resulting in more support vectors.
svc2 = SVC(C=0.1, kernel="linear")
svc2.fit(X, y)
plot_svc(svc2, X, y)

In [None]:
# Select the optimal C parameter by cross-validation
tuned_parameters = [{"C": [0.001, 0.01, 0.1, 1, 5, 10, 100]}]
clf = GridSearchCV(
    SVC(kernel="linear"),
    tuned_parameters,
    cv=10,
    scoring="accuracy",
    return_train_score=True,
)
clf.fit(X, y)
clf.cv_results_

In [None]:
# 0.001 is best according to GridSearchCV.
clf.best_params_

In [None]:
# Generating test data
np.random.seed(1)
X_test = np.random.randn(20, 2)
y_test = np.random.choice([-1, 1], 20)
X_test[y_test == 1] = X_test[y_test == 1] - 1

plt.scatter(X_test[:, 0], X_test[:, 1], s=70, c=y_test, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")

In [None]:
# svc2 : C = 0.1
y_pred = svc2.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_pred), index=svc.classes_, columns=svc.classes_)

In [None]:
svc3 = SVC(C=0.001, kernel="linear")
svc3.fit(X, y)

# svc3 : C = 0.001
y_pred = svc3.predict(X_test)
pd.DataFrame(
    confusion_matrix(y_test, y_pred), index=svc3.classes_, columns=svc3.classes_
)
# The misclassification is the same

In [None]:
# Changing the test data so that the classes are really seperable with a hyperplane.
X_test[y_test == 1] = X_test[y_test == 1] - 1
plt.scatter(X_test[:, 0], X_test[:, 1], s=70, c=y_test, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")

In [None]:
svc4 = SVC(C=10.0, kernel="linear")
svc4.fit(X_test, y_test)

In [None]:
plot_svc(svc4, X_test, y_test)

In [None]:
# Increase the margin. Now there is one misclassification: increased bias, lower variance.
svc5 = SVC(C=1, kernel="linear")
svc5.fit(X_test, y_test)

In [None]:
plot_svc(svc5, X_test, y_test)

### 9.6.2 Support Vector Machine 

In [None]:
# Generating test data
np.random.seed(8)
X = np.random.randn(200, 2)
X[:100] = X[:100] + 2
X[101:150] = X[101:150] - 2
y = np.concatenate([np.repeat(-1, 150), np.repeat(1, 50)])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)

plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2");

In [None]:
svm = SVC(C=1.0, kernel="rbf", gamma=1)
svm.fit(X_train, y_train)

In [None]:
plot_svc(svm, X_train, y_train)

In [None]:
# Increasing C parameter, allowing more flexibility
svm2 = SVC(C=100, kernel="rbf", gamma=1.0)
svm2.fit(X_train, y_train)

In [None]:
plot_svc(svm2, X_train, y_train)

In [None]:
# Set the parameters by cross-validation
tuned_parameters = [{"C": [0.01, 0.1, 1, 10, 100], "gamma": [0.5, 1, 2, 3, 4]}]
clf = GridSearchCV(
    SVC(kernel="rbf"),
    tuned_parameters,
    cv=10,
    scoring="accuracy",
    return_train_score=True,
)
clf.fit(X_train, y_train)
clf.cv_results_

In [None]:
clf.best_params_

In [None]:
confusion_matrix(y_test, clf.best_estimator_.predict(X_test))

In [None]:
# 15% of test observations misclassified
clf.best_estimator_.score(X_test, y_test)

### 9.6.3 ROC Curves

Comparing the ROC curves of two models on train/test data. One model is more flexible than the other.

In [None]:
svm3 = SVC(C=1, kernel="rbf", gamma=2)
svm3.fit(X_train, y_train)

In [None]:
# More flexible model
svm4 = SVC(C=1, kernel="rbf", gamma=50)
svm4.fit(X_train, y_train)

In [None]:
y_train_score3 = svm3.decision_function(X_train)
y_train_score4 = svm4.decision_function(X_train)

false_pos_rate3, true_pos_rate3, _ = roc_curve(y_train, y_train_score3)
roc_auc3 = auc(false_pos_rate3, true_pos_rate3)

false_pos_rate4, true_pos_rate4, _ = roc_curve(y_train, y_train_score4)
roc_auc4 = auc(false_pos_rate4, true_pos_rate4)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
ax1.plot(
    false_pos_rate3,
    true_pos_rate3,
    label="SVM $\gamma = 1$ ROC curve (area = %0.2f)" % roc_auc3,
    color="b",
)
ax1.plot(
    false_pos_rate4,
    true_pos_rate4,
    label="SVM $\gamma = 50$ ROC curve (area = %0.2f)" % roc_auc4,
    color="r",
)
ax1.set_title("Training Data")

y_test_score3 = svm3.decision_function(X_test)
y_test_score4 = svm4.decision_function(X_test)

false_pos_rate3, true_pos_rate3, _ = roc_curve(y_test, y_test_score3)
roc_auc3 = auc(false_pos_rate3, true_pos_rate3)

false_pos_rate4, true_pos_rate4, _ = roc_curve(y_test, y_test_score4)
roc_auc4 = auc(false_pos_rate4, true_pos_rate4)

ax2.plot(
    false_pos_rate3,
    true_pos_rate3,
    label="SVM $\gamma = 1$ ROC curve (area = %0.2f)" % roc_auc3,
    color="b",
)
ax2.plot(
    false_pos_rate4,
    true_pos_rate4,
    label="SVM $\gamma = 50$ ROC curve (area = %0.2f)" % roc_auc4,
    color="r",
)
ax2.set_title("Test Data")

for ax in fig.axes:
    ax.plot([0, 1], [0, 1], "k--")
    ax.set_xlim([-0.05, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.legend(loc="lower right")

As expected, the more flexible model scores better on training data but worse on the test data.

### 9.6.4 SVM with Multiple Classes

In [None]:
# Adding a third class of observations
np.random.seed(8)
XX = np.vstack([X, np.random.randn(50, 2)])
yy = np.hstack([y, np.repeat(0, 50)])
XX[yy == 0] = XX[yy == 0] + 4

plt.scatter(XX[:, 0], XX[:, 1], s=70, c=yy, cmap=plt.cm.prism)
plt.xlabel("XX1")
plt.ylabel("XX2");

In [None]:
svm5 = SVC(C=1, kernel="rbf")
svm5.fit(XX, yy)

In [None]:
plot_svc(svm5, XX, yy)

### 9.6.5 Application to Gene Expression Data

In R, I exported the dataset from package 'ISLR' to csv files.

In [None]:
X_train = pd.read_csv("Data/Khan_xtrain.csv").drop("Unnamed: 0", axis=1)
y_train = (
    pd.read_csv("Data/Khan_ytrain.csv").drop("Unnamed: 0", axis=1).as_matrix().ravel()
)
X_test = pd.read_csv("Data/Khan_xtest.csv").drop("Unnamed: 0", axis=1)
y_test = (
    pd.read_csv("Data/Khan_ytest.csv").drop("Unnamed: 0", axis=1).as_matrix().ravel()
)

In [None]:
# y_train counts
pd.Series(y_train).value_counts(sort=False)

In [None]:
# y_test counts
pd.Series(y_test).value_counts(sort=False)

In [None]:
# This model gives identical results to the svm() of the R package e1071, also based on libsvm library.
svc = SVC(kernel="linear")

# This model is based on liblinear library and gives 100 score on the test data.
# svc = LinearSVC()

svc.fit(X_train, y_train)

In [None]:
cm = confusion_matrix(y_train, svc.predict(X_train))
cm_df = pd.DataFrame(cm.T, index=svc.classes_, columns=svc.classes_)
cm_df.index.name = "Predicted"
cm_df.columns.name = "True"
print(cm_df)

In [None]:
cm = confusion_matrix(y_test, svc.predict(X_test))
cm_df = pd.DataFrame(cm.T, index=svc.classes_, columns=svc.classes_)
cm_df.index.name = "Predicted"
cm_df.columns.name = "True"
print(cm_df)