# Support vector machines

In this section, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best "out of the box" classifiers.

<!-- - [Lab: 9.6.1 Support Vector Classifier](#9.6.1-Support-Vector-Classifier)
- [Lab: 9.6.2 Support Vector Machine](#9.6.2-Support-Vector-Machine)
- [Lab: 9.6.3 ROC Curves](#9.6.3-ROC-Curves)
- [Lab: 9.6.4 SVM with Multiple Classes](#9.6.4-SVM-with-Multiple-Classes)
- [Lab: 9.6.5 Application to Gene Expression Data](#9.6.5-Application-to-Gene-Expression-Data) -->

In [None]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC, LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report

%matplotlib inline

## Maximum margin classifier (hard margin)

In this section, we define a hyperplane and introduce the concept of an optimal separating hyperplane.

In a $D$-dimensional space, a hyperplane is a flat affine subspace of dimension $D − 1$. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane. In $D > 3$ dimensions, it can be hard to visualize a hyperplane, but the notion of a $(D − 1)$-dimensional flat subspace still applies.

The mathematical definition of a hyperplane is quite simple. In two dimensions, a hyperplane is defined by the equation,

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 = 0
$$

where $\beta_0, \beta_1, \beta_2$ are parameters. When we say that the above equation “defines” the hyperplane, we mean that any $\mathbf{x} = [x_1, x_2]^\top $ satisfies the equation is a point on the hyperplane. Note the equation above is simply the equation of a line, since indeed in two dimensions a hyperplane is a line. In the $D$-dimensional case, the equation defining a hyperplane is

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D = 0
$$

where $\beta_0, \beta_1, \beta_2, \ldots, \beta_D$ are parameters. In this equation, any $\mathbf{x} = [x_1, x_2, \ldots, x_D]^\top$ satisfies the above equation is a point on the hyperplane. If $\mathbf{x} $ does not satisfy the equation, e.g.

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D > 0
$$

then $\mathbf{x} $ is on one side of the hyperplane, and if

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_D x_D < 0
$$

then $\mathbf{x} $ is on the other side of the hyperplane. So we can think of the hyperplane as dividing $D$-dimensional space into two halves. One can easily determine on which side of the hyperplane a point lies by simply calculating
the sign of the left hand side of the hyperplane equation. The following code plots a hyperplane in two dimensions.

In [None]:
# plot hyperplane
x = np.arange(-1.5, 1.51, 0.01)
y = np.arange(-1.5, 1.51, 0.01)

X, Y = np.meshgrid(x, y)
zz = np.array([X.ravel(), Y.ravel()]).T
Z = zz[:, 0] * 2 + zz[:, 1] * 3 + 1
Z[np.where(Z > 0)] = 1
Z[np.where(Z <= 0)] = -1
Z = Z.reshape(X.shape)
plt.contourf(X, Y, Z, cmap=plt.cm.Paired, alpha=0.2)
plt.show()

### Classification Using a Separating Hyperplane

Suppose we have a set of observations $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N$ in $D$-dimensional space

$$
\mathbf{x}_1 = \begin{bmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1D} \end{bmatrix}, \ldots, \mathbf{x}_N = \begin{bmatrix} x_{N1} \\ x_{N2} \\ \vdots \\ x_{ND} \end{bmatrix},
$$

and that these observations fall into two classes, $y_1, \ldots, y_N \in \{-1, 1\}$, where 1 represents one class and -1 the other class. We also have a test observation $\mathbf{x}^* = [x^*_1, \ldots, x^*_N]^\top$ that we would like to classify. We can do so by finding the hyperplane that separates the two classes. Such a hyperplane is known as an separating hyperplane. 

We can label the observations from the blue class as $ y_i = 1 $ and those from the purple class as $ y_i = −1 $. Then a separating hyperplane has the property that

$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD} > 0, \text{ if } y_i = 1
$$

and

$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD} < 0, \text{ if } y_i = -1.
$$

Equivalently, we can write this as

$$
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD}) > 0, \text{ for } i = 1, \ldots, N.
$$

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm2.png
---
width: 700px
name: svm2
figclass: margin-caption
---
Two classes of observations are shown in blue and in purple, respectively. Left: Three separating hyperplanes, out of many possible, are shown in black. Right: A separating hyperplane is shown in black. The blue and purple grid indicates the decision rule made by a classifier based on this separating hyperplane: a test observation that falls in the blue portion of the grid will be assigned to the blue class, and one that falls in the purple portion of the grid will be assigned to the purple class (figure source: [https://trevorhastie.github.io/ISLR/](https://trevorhastie.github.io/ISLR/)).
```

A shown in {numref}`svm2`, if a separating hyperplane exists, we can use it to construct a very natural classifier: a test observation is assigned a class depending on which side of the hyperplane it is located. 

### The maximum margin classifier

In general, if the data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up or down, or rotated, without coming into contact with any of the observations.

A natural choice is the _maximal margin hyperplane_ (also known as the optimal separating hyperplane), which is the separating hyperplane that is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is margin largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies. This is known as the maximal margin classifier. We hope that a classifier that has a large margin on the training data will also have a large margin on the test data, and hence will classify the test observations correctly. 

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm3.png
---
width: 400px
name: svm3
figclass: margin-caption
---
Two classes of observations are shown in blue and in purple, respectively. The maximal margin hyperplane is shown as a solid line. The margin is the distance from the solid line to either of the dashed lines. The two blue points and the purple point that lie on the dashed lines are the support vectors, and the distance from those points to the hyperplane is indicated by arrows. The purple and blue grid indicates the decision rule made by a classifier based on this separating hyperplane.
```

{numref}`svm3` shows a maximal margin hyperplane. Comparing to the right-hand panel of {numref}`svm2`, we see that the maximal margin hyperplane in {numref}`svm3` results in a greater minimal distance between the observations and the separating hyperplane—that is, a larger margin. We can also see that three training observations are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin. These three observations are known as _support vectors_, and the distance from the support vectors to the hyperplane is known as the _margin width_.


### Construction of maximum margin classifier

The maximal margin hyperplane is the solution to the following optimization problem

```{math}
:label: eq:max-margin-classifier
\begin{aligned}
& \max_{\beta_0, \beta_1, \ldots, \beta_D} M \\
& \text{subject to } \sum_{j=1}^D \beta_j^2 = 1 ,\\
& y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD}) \geq M, \text{ for } i = 1, \ldots, N.
\end{aligned}
```

where $ y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD}) $ in the constraint represents the perpendicular distance from the $i\text{th}$ observation to the hyperplane. This constraint guarantees that each observation is on the correct side of the hyperplane and at least a distance $ M $ from the hyperplane. Therefore, $M$ represents the margin of our hyperplane, and the optimization problem chooses $ \beta_0, \beta_1, \ldots, \beta_D $ to maximize M. This is exactly the definition of the maximal margin hyperplane! The problem {eq}`eq:max-margin-classifier` can be solved efficiently via quadratic programming, but details of this optimization are outside of the scope of this course.

## Support vector classifier (soft margin)

### The non-separable case

The maximal margin classifier is a very natural way to perform classification, if a separating hyperplane exists. However, in many cases no separating hyperplane exists, and so there is no maximal margin classifier, as shown in {numref}`svm4`. 

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm4.png
---
width: 400px
name: svm4
figclass: margin-caption
---
Two classes of observations are shown in blue and in purple, respectively. The two classes are not separable by a hyperplane, and so the maximal margin classifier cannot be used.
```

In this case, we might be willing to consider allowing some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane, rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin. This is the idea behind the _support vector classifier_, also known as the _soft margin classifier_. 

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm5.png
---
width: 700px
name: svm5
figclass: margin-caption
---
Left: Two classes of observations are shown in blue and in purple, along with the maximal margin hyperplane. Right: An additional blue observation has been added, leading to a dramatic shift in the maximal margin hyperplane shown as a solid line. The dashed line indicates the maximal margin hyperplane that was obtained in the absence of this additional point.
```

An example is shown in the left-hand panel of {numref}`svm5`. Most of the observations are on the correct side of the margin.
However, a small subset of the observations are on the wrong side of the margin.

An observation can be not only on the wrong side of the margin, but also on the wrong side of the hyperplane. In fact, when there is no separating hyperplane, such a situation is inevitable. Observations on the wrong side of the hyperplane correspond to training observations that are misclassified by the support vector classifier. The right-hand panel of {numref}`svm5` illustrates
such a scenario.

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm6.png
---
width: 700px
name: svm6
figclass: margin-caption
---
Left: A support vector classifier was fit to a small data set. The hyperplane is shown as a solid line and the margins are shown as dashed lines. Purple observations: Observations 3, 4, 5, and 6 are on the correct side of the margin, observation 2 is on the margin, and observation 1 is on the wrong side of the margin. Blue observations: Observations 7 and 10 are on the correct side of the margin, observation 9 is on the margin, and observation 8 is on the wrong side of the margin. No observations are on the wrong side of the hyperplane. Right: Same as left panel with two additional points, 11 and 12. These two observations are on the wrong side of the hyperplane and the wrong side of the margin.
```

### Construction of support vector classifier

The support vector classifier classifies a test observation depending on which side of a hyperplane it lies. The hyperplane is chosen to correctly separate most of the training observations into the two classes, but may misclassify a few observations. It is the solution to the optimization problem

```{math}
:label: eq:soft-margin-classifier
\begin{aligned}
& \max_{\beta_0, \beta_1, \ldots, \beta_D, \epsilon_1, \ldots, \epsilon_N} M \\
& \text{subject to } \sum_{j=1}^D \beta_j^2 = 1 ,\\
& y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_D x_{iD}) \geq M(1 - \epsilon_i),\\
& \epsilon_i \geq 0, \sum_{i = 1}^N \epsilon_i \leq C, \text{ for } i = 1, \ldots, N,
\end{aligned}
```

where $ \epsilon_i $ is a slack variable that measures the amount by which the $i\text{th}$ observation is on the wrong side of the margin. The slack variables are constrained to be non-negative, and the constraint $ \sum_{i = 1}^N \epsilon_i \leq C $ places an upper bound on the total amount by which the observations can be on the wrong side of the margin. The parameter $ C $ is a tuning parameter that controls the trade-off between the two competing goals of the support vector classifier: maximizing the margin and minimizing the number of training observations that are misclassified. When $ C $ is very large, the support vector classifier will attempt to maximize the margin, and so it will misclassify very few training observations. When $ C $ is very small, the support vector classifier will attempt to minimize the number of training observations that are misclassified, and so it will have a smaller margin. In practice, $ C $ is often tuned using cross-validation. {numref}`svm7` illustrates the effect of $ C $ on the support vector classifier.

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm7.png
---
width: 700px
name: svm7
figclass: margin-caption
---
A support vector classifier was fit using four different values of the tuning parameter $ C $ of Equation {eq}`eq:soft-margin-classifier`. The largest value of $ C $ was used in the top left panel, and smaller values were used in the top right, bottom left, and bottom right panels. When $ C $ is large, then there is a high tolerance for observations being on the wrong side of the margin, and so the margin will be large. As $ C $ decreases, the tolerance for observations being on the wrong side of the margin decreases, and the margin narrows.
```

## Support vector machines

### Classification with non-linear decision boundaries

The support vector classifier is a natural approach for classification in the two-class setting, if the boundary between the two classes is linear. However, in practice we are sometimes faced with non-linear class boundaries, as the example shown in the left panel of {numref}`svm8`. 

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm8.png
---
width: 700px
name: svm8
figclass: margin-caption
---
Left: A support vector classifier was fit to a small data set. The hyperplane is shown as a solid line and the margins are shown as dashed lines. Purple observations: Observations 3, 4, 5, and 6 are on the correct side of the margin, observation 2 is on the margin, and observation 1 is on the wrong side of the margin. Blue observations: Observations 7 and 10 are on the correct side of the margin, observation 9 is on the margin, and observation 8 is on the wrong side of the margin. No observations are on the wrong side of the hyperplane. Right: Same as left panel with two additional points, 11 and 12. These two observations are on the wrong side of the hyperplane and the wrong side of the margin.
```

In {doc}`Linear regression <../02-linear-reg/extension-limitation>`, we discussed using higher-order polynomials as a way to fit a non-linear relationship between the predictors and the response. Foe example, rather than fitting a support vector classifier using $D$ features: $ x_{1}, x_{2}, \ldots, x_{D} $, we could fit a support vector classifier using $ 2 \times D $ features: $ x_{1}, x_{2}, \ldots, x_{D'}, x_{1}^2, x_{2}^2, \ldots, x_{D'}^2 $. Then the optimisation problem becomes

```{math}
:label: eq:svm-polynomial
\begin{aligned}
& \max_{\beta_0, \beta_{1,1}, \ldots, \beta_{D,2}, \epsilon_1, \ldots, \epsilon_N} M \\
& \text{subject to } y_i\left(\beta_0 + \sum_{j=1}^{D} \beta_{j,1} x_{ij} + \beta_{j,2} x_{ij}^2\right) \geq M(1 - \epsilon_i), \\
& \sum_{j=1}^{D} \beta_{j,1}^2 + \beta_{j,2}^2 = 1, \epsilon_i \geq 0, \sum_{i = 1}^N \epsilon_i \leq C, \text{ for } i = 1, \ldots, N,
\end{aligned}
```

the decision boundary that results from Equation {eq}`eq:svm-polynomial` is in fact linear. But in the original feature space, the decision boundary is of the form $ q(x) = 0 $, where $ q(\cdot) $ is a quadratic polynomial, and its solutions are generally non-linear.


### The support vector machine and kernel trick

The _support vector machine (SVM)_ is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using _kernels_.

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm9.png
---
width: 700px
name: svm9
figclass: margin-caption
---
Left: An SVM with a polynomial kernel of degree 3 is applied to the non-linear data from {numref}`svm8`, resulting in a far more appropriate decision rule. Right: An SVM with a radial kernel is applied. In this example, either kernel is capable of capturing the decision boundary.
```

## LAB

### 9.6.1 Support Vector Classifier

Define a function to plot a classifier with support vectors.

In [None]:
def plot_svc(svc, X, y, h=0.02, pad=0.25):
    x_min, x_max = X[:, 0].min() - pad, X[:, 0].max() + pad
    y_min, y_max = X[:, 1].min() - pad, X[:, 1].max() + pad
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    print(xx.shape)
    print(yy)
    Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.2)

    plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
    # Support vectors indicated in plot by vertical lines
    sv = svc.support_vectors_
    plt.scatter(
        sv[:, 0], sv[:, 1], c="k", marker="x", s=100, alpha=0.5
    )  # , linewidths=1)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel("X1")
    plt.ylabel("X2")
    plt.show()
    print("Number of support vectors: ", svc.support_.size)

In [None]:
# Generating random data: 20 observations of 2 features and divide into two classes.
np.random.seed(5)
X = np.random.randn(20, 2)
y = np.repeat([1, -1], 10)

X[y == -1] = X[y == -1] + 1
plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")

In [None]:
y

In [None]:
# Support Vector Classifier with linear kernel.
svc = SVC(C=1.0, kernel="linear")
svc.fit(X, y)

plot_svc(svc, X, y)

In [None]:
# When using a smaller cost parameter (C=0.1) the margin is wider, resulting in more support vectors.
svc2 = SVC(C=0.1, kernel="linear")
svc2.fit(X, y)
plot_svc(svc2, X, y)

In [None]:
# Select the optimal C parameter by cross-validation
tuned_parameters = [{"C": [0.001, 0.01, 0.1, 1, 5, 10, 100]}]
clf = GridSearchCV(
    SVC(kernel="linear"),
    tuned_parameters,
    cv=10,
    scoring="accuracy",
    return_train_score=True,
)
clf.fit(X, y)
clf.cv_results_

In [None]:
# 0.001 is best according to GridSearchCV.
clf.best_params_

In [None]:
# Generating test data
np.random.seed(1)
X_test = np.random.randn(20, 2)
y_test = np.random.choice([-1, 1], 20)
X_test[y_test == 1] = X_test[y_test == 1] - 1

plt.scatter(X_test[:, 0], X_test[:, 1], s=70, c=y_test, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")

In [None]:
# svc2 : C = 0.1
y_pred = svc2.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_pred), index=svc.classes_, columns=svc.classes_)

In [None]:
svc3 = SVC(C=0.001, kernel="linear")
svc3.fit(X, y)

# svc3 : C = 0.001
y_pred = svc3.predict(X_test)
pd.DataFrame(
    confusion_matrix(y_test, y_pred), index=svc3.classes_, columns=svc3.classes_
)
# The misclassification is the same

In [None]:
# Changing the test data so that the classes are really seperable with a hyperplane.
X_test[y_test == 1] = X_test[y_test == 1] - 1
plt.scatter(X_test[:, 0], X_test[:, 1], s=70, c=y_test, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")

In [None]:
svc4 = SVC(C=10.0, kernel="linear")
svc4.fit(X_test, y_test)

In [None]:
plot_svc(svc4, X_test, y_test)

In [None]:
# Increase the margin. Now there is one misclassification: increased bias, lower variance.
svc5 = SVC(C=1, kernel="linear")
svc5.fit(X_test, y_test)

In [None]:
plot_svc(svc5, X_test, y_test)

### 9.6.2 Support Vector Machine 

In [None]:
# Generating test data
np.random.seed(8)
X = np.random.randn(200, 2)
X[:100] = X[:100] + 2
X[101:150] = X[101:150] - 2
y = np.concatenate([np.repeat(-1, 150), np.repeat(1, 50)])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)

plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2");

In [None]:
svm = SVC(C=1.0, kernel="rbf", gamma=1)
svm.fit(X_train, y_train)

In [None]:
plot_svc(svm, X_train, y_train)

In [None]:
# Increasing C parameter, allowing more flexibility
svm2 = SVC(C=100, kernel="rbf", gamma=1.0)
svm2.fit(X_train, y_train)

In [None]:
plot_svc(svm2, X_train, y_train)

In [None]:
# Set the parameters by cross-validation
tuned_parameters = [{"C": [0.01, 0.1, 1, 10, 100], "gamma": [0.5, 1, 2, 3, 4]}]
clf = GridSearchCV(
    SVC(kernel="rbf"),
    tuned_parameters,
    cv=10,
    scoring="accuracy",
    return_train_score=True,
)
clf.fit(X_train, y_train)
clf.cv_results_

In [None]:
clf.best_params_

In [None]:
confusion_matrix(y_test, clf.best_estimator_.predict(X_test))

In [None]:
# 15% of test observations misclassified
clf.best_estimator_.score(X_test, y_test)

### 9.6.3 ROC Curves

Comparing the ROC curves of two models on train/test data. One model is more flexible than the other.

In [None]:
svm3 = SVC(C=1, kernel="rbf", gamma=2)
svm3.fit(X_train, y_train)

In [None]:
# More flexible model
svm4 = SVC(C=1, kernel="rbf", gamma=50)
svm4.fit(X_train, y_train)

In [None]:
y_train_score3 = svm3.decision_function(X_train)
y_train_score4 = svm4.decision_function(X_train)

false_pos_rate3, true_pos_rate3, _ = roc_curve(y_train, y_train_score3)
roc_auc3 = auc(false_pos_rate3, true_pos_rate3)

false_pos_rate4, true_pos_rate4, _ = roc_curve(y_train, y_train_score4)
roc_auc4 = auc(false_pos_rate4, true_pos_rate4)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
ax1.plot(
    false_pos_rate3,
    true_pos_rate3,
    label="SVM $\gamma = 1$ ROC curve (area = %0.2f)" % roc_auc3,
    color="b",
)
ax1.plot(
    false_pos_rate4,
    true_pos_rate4,
    label="SVM $\gamma = 50$ ROC curve (area = %0.2f)" % roc_auc4,
    color="r",
)
ax1.set_title("Training Data")

y_test_score3 = svm3.decision_function(X_test)
y_test_score4 = svm4.decision_function(X_test)

false_pos_rate3, true_pos_rate3, _ = roc_curve(y_test, y_test_score3)
roc_auc3 = auc(false_pos_rate3, true_pos_rate3)

false_pos_rate4, true_pos_rate4, _ = roc_curve(y_test, y_test_score4)
roc_auc4 = auc(false_pos_rate4, true_pos_rate4)

ax2.plot(
    false_pos_rate3,
    true_pos_rate3,
    label="SVM $\gamma = 1$ ROC curve (area = %0.2f)" % roc_auc3,
    color="b",
)
ax2.plot(
    false_pos_rate4,
    true_pos_rate4,
    label="SVM $\gamma = 50$ ROC curve (area = %0.2f)" % roc_auc4,
    color="r",
)
ax2.set_title("Test Data")

for ax in fig.axes:
    ax.plot([0, 1], [0, 1], "k--")
    ax.set_xlim([-0.05, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.legend(loc="lower right")

As expected, the more flexible model scores better on training data but worse on the test data.

### 9.6.4 SVM with Multiple Classes

In [None]:
# Adding a third class of observations
np.random.seed(8)
XX = np.vstack([X, np.random.randn(50, 2)])
yy = np.hstack([y, np.repeat(0, 50)])
XX[yy == 0] = XX[yy == 0] + 4

plt.scatter(XX[:, 0], XX[:, 1], s=70, c=yy, cmap=plt.cm.prism)
plt.xlabel("XX1")
plt.ylabel("XX2");

In [None]:
svm5 = SVC(C=1, kernel="rbf")
svm5.fit(XX, yy)

In [None]:
plot_svc(svm5, XX, yy)

### 9.6.5 Application to Gene Expression Data

In R, I exported the dataset from package 'ISLR' to csv files.

In [None]:
# X_train = pd.read_csv("Data/Khan_xtrain.csv").drop("Unnamed: 0", axis=1)
# y_train = (
#     pd.read_csv("Data/Khan_ytrain.csv").drop("Unnamed: 0", axis=1).as_matrix().ravel()
# )
# X_test = pd.read_csv("Data/Khan_xtest.csv").drop("Unnamed: 0", axis=1)
# y_test = (
#     pd.read_csv("Data/Khan_ytest.csv").drop("Unnamed: 0", axis=1).as_matrix().ravel()
# )

In [None]:
# y_train counts
# pd.Series(y_train).value_counts(sort=False)

In [None]:
# y_test counts
# pd.Series(y_test).value_counts(sort=False)

In [None]:
# This model gives identical results to the svm() of the R package e1071, also based on libsvm library.
# svc = SVC(kernel="linear")

# # This model is based on liblinear library and gives 100 score on the test data.
# # svc = LinearSVC()

# svc.fit(X_train, y_train)

In [None]:
# cm = confusion_matrix(y_train, svc.predict(X_train))
# cm_df = pd.DataFrame(cm.T, index=svc.classes_, columns=svc.classes_)
# cm_df.index.name = "Predicted"
# cm_df.columns.name = "True"
# print(cm_df)

In [None]:
# cm = confusion_matrix(y_test, svc.predict(X_test))
# cm_df = pd.DataFrame(cm.T, index=svc.classes_, columns=svc.classes_)
# cm_df.index.name = "Predicted"
# cm_df.columns.name = "True"
# print(cm_df)