# Support vector machines

## Non-linear decision boundaries and kernel methods

### Classification with non-linear decision boundaries

The support vector classifier is a natural approach for classification in the two-class setting, if the boundary between the two classes is linear. However, in practice we are sometimes faced with non-linear class boundaries, as the example shown in the left panel of {numref}`svm8`. 

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm8.png
---
width: 700px
name: svm8
figclass: margin-caption
---
Left: A support vector classifier was fit to a small data set. The hyperplane is shown as a solid line and the margins are shown as dashed lines. Purple observations: Observations 3, 4, 5, and 6 are on the correct side of the margin, observation 2 is on the margin, and observation 1 is on the wrong side of the margin. Blue observations: Observations 7 and 10 are on the correct side of the margin, observation 9 is on the margin, and observation 8 is on the wrong side of the margin. No observations are on the wrong side of the hyperplane. Right: Same as left panel with two additional points, 11 and 12. These two observations are on the wrong side of the hyperplane and the wrong side of the margin.
```

In {doc}`Linear regression <../02-linear-reg/extension-limitation>`, we discussed using higher-order polynomials as a way to fit a non-linear relationship between the predictors and the response. Foe example, rather than fitting a support vector classifier using $D$ features: $ x_{1}, x_{2}, \ldots, x_{D} $, we could fit a support vector classifier using $ 2 \times D $ features: $ x_{1}, x_{2}, \ldots, x_{D'}, x_{1}^2, x_{2}^2, \ldots, x_{D'}^2 $. Then the optimisation problem becomes

```{math}
:label: eq:svm-polynomial
\begin{aligned}
& \max_{\beta_0, \beta_{1,1}, \ldots, \beta_{D,2}, \epsilon_1, \ldots, \epsilon_N} M \\
& \text{subject to } y_i\left(\beta_0 + \sum_{j=1}^{D} \beta_{j,1} x_{ij} + \beta_{j,2} x_{ij}^2\right) \geq M(1 - \epsilon_i), \\
& \sum_{j=1}^{D} \beta_{j,1}^2 + \beta_{j,2}^2 = 1, \epsilon_i \geq 0, \sum_{i = 1}^N \epsilon_i \leq C, \text{ for } i = 1, \ldots, N.
\end{aligned}
```

The decision boundary that results from Equation {eq}`eq:svm-polynomial` is in fact linear. But in the original feature space, the decision boundary is of the form $ q(x) = 0 $, where $ q(\cdot) $ is a quadratic polynomial, and its solutions are generally non-linear.


### Kernel methods

The _support vector machine (SVM)_ is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using _kernels_. As described above, the main idea is to enlarge our feature space in order to accommodate a non-linear boundary between the classes. The kernel approach that we describe here is simply an efficient computational approach for enacting this idea.

We have not discussed exactly how the support vector classifier is computed because the details become somewhat technical. However, it turns
out that the solution to the support vector classifier problem {eq}`eq:soft-margin-classifier` involves only the inner products of the observations (as opposed to the observations themselves). The inner product of two $D$-vectors $\mathbf{a}$ and $\mathbf{b}$ is defined as $⟨a, b⟩ = \sum_{i=1}^{D} a_i b_i $. Thus the inner product of two observations $\mathbf{x}_i$ and $\mathbf{x}_{i'}$ is $⟨\mathbf{x}_i, \mathbf{x}_{i'}⟩ = \sum_{j=1}^{D} x_{ij} x_{i'j} $. As a result, the linear svm classifier can be written as

```{math}
:label: eq:svm-inner-product
\begin{equation}
f(x) = \beta_0 + \sum_{i=1}^{N} \alpha_i y_i ⟨\mathbf{x}, \mathbf{x}_i⟩,
\end{equation}
```
where there are $N$ parameters $\alpha_i, \text{ for } i = 1, \ldots N $, one per training observation. To estimate the parameters $\alpha_i$, all we need are the inner products of the training observations. The inner product can be denoted in the following generalised form:

$$
⟨x_i, x_{i'}⟩ =k(x_i, x_{i'}),
$$

where $ k(\cdot, \cdot) $ is some function that we will refer to as a _kernel_, which quantifies the similarity of two observations. For instance, we could simply take

$$
k(\mathbf{x}_i, \mathbf{x}_{i'}) = \sum_{j=1}^{D} x_{ij} x_{i'j},
$$

which knows as the _linear kernel_. However, we could also take

$$
k(\mathbf{x}_i, \mathbf{x}_{i'}) = \left(1 + \sum_{j=1}^{D} x_{ij} x_{i'j}\right)^d,
$$

which is known as _polynomial kernel_  of degree $d$, and $ d > 1 $ is a positive integer. Another popular choice is the radial kernel, which takes the form

$$
k(\mathbf{x}_i, \mathbf{x}_{i'}) = \exp\left(-\gamma \sum_{j=1}^{D} (x_{ij} - x_{i'j})^2\right),
$$

where $ \gamma > 0 $ is a positive tuning parameter. The radial kernel is also known as the _Gaussian kernel_. {numref}`svm9` shows the decision boundaries that result from using the polynomial (left) and radial kernels (right).

When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known as a support vector machine:

```{math}
:label: eq:svm
\begin{equation}
f(x) = \beta_0 + \sum_{i=1}^{N} \alpha_i k(\mathbf{x}, \mathbf{x}_i).
\end{equation}
```

```{figure} https://raw.githubusercontent.com/pykale/transparentML/main/content/images/svm/svm9.png
---
width: 700px
name: svm9
figclass: margin-caption
---
Left: An SVM with a polynomial kernel of degree 3 is applied to the non-linear data from {numref}`svm8`, resulting in a far more appropriate decision rule. Right: An SVM with a radial kernel is applied. In this example, either kernel is capable of capturing the decision boundary.
```

### Example: SVMs on toy data using `scikit-learn`

Import the required libraries

In [None]:
# import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import seaborn as sns

# from sklearn.preprocessing import label_binarize
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

%matplotlib inline

Define a function to plot a classifier with support vectors.

In [None]:
def plot_svc(svc, X, y, h=0.02, pad=0.25):
    x_min, x_max = X[:, 0].min() - pad, X[:, 0].max() + pad
    y_min, y_max = X[:, 1].min() - pad, X[:, 1].max() + pad
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.2)

    plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
    # Support vectors indicated in plot by vertical lines
    sv = svc.support_vectors_
    plt.scatter(
        sv[:, 0], sv[:, 1], c="k", marker="x", s=100, alpha=0.5
    )  # , linewidths=1)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel("X1")
    plt.ylabel("X2")
    plt.show()
    print("Number of support vectors: ", svc.support_.size)

Generating data

In [None]:
np.random.seed(8)
X = np.random.randn(200, 2)
X[:100] = X[:100] + 2
X[101:150] = X[101:150] - 2
y = np.concatenate([np.repeat(-1, 150), np.repeat(1, 50)])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)

plt.scatter(X[:, 0], X[:, 1], s=70, c=y, cmap=plt.cm.Paired)
plt.xlabel("X1")
plt.ylabel("X2")
plt.show()

In [None]:
svm = SVC(C=1.0, kernel="rbf", gamma=1)
svm.fit(X_train, y_train)
plot_svc(svm, X_train, y_train)

In [None]:
# Increasing C parameter, allowing more flexibility
svm2 = SVC(C=100, kernel="rbf", gamma=1.0)
svm2.fit(X_train, y_train)
plot_svc(svm2, X_train, y_train)

Choosing the parameters by cross-validation

In [None]:
tuned_parameters = {"C": [0.01, 0.1, 1, 10, 100], "gamma": [0.5, 1, 2, 3, 4]}
clf = GridSearchCV(
    SVC(kernel="rbf"),
    tuned_parameters,
    cv=10,
    scoring="accuracy",
    return_train_score=True,
)
clf.fit(X_train, y_train)
clf.cv_results_

Display the best combination of parameters

In [None]:
clf.best_params_

In [None]:
cm = confusion_matrix(y_test, clf.best_estimator_.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()
plt.show()

In [None]:
# 15% of test observations misclassified
clf.best_estimator_.score(X_test, y_test)

### Evaluating with ROC curves

Comparing the ROC curves of two models on train/test data with different hyper-parameter $\gamma$ of Gaussian kernels (RBF kernel). 
<!-- One model is more flexible than the other. -->

In [None]:
svm3 = SVC(C=1, kernel="rbf", gamma=2)
svm3.fit(X_train, y_train)

In [None]:
# More flexible model
svm4 = SVC(C=1, kernel="rbf", gamma=50)
svm4.fit(X_train, y_train)

In [None]:
y_train_score3 = svm3.decision_function(X_train)
y_train_score4 = svm4.decision_function(X_train)

false_pos_rate3, true_pos_rate3, _ = roc_curve(y_train, y_train_score3)
roc_auc3 = auc(false_pos_rate3, true_pos_rate3)

false_pos_rate4, true_pos_rate4, _ = roc_curve(y_train, y_train_score4)
roc_auc4 = auc(false_pos_rate4, true_pos_rate4)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
ax1.plot(
    false_pos_rate3,
    true_pos_rate3,
    label="SVM $\gamma = 1$ ROC curve (area = %0.2f)" % roc_auc3,
    color="b",
)
ax1.plot(
    false_pos_rate4,
    true_pos_rate4,
    label="SVM $\gamma = 50$ ROC curve (area = %0.2f)" % roc_auc4,
    color="r",
)
ax1.set_title("Training Data")

y_test_score3 = svm3.decision_function(X_test)
y_test_score4 = svm4.decision_function(X_test)

false_pos_rate3, true_pos_rate3, _ = roc_curve(y_test, y_test_score3)
roc_auc3 = auc(false_pos_rate3, true_pos_rate3)

false_pos_rate4, true_pos_rate4, _ = roc_curve(y_test, y_test_score4)
roc_auc4 = auc(false_pos_rate4, true_pos_rate4)

ax2.plot(
    false_pos_rate3,
    true_pos_rate3,
    label="SVM $\gamma = 1$ ROC curve (area = %0.2f)" % roc_auc3,
    color="b",
)
ax2.plot(
    false_pos_rate4,
    true_pos_rate4,
    label="SVM $\gamma = 50$ ROC curve (area = %0.2f)" % roc_auc4,
    color="r",
)
ax2.set_title("Test Data")

for ax in fig.axes:
    ax.plot([0, 1], [0, 1], "k--")
    ax.set_xlim([-0.05, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.legend(loc="lower right")

The more flexible model ($ \gamma = 50 $) scores better on training data but worse on the test data.

## SVMs with more than two classes

So far, our discussion has been limited to the case of binary classification: that is, classification in the two-class setting. How can we extend SVMs to the more general case where we have some arbitrary number of classes? It turns out that the concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes. Though a number of proposals for extending SVMs to the K-class case have been made, the two most popular are the one-versus-one and one-versus-all approaches. 

### One-vs-one

Suppose that we would like to perform classification using SVMs, and there are $K$ > 2 classes. A one-versus-one or all-pairs approach constructs $K(K-1)/2$ binary classifiers, one for each pair of classes. For example, if there are three classes, then we would construct three binary classifiers, one for each pair of classes. The first classifier would distinguish between class 1 and class 2, the second classifier would distinguish between class 1 and class 3, and the third classifier would distinguish between class 2 and class 3. To classify a new observation, we would apply each of the three classifiers, and assign the observation to the class that receives the most votes.


### One-vs-all

The one-versus-all approach is an alternative procedure for applying SVMs in the case of $K$ > 2 classes. We fit $K$ SVMs, each time comparing one of the $K$ classes to the remaining $K − 1$ classes. For example, if there are three classes, then we would fit three SVMs, one for each class. The first SVM would separate class 1 from classes 2 and 3, the second SVM would separate class 2 from classes 1 and 3, and the third SVM would separate class 3 from classes 1 and 2. To classify a new observation, we would apply each of the three classifiers, and assign the observation to the class that receives the most votes.

The following code shows how to train a SVM using `scikit-learn`, the strategy for multi-class classification is "one vs one" internally. However, we can get "one vs rest" hyper-plane by setting `decision_function_shape='ovr'` in the `SVC` object.

In [None]:
# Adding a third class of observations
np.random.seed(8)
XX = np.vstack([X, np.random.randn(50, 2)])
yy = np.hstack([y, np.repeat(0, 50)])
XX[yy == 0] = XX[yy == 0] + 4

plt.scatter(XX[:, 0], XX[:, 1], s=70, c=yy, cmap=plt.cm.prism)
plt.xlabel("XX1")
plt.ylabel("XX2");

In [None]:
svm5 = SVC(C=1, kernel="rbf")
svm5.fit(XX, yy)
plot_svc(svm5, XX, yy)