# 6. Sparse Kernel Machines

### *Table of Contents*

* 7.1 [Maximum Margin Classifiers](#7.1-Maximum-Margin-Classifiers)

In [3]:
import numpy as np
import matplotlib.pyplot as plt

One significant limitation of many kernel-based methods (see [Chapter 6](ch6_kernel_methods.ipynb)) is that the kernel function $k(\mathbf{x}_n, \mathbf{x}_m)$ must be evaluated for all possible pairs $\mathbf{x}_n$ and $\mathbf{x}_m$ of training data points. This can be computationally infeasible during training (does not scale to large datasets) and also leads to excessive computation overhead when making predictions for new data points.

On the other hand, there are kernel-based algorithms that yield *sparse* solutions (maintain a subset of training data points), so that predictions depend only on the kernel fuction evaluated at the subset of these training data points. We shall look into *support vector machine* (SVM), which are easy to train using convex optimization, but does not provide posterior probabilities. An alternative sparse kernel technique, known as *relevance vector machine* (RVM), is based on Bayesian formulation and provides posterior probabilistic outputs. Additionally, RVM has much sparser solutions than SVM, but it is slower to optimize.

## 7.1 Maximum Margin Classifiers

Consider the classification problem using linear models of the form,

$$
y(\mathbf{x}) = \mathbf{w}^T\boldsymbol\phi(\mathbf{x}) + b
$$

The training data set comprises $N$ input vectors $\mathbf{x}_1,\dots,\mathbf{x}_N$ and corresponding target values $t_1,\dots,t_N$, where $t_n\in\{−1, 1\}$.

> We assume for the moment that the training dataset is linearly separable in feature space defined by $\boldsymbol\phi$, so that there exists at least one choice of parameters such $y(\mathbf{x}_n)>0$ for points having $t_n=+1$ and $y(\mathbf{x}_n)<0$ for points having $t_n=−1$. In general, so that $t_ny(\mathbf{x}_n) > 0$ for all training data points.

The support vector machine approaches this problem through the concept of the margin, which is defined to be the smallest distance between the decision boundary and any of the samples. In support vector machine the decision boundary is chosen to be the one for which the margin is maximized. Recall that the perpendicular distance of a point $\mathbf{x}_n$ from the decision boundary, defined by $y(\mathbf{x}) = 0$, is given by $\frac{y(\mathbf{x})}{||\mathbf{w}||_2}$. Since we are only interested in solutions for which all data points are correctly classified, so that $t_ny(\mathbf{x}_n) > 0$ for all $n$. Thus, the distance of a point $\mathbf{x}_n$ to the decision surface is given by,

$$
d_s(\mathbf{x}_n) = \frac{t_ny(\mathbf{x}_n)}{||\mathbf{w}||_2} = \frac{t_n(\mathbf{w}^T\boldsymbol\phi(\mathbf{x}_n) + b)}{||\mathbf{w}||_2}
$$

Thus, the margin is defined by the closest point $\mathbf{x}_n$ from the training data points. SVM goal is to optimize the parameters of $y$ in order to maximize the margin or the distance of the closest point. Therefore, the maximum margin solution is found by solving,

$$
\argmax_{\mathbf{w}, b} \Bigg\{\ \min_n d_s(\mathbf{x}_n) \Bigg\} \overset{(7.2)}{=} 
\argmax_{\mathbf{w}, b} \Bigg\{\ \min_n\Bigg[ \frac{t_n(\mathbf{w}^T\boldsymbol\phi(\mathbf{x}_n) + b)}{||\mathbf{w}||_2} \Bigg] \Bigg\} =
\argmax_{\mathbf{w}, b} \Bigg\{\ \frac{1}{||\mathbf{w}||_2} \min_n\Bigg[ t_n(\mathbf{w}^T\boldsymbol\phi(\mathbf{x}_n) + b) \Bigg] \Bigg\}
$$

The figure on the left depicts the margin as the distance between the decision boundary and the closest data point. On the right, the margin is maximized leadning to a particular choice of the decision boundary (determined by the parameters $\mathbf{w}$ and b). The subset of data points determining the location of the optimized boundary are called *support vectors*.

<img src="../images/fg7_1a.png" width="400">
<img src="../images/fg7_1b.png" width="400"> 

A direct solution of the above optimization problem is very complex, but there is an equivalent problem that is much easier to solve.