# Lecture 22: Kernel-based Learning

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
import utils as utils
from fractions import Fraction

from IPython.display import display, HTML

# Inline plotting
%matplotlib inline

# Make sympy print pretty math expressions
sy.init_printing()

utils.load_custom_styles()

## Support Vector Machine

Suppose we have a training set $\mathbf{T} = \{ (\mathbf{x_1}, l_1), (\mathbf{x_2}, l_2), \cdots,  (\mathbf{x_N}, l_N) \}$ where each label is binary $l_i \in \{ -1, 1 \}$. We want to optimise the parameters of a decision function $g(\cdot)$ to define a discriminant hyperplane that separates the two classes.

SVM assumes that the samples are transformed by some function $\phi(\cdot)$ to a new feature space $\mathcal{F}$

<img src="figures/lecture-22/feature-transformation.png" width="600" />


For a linear decision function, we can use $\phi(\mathbf{x})=\mathbf{x}$. In this case, $\mathcal{F} = \mathbb{R}^D$. The advantage of this assumption is that it allows us to define other types of $\phi(\cdot)$ that lead to **nonlinear decision functions**. We could come up with a mapping that transforms our samples from a feature space that are non-linearly separable to another feature space where our samples become linearly separable:

<img src="figures/lecture-22/high-dim-map.png" width="600" />















<img src="figures/lecture-22/svm-decision-function.png" width="600" />







<div class="sidenote">
The bias term expresses the displacement of the hyperplane from the origin in <strong><em>one</em></strong> of the axes. For example, in the two-dimensional case, $b$ could express the displace on the $x$-axis or the $y$-axis. It does not really matter. What matters is that the definition is consistent throughout the problem.
<div>

Given an optimal $\mathbf{w}$ that represents the normal to the hyperplane that separates all the samples into two classes, the following equation would be true:

<img src="figures/lecture-22/separating-hyperplane.png" width="600" />


Using the decision function above and assuming that training set is linearly separable, then there are many different hyperplanes to chose from:

<img src="figures/lecture-22/multiple-decision-functions.png" width="450" />






















It makes sense to choose a hyperplane that separates the two clouds as clearly as possible, which is done by selecting a hyperplane of
maximum margin, that is, maximum distance to any point in either cluster of data. This will make our classifier more robust.

A decision function that separates the two classes with a margin $q$ can be defined as:


<img src="figures/lecture-22/decision-function-with-margin.png" width="600" />


The margin $q$ expresses the minimal distance between the decision hyperplane and the closest training samples.

<img src="figures/lecture-22/illustration-margin.png" width="300" />

















For the closest sample to the hyperplane (support vectors) $\phi_m$, we have:

<img src="figures/lecture-22/support-vector-distance.png" width="600" />




How do we maximise the margin while classifying all the samples correctly?

<img src="figures/lecture-22/maximise-margin.png" width="600" />














Based on the ideas above, the optimization problem of Support Vector Machine (SVM) maximises the margin, while correctly classifying as many training samples
as possible. The Soft SVM objective function can be expressed as:

<img src="figures/lecture-22/soft-svm-objective-function.png" width="600" />









The set of $\xi_i$  called the **slack variables** are introduced to allow samples to violate the margin. The constant $C$ is hyperparameter used to penalise violations and control how much training error is allowed. Eq. 5.7 expresses an optimisation problem that balance between:
- the minimisation of the norm of $\mathbf{w}$. This corresponds to the maximisation of the margin
- the minimisation of the non-negative slack variables. This corresponds to the minimisation of the training error.

<img src="figures/lecture-22/soft-margin-svm.png" width="350" />














---
### Solving SVM Objective Function

<img src="figures/lecture-22/svm_lagrangian.png" width="600" />






<img src="figures/lecture-22/saddle-points-of-lagragian.png" width="600" />











<img src="figures/lecture-22/transform-convex-problem.png" width="600" />








#### Quadratic Problem Formulation

<img src="figures/lecture-22/convex-matrix-form.png" width="600" />












The optimization problem in Eq. 5.15 is a quadratic optimization problem, which (in the case where $K$ is positive semi-definite) has
a global optimum solution. For a function $\phi(\mathbf{x}) = \mathbf{x}$, it can be shown that $K$ is always positive semi-definite.

#### How to compute w?

<img src="figures/lecture-22/compute-w.png" width="600" />









#### How to compute b?

<img src="figures/lecture-22/compute-b.png" width="600" />














---
## Kernel Functions

Until now, we have considered only the case where $\phi(\mathbf{x}) = \mathbf{x}$. However, as can be observed by Eq. 5.15, the solution of the SVM optimization problem does not depend on the nature of $\phi(\cdot)$. That is, as long as the kernel matrix $\mathbf{K} \in \mathbb{R}^{D\times D}$
is a positive semi-definite matrix, an optimal solution can be found. 

Since the elements of $\mathbf{K}$ are defined on pairs of samples, i.e. $K_{ij}= \phi_i^T \phi_j$, we can define any function $\kappa(\cdot, \cdot)$ involving sample pairs, as long as the corresponding $\mathbf{K}$ is positive semi-definite. 

Such functions are usually called kernel functions. Two popular kernel functions are Gaussian (radial basis function) kernel and Polynomial kernel:

<img src="figures/lecture-22/kernel-functions.png" width="600" />




Eq. 5.19 is the so-called RBF kernel function with parameter $\gamma > 0$, while Eq. 5.20 is the so-called polynomial kernel function with parameter $d > 0$.

---
## Kernels and SVM

<img src="figures/lecture-22/kernel-svm.png" width="600" />












---
## Kernel Least-Means Square Regression

<img src="figures/lecture-22/feature-transformation.png" width="600" />












