**CS596 - Machine Learning**
<br>
Date: **17 November 2020**


Title: **Lecture 10**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

Sources:
Bibliography: 
<br>[1] **Chapter 6-7**. Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006.

<h1 align="center">Kernel Methods</h1>

<h3 align="center">Kernel Function</h3>

- For models which are based on a fixed nonlinear **feature space** mapping $\phi (\mathbf{x})$ the kernel function is given
by the relation:

  $$k(\mathbf{x}, \mathbf{x}') = \phi (\mathbf{x})^T \phi (\mathbf{x}').$$

  From this definition, we see that the **kernel** is a **symmetric function** of its arguments.

- Examples of kernel functions:
  - **stationary** kernel: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$;
  - **homogeneous** or **radial basis functions** kernel: $k(\mathbf{x}, \mathbf{x}') = k(\|\mathbf{x} - \mathbf{x}'\|)$;
  
- Many **linear parametric models** can be **re-cast** into an equivalent **dual representation** in which the predictions are based on linear combinations of a **kernel function** evaluated at the **training data points**.

<h3 align="center">Dual Representations</h3>

- Lets consdier linear regression model whose parameters are determined by minimizing a regularized sum-of-squares error function given by:

  $$J(\mathbf{w}) = \frac{1}{2} \{ \mathbf{w^T\phi(x_n)}  - t_n \}^2 + \frac{\lambda}{2} \mathbf{w^Tw},$$
  
  where $\lambda \geq 0$.
  
  If we set the gradient of $J(\mathbf{w})$ with respect to $\mathbf{w}$ equal to zero, we see that the solution for  $\mathbf{w}$ takes the form of a linear combination of the vectors $\phi(\mathbf{x_n})$, with coefficients that are functions of $\mathbf{w}$, of the form:
  
  $$\mathbf{w} = -\frac{1}{\lambda} \sum_{n=1}^{N} \{ \mathbf{w^T\phi(x_n)}  - t_n \}^2 \phi(\mathbf{x_n})  = 
  \sum_{n=1}^{N} a_n \phi(\mathbf{x_n}) = \mathbf{\Phi^T a},$$
  
  where $\mathbf{\Phi}$ is the design matrix, whose $n$-th row is given by $\mathbf{\phi(x_n)^T}$.
  
  Here the vector $\mathbf{a} = (a_1, ..., a_n)^\mathbf{T}$ is defnied as:
  
  $$a_n = -\frac{1}{\lambda}\{ \mathbf{w^T\phi(x_n)}  - t_n \}.$$
  
  Therefore, **instead of working** with the parameter vector $\mathbf{w}$, we can now **reformulate the leastsquares algorithm** in **terms of** the parameter vector $\mathbf{a}$, giving rise to a **dual representation**:
  
  $$J(\mathbf{a}) = \frac{1}{2} \mathbf{ a^T \Phi \Phi^T \Phi \Phi^T a - a^T \Phi \Phi^T t +\frac{1}{2} t^T t + \frac{\lambda}{2} a^T \Phi \Phi^T },$$
  
  where $\mathbf{t} = (t_1, ..., t_N)^{\mathbf{T}}.$
  
  We now define the **Gramm matrix** $K = \Phi \Phi^T$, which is an $N \times N$ symmetric matrix with elements:
  
  $$K_{nm} = \phi (\mathbf{x}_n)^{\mathbf{T}} \phi (\mathbf{x}_m) = k(\mathbf{x}_n, \mathbf{x}_m).$$
  
  where we have introduced the **kernel function** $k(\mathbf{x}_n, \mathbf{x}_m)$.
  
  In terms of the **Gram matrix**, the sum-of-squares error function can be written as:
  
  $$J(\mathbf{a}) = \frac{1}{2} \mathbf{ a^T K K a  - a^T K t + \frac{1}{2} t^T t + \frac{\lambda}{2} a^T K a}.$$
  
  Setting the gradient $J(\mathbf{a})$ of with respect to a to zero, we obtain the following solution:
  
  $$\mathbf{a = (K + \lambda I_N)^{-1} t}.$$
  
  If we substitute this back into the linear regression model, we obtain the following prediction for a new input $\mathbf{x}$:
  
  $$y(\mathbf{x}) = \mathbf{w^T \phi(x_n) = a^T \Phi \phi(x_n) = k(x)^T (K + \lambda I_N)^{-1} t},$$
  
  where we have defined the vector $\mathbf{k(x)}$ with elements $k_n (\mathbf{x}) = k(\mathbf{x}_n, \mathbf{x}).$
  
  This is known as a **dual formulation**.
  
  
- In the **dual formulation**, we determine the parameter vector a by inverting an $N \times N$ matrix, whereas in the **original parameter space formulation** we had to invert an $M \times M$ matrix in order to determine $\mathbf{w}$. 

- Because $N$ is typically **much larger** than $M$, the **dual formulation does not seem to be particularly useful**.
  
- However, the **advantage of the dual formulation** is that it is **expressed entirely in terms of the kernel function** $k(\mathbf{x}, \mathbf{x}')$. We can therefore **work directly in terms of kernels** and **avoid the explicit introduction of the feature vector** φ(x), which allows us implicitly to use **feature spaces of high dimensionality** (even infinite).

<h3 align="center">Construction Kernels</h3>

- One approach to construct valid kernel functions is to choose a feature space mapping $\phi(x)$ and then use this to find the corresponding kernel.


- The kernel function is defined for a one-dimensional input space by:

  $$k(x, x') = \phi(x)^T \phi(x') = \sum_{i=1}^{M}\phi_i(x)\phi_i(x'),$$
  
  where $\phi_i$ are the basis functions.
 
- The construction of kernel functions starting from a corresponding set of basis functions is shown below.

  In each column the lower plot shows the kernel function $k(x, x')$ defined as a function of $x$ and $x' = 0$.
 
  <img src="images/L10_Kernel_Function.png" width="1800" alt="Example" />


- An alternative approach is to construct kernel functions directly.


- In this case, we must ensure that the function we choose is a **valid kernel**, in other words that it corresponds to a scalar product in some (perhaps infinite dimensional) feature space.


- Given **valid kernels** $k_1(\mathbf{x},\mathbf{x}')$ and $k_2(\mathbf{x},\mathbf{x}')$, the following new kernels **will also be valid**:

$$
\begin{matrix}
k(\mathbf{x},\mathbf{x}') & = & c k_1(\mathbf{x},\mathbf{x}')& \\ 
k(\mathbf{x},\mathbf{x}') & = & f(\mathbf{x}) k(\mathbf{x},\mathbf{x}') f(\mathbf{x}')& \\
k(\mathbf{x},\mathbf{x}') & = & q(k_1(\mathbf{x},\mathbf{x}'))& \\ 
k(\mathbf{x},\mathbf{x}') & = & \mathrm{exp}(k_1(\mathbf{x},\mathbf{x}'))& \\ 
k(\mathbf{x},\mathbf{x}') & = & k_1(\mathbf{x},\mathbf{x}') + k_2(\mathbf{x},\mathbf{x}')& \\ 
k(\mathbf{x},\mathbf{x}') & = & k_1(\mathbf{x},\mathbf{x}') \cdot k_2(\mathbf{x},\mathbf{x}')& \\ 
k(\mathbf{x},\mathbf{x}') & = & k_3( \phi(\mathbf{x}), \phi(\mathbf{x}'))& \\ 
k(\mathbf{x},\mathbf{x}') & = & \mathbf{ x^T A x'}& \\ 
k(\mathbf{x},\mathbf{x}') & = & k_a(\mathbf{x}_a,\mathbf{x}_a') + k_b(\mathbf{x}_b,\mathbf{x}_b')& \\ 
k(\mathbf{x},\mathbf{x}') & = &  k_a(\mathbf{x}_a,\mathbf{x}_a') \cdot k_b(\mathbf{x}_b,\mathbf{x}_b')&
\end{matrix}
$$



<h3 align="center">Support Vector Machines</h3>

- Lets consider a two-class classification problem using linear models of the form:

  $$ y (\mathbf{x}) = \mathbf{w^T \phi (x)} + b.$$
  
  where $\mathbf{\phi(x)}$ denotes a fixed feature-space transformation, and we have made the bias parameter $b$ explicit.
  
  The training data set comprises $N$ input vectors $\mathbf{x}_1, ..., \mathbf{x}_N$ with corresponding target values $t_1, .., t_N$ where $t_n \in \{ -1, 1\}$, and new new data points $\mathbf{x}$ are classified according to the sign of $y(\mathbf{x})$.
  
  
- Lets assume for the moment that the training data set is **linearly separable** in **feature space**, so that **by definition** there **exists at least one choice of the parameters** such that a function satisfies $y(\mathbf{x}_n)> 0 $ for points having $t_n = +1$ and $y(\mathbf{x}_n)< 0 $ for points having $t_n = -1$, so that $t_n y(\mathbf{x}_n)> 0 $ for all training data points.


- The perceptron algorithm that is guaranteed to find a solution in a finite number of steps. Howere, the solution will be dependent on the (arbitrary) initial values chosen for $\mathbf{w}$ and $b$ as well as on the **order in which the data points are presented**.


- If there are **multiple solutions** all of which classify the training data set exactly, then we should **try to find the one** that will give the **smallest generalization error**.


- The **Support Vector Machine (SVM)** approaches this problem through the concept of the **margin**, which is defined to be the **smallest distance between the decision boundary and any of the samples**.

  <img src="images/L10_SVM.png" width="1800" alt="Example" />
  
  
- Recall that the perpendicular distance of a point $\mathbf{x}$ from a hyperplane defined by $y(\mathbf{x}) = 0$ is given by:

  $$\frac{|y(\mathbf{x})|}{ \|\mathbf{w}\|}.$$
  
  Thus the distance of a point $\mathbf{x}_n$ to the decision surface is given by:
  
  $$\frac{t_n y(\mathbf{x}_n)}{ \| \mathbf{w}\|} = \frac{t_n (\mathbf{w^T} \phi(\mathbf{x}_n) +b)}{ \| \mathbf{w}\|}.$$
  
  The **margin** is given by the **perpendicular distance** to the closest point $\mathbf{x}_n$ from the data set, and we wish to optimize the parameters $\mathbf{w}$ and $b$ in order to maximize this distance.
  
  Thus the **maximum margin solution** is found by solving:
  
  $$ \mathbf{\arg\max_{w,b}} \left \{  \frac{1}{\| \mathbf{w}\|} \min_{n} \left [t_n (\mathbf{w^T} \phi(\mathbf{x}_n) + b) \right ] \right\}.$$
  
  where we have taken the factor $\frac{1}{\| \mathbf{w}\|}$ outside because $\mathbf{w}$ does not depend on $n$.
  
  
- Lets make the rescaling  $\mathbf{w} \rightarrow \kappa \mathbf{w}$ and $b \rightarrow \kappa b$, then the distance from any point $\mathbf{x}_n$ to the **decision surface** is unchanged. We can use this freedom to set:

  $$t_n (\mathbf{w^T} \phi(\mathbf{x}_n) + b) = 1,$$
  
  for the point that is closest to the surface. In this case, all data points will satisfy the constraints:
  
  $$t_n (\mathbf{w^T} \phi(\mathbf{x}_n) + b) \geq 1,$$
  
  for $n = 1, ..., N$. 
  
  This is known as the **canonical representation** of the **decision hyperplane**.
  
- In the case of data points for which the equality holds, the constraints are said to be **active**, whereas for the remainder they are said to be **inactive**.


- By definition, there will always be at **least one active** constraint, because **there will always be a closest point**.


- The optimization problem then simply requires that we maximize $\| \mathbf{w}\|^{-1}$, which is equivalent to minimizing $\| \mathbf{w}\|^2$. Therefore, we need to solve the optimization problem:

  $$\mathbf{\arg\max_{w,b}} \frac{1}{2} \| \mathbf{w}\| ^ 2.$$
  
  This is an example of a quadratic programming problem in which we are trying to minimize a quadratic function subject to a set of linear inequality constraints.
  
  
- In order to solve this constrained optimization problem, we introduce Lagrange multipliers $a_n \geq 0$, with one multiplier $a_n$ for each of the constraints, giving the Lagrangian function:

  $$L(\mathbf{w}, b, \mathbf{a})  = \frac{1}{2} \| \mathbf{w}\|^2  - \sum_{i=1}^{N} a_n \{ t_n (\mathbf{w^T} \phi (\mathbf{x}_n) + b) - 1 \},$$
  
  where $\mathbf{a} = (a_1, ..., a_n)^{\mathbf{T}}$.
  

- Setting the derivatives of $L(\mathbf{w}, b, \mathbf{a})$ with respect to $\mathbf{w}$ and $b$ equal to zero, we obtain the following two conditions:

  $$\mathbf{w} = \sum_{n=1}^{N} a_n t_n \phi(\mathbf{x}_n),$$
  $$0 = \sum_{n=1}^{N} a_n t_n.$$

  Eliminating $\mathbf{w}$ and $b$ from  $L(\mathbf{w}, b, \mathbf{a})$ using these conditions, gives the **dual representation** of the **maximum margin problem** in which we maximize:
  
  $$\widetilde{L}(\mathbf{a}) = \sum_{n=1}^{N} a_n - \frac{1}{2} \sum_{n=1}^{N}\sum_{m=1}^{N} a_n a_m t_n t_m k(\mathbf{x}_n, \mathbf{x}_m),$$

  with respect to $\mathbf{a}$ subject to the constraints:
  
  $$a_n \geq 0, n=1, ..., N;$$
  $$\sum_{n=1}^{N} a_n t_n = 0.$$
  
  Here the kernel function is defined by $k(\mathbf{x, x'}) = \mathbf{\phi(x)^T \phi (x')}.$
  
  
- Again, this takes the form of a **quadratic programming problem** in which we optimize a quadratic function of $\mathbf{a}$ subject to a set of inequality constraints.


- In general, the **solution** to a **quadratic programming problem** in $M$ variables in has computational complexity that is $O(M^3)$.
  
  For a **fixed set of basis functions** whose **number $M$** is **smaller** than the **number $N$** of data points, the **move to the dual problem** appears **disadvantageous**. However, it allows the model to be reformulated using kernels, and so the maximum margin classifier can be applied efficiently to feature spaces whose dimensionality exceeds the number of data points, including infinite feature spaces.


- In order to classify new data points using the trained model we evaluate the sign of $y(\mathbf{x})$.

  This can be expressed in terms of the parameters $\{a_n\}$ and the kernel function:
  
  $$y(\mathbf{x}) = \sum_{n=1}^{N} a_n t_n k(\mathbf{x, x}_n +b).$$
  
- The constrained optimization of this form satisfies the **Karush-Kuhn-Tucker (KKT) conditions** which in this case require that the following three properties hold:
  
  $$a_n \geq 0,$$
  
  $$t_n y(\mathbf{x}_n) - 1 \geq 0,$$
  
  $$a_n \{ t_n y(\mathbf{x}_n) -1\} = 0$$
  
  Thus for every data point, either $a_n$ = 0 or $t_n y(\mathbf{x}_n) = 1$.
  
  
- Therefore, any data point for which $a_n =0$ will not appear in the sum and hence **plays no role** in making predictions for new data points.
  
  
- The remaining data points are called **support vectors**, and because they satisfy $t_n y(\mathbf{x}_n) = 1$. they correspond to points that lie on the maximum margin hyperplanes in feature space.


- **This property is central to the practical applicability of support vector machines**. 

  **Once the model is trained, a significant proportion of the data points can be discarded and only the support vectors retained.**


- Having solved the quadratic programming problem and found a value for $\mathbf{a}$, we can then determine the value of the threshold parameter $b$ by noting that any **support vector** $\mathbf{x}_n$ satisfies $t_n y(\mathbf{x}_n) = 1$>:

  $$t_n \left ( \sum_{m \in \mathcal{S}} a_m t_n k(\mathbf{x}_n, \mathbf{x}_m) + b \right ) = 1,$$
  
  where $\mathcal{S}$ denotes the set of indices of the **support vectors**.
  
  Solving for b to give:
  
  $$b = \frac{1}{N_{\mathcal{S}}} \sum_{n \in \mathcal{S}} \left ( t_n - \sum_{m \in \mathcal{S}} a_m t_m k(\mathbf{x}_n, \mathbf{x}_m)\right ),$$
  
  where $N_{\mathcal{S}}$ is the total number of support vectors.


- Figure below shows an example of the classification resulting from training a support vector machine on a simple synthetic data set using a Gaussian kernel.

<img src="images/L10_SVM2.png" width="600" alt="Example" />





<h1 align="center">End of Lecture</h1>