**Artificial Inteligence (CS550)**
<br>
Title: **Lecture 8: Logistic Regression**
<br>
Speaker: **Dr. Shota Tsiskaridze**

Bibliography: [1] Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006.

<h1 align="center">Kernel Methods</h1>

<h3 align="center">Vandermonde Matrix</h3>

In linear algebra, a **Vandermonde matrix**, is a an $m \times n$ matrix with the terms of a geometric progression in each row

$$
V = 
\begin{bmatrix}
1      & \alpha_1 & \alpha_1^2 & \cdots & \alpha_1^n\\ 
1      & \alpha_2 & \alpha_2^2 & \cdots & \alpha_2^n\\ 
1      & \alpha_3 & \alpha_2^2 & \cdots & \alpha_3^n\\ 
\vdots & \vdots   & \vdots     & \ddots & \alpha_4^n\\ 
1      & \alpha_m & \alpha_m^2 & \cdots & \alpha_5^n
\end{bmatrix},$$

or

$$V_{ij} = \alpha_i^{j-1}.$$

The determinant of a square Vandermonde matrix (where m = n) can be expressed as:

$$det(V) = \prod_{1 \leq i < j \leq n} (\alpha_i - \alpha_j).$$

This is called the **Vandermonde determinant** or **Vandermonde polynomial**. 
<br>
If all the numbers $\alpha _{i}$ **are distinct**, then it is **non-zero**.

<h3 align="center">Discriminant Functions (Two Classes Case)</h3>

- The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that:

  $$y(\mathbf{x}) = \mathbf{w}^{\mathbf{T}}\mathbf{x} + w_0,$$
  where $\mathbf{w}$ is called a **weight vector**, and $w_0$ is a **bias**.


- In case of two classes classification problem, an input vector $x$ is assigned to class $\mathcal{C}_1$ if $y(\mathbf{x})\geq 0$ and to class $\mathcal{C}_2$ otherwise.

- The corresponding **decision boundary** is therefore defined by the relation $y(\mathbf{x}) = 0$, which corresponds to a **$(D − 1)$-dimensional hyperplane** within the **$D$-dimensional input space**.


- If $\mathbf{x}$ and $\mathbf{x}'$ are two points which lie on the decision surface, then $y(\mathbf{x}) = y(\mathbf{x'}) = 0 \rightarrow \mathbf{w}^{\mathbf{T}}(\mathbf{x} - \mathbf{x}') = 0$ and hence the vector $\mathbf{w}$ is orthogonal to every vector lying within the decision surface, and sow determines the orientation of the decision surface.


- If $\mathbf{x}$ is a **point on the decision surface**, then the **normal distance** from the **origin** to the **decision surface** is given by:

  $$\frac{\mathbf{w}^{\mathbf{T}}\mathbf{x}}{\| \mathbf{w}\|} = -\frac{w_0}{\| \mathbf{w}\|},$$

  i.e the **bias parameter** $w_0$ determines the location of the decision surface.
  
  
- The the **perpendicular distance** $r$ of the point $\mathbf{x}$ from the **decision surface** can be calculates as:

  $$r = \frac{y(\mathbf{x})}{\| \mathbf{w}\|}.$$

  Indeed, let's consider an arbitrary point $\mathbf{x}$ and let $\mathbf{x}_{\perp}$ be its orthogonal projection onto the decision surface, so that:
  
  $$\mathbf{x} = \mathbf{x}_{\perp} + \frac{y(\mathbf{x})}{\| \mathbf{w}\|}.$$
  
  Multiplying both sides by $\mathbf{w}^{\mathbf{T}}$ and adding $w_0$, and making use of $y(\mathbf{x}) = \mathbf{w}^{\mathbf{T}}\mathbf{x} + w_0$ and $y(\mathbf{x}_{\perp}) = \mathbf{w}^{\mathbf{T}}\mathbf{x}_{\perp} + w_0 = 0$, we get our expression.


<img src="images/L8_perpendicular_distance.png" width="600" alt="Example" />


<h3 align="center">Discriminant Functions (Multiple Classes Case)</h3>

- We might be tempted be to **build** a $K$-class discriminant by **combining a number of two-class discriminant functions**. However, this leads to some serious difficulties:
  - **one-versus-the-rest** classifier: use $K−1$ classifiers each of which solves a two-class problem of separating points in a particular class $\mathcal{C}_k$ from points not in that class.

  - **one-versus-one** classifier: introduce $K(K − 1)/2$ binary discriminant functions, one for every possible pair of classes.

    $$y(\mathbf{x}) = \mathbf{w}^{\mathbf{T}}\mathbf{x} + w_0,$$
    where $\mathbf{w}$ is called a **weight vector**, and $w_0$ is a **bias**.

  As we see, this approach runs into the **problem of ambiguous regions**, as illustrated in the Figure.
  
<img src="images/L8_Multiple_Classes.png" width="1800" alt="Example" />


- We can avoid these difficulties by considering a **single $K$-class discriminant** comprising $K$ linear functions of the form:

  $$y_k(\mathbf{x}) = \mathbf{w}_k^{\mathbf{T}}\mathbf{x} + w_{k0},$$
  
  and then assigning a point $\mathbf{x}$ to class $\mathcal{C}_k$ if $y_k(\mathbf{x}) > y_j(\mathbf{x})$ for all $j \neq k$.
  
  The decision boundary between class $\mathcal{C}_k$ and class $\mathcal{C}_j$ is therefore given by $y_k(\mathbf{x}) = y_j(\mathbf{x})$ and hence corresponds to a $(D − 1)$-dimensional hyperplane defined by:
  
  $$(\mathbf{w}_k - \mathbf{w}_j)^{\mathbf{T}} \mathbf{x} + (w_{k0} - w_{j0}) = 0.$$
  
  This has the same form as the decision boundary for the two-class.
  
- Note, that he **decision regions** of such a discriminant are always **singly connected** and **convex**.

  Indeed, if $\mathbf{x}$ and $\mathbf{x}'$ are two points both of which lie inside decision region $\mathcal{R}_k$, then any point $\hat{\mathbf{x}}$ that lies on the line connecting $\mathbf{x}$ and $\mathbf{x}'$ can be expressed in the form:
  
  $$\hat{\mathbf{x}} = \lambda \mathbf{x}  + (1- \lambda) \mathbf{x}',$$
  
  where $0 \leq \lambda \leq 1$. From the linearity of the discriminant functions, it follows that:
  
  $$y_k(\hat{\mathbf{x}}) = \lambda y_k(\mathbf{x})  + (1- \lambda) y_k(\mathbf{x}').$$
  
  Because both $\mathbf{x}$ and $\mathbf{x}'$ lie inside $\mathcal{R}_k$, it follows that $y_k({\mathbf{x}}) > y_j({\mathbf{x}})$ and $y_k({\mathbf{x}}') > y_j({\mathbf{x}}')$ for all $j\neq k$, and hence $y_k(\hat{\mathbf{x}}) > y_j(\hat{\mathbf{x}})$, and so $\hat{\mathbf{x}}$ also lies inside $\mathcal{R}_k$.
  
  
<img src="images/L8_Convex_decision_boundary.png" width="600" alt="Example" />
  

<h3 align="center">Least Squares for Classification</h3>

The **least-squares** approach gives an exact **closed-form solution** for the discriminant function **parameters**. 

Lets demonstrate it:

- Each class $\mathcal{C}_k$ is described by its own linear model so that:

  $$y_k(\mathbf{x}) = \mathbf{w}_k^{\mathbf{T}}\mathbf{x} + w_{k0},$$

  where $k = 1, ..., K$. 
  
  We can conveniently group these together using vector notation so that:
  
  $$y(\mathbf{x}) = \widetilde{\mathbf{W}}^{\mathbf{T}}\widetilde{\mathbf{x}}.$$

  where $\widetilde{\mathbf{W}}$ is a matrix whose $k^{th}$ column comprises the $D + 1$-dimensional vector $\widetilde{\mathbf{w}}_k = (w_{k0}, \mathbf{w}_k^{\mathbf{T}})^{\mathbf{T}}$ and $\widetilde{\mathbf{x}}$ is the corresponding augmented input vector $(1, \mathbf{x}^{\mathbf{T}})^{\mathbf{T}}$ with a dummy input $x_0 = 1$.


- As we do for regression, we can determine the parameter matrix $\widetilde{\mathbf{W}}$ by minimizing a sum-of-squares error function.

  Lets consider a training data set $\mathbf{x}_n, \mathbf{t}_n$ where $n = 1, ..., N$, and define matrix $\mathbf{T}$ whose $n^{th}$ row is the vector $\mathbf{t}_n^{\mathbf{T}}$ and matrix $\widetilde{\mathbf{X}}$ whose $n^{th}$ row is the vector $\mathbf{\widetilde{x}}_n^{\mathbf{T}}$.
  
  The sum-of-squares error function can then be written as:
  
  $$E_D(\widetilde{\mathbf{W}}) = \frac{1}{2} \mathbf{Tr} \left \{ (\widetilde{\mathbf{X}}\widetilde{\mathbf{W}} - \mathbf{T})^{\mathbf{T}} (\widetilde{\mathbf{X}}\widetilde{\mathbf{W}} - \mathbf{T})\right \}.$$
  
  Setting the derivative with respect to $\widetilde{\mathbf{W}}$ to zero, and rearranging, we then obtain the solution for $\widetilde{\mathbf{W}}$ in the form:
  
  $$\widetilde{\mathbf{W}} = \left ( \widetilde{\mathbf{X}}^{\mathbf{T}}\widetilde{\mathbf{X}} \right )^{-1} \widetilde{\mathbf{X}}^{\mathbf{T}}\widetilde{\mathbf{X}} = \widetilde{\mathbf{X}}^{\dagger} \mathbf{T},$$
  
  where $\widetilde{\mathbf{X}}^{\dagger} \mathbf{T}$ is the pseudo-inverse of the matrix $\widetilde{\mathbf{X}}$.
  
  The discriminant function will have the form:
  
  $$y(\mathbf{x}) = \widetilde{\mathbf{W}}^{\mathbf{T}}\widetilde{\mathbf{x}}  = \mathbf{T}^{\mathbf{T}} \left ( \widetilde{\mathbf{X}}^{\dagger} \right )^{\mathbf{T}} \widetilde{\mathbf{x}}.$$
  

However, even as a discriminant function, this **solution suffers** from some **severe problems**:

- Lack robustness: the **additional data points** produce a **significant change** in the location of the **decision boundary**.
  
  <img src="images/L8_LS_suffering.png" width="1800" alt="Example" />

  The left plot shows data from **two classes**, denoted by **red crosses** and **blue circles**, together with the decision boundary found by **least squares (magenta curve)** and also by the **logistic regression model (green curve)**.


- The problems with least squares can be more severe than simply lack of robustness:

  <img src="images/L8_LS_suffering2.png" width="1800" alt="Example" />

  On the left is the result of using a least-squares discriminant. We see that the region of input space assigned to the green class is too small and so most of the points from this class are misclassified. On the right is the result of using logistic regressions.
  
  
  
- The **failure of least squares** should not surprise us when we recall that it **corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution**, whereas binary target vectors clearly have a distribution that is **far from Gaussian**.

<h3 align="center">Fisher’s Linear Discriminant</h3>

- Suppose we take the $D$-dimensional input vector $\mathbf{x}$ and project it down to one dimension using:

  $$y = \mathbf{w}^{\mathbf{T}} \mathbf{x}.$$ 
  
  If we place a threshold on $y$ and classify $y \geq -w_0$ as class $\mathcal{C}_1$ and otherwise class $\mathcal{C}_2$, then we obtain our standard linear classifier.
  

- Lets consider a two-class problem in which there are $N_1$ points of class $\mathcal{C}_1$ and $N_2$ points of class $\mathcal{C}_2$, so the mean vectors of the two classes are given by:

  $$m_1 = \frac{1}{N_1} \sum_{n \in \mathcal{C}_1} \mathbf{x_n}, $$
  $$m_2 = \frac{1}{N_2} \sum_{n \in \mathcal{C}_2} \mathbf{x_n}.$$
  
  The simplest measure of the separation of the classes, when projected onto $\mathbf{w}$, is the  separation of the projected class means. This suggests that we might choose $\mathbf{w}$ so as to maximize:
   
  $$m_2 - m_1 = \mathbf{w}^{\mathbf{T}}(\mathbf{m_2} - \mathbf{m_1}),$$
  
  where $m_k = \mathbf{w}^{\mathbf{T}}\mathbf{m}_k.$
  
  
- However, this expression can be made arbitrarily large simply by increasing the magnitude of $\mathbf{w}$. To solve this problem, we could constrain $\mathbf{w}$ to have unit length:

  $$\sum_{i} w_i^{2} = 1.$$
  
- Therefore, we **transformed** the **set of labelled data points** in $\mathbf{x}$ into a **labelled set in the one-dimensional space** $y$.
  
  The within-class variance of the transformed data from class $\mathcal{C}_k$ is therefore given by:
  
  $$s_k^{2} = \sum_{n \in \mathcal{C}_k} (y_n - m_k)^2,$$
  
  where $y_n = \mathbf{w}^{\mathbf{T}} \mathbf{x}_n.$
  
  We can define the total within-class variance for the whole data set to be simply $s_1^2 + s_2^2.$
  
  The **Fisher criterion** is defined to be the **ratio of the between-class variance to the within-class variance** and is given by:
  
  $$J(\mathbf{w}) = \frac{(m_2 - m_1)^2}{s_1^2 + s_2^2}.$$
  
  We can introduce the **between-class** covariance matrix $\mathbf{S_B}$ and the **total within-class** covariance matrix $\mathbf{S_W}$ as follows:
  
  $$\mathbf{S_B} = (\mathbf{m_2} - \mathbf{m_1})(\mathbf{m_2} - \mathbf{m_1})^{\mathbf{T}},$$
  $$\mathbf{S_W} = \sum_{n\in \mathcal{C}_1}(\mathbf{x_n} - \mathbf{m_1})(\mathbf{x_n} - \mathbf{m_1})^{\mathbf{T}} + \sum_{n\in \mathcal{C}_2}(\mathbf{x_n} - \mathbf{m_2})(\mathbf{x_n} - \mathbf{m_2})^{\mathbf{T}}.$$
  
  Thus, we can rewrite the Fisher criterion in the form:
  
  $$J(\mathbf{w}) = \mathbf{\frac{w^T S_B w}{w^T S_W w}}.$$
  
  Differentiating with respect to $\mathbf{w}$ we find that $J(\mathbf{w})$ is maximized when:
  
  $$\mathbf{ (w^T S_B w) S_w w = (w^T S_W w) S_B w }.$$
  
  But $\mathbf{S_B w}$ is always in the direction of $\mathbf{m_2 - m_1}$, i.e. furthermore, we do not care about the magnitude of $\mathbf{w}$ only its direction, and so we can drop the scalar factors $\mathbf{ (w^T S_B w)}$ and $\mathbf{(w^T S_W w)}.$

  Finally, we get:
  
  $$\mathbf{ w \propto S_W^{-1} (m_2 - m_1)}.$$
  
  This result is is known as **Fisher’s linear** discriminant.


  <img src="images/L8_Fisher_discriminant.png" width="1800" alt="Example" />

   

<h3 align="center">The Perceptron Algorithm</h3>

- Another example of a linear discriminant model is the **perceptron of Rosenblatt**.

- Lets take a two-class model in which the input vector $\mathbf{x}$ is first transformed using a fixed nonlinear transformation to give a feature vector $\phi(\mathbf{x})$ and this is then used to construct a generalized linear model of the form:

  $$y(\mathbf{x}) = f(\mathbf{w}^\mathbf{T} \phi (\mathbf{x}) ),$$

  where the nonlinear activation function $f(\cdot)$ is given by a step function of the form:
  
  $$f(a) = 
\left\{\begin{matrix}
+1, a \geq 0
\\ 
-1, a < 0
\end{matrix}\right.$$
 
- For the perceptron, it is more convenient to use target values $t = +1$ for class $\mathcal{C}_1$ and $t = -1$ for $\mathcal{C}_2$.

- We are seeking a weight vector $\mathbf{w}$ such that patterns  $\mathbf{x}_n$ in class $\mathcal{C}_1$ will have $\mathbf{w^T} \phi(\mathbf{x}_n)>0$, whereas patterns $\mathbf{x}_n$ in class $\mathcal{C}_2$ have $\mathbf{w^T} \phi(\mathbf{x}_n)<0$.

  Using the $t \in \{-1, +1\}$ target coding scheme it follows that we would like all patterns to satisfy $\mathbf{w^T} \phi(\mathbf{x}_n) t_n>0$.
  
- The perceptron criterion is given by:

  $$ E_P(\mathbf{w}) = - \sum_{n \in \mathcal{M}} \mathbf{w^T} \phi(\mathbf{x}_n) t_n,$$
  
  where $\mathcal{M}$ denotes the set of all misclassified patterns, i.e. it associates **zero error** with any pattern that is **correctly classified**, whereas for a **misclassified pattern** $\mathbf{x}_n$ it tries to **minimize the quantity**  $-\mathbf{w^T} \phi(\mathbf{x}_n) t_n$.
  
- We can apply the **stochastic gradient descent** algorithm to this error function and get:

  $$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)}  - \eta \nabla E_P(\mathbf{w})  = \mathbf{w}^{(\tau)} + \eta \phi_n t_n,$$
  
  where $\eta$ is the learning rate parameter and $\tau$ is an integer that indexes the steps of the algorithm.
  
  can set the learning rate parameter $\eta = 1$, becasue the perceptron function $y (\mathbf{x, w})$ is unchanged if we multiply $\mathbf{w}$ by a constant.
  
  
- **Interpretation**: We cycle through the training patterns in turn, and for each pattern $\mathbf{x}_n$ we evaluate the perceptron function. 
  - If the pattern is correctly classified, then the weight vector remains unchanged;
  - If it is incorrectly classified, then for class $\mathcal{C}_1$ we add the vector $\phi(\mathbf{x}_n)$ onto the current estimate of weight vector $\mathbf{w}$ while for class $\mathcal{C}_2$ we subtract the vector $\phi(\mathbf{x}_n)$ from $\mathbf{w}$. 
  
  The perceptron learning algorithm is illustrated in next Figure:

  <img src="images/L8_Perceptron_discriminant.png" width="1800" alt="Example" />


- The **perceptron convergence theorem** states that if there **exists an exact solution** (in other words, if the training data set is linearly separable), then the **perceptron learning algorithm** is **guaranteed to find** an exact solution in a **finite number of steps**.

<h3 align="center">Kernel Function</h3>

- For models which are based on a fixed nonlinear **feature space** mapping $\phi (\mathbf{x})$ the kernel function is given
by the relation:

  $$k(\mathbf{x}, \mathbf{x}') = \phi (\mathbf{x})^T \phi (\mathbf{x}').$$

  From this definition, we see that the **kernel** is a **symmetric function** of its arguments.

- Examples of kernel functions:
  - **stationary** kernel: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$;
  - **homogeneous** or **radial basis functions** kernel: $k(\mathbf{x}, \mathbf{x}') = k(\|\mathbf{x} - \mathbf{x}'\|)$;
  
- Many **linear parametric models** can be **re-cast** into an equivalent **dual representation** in which the predictions are based on linear combinations of a **kernel function** evaluated at the **training data points**.


  <img src="images/L8_Kernels.png" width="600" alt="Example" />



<h3 align="center">Dual Representations</h3>

- Lets consdier linear regression model whose parameters are determined by minimizing a regularized sum-of-squares error function given by:

  $$J(\mathbf{w}) = \frac{1}{2} \{ \mathbf{w^T\phi(x_n)}  - t_n \}^2 + \frac{\lambda}{2} \mathbf{w^Tw},$$
  
  where $\lambda \geq 0$.
  
  If we set the gradient of $J(\mathbf{w})$ with respect to $\mathbf{w}$ equal to zero, we see that the solution for  $\mathbf{w}$ takes the form of a linear combination of the vectors $\phi(\mathbf{x_n})$, with coefficients that are functions of $\mathbf{w}$, of the form:
  
  $$\mathbf{w} = -\frac{1}{\lambda} \sum_{n=1}^{N} \{ \mathbf{w^T\phi(x_n)}  - t_n \}^2 \phi(\mathbf{x_n})  = 
  \sum_{n=1}^{N} a_n \phi(\mathbf{x_n}) = \mathbf{\Phi^T a},$$
  
  where $\mathbf{\Phi}$ is the design matrix, whose $n^{ht}$ row is given by $\mathbf{\phi(x_n)^T}$.
  
  Here the vector $\mathbf{a} = (a_1, ..., a_n)^\mathbf{T}$ is defnied as:
  
  $$a_n = -\frac{1}{\lambda}\{ \mathbf{w^T\phi(x_n)}  - t_n \}.$$
  
  Therefore, **instead of working** with the parameter vector $\mathbf{w}$, we can now **reformulate the leastsquares algorithm** in **terms of** the parameter vector $\mathbf{a}$, giving rise to a **dual representation**:
  
  $$J(\mathbf{a}) = \frac{1}{2} \mathbf{ a^T \Phi \Phi^T \Phi \Phi^T a - a^T \Phi \Phi^T t +\frac{1}{2} t^T t + \frac{\lambda}{2} a^T \Phi \Phi^T },$$
  
  where $\mathbf{t} = (t_1, ..., t_N)^{\mathbf{T}}.$
  
  We now define the **Gramm matrix** $K = \Phi \Phi^T$, which is an $N \times N$ symmetric matrix with elements:
  
  $$K_{nm} = \phi (\mathbf{x}_n)^{\mathbf{T}} \phi (\mathbf{x}_m) = k(\mathbf{x}_n, \mathbf{x}_m).$$
  
  where we have introduced the **kernel function** $k(\mathbf{x}_n, \mathbf{x}_m)$.
  
  In terms of the **Gram matrix**, the sum-of-squares error function can be written as:
  
  $$J(\mathbf{a}) = \frac{1}{2} \mathbf{ a^T K K a  - a^T K t + \frac{1}{2} t^T t + \frac{\lambda}{2} a^T K a}.$$
  
  Setting the gradient $J(\mathbf{a})$ of with respect to a to zero, we obtain the following solution:
  
  $$\mathbf{a = (K + \lambda I_N)^{-1} t}.$$
  
  If we substitute this back into the linear regression model, we obtain the following prediction for a new input $\mathbf{x}$:
  
  $$y(\mathbf{x}) = \mathbf{w^T \phi(x_n) = a^T \Phi \phi(x_n) = k(x)^T (K + \lambda I_N)^{-1} t},$$
  
  where we have defined the vector $\mathbf{k(x)}$ with elements $k_n (\mathbf{x}) = k(\mathbf{x}_n, \mathbf{x}).$
  
  This is known as a **dual formulation**.
  
  
- In the **dual formulation**, we determine the parameter vector a by inverting an $N \times N$ matrix, whereas in the **original parameter space formulation** we had to invert an $M \times M$ matrix in order to determine $\mathbf{w}$. 

- Because $N$ is typically **much larger** than $M$, the **dual formulation does not seem to be particularly useful**.
  
- However, the **advantage of the dual formulation** is that it is **expressed entirely in terms of the kernel function** $k(\mathbf{x}, \mathbf{x}')$. We can therefore **work directly in terms of kernels** and **avoid the explicit introduction of the feature vector** φ(x), which allows us implicitly to use **feature spaces of high dimensionality** (even infinite).

  <img src="images/L8_Kernel_Function.png" width="1800" alt="Example" />


<h3 align="center">Support Vector Machines</h3>

- Lets consider a two-class classification problem using linear models of the form:

  $$ y (\mathbf{x}) = \mathbf{w^T \phi (x)} + b.$$
  
  where $\mathbf{\phi(x)}$ denotes a fixed feature-space transformation, and we have made the bias parameter $b$ explicit.
  
  The training data set comprises $N$ input vectors $\mathbf{x_1, ..., x}_N$ with corresponding target values $t_1, .., t_N$ where $t_n \in \{ -1, 1\}$, and new new data points $\mathbf{x}$ are classified according to the sign of $y(\mathbf{x})$.
  
  
- Lets assume for the moment that the training data set is **linearly separable** in **feature space**, so that **by definition** there **exists at least one choice of the parameters** such that a function satisfies $y(\mathbf{x_n})> 0 $ for points having $t_n = +1$ and $y(\mathbf{x_n})< 0 $ for points having $t_n = -1$, so that $t_n y(\mathbf{x_n})> 0 $ for all
training data points.

- The perceptron algorithm that is guaranteed to find a solution in a finite number of steps. Howere, the solution will be dependent on the (arbitrary) initial values chosen for $\mathbf{w}$ and $b$ as well as on the **order in which the data points are presented**.

- If there are **multiple solutions** all of which classify the training data set exactly, then we should **try to find the one** that will give the **smallest generalization error**.


- The **Support Vector Machine (SVM)** approaches this problem through the concept of the **margin**, which is defined to be the **smallest distance between the decision boundary and any of the samples**.

  <img src="images/L8_SVM.png" width="1800" alt="Example" />
  
  
- Recall that the perpendicular distance of a point $\mathbf{x}$ from a hyperplane defined by $y(\mathbf{x}) = 0$ is given by:

  $$\frac{|y(\mathbf{x})|}{ \|\mathbf{w}\|}.$$
  
  Thus the distance of a point $\mathbf{x}_n$ to the decision surface is given by:
  
  $$\frac{t_n y(\mathbf{x_n})}{ \| \mathbf{w}\|} = \frac{t_n (\mathbf{w^T \phi(x_n}) +b)}{ \| \mathbf{w}\|}.$$
  
  The **margin** is given by the **perpendicular distance** to the closest point $\mathbf{x_n}$ from the data set, and we wish to optimize the parameters $\mathbf{w}$ and $b$ in order to maximize this distance.
  
  Thus the **maximum margin solution** is found by solving:
  
  $$ \mathbf{\arg\max_{w,b}} \left \{  \frac{1}{\| \mathbf{w}\|} \min_{n} \left [t_n (\mathbf{w^T \phi(x_n)} + b) \right ] \right\}.$$
  
  where we have taken the factor $\frac{1}{\| \mathbf{w}\|}$ outside because $\mathbf{w}$ does not depend on $n$.
  
  
- Lets make the rescaling  $\mathbf{w} \rightarrow \kappa \mathbf{w}$ and $b \rightarrow \kappa b$, then the distance from any point $\mathbf{xn}$ to the **decision surface** is unchanged. We can use this freedom to set:

  $$t_n (\mathbf{w^T \phi(x_n)} + b) = 1,$$
  
  for the point that is closest to the surface. In this case, all data points will satisfy the constraints:
  
  $$t_n (\mathbf{w^T \phi(x_n)} + b) \geq 1,$$
  
  for $n = 1, ..., N$. 
  
  This is known as the **canonical representation** of the **decision hyperplane**.
  
- In the case of data points for which the equality holds, the constraints are said to be **active**, whereas for the remainder they are said to be **inactive**.

- By definition, there will always be at **least one active** constraint, because **there will always be a closest point**.


- The optimization problem then simply requires that we maximize $\| \mathbf{w}\|^{-1}$, which is equivalent to minimizing $\| \mathbf{w}\|^2$. Therefore, we need to solve the optimization problem:

  $$\mathbf{\arg\max_{w,b}} \frac{1}{2} \| \mathbf{w}\| ^ 2.$$
  
  This is an example of a quadratic programming problem in which we are trying to minimize a quadratic function subject to a set of linear inequality constraints.
  
- In order to solve this constrained optimization problem, we introduce Lagrange multipliers $a_n \geq 0$, with one multiplier $a_n$ for each of the constraints, giving the Lagrangian function:

  $$L(\mathbf{w}, b, \mathbf{a})  = \frac{1}{2} \| \mathbf{w}\|^2  - \sum_{i=1}^{N} a_n \{ t_n (\mathbf{w^T \phi (x_n)} + b) - 1 \},$$
  
  where $\mathbf{a} = (a_1, ..., a_n)^{\mathbf{T}}$.
  

- Setting the derivatives of $L(\mathbf{w}, b, \mathbf{a})$ with respect to $\mathbf{w}$ and $b$ equal to zero, we obtain the following two conditions:

  $$\mathbf{w} = \sum_{n=1}^{N} a_n t_n \phi(\mathbf{x}_n),$$
  $$0 = \sum_{n=1}^{N} a_n t_n.$$

  Eliminating $\mathbf{w}$ and $b$ from  $L(\mathbf{w}, b, \mathbf{a})$ using these conditions, gives the **dual representation** of the **maximum margin problem** in which we maximize:
  
  $$\widetilde{L}(\mathbf{a}) = \sum_{n=1}{N} a_n - \frac{1}{2} \sum_{n=1}^{N}\sum_{m=1}^{N} a_n a_m t_n t_m k(\mathbf{x_n, x_m}),$$

  with respect to $\mathbf{a}$ subject to the constraints:
  
  $$a_n \geq 0, n=1, ..., N;$$
  $$\sum_{n=1}^{N} a_n t_n = 0.$$
  
  Here the kernel function is defined by $k(\mathbf{x, x'}) = \mathbf{\phi(x)^T \phi (x')}.$
  
  
- Again, this takes the form of a **quadratic programming problem** in which we optimize a quadratic function of $\mathbf{a}$ subject to a set of inequality constraints.


- In general, the **solution** to a **quadratic programming problem** in $M$ variables in has computational complexity that is $O(M^3)$.
  
  For a **fixed set of basis functions** whose **number $M$** is **smaller** than the **number $N$** of data points, the **move to the dual problem** appears **disadvantageous**. However, it allows the model to be reformulated using kernels, and so the maximum margin classifier can be applied efficiently to feature spaces whose dimensionality exceeds the number of data points, including infinite feature spaces.


- In order to classify new data points using the trained model we evaluate the sign of $y(\mathbf{x})$.

  This can be expressed in terms of the parameters $\{a_n\}$ and the kernel function:
  
  $$y(\mathbf{x}) = \sum_{n=1}^{N} a_n t_n k(\mathbf{x, x_n} +b).$$
  
- The constrained optimization of this form satisfies the **Karush-Kuhn-Tucker (KKT) conditions** which in this case require that the following three properties hold:
  
  $$a_n \geq 0,$$
  $$t_n y(\mathbf{x_n}1) - 1 \geq 0,$$
  $$a_n \{ t_n y(\mathbf{x_n}) -1\} = 0$$
  
  Thus for every data point, either $a_n$ = 0 or $t_n y(\mathbf{x_n}) = 1$.
  
- Therefore, any data point for which $a_n =0$ will not appear in the sum and hence **plays no role** in making predictions for new data points.
  
- The remaining data points are called **support vectors**, and because they satisfy $t_n y(\mathbf{x_n}) = 1$. they correspond to points that lie on the maximum margin hyperplanes in feature space.


-  **Thisproperty is central to the practical applicability of support vector machines. Once the model is trained, a significant proportion of the data points can be discarded and only the support vectors retained.**


- Having solved the quadratic programming problem and found a value for $\mathbf{a}$, we can then determine the value of the threshold parameter $b$ by noting that any $support vector$ $\mathbf{x_n}$ satisfies $t_n y(\mathbf{x_n}) = 1$>:

  $$t_n \left ( \sum_{m \in \mathcal{S}} a_m t_n k(\mathbf{x_n, x_m}) + b \right ) = 1,$$
  
  where $\mathcal{S}$ denotes the set of indices of the **support vectors**.
  
  Solving for b to give:
  
  $$b = \frac{1}{N_{\mathcal{S}}} \sum_{n \in \mathcal{S}} \left ( t_n - \sum_{m \in \mathcal{S}} a_m t_m k(\mathbf{x_n, x_m})\right ),$$
  
  where $N_{\mathcal{S}}$ is the total number of support vectors.


- Figure below shows an example of the classification resulting from training a support vector machine on a simple synthetic data set using a Gaussian kernel.

<img src="images/L8_SVM2.png" width="600" alt="Example" />





<h1 align="center">End of Lecture</h1>