# Support Vector Machines

This is the first in a series of posts where I attempt to break down the mathematics behind Support Vector Machine Classifiers. 

Suppose we are given N training vectors $\{(x^{(i)}, y^{(i)})\}, \text{ where } x ∈ \mathbb{R}^{D}, y ∈ \{−1, 1\}$ and we want to learn a classifier:

$
h_{w,b}(x) = g(w^{T} x + b)
$

where $g(z) = 1$ if $z>= 0$ and $-1$ otherwise. 

One unique way SVMs differ from other classification methods is that they focus on the hardest to classify points as a basis for defining an optimum decision boundary.



Consider for a moment the case of a two dimensional version of this problem as shown in the figure below. The orange, black and green hyperplanes all perfectly serperate the red and blue classes. So which one do we prefer?

<img src="multiple_hyperplanes.png" alt="Drawing" style="width: 500px;"/>

Support vector machines set out to find a decision boundary whose margin is maximally far away from the closest training points.

<img src="maximal_margin.png" alt="Drawing" style="width: 500px;"/>

### Distance between points and the hyperplane

How can we measure the distance $\gamma^{(i)}$ between a training observation $x^{(i)}$ and our hyperplane? 

Consider the decision boundary shown as the red dotted line in the figure below. If we define $x_{0}$ to be a vector on the hyperplane. Then the $x^{(i)} - x_{0}$ represents a vector from $x^{(i)}$ to the hyperplane. The dotted black line from $x^{(i)}$ to the hyperplane represents the vector whose distance is the shortest to the hyperplane. This dotted line forms a right-angled triangle with the hyperplane and we can label the unknown angle at the top of our triangle $\theta$. The vector $w^{*}$ represents the unit vector perpendicular to the hyperplane. 

<img src="svm_distance_to_hyperplane.png" alt="Drawing" style="width: 500px;"/>

Using trigonometry, and setting $f = x^{(i)} - x_{0}$: 

$
\cos{\theta} = \dfrac{\text{adjacent}}{\text{hypothenuse}}\\
\implies \cos{\theta} = \dfrac{\gamma^{(i)}}{\|f\|}\\
\implies \|f\|\cos{\theta} = \gamma^{(i)}\\
\implies \dfrac{\|w^{*}\|}{\|w^{*}\|}\|f\|\cos{\theta} = \gamma^{(i)}\\
\implies fw^{*} = \gamma^{(i)}\\
\implies \dfrac{(x^{(i)} - x_{0})w}{\|w\|} = \gamma^{(i)}
$

and then using $wx_{0} = -b$ we have: 

$
\implies \dfrac{wx^{(i)} + b}{\|w\|} = \gamma^{(i)}
$

We will see below how this distance is related to the definitions of the Functional and Gemetric margins defined below.

### Functional and Geometric Margins

Given a training example $(x^{(i)}, y^{(i)})$, we define the functional margin of $(w, b)$ with
respect to a training example as:

$\hat{\gamma}^{(i)} = y^{(i)}(w^{T} x^{(i)} + b)$

The functional margin serves as a test function to determine whether a given training point is classified correctly. For a training example to be correctly classified $\hat{\gamma}^{(i)} \geq 0$. 

One problem with the functional margin is that it can be affected by an arbirtrary scaling of $w$ and $b$. The functional margin gives a number but without a reference you can't tell if the point is actually far away or close to the decision plane. That brings us onto the definition of the geometric margin: 

$\gamma^{(i)} = \hat{\gamma}^{(i)}/\|w\|$


The geometric margin is telling you not only if the point is properly classified or not, but the magnitude of that distance in term of units of |w|. It is invariant to any scaling of $w$ or $b$ which will be important later. The geometric margin should look familiar as $y^{(i)}$ multiplied by the distance between a point and our hyperplane that we derived in the previous section.

Given a training set 

$S = \{(x^{(i)}, y^{(i)}); i=1 \dotsc n\}$ 

we define the geometric margin of $(w,b)$ with respect to $S$ to be the smallest of the geometric margins on the individual training examples:

$\gamma = \min_{i=1 \dotsc n} \gamma ^ {(i)}$


### The optimal margin classifier

Linear SVM maximises the geometric margin of the training dataset such that all the training examples are correctly classified. This can be formulated as the following optimisation problem:

$\max_{w, b} \gamma \quad$ s.t. $\quad \dfrac{y^{(i)}(w^{T} x^{(i)} + b)}{\|w\|}\geq \gamma \text{ for } i=1 \dotsc n$

For any solution that satisfies the above equations, any positively scaled multiple will also due to the fact that the geometric margin is invariant to scaling of $w$. 

Therefore, we can scale $w$ in such a way that $\|w\| = \dfrac{1}{\gamma}$. Also note that maximising $\dfrac{1}{\|w\|}$ is the same as minimising $\|w\|$ which is the same as minimising $\dfrac{1}{2}\|w\|^{2}$.

Thus, we can reformulate the optimisation problem as:  

$\min_{w, b} \dfrac{1}{2}\|w\|^{2} \quad$ s.t. $\quad y^{(i)}(w^{T} x^{(i)} + b)\geq 1 \text{ for } i=1 \dotsc n$


In a future post I'll detail how the above optimisation problem is solved. 

