# Support vector machine

## Optimization Objective


To make a support vector machine, we will refer to the cost function of logistic regression but modify the first term so that when $θ^Tx$ is greater than 1, it outputs 0. 

Furthermore, for values less than 1, we shall use a straight decreasing line instead of the sigmoid curve (Hinge loss function).

Similarly, we modify the second term of the cost function so that when $θ^Tx$ is less than -1, it outputs 0.

We also modify it so that for values greater than -1, we use a straight increasing line instead of the sigmoid curve.

<img src='files/svm_cost.png'>

The cost function of linear SVM is as follow : 

<img src='files/cost_functionSVM.png'>

The convention dictates that we regularize using a factor C, instead of λ, like so and C = 1 / λ.

Now, when we wish to regularize more (that is, reduce overfiting), we decrease C, and when we wish to regularize less (that is, reduce underfitting), we increase C.

Finally, note that the hypothesis of the Support Vector Machine is not interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression). Instead, it outputs either 1 or 0 :

<img src='files/svm_hp.png'>

## The large margin classifier

<br>
<img src='files/cost_functionSVM.png'>

Let's say we have a very high value of C, then we need to make the non regularized part of the equation as small as possible in order to minimize the SVM cost function.

So, in that case, the goal is that whenever we have a training example with a label of y = 1, we want that first term to be zero, what we need is to find a value of theta so that $θ^Tx$ is greater than or equal to 1. 

And similarly, whenever we have an example, with label zero, in order to make sure that the second term is 0, we need that $θ^Tx$ is less than or equal to -1.

Knowing that, the goal of the SVM is to maximize the distance (also called the margin of the support vector machine) between the two classes and this gives the SVM a certain robustness, because it tries to separate the data with as large a margin as possible.

In practice when applying support vector machines, when C is not very large like that, it can do a better job ignoring the few outliers. And also do reasonable things even if your data is not linearly separable.



## Maths behind SVM

The length of vector v is denoted $\left\|v\right\|$, and it describes the line on a graph from origin $(0,0)$ to ($v_{1}$, $v_{2}$).
The length of vector v can be calculated with the Pythagorean theorem.

The projection of vector v onto vector u is found by taking a right angle from the end of v to u, creating a right triangle :

p= length of projection of v onto the vector u

$u^Tv$ = $v^Tu$ = p⋅$\left\|u\right\|$ = $u_{1}v_{1}$ + $u_{2}v_{2}$

If the angle between the lines for v and u is greater than 90 degrees, then the projection p will be negative.


\begin{align*} \min_\theta & \frac{1}{2}\sum_{j=1}^n\theta_j^2\\ \mbox{s.t.} & \|\theta\|\cdot p^{(i)} \geq 1\quad \mbox{if}\ y^{(i)} = 1\\ & \|\theta\|\cdot p^{(i)} \leq -1\quad \mbox{if}\ y^{(i)} = 0 \end{align*}

where $p^{(i)}$
  is the (signed - positive or negative) projection of $x^{(i)}$
  onto $\theta$.
  
\begin{align*} \min_\theta & \frac{1}{2}\sum_{j=1}^n\theta_j^2\\ = 
1/2 * (Θ²_{1} + Θ²_{2} + ... + Θ²_{n}) 
= 1/2 * (\sqrt{Θ²_{1} + Θ²_{2} + ... + Θ²_{n}} )²  = 1/2 * \|Θ\|²  \end{align*}


We can use the same rules to rewrite :

\begin{align*} θ^T = p^{(i)} . \|Θ\| = Θ_{1}x^{(i)}_{1} + Θ_{2}x^{(i)}_{2}  + ... + Θ_{n}x^{(i)}_{n} \end{align*}

If y=1, we want :
\begin{align*} p^{(i)} . \|Θ\| >= 1 \end{align*}
If y=0, we want :
\begin{align*} p^{(i)} . \|Θ\| <= -1  \end{align*}


<br>
<img src='files/svm_db1.png'>
<br>

The reason this causes a "large margin" is because it is proven by linear algebra that the vector for Θ is perpendicular to the decision boundary. And in order for our optimization
objective to hold true, we need the absolute value of our $p^{(i)}$, projections of $x^{(i)}$, to be as large as possible. In the image above, the $p^{(i)}$ are small and then the margin of the decision boundary in regard to the $x^{(i)}$ is very small too. 
It's the contrary on the picture below :

<br>
<img src='files/svm_db2.png'>
<br>

If $Θ_{0} = 0$, then all our decision boundaries will intersect $(0,0)$. If $Θ_{0} 	\neq 0$, the support vector machine will still find a large margin for the decision
boundary.

All this demonstration is based on a large value of C, the regularization parameter.