# Non linear SVM


## Kernels

Kernels allow us to make complex, non-linear classifiers using Support Vector Machines.

Given x, we compute new features depending on proximity to landmarks.

To do this, we find the "similarity" of x and some landmark : $l^{(1)}$, $l^{(2)}$, $l^{(3)}$

To do this, we nd the "similarity" of x and some landmark $l^{(i)}$ :

<img src ='files/kernel_rbf.png'>

This "similarity" function is called a Gaussian Kernel. It is a specific example of a kernel.

The similarity function can also be written as follows :

<img src ='files/kernel_rbf2.png'>

There are a couple properties of the similarity function :

<img src ='files/kernel_rbf3.png'>

In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0.

Each landmark gives us the features in our hypothesis :

<img src ='files/rbf_features_hp.png'>

The kernel trick, maps data (sometimes nonlinear data) from a low-dimensional space to a high-dimensional space. In a higher dimension, you can solve a linear problem that’s nonlinear in lower-dimensional space.

$σ^{2}$ is a parameter of the Gaussian Kernel, and it can be modified to increase or decrease the drop-off of our feature. 

Combined with looking at the values inside Θ, we can choose these landmarks to get the general shape of the decision boundary.




## Kernels II

One way to get the landmarks is to put them in the exact same locations as all the training examples. This gives us m landmarks, with one landmark per training example.

Given example x:

<img src ='files/f_features.png'>

This gives us a "feature vector" $f_{i}$ of all our features for example $x_{i}$ . We may also set $f_{0} =1$ to correspond with $θ_0$. 
Thus given training example $x_{i}$ :

<img src ='files/f_features_xi.png'>

Now to get the parameters Θ we can use the SVM minimization algorithm but with $f_{i}$ substituted in for $x_{i}$ :

<br>
<img src ='files/cost_fi.png'>
<br>

Using kernels to generate $f_{i}$ is not exclusive to SVMs and may also be applied to logistic regression. 

However, because of computational optimizations specific to SVMs, kernels combined with SVMs are much faster than with other algorithms.

<img src ='files/svm_params.png'>

## Using An SVM

In practical application, the choices you do need to make are:
- Choice of parameter C
- Choice of kernel (similarity function) :
    - No kernel ("linear" kernel) gives standard linear classifier
        --> Choose when n is large and when m is small
    - Gaussian Kernel (above) where you need to choose σ²
        --> Choose when n is small and m is large (but not above 50K)

Note: do perform feature scaling before using the Gaussian Kernel, in order to improve convergence and give each feature the same importance 

Note: not all similarity functions are valid kernels. They must satisfy "Mercer's Theorem" which guarantees that the SVM package's optimizations run correctly and do not diverge.

### Multi-class Classification

You can use the one-vs-all method just like we did for logistic regression where we train k SVM, one for each class vs the others :

<img src ='files/multi_svm.png'>

### Logistic Regression vs. SVMs

If n is large (relatively to m), then use logistic regression, or SVM without a kernel (the "linear kernel") : a simpler model will not overfit the data given the low count of training examples

If n is small (less than 1k) and m is intermediate (between 10 and 50K), then use SVM with a Gaussian Kernel : you can afford a more complicated hypothesis.

If n is small and m is large (more than 100K), then you need manually create/add more features, then use logistic regression or SVM without a kernel.

In the first case, we don't have enough examples to need a complicated polynomial hypothesis. In the second example, we have enough examples so we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable.

End note : a neural network is likely to work well for any of these situations, but may be slower to train.



