Support Vector Machines is very powerful, capable of performing linear or nonlinear classification, regressioin, and even outlier detection. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

# Linear SVM Classification
You can think of an SVM classifier as fitting the widest possible street between the classes.This is called **large margin classification**.
![fig 5-1](images/5-1.png)
Notice that adding more training instances "off the street" will not affect the decision boundary at all:it is fully determined(or "supported") by the instances located on the edge of the street. These instances are called the support vectors.

SVMs are sensitive to the feature scales, as you can see in fig 5-2.
![fig 5-2](images/5-2.png)

# Soft Margin Classification
If we strictly impose that all instances be off the street and on the right side, this is called **hard margin classification**. There are 2 main issues with that.
- It only works if the data is linearly separable.
- It is quite sensitive to outliers.

![fig 5-3](images/5-3.png)

To avoid these issues it is preferable to use a more flexible model. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations. This is **Soft Margin Classification**.

**In Scikit-Learn's SVM classes, you can control this balance using the C hyperparameter: a smaller C value leads to a wider street but more margin violations.** Fig 5-4 shows the decision boundaries and margins of two soft margin SVM classifiers on a nonlinearly separable dataset. On the left, using a high C value the classifier makes fewer margin violations but ends up with a smaller margin. On the right, using a low C value the margin is much larger, but many instances end up on the street. However, it seems likely that the second classifier will generalize better: in fact even on this training set it makes fewer prediction errors, since most of the margin violations are actually on the correct side of the decision boundary.
![fig 5-4](images/5-4.png)

**If your SVM model is overfitting, you can try regularizing it by reducing C**.

Below are the codes of the right of fig 5-4.

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris['data'][:,(2,3)] # petal length, petal width
y = (iris['target']==2).astype(np.float64) # iris-virginica

svm_clf = Pipeline((
            ('scaler',StandardScaler()),
            ('linear_svc',LinearSVC(C=1, loss='hinge')),
        ))

svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

In [2]:
svm_clf.predict([[5.5, 1.7]])

array([ 1.])

**Alternatively, we could use the SVC class, using `SVC(kernel='linear',C=1)`, but it's much slower, especially with large training sets, so it is not recommended. Another option is to use the `SGDClassifier` class, with `SGDClassifier(loss='hinge', alpha=1/(m*C))`. This applies regular SGD to train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but is can be useful to handle huge datasets taht do not fit in memory, or to handle online classification tasks.**

**The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its means. This is automatic if you scale the data using the `StandardScaler`. Moreover, make sure you set the `loss` hyperparameter to `'hinge'`, as it is not the default value. Finally for better performance you should set the `dual` hyperparameter to `False`, unless there are more features than training instances.**

# Nonlinear SVM Classification
One approach to handling nonlinear datasets is to add more features, such as polynomial features. Look at fig5-5.

![fig 5-5](images/5-5.png)
Here is an example. See result in Fig5-6.

In [3]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline((
        ('poly_features', PolynomialFeatures(degree=3)),
        ('scaler', StandardScaler()),
        ('svm_clf', LinearSVC(C=10, loss='hinge'))
        ))
polynomial_svm_clf.fit(X,y)

Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])


![fig 5-6](images/5-6.png)
## Polynomial Kernel
**Adding polynomial features is simple to implement and can work greate with all sorts of ML algorithms, but at a low polynomial degree it connot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow**.

Fortunately, when using SVMs you can apply an almost miraculous(不可思议的) mathematical technique called the**Kernel Trick**. **It makes it possible to get the same result as if you added many polynomial features, without actually having to add them.**

This trick is implemented by SVC class. Let's test it on the moons dataset:

In [4]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
        ('scaler', StandardScaler()),
        ('svm_clf', SVC(kernel='poly', degree=3,coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X,y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

This code trains an SVM classifier using a $3^{rd}-degree$ polynomial kernel. It is represented on the left of fig 5-7. On the right is using a $10^{th}-degree$ polynomial kernel. **Obviously, if your model is overfitting, reduce the polynomial degree. Conversely, if underfitting, try increasing it. The hyperparameter `coef0` controls how much the model is inflenced by high-degree polynomials versus low-degree polynomials. A common approach to find the right hyperparameter value is to use grid search.**

![fig 5-7](images/5-7.png)

## Adding Similarity Features
**Another technique to tackle nonlinear problems is to add features computed using a similarity funciton that measures how much each instance resemaples a particular landmark.** 

For example, take the one-dimensional dataset discussed earlier and add tow landmarks to it at $x_1=-2$ and $x_2=1$ **So $l_1=-1$ and $l_2=1$**. Next, let's define the similarity function to be the **Gaussian Radial Basis Function(RBF)** with $\gamma=0.3$. 

**Note:**径向基函数 (Radial Basis Function 简称 RBF), 就是某种沿径向对称的标量函数。 通常定义为空间中任一点x到某一中心xc之间欧氏距离的单调函数 , 可记作 k(||x-xc||), 其作用往往是局部的 , 即当x远离xc时函数取值很小。最常用的径向基函数是高斯核函数 ,形式为 k(||x-xc||)=exp{- ||x-xc||^2/(2*σ^2) } 其中xc为核函数中心,σ为函数的宽度参数 , 控制了函数的径向作用范围。中文名 高斯核函数 外文名 Radial Basis Function

*Equation 5-1. Gaussian RBF*
$$\phi\gamma(x,l)=exp(-\gamma\left\|x-l\right\|^2)$$

Gaussian RBF is a bell-shaped function varying from 0(very far from the landmark) to 1(at the landmark). Now we are ready to compute the new features. Look at the instance $x_1=-1$: its new features are $x_2=exp(-0.3*\left\|-1-(-2)\right\|^2) \approx{0.74}$ and $x_3=exp(-0.3*\left\|-1-1\right\|^2) \approx{0.3}$. See fig 5-8.

![fig 5-8](iamges/5-8.png)

#### How to select the landmarks?
The simplest way is to create a landmark at the location of each and every instance in the dataset. This creates many dimensions and thus increases the chances that the transformed training set will be linear separable. The downside is that a training set with m instances and n features gets transformed into a training set with m instances and m meatuers(assuming you drop the original features).

## Gaussian RBF Kernal
The similarity features method can be useful with any ML algorithm, but it may be computationally expensive to compute all the additional features, especially on large training sets. However **the kernel trick makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them.**

In [5]:
rbf_kernel_svm_clf = Pipeline((
        ('scaler', StandardScaler()),
        ('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X,y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

![fig 5-9](images/5-9.png)
- $\gamma$: Increasing(decreasing) $\gamma$ makes the bell-shape curve narrower, and as a result each instance's range of influence is smaller(larger): the decision boundary ends up being more irregular(smoother). So $\gamma$ acts like a regularization hyperparameter:**if overfitting, reduce it; if underfitting, increase it**.

#### How to select kernels?
As a rule of thumb, always try the linear kernel first(`LinearSVC` is much faster than `SVC(kernel='linear')`), expecially the training set is very large or it has plenty of features. If the training set is not too large, try the Gaussian RBF kernel as well; it works well in most cases. Last other kernels.

## Computational Comlexity

# Reference
1. https://blog.csdn.net/v_july_v/article/details/7624837