
Capable of performing linear or nonlinear classification, regression, and even outlier detection.

SVMs are particularly well suited for classification of complex but __small- or medium-sized__ datasets.

## 1. Linear SVM Classification
- large margin classification
- support vectors
- __SVMs are sensitive to the feature scales__

### 1.1 Soft Margin Classification
In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter: a smaller C value
leads to a wider street but more margin violations.

- If your SVM model is overfitting, you can try regularizing it by reducing C.
- Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class.

In [2]:
# The following Scikit-Learn code loads the iris dataset, scales the features, and then trains a linear SVM
# model (using the LinearSVC class with C = 0.1 and the hinge loss function, described shortly) to detect
# Iris-Virginica flowers
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

In [3]:
svm_clf.predict([[5.5, 1.7]])

array([ 1.])

- The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean. This is automatic if you scale the data using the StandardScaler.
- make sure you set the loss hyperparameter to "hinge", as it is not the default value.
- for better performance you should set the dual hyperparameter to False, unless there are more features than training instances

## 2. Nonlinear SVM Classification

有时数据不是线性可分的，这时一个办法是：增加更多的特征，比如多项式特征。

In [4]:
# make polynomial features
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

### 2.1 Polynomial Kernel
前面通过添加多项式特征的方法，如果多项式度数过大，将会创造太多特征，使模型训练很慢。

- 通过kernel trick: It makes it possible to get the same result as if you added many polynomial features, even with very high-degree polynomials, without actually having to add them.

In [5]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

- if your model is overfitting, you might want to reduce the polynomial degree.
- The hyperparameter coef0 controls how much the model is influenced by high-degree polynomials versus low-degree polynomials.

### 2.2 Adding Similarity Features
另一种处理非线性可分数据的方法：to add features computed using a _similarity function_

- _similarity function_: measures how much each instance resembles a particular landmark
-  $\phi\gamma(x,l) = exp(-\gamma\lVert x-l\rVert^2)$ 
- a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark).
- The downside不足 is that a training set with m instances and n features gets transformed into a training set with m instances and m features (assuming you drop the original features). If your training set is very large, you end up with an equally large number of features.

### 2.3 Gaussian RBF Kernel
Kernel trick: it makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them.

In [6]:
rbf_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

- Increasing gamma makes the bell-shape curve narrower, and as a result each instance’s range of influence is smaller: the decision boundary ends up being more irregular, wiggling around individual instances.
- Conversely, a small gamma value makes the bell-shaped curve wider, so instances have a larger range of influence, and the decision boundary ends up smoother.
- γ acts like a regularization hyperparameter: if your model is __overfitting__, you should reduce it, and if it is underfitting, you should increase it (similar to the C hyperparameter).

### 2.4 Computational Complexity
- The LinearSVC class is based on the liblinear library, which implements an optimized algorithm for linear SVMs. 
It does not support the kernel trick, but it scales almost linearly with the number of training
instances and the number of features: its training time complexity is roughly O(m × n).

- The algorithm takes longer if you require a very high precision. This is controlled by the tolerance hyperparameter ϵ (called _tol_ in Scikit-Learn). In most classification tasks, the default tolerance is fine.

- The SVC class is based on the libsvm library, which implements an algorithm that supports the kernel trick. Unfortunately, this means that it gets dreadfully slow when the number of training instances gets large (e.g., hundreds of thousands of instances).__This algorithm is perfect for complex but small or medium training sets. However, it scales well with the number of features, especially with sparse features (i.e., when each instance has few nonzero features).__

| Class | Time complexity | Out-of-core support | Scaling required | Kenerl trick |
|------|------|------|------|------|
| LinearSVC | O(m×n) | No | Yes | No |
| SGDClassifier | O(m×n) | Yes | Yes | No |
| SVC | O(m² × n) to O(m³ × n) | No | Yes | Yes|

## 3. SVM Regression
SVM not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression.

The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street).

In [7]:
from sklearn.svm import LinearSVR
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

In [8]:
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

- SVMs can also be used for outlier detection; see Scikit-Learn’s documentation for more details.

## 4. Under the Hood
the bias term will be called b and the feature weights vector will be called w. No bias feature will be added to the input feature vectors.
### 4.1 Decision Function and Predictions