### Support Vector Machines

#### 1. Linear SVM Classification

Two classes are **linearly separable** if they can be clearly separated with a straight line. SVM classifiers fit the widest possible "street" between the classes (**large margin** classification). 

 * Adding more training instances "off the street" does not affect the decision boundary at all --> it is fully determined (or "supported") by the instances located on the edge of the street, which are called **support vectors** 
 
 * SVMs are **sensitive to feature scales**

#### 1a. Hard Margin Classification

If we strictly impose that all instances must be off the street, this is called hard margin classification, which has two issues: 

 * Only works if the data is linearly separable 
 * Sensitive to outliers
 

#### 1b. Soft Margin Classification 

To avoid these issues, we can modify the goal to find a **good balance between keeping the street as large as possible and limiting the margin violations**. 

 * Overfitting can be reduced by regularizing the model by reducing C 

In [1]:
import numpy as np 
from sklearn import datasets 
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.svm import LinearSVC 

iris = datasets.load_iris()
X = iris['data'][:, (2,3)] # petal length and petal width
y = (iris['target'] == 2).astype(np.float64) # Iris virginica 

svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ("linear_svc", LinearSVC(C = 1, loss = 'hinge'))
])
svm_clf.fit(X, y)
svm_clf.predict([[5.5, 1.7]])

array([1.])

 * Unlike Logistic Regression classifiers, SVM classifiers **do not output probabilities** for each class unless specified by a parameter. 
 * LinearSVC class regularizes the bias term, so we should **center the training set** by subtracting its mean --> this is automatic if we scale the data using the StandardScaler. 
 * Specify the **loss = "hinge"**
 * Set **dual = False** unless there are more features than training instances

#### 2. Nonlinear SVM Classification 

Adding features (eg. polynomial features) can make nonlinear datasets linearly separable. To implement this idea, we can create a Pipeline containing a PolynomialFeatures transformer, followed by a StandardScaler and LinearSVC.

In [4]:
from sklearn.datasets import make_moons 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures 

X, y = make_moons(n_samples = 100, noise = 0.15)
polynomial_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree = 3)),
    ('scaler', StandardScaler()),
    ('svm_clf', LinearSVC(C = 10, loss = 'hinge'))
])

polynomial_svm_clf.fit(X, y)



Pipeline(steps=[('poly_features', PolynomialFeatures(degree=3)),
                ('scaler', StandardScaler()),
                ('svm_clf', LinearSVC(C=10, loss='hinge'))])

#### 2a. Polynomial kernel 

Adding polynomial features is not necessarily always possible because at low polynomial degree the model cannot deal with complex datasets, and at high polynomial degree the model becomes very slow. 

--> When using SVMs we can apply the **kernel trick**, which allows to get the same results as if we had added many polynomial features without actually having to add them. This means that there is no combinatorial explosion of the number of features because we don't actually add any features. 

In other words, a **kernel** is a function capable of computing the dot product $\phi(a)^{T}$ $\phi(b)$ based only on the original vectors $a$ and $b$, without having to compute (or even know) the transformation $\phi$.

Most commonly used kernels are: 

 * Linear: $K(a, b)$ = $a^{T}b$
 * Polynomial: $K(a, b)$ = $(\gamma a^{T}b + r)^{d}$
 * Gaussian RBF: $K(a, b)$ = exp(-$\gamma||a - b||^{2}$
 * Sigmoid: $K(a, b)$ = tanh $(\gamma a^{T}b + r)$
 
 

According to **Mercer's theorem**, if a function $K(a, b)$ respects conditions of continuity and symmetry in its arguments, then there exists a function $\phi$ that maps $a$ and $b$ into another space (possibly which much higher dimensions) such that $K(a, b)$ = $\phi(a)^{T}$ $\phi(b)$.

In [7]:
from sklearn.svm import SVC 

poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel = 'poly', degree = 3, coef0 = 1, C = 5))
])

poly_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=5, coef0=1, kernel='poly'))])

#### 2b. Similarity features 

Another technique to tackle nonlinear problems is to add features computed using a **similarity function**, which measures how much each instance resembles a particular **landmark**.
Ex. **Gaussian Radial Basis Function (RBF)** 

$\phi_{\gamma}(x, l)$ = exp(-$\gamma||x - l||^{2}$)

 * $\gamma$ is a hyperparameter 
 * $l$ is the landmark, which we can choose to be the location of each and every instance in the dataset. This creates many dimensions and thus increases the chances that the transformed training set will be linearly separable. 

#### 2c. Gaussian RBF kernel 

The **kernel trick** can also be applied to similarity features methods! 

In [9]:
rbf_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel = 'rbf', gamma = 5, C = 0.001))
])

rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=0.001, gamma=5))])

 * Increasing $\gamma$ makes the bell-shaped curve narrower --> decision boundary is more irregular, wiggling around individual instances 
 * $\gamma$ acts as a regularization paramenter: if the model is overfitting, we should reduce it: if it's underfitting, we should increase it. 

#### 3. SVM Regression

To use SVMs for regression instead of classification, the objective is to fit as many instances as possible **on the street** while limiting margin violations. The width of the stree is controlled by the parameter $\epsilon$

 * Adding more training instances within the margin does not affect the model's predictions --> $\epsilon$ insensitive. 

In [10]:
from sklearn.svm import LinearSVR 

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(epsilon=1.5)

To tackle nonlinear regression tasks, we can use a kernelized SVM model. 

In [11]:
from sklearn.svm import SVR 

svm_poly_reg = SVR(kernel = 'poly', degree = 2, C = 100, epsilon = 0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, degree=2, kernel='poly')

#### 4. Hinge loss 

$max(0, 1-t)$ is called the Hinge loss. 

 * equal to 0 when t >= 1
 * derivative is eqaul to -1 if t < 1 and 0 if t > 1
 * not differentiable at t = 1, but like Lasso we can use subgradients when using Gradient Descent

#### End of notebook