# Support Vector Machines

Support vector machines are versatile tools capable of performing linear and nonlinear classification and regression along with outlier detection. It tries to create a "street" between classes or groups which determines how the model makes predictions. The width and shape of the "street" is determined by the points that lie on the street called **support vectors**.

## Linear SVM Classification

### Soft Margin Classification

Hard margin classification does not respond well to outliers, so we use soft margin to allow outliers and limit margin violations.
C hyperparamter controls street size. **Large C = narrow street,** few violations. **Small C = large street,** more violations. If an SVM is overfitting, try lowering the C parameter.

In the following, we will use a linear SVM to predict iris entries using C=0.1 and hinge loss function.

In [4]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # Petal widt, petal length
y = (iris["target"] == 2).astype(np.float64) # iris-virginica

svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X, y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))))

In [6]:
print(svm_clf.predict([[5.5, 1.7]]))

[ 1.]


Could use the SVC class with SVC(kernal="linear", C=1) but this is much slower on large datasets. Could also use an SGDClassifier (Stochastic Gradient Descent) which doesn't converge as fast as LinearSVC but handles large datasets which do not fit in memory.

Should **always** scale the data by subtracting the mean or using standard scalar. To increase performance set the loss unfunction to "hinge" and dual to "false" unless there are more features than instances.

## Nonlinear SVM Classification

Most datasets are not linearly seperable which is where these SVMs come in. We can change this by adding polynomial features, which may create a dataset which is linearly seperable.

In scikit learn we can do this by using a pipeline with the PolynomialFeatures transformer, the StandardScalar transformer, and a LinearSVC. We'll try this on the moon dataset.

In [8]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons()

polynomial_svm_clf = Pipeline ((
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge"))
))

polynomial_svm_clf.fit(X, y)

Pipeline(steps=(('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))))

## Polynomial Kernel

While adding polynomial features is great for low complexity datasets, this can't deal with complex datasets, and at a high polynomial degree it creates an enormous number of features making it too slow.

By using a mathematical technique called the **Kernal Trick** we can get the same result without the additional computation. Let's test this:

In [10]:
from sklearn.svm import SVC
poly_kernal_svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))

poly_kernal_svm_clf.fit(X, y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

Coef0 controls how the model reacts to high degree polynomials. If the model is overfitting we can reduce the degree, and if it's underfitting we can try raising it. We can use grid search to find the best hyperparameter values to use for our model. See chapter 2.

### Adding Similarity features

We can also add features using a similarity function that measures how each feature represents a *landmark*. Landmarks can be arbitrarily chosen but a simple a choice is choosing a landmark for each training instance. The gaussian Radial Basis Function (RBF) is an example of a similarity function. This function is a bell shaped curve varying from 0 (far from the landmark) to 1 (on the landmark).

Our simple method of choosing landmarks will transform our data set from m instances and n features to m instances and m features

### Gaussian RBF Kernal

We can use the RBF kernal to compute the transformed data with the added computational expense by using the RBF SVC kernal.

$$\phi {y}(x,{l}) = \exp ( -{y}|| x - {l} ||^2 )$$

In [32]:
rbf_kernel_clf = Pipeline((
    ("scalar", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_clf.fit(X, y)

Pipeline(steps=(('scalar', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

Using a large gamma will make the curve narrower, reducing each instances range of influence. The decision boundary becomes more irregular and wriggly. Conversely, a smaller gamma will make the range of influence larger. The decision boundary is larger and more smooth. Like a regularization parameter, reducing gamma may be helpful when the model is overfitting and increasing may be belpful when underfitting.

Other kernals exist but are used in rare or specific cases, like the string kernal when analyzing a text document. It's best to train the linear kernal first (LinearSVC is faster than SVC(kernal="linear")) if it's large or has a good number of features. If the training set is somewhat small, try going with Gaussian RBF. If there's extra time and computing power, try other kernals using cross-validation (look back at MNIST) and grid search (chapter 2)

### Computation Complexity

Linear SVC: No kernal trick, but scales linearly with features and instances

SVC: Can use kernal trick, but scales somewhere between 2nd and 3rd degree polynomial time on the number of features. Scales well with sparse features.

SGDClassifier: Scales linearly with features and instances, supports out-of-core fitting. Works well for large datasets.

## SVM Regression

Like classification, but instead of trying to keep instances out of the street/margin, it tries to include instances within the street. The margin violations now occur outside the street/margin. The hyperparameter $\epsilon$ controls the width of this margin: higher=wider, lower=narrower.

Since the model tries to fit instances into the margin, new instances added inside the margin do not impact the models predictions, making this *epsilon-insensitive*.

In [34]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

We can used kernalized SVMs to tackle non-linear regression tasks. Like the SVC class, LinearSVR is faster than SVR(kernal="linear").

In [36]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=.1)
svm_poly_reg.fit(X, y)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

## Under the hood

Support Vector Machine Classifiers use the dot product of the feature weights and instance values plus a bias term to classify instances: if the value is positive, it belongs to one class and if negative it belongs to another.

The slope of the decision function is equal to the magnitude of the weight vector. By dividing the slope of the decision function by two, we have doubled the distance of the decision margin from the decision boundary. The smaller the weight vector, the wider the margin. We want to minimize the magnitude of this vector in order to maximize the size of the boundary.

The hard margin optimization attempts to minimize the half the square of the magnitude of the weight vector. This happens because the optimization algorithm performs best on differentiable functions, and the unaltered magnitude is not differentiable when the weight vector is 0.

The soft margin adds a slack variable, which measures how much each instance is allowed to violate the margin. This creates two conflicting objectives: minimize the slack variable to reduce margin violations while making the magnitude of the weight vector as small as possible to maximize the size of the margin. We can control this tradeoff with the C hyperparameter, which allows us to scale the sum of the slack variables.

Look at chapter 5 for more detail on these equations

### Quadratic Programming

This math is too complicated for me to try and summarize, check the book for more info

### The Dual Problem

Given an optimization problem known as the *primal problem*, it is possible to express a similar problem known as the *dual problem*. This is usually a lower bound to the primal problem but it can also have the solution under certain conditions. Support Vector Machines satisfy these requirements so both or either equation can be used.