# Support Vector Machines

A Support Vector Machine commonly reffered to as an SVM is a powerful and very versatile machine learning model. SVM are one of the most popular Supervised Learning algorithms, they are capable of linear and non linear classification, regression and even outlier detection. Support Vector Machines are typically well suited for classification of complex small or medium sized datasets.

The goal of SVM is to create the best line or decision boundary that can be used to segregate n-dimensional space into classes so that we can easily put the new datapoint in the correct category in the future. This best decision boundary is called the hyperplane. SVM chooses extreme points/vectors (that are instances) that help create the hyperplane, these are called support vectors, hence the name of the algorithm. This process will be discussed in more detail later.

This chapter will be divided into the following sections: -

1. Linear SVM Classification
2. Nonlinear SVM Classification
3. SVM Regression
4. Working of SVM

## 1. Linear SVM Classification

<br>
<img src="https://www.researchgate.net/publication/304611323/figure/fig8/AS:668377215406089@1536364954428/Classification-of-data-by-support-vector-machine-SVM.png" width="400">
<br>

The fundamental idea behind SVM can best be explained by using the above image as an example. The two classes can easily be seperated by a straight line (i.e they are linearly seperable). 

The decision boundaries are marked by the dash lines. As stated earlier the support vectors are the ones that are used to make the hyperplanes. The instances are classified according to which side of the hyperplane they lie on. You can think of an SVM classifier as fitting the widest possible street (represented by parallel dashed lines) between the classes. This is called large margin classification. It should be noted that addeing more training instances off the street will not affect the decision boundary at all. Ot is fully determined/supported by the instances located on the edge of the street. These instances are called the support vectors.

<br>
<img src="https://www.oreilly.com/api/v2/epubs/9781787125933/files/graphics/B07030_03_09.jpg" height="350">
<br>

It must also be rememebered that SVMs are sensitive to feature scales, after feature scaling the decision boundary looks a lot better, i.e. the margin is a lot bigger.

https://miro.medium.com/max/1332/1*mKH7ePxH9xJ2Avsess9nzA.png

<br>
<img src="https://miro.medium.com/max/1332/1*mKH7ePxH9xJ2Avsess9nzA.png" height="250">
<br>

### 1.1 Soft Margin Classification

We cannot always make the model in such a way that all the instances are off the street (street referring to the area between the hyperplanes). If we try to impose this then it is called hard margin classification. The two main issues with hard margin classification are that: 1. It only works with linearly seperable data, 2. It is highly susceptible to outliers. For example if one of the instances of negative classes is in the group of the positive class instances when we plot, then it becomes impossible to seperate them in such a way to have hard margin i.e. no instances on the street. Another problem is that if we do make the decision boundary taking the outlier into account then the model will overfit and will have a hard time generalising.

To avoid these issues we use a more flexible model. The objective is to find a model that keeps the street as wide as possible and limit the margin violations (i.e. instances that end up in the middle of the street or even on the wrong side). This is called soft margin classification.

When creating a SVM model using sklearn, we can specify a number of hyperparameters. C is one of those hyperparameters. If we set it to a low value, then we end up with the modelthat will have a wide street but a few instances will be on the street. IF we set it to a higher value the street will be a lot narrower and we will however have fewer instances on the street. The prior model would be have more margin violations but will probably generalise better. Our goal is to find a good balance between these two. If the SVM model is overfitting, then you can regularise it by reducing the value of C.

Next lets implement Linear SVM Model and train it on the iris dataset. We will load the dataset, scale the feature and then train a Linear SVM Model (using LinearSVC class with C=1 and the hinge loss function) to detect Iris Virginica flowers.

In [2]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]  # to get petal length and petal width
y = (iris['target'] == 2).astype(float)

svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('linear_svc', LinearSVC(C=1, loss="hinge"))
])

svm_clf.fit(X, y)

svm_clf.predict([[5.5, 1.7]])

array([1.])

Unlike Logistic Regression classes, SVM classifiers do not put probabilities of each class. Instead of using the LinearSVC class, we could also the SVC class with a linear kernel. To do this we would write SVC(kernel="linear", C=1). Or we could use the SGDClassifier class, with SGDClassifier(loss="hinge", alpha=1/(m*C)). This applies regular Stochastic Gradient Descent to train a linear SVM classifier. It does not converge as fast as the linear SVC class, but it can be used to handle online classification tasks or huge datasets that do not fit in memory (out-of-core training).

The LinearSVC class regularises the bias term, so you shoukld center your training set first by subtracting its mean. THis is automatic if you train the data using the standard scaler. Also make sure the loss hyperparameter is set to "hinge" as it is not the default value. For better performance, you should set the dual hyperparameter to False unless there are more features than training instances.


## 2. Nonlinear SVM Classification

Linear SVM Classifiers are efficient and work very well, however a vast majority of daatasets are not even close to being linearly seperable. One approach to handling non linear datasets, is to add more features, such as polynomial features, in some cases this can resilt in a linearly seperable dataset. For example a dataset with just one feature x1 may not be linearly seperable but if another feature is added, the resulting 2D dataset is perfectly linearly seperable.

<br>
<img src="https://miro.medium.com/max/1400/1*NwhqamsvzBkUlYwSAubv5g.png" height="300">
<br>

To implement this using sklearn, we will create a Pipeline containing a PolynomialFeatures transformer, followed by a Standard Scaler and a Linear SVC. We can test this on the moons dataset: a toy dataset for binary classification in which data points are shaped as two interleaving half circles. You can generate this dataset using the make_moons() function.

We will now implement this in python.

In [12]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=100, noise=0.15)
poly_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('svm_clf', LinearSVC(C=10, loss="hinge", max_iter=10000))  # failed to converge, use a value less than 1k
])

poly_svm_clf.fit(X, y)

<br>
<img src="https://img2018.cnblogs.com/blog/1012590/201903/1012590-20190331183702438-196976647.png" height="400">
<br>

### 2.1 Polynomial Kernel