# Support Vector Machines

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of
performing linear or nonlinear classification, regression, and even outlier detection.


### Hard Margin Classification

The following figure describes the idea of support vector machines in which, Two classes from iris datasets were used with two features only.

![alt text](images/im1.png)

The left plot shows the decision boundaries of three
possible linear classifiers. The model whose decision boundary is represented by the dashed line is so
bad that it does not even separate the classes properly. The other two models work perfectly on this
training set, but their decision boundaries come so close to the instances that these models will probably
not perform as well on new instances. In contrast, the solid line in the plot on the right represents the
decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far
away from the closest training instances as possible. You can think of an SVM classifier as fitting the
widest possible street (represented by the parallel dashed lines) between the classes. This is **called large
margin classification**. Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. These instances are called the support vectors.

### Scaling Sensitivity

SVMs are sensitive to the feature scales, as you can the see in following figure: on the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. After feature scaling (e.g., using Scikit-Learn’s StandardScaler descibed in section 1 notes), the decision boundary looks much better.
![alt text](images/im2.png)


### Soft Margin Classification

If we strictly impose that all instances be off the street and on the right side, this is called hard margin
classification. There are two main issues with hard margin classification. First, it only works if the data
is linearly separable, and second it is quite sensitive to outliers. se the below figure:

![alt text](images/im1.png)
![alt text](images/im3.png)

The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side). This is called soft margin classification.

In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter: a smaller C value
leads to a wider street but more margin violations. The following figure shows the decision boundaries and margins
of two soft margin SVM classifiers on a nonlinearly separable dataset. On the left, using a high C value
the classifier makes fewer margin violations but ends up with a smaller margin. On the right, using a low
C value the margin is much larger, but many instances end up on the street. However, it seems likely that
the second classifier will generalize better: in fact even on this training set it makes fewer prediction
errors, since most of the margin violations are actually on the correct side of the decision boundary

![alt text](images/im4.png)

The following Scikit-Learn code loads the iris dataset, scales the features, and then trains a linear SVM
model (using the Linear Support Vector Classifier (LinearSVC) class with C = 0.1) to detect Iris-Virginica flowers which is the right part in the previous figure.

In [4]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X, y)
svm_clf.predict([[5.5, 1.7]])

array([1.])

Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it is much slower,
especially with large training sets, so it is not recommended.



### Nonlinear SVM Classification

Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets
are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more
features, such as polynomial features  in some cases this can result in a linearly
separable dataset. Consider the left plot in following Figure : it represents a simple dataset with just one feature x1. This dataset is not linearly separable, as you can see. But if you add a second feature x2 = (x1)2, the resulting 2D dataset is perfectly linearly separable.

![alt text](images/im5.png)

To implement this idea using Scikit-Learn, you can create a Pipeline containing a PolynomialFeatures
transformer, followed by a StandardScaler and a LinearSVC.

In [6]:
# implementation
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)

Pipeline(memory=None,
         steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=True,
                                    interaction_only=False, order='C')),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 LinearSVC(C=10, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

### Polynomial Kernel

Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning
algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets,
and with a high polynomial degree it creates a huge number of features, making the model too slow.

Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the
kernel trick. It makes it possible to get the same result as if you added many
polynomial features, even with very high-degree polynomials, without actually having to add them. So
there is no combinatorial explosion of the number of features since you don’t actually add any features.
This trick is implemented by the SVC class. 


In [7]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=5, cache_size=200, class_weight=None, coef0=1,
                     decision_function_shape='ovr', degree=3,
                     gamma='auto_deprecated', kernel='poly', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

This code trains an SVM classifier using a 3rd-degree polynomial kernel. It is represented on the left of
the following Figure. On the right is another SVM classifier using a 10th-degree polynomial kernel. Obviously, if
your model is overfitting, you might want to reduce the polynomial degree. Conversely, if it is
underfitting, you can try increasing it. The hyperparameter coef0 controls how much the model is
influenced by high-degree polynomials versus low-degree polynomials.

![alt text](images/im6.png)


### Linear SVM under the hood
The linear SVM classifier model predicts the class of a new instance x by simply computing the decision
function
$$ W^{T} * b = w_{1}x _{1} + ...... + w_{n}x _{n} $$
$$ prediction = 0    if W^{T} * b < 0   $$
$$ prediction = 1    if W^{T} * b \geq 0 $$

The following figure two-dimensional plane since this dataset has two features (petal width and petal length). The decision boundary is the set of points where the decision function is equal to 0: it is the intersection of two planes, which is a straight line (represented by the thick solid line). While, The dashed lines represent the points where the decision function is equal to 1 or –1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Training a linear SVM classifier means finding the value of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin).

![alt text](images/im7.png)
