### Support Vector Machines

“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. 

However,  it is mostly used in classification problems. 

In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. 

Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.

![Support Vector Machines](./img/SVM_1.png)

**Note: The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.**

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. 

Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

#### How does SVM Work?

**Identify the right hyper-plane (Scenario-1):** Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-plane to classify stars and circles.

![Scenario 1](./img/SVM_1_1.png)

You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.

**Identify the right hyper-plane (Scenario-2):** Here, we have three hyper-planes (A, B, and C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?

![Scenario 2](./img/SVM_3.png)

Above, you can see that the margin for hyper-plane C is high as compared to both A and B. 

Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is robustness. If we select a hyper-plane having low margin then there is high chance of miss-classification.

**Identify the right hyper-plane (Scenario-3):** Hint: Use the rules as discussed in previous section to identify the right hyper-plane

![Scenario 3](./img/SVM_5.png)

Some of you may have selected the hyper-plane B as it has higher margin compared to A. But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin. Here, hyper-plane B has a classification error and A has classified all correctly. Therefore, the right hyper-plane is A.

**Can we classify two classes (Scenario-4)?:** Below, I am unable to segregate the two classes using a straight line, as one of the stars lies in the territory of other(circle) class as an outlier. 

![Scenario 4](./img/SVM_61.png)

As I have already mentioned, one star at other end is like an outlier for star class. The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the maximum margin. Hence, we can say, SVM classification is robust to outliers.

![Scenario 4](./img/SVM_71.png)

**Find the hyper-plane to segregate to classes (Scenario-5):** In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have only looked at the linear hyper-plane.

![Scenario 5](./img/SVM_8.png)

SVM can solve this problem. Easily! It solves this problem by introducing additional feature. Here, we will add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:

![Scenario 6](./img/SVM_9.png)

#### Pros and Cons associated with SVM

Pros:

- It works really well with a clear margin of separation
- It is effective in high dimensional spaces.
- It is effective in cases where the number of dimensions is greater than the number of samples.
- It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.


Cons:

- It doesn’t perform well when we have large data set because the required training time is higher
- It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
- SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. - It is included in the related SVC method of Python scikit-learn library.

### SVM for Non-Linear Data Sets

An example of non-linear data is:

![SVM's for Non-Linear Data Sets](./img/non_linear_svm.png)

In this case we cannot find a straight line to separate apples from lemons. So how can we solve this problem. We will use the Kernel Trick!

The basic idea is that when a data set is inseparable in the current dimensions, add another dimension, maybe that way the data will be separable. 

The example above is in 2D and it is inseparable, but maybe in 3D there is a gap between the apples and the lemons, maybe there is a level difference, so apples are on level one and lemons are on level two. In this case we can easily draw a separating hyperplane (in 3D a hyperplane is a plane) between level 1 and 2.

Let's assume that we add another dimension called X3. Another important transformation is that in the new dimension the points are organized using this formula x1² + x2².

If we plot the plane defined by the x² + y² formula, we will get something like this:

![3d_SVM](./img/3d_svm.png)

Now we have to map the apples and lemons (which are just simple points) to this new space. 

What did we do? We just used a transformation in which we added levels based on distance. 

If you are in the origin, then the points will be on the lowest level. As we move away from the origin, it means that we are climbing the hill (moving from the center of the plane towards the margins) so the level of the points will be higher. 

Now if we consider that the origin is the lemon from the center, we will have something like this:

![Transformed SVM](./img/transformed_svm.png)

Now we can easily separate the two classes. These transformations are called kernels.
Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace RBF Kernel, Sigmoid Kernel, Anove RBF Kernel, etc 

Another example would be:

![](./img/1d_svm.png)

After using the kernel and after all the transformations we will get:

![](./img/transformed_1d_kernel.png)

So after the transformation, we can easily delimit the two classes using just a single line.

In real life applications we won’t have a simple straight line, but we will have lots of curves and high dimensions. In some cases we won’t have two hyperplanes which separates the data with no points between them, so we need some trade-offs, tolerance for outliers. 

Fortunately the SVM algorithm has a so-called regularization parameter to configure the trade-off and to tolerate outliers.

#### Regularisation

The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss classifying each training example.

If the C is higher, the optimization will choose smaller margin hyperplane, so training data miss classification rate will be lower.

On the other hand, if the C is low, then the margin will be big, even if there will be miss classified training data examples. This is shown in the following two diagrams:

![](./img/reg_svm.png)

As you can see in the image, when the C is low, the margin is higher (so implicitly we don’t have so many curves, the line doesn’t strictly follows the data points) even if two apples were classified as lemons. When the C is high, the boundary is full of curves and all the training data was classified correctly. 


**Note:** even if all the training data was correctly classified, this doesn’t mean that increasing the C will always increase the precision (because of overfitting).

#### Examples of SVM kernels

- Polynomial kernel
It is popular in image processing.
Equation is:

![](./img/polynomial-kernel.png)

where d is the degree of the polynomial.

- Gaussian kernel
It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:

![](./img/gaussian-kernel.png)

- Sigmoid kernel
We can use it as the proxy for neural networks. Equation is

![](./img/sigmoid-kernel.png)

In [1]:
#import libs

import pandas as pd
import numpy as np
from sklearn.svm import SVR  ##<---- support vector machine
from sklearn.svm import SVC

In [2]:
#import data

df = pd.read_csv('./data/winequality-red.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
