# **Support Vector Machines** 

A Support Vector Machine (SVM) is a powerful supervised machine learning model used for classification. An SVM makes classifications by defining a decision boundary and then seeing what side of the boundary an unclassified point falls on. In the next few exercises, we’ll learn how these decision boundaries get defined, but for now, know that they’re defined by using a training set of classified points. That’s why SVMs are supervised machine learning models.



## **Decision Boundaries**

Decision boundaries are easiest to wrap your head around when the data has two features. In this case, the decision boundary is a line. Take a look at the example below.


![image](images/2d_boundary.png)

After finding a decision boundary using the training set, you could give the SVM an unlabeled data point, and it will predict whether or not that team will make the playoffs.

Decision boundaries exist even when your data has more than two features. If there are three features, the decision boundary is now a plane rather than a line.

![images](images/3d_boundary.png)

Nonetheless, SVMs can still find a decision boundary. However, rather than being a separating line, or a separating plane, the decision boundary is called a separating hyperplane.

One problem that SVMs need to solve is figuring out what decision boundary to use. After all, there could be an infinite number of decision boundaries that correctly separate the two classes. here are so many valid decision boundaries, but which one is best? In general, we want our decision boundary to be as far away from training points as possible. Maximizing the distance between the decision boundary and points in each class will decrease the chance of false classification.

## **Support Vectors**

The support vectors are the points in the training set closest to the decision boundary. In fact, these vectors are what define the decision boundary. But why are they called vectors? Instead of thinking about the training data as points, we can think of them as vectors coming from the origin.

![image](images/vectors.png)

These vectors are crucial in defining the decision boundary — that’s where the “support” comes from. If you are using n features, there are at least n+1 support vectors. The distance between a support vector and the decision boundary is called the margin. We want to make the margin as large as possible. The support vectors are highlighted in the image below:

![images](images/margins.png)

Because the support vectors are so critical in defining the decision boundary, many of the other training points can be ignored. This is one of the advantages of SVMs. Many supervised machine learning algorithms use every training point in order to make a prediction, even though many of those training points aren’t relevant. SVMs are fast because they only use the support vectors!

## **Model Fit**

### **sklearn.svm.SVC**

To use scikit-learn’s SVM we first need to create an SVC object. It is called an SVC because scikit-learn is calling the model a Support Vector Classifier rather than a Support Vector Machine. We’ll soon go into what the kernel parameter is doing, but for now, let’s use a 'linear' kernel.


In [1]:
from sklearn.svm import SVC

classifier = SVC(kernel="linear")

Next, the model needs to be trained on a list of data points and a list of labels associated with those data points. The labels are analogous to the color of the point — you can think of a 1 as a red point and a 0 as a blue point. The training is done using the .fit() method:

In [2]:
training_points = [[1, 2], [1, 5], [2, 2], [7, 5], [9, 4], [8, 2]]
labels = [1, 1, 1, 0, 0, 0]
classifier.fit(training_points, labels)

0,1,2
,C,1.0
,kernel,'linear'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


The graph of this dataset would look like this:

![image](images/example.png)

Calling `.fit()` creates the line between the points. Finally, the classifier predicts the label of new points using the .`predict()` method. The `.predict()` method takes a list of points you want to classify. Even if you only want to classify one point, make sure it is in a list:

In [3]:
print(classifier.predict([[3, 2]]))

[1]


In addition to using the SVM to make predictions, you can inspect some of its attributes. For example, if you can print `classifier.support_vectors_` to see which points from the training set are the support vectors.

In [4]:
classifier.support_vectors_

array([[7., 5.],
       [8., 2.],
       [2., 2.]])

## **Outliers**

SVMs try to maximize the size of the margin while still correctly separating the points of each class. As a result, outliers can be a problem. Consider the image below.

![image](images/outliers.png)

The size of the margin decreases when a single outlier is present, and as a result, the decision boundary changes as well. However, if we allowed the decision boundary to have some error, we could still use the original line.

SVMs have a parameter C that determines how much error the SVM will allow for. If C is large, then the SVM has a hard margin — it won’t allow for many misclassifications, and as a result, the margin could be fairly small. If C is too large, the model runs the risk of overfitting. It relies too heavily on the training data, including the outliers.

On the other hand, if C is small, the SVM has a soft margin. Some points might fall on the wrong side of the line, but the margin will be large. This is resistant to outliers, but if C gets too small, you run the risk of underfitting. The SVM will allow for so much error that the training data won’t be represented.

When using scikit-learn’s `SVM`, you can set the value of C when you create the object:

In [5]:
classifier = SVC(C=0.01)

The optimal value of C will depend on your data. Don’t always maximize margin size at the expense of error. Don’t always minimize error at the expense of margin size. The best strategy is to validate your model by testing many different values for C.

## **Kernels**

Up to this point, we have been using data sets that are linearly separable. This means that it’s possible to draw a straight decision boundary between the two classes. However, what would happen if an SVM came along a dataset that wasn’t linearly separable?

![image](images/non-linear.png)

It’s impossible to draw a straight line to separate the red points from the blue points!

Luckily, SVMs have a way of handling these data sets. Remember when we set `kernel = 'linear'` when creating our SVM? Kernels are the key to creating a decision boundary between data points that are not linearly separable.

In [6]:
classifier = SVC(kernel="poly", degree=2)

kernel_list = ["linear", "poly", "rbf", "sigmoid"]

The kernel transforms the data in a clever way to make it linearly separable. We used a polynomial kernel which transforms every point in the following way:

$(x,y)\rightarrow (\sqrt{2}xy, x^2, y^2)$

If we plot these new three dimensional points, we get the following graph:

![image.png](images/non-linear-boundary.png)

Look at that! All of the blue points have scooted away from the red ones. By projecting the data into a higher dimension, the two classes are now linearly separable by a plane. We could visualize what this plane would look like in two dimensions to get the following decision boundary.

![image.png](images/2d_non-linear_boundary.png)

## **Radial Basis Function**

The most commonly used kernel in SVMs is a radial basis function (rbf) kernel. This is the default kernel used in scikit-learn’s SVC object. If you don’t specifically set the kernel to "linear", "poly" the SVC object will use an rbf kernel. If you want to be explicit, you can set `kernel = "rbf"`, although that is redundant.

It is very tricky to visualize how an rbf kernel “transforms” the data. The polynomial kernel we used transformed two-dimensional points into three-dimensional points. An rbf kernel transforms two-dimensional points into points with an infinite number of dimensions!

We won’t get into how the kernel does this — it involves some fairly complicated linear algebra. However, it is important to know about the rbf kernel’s `gamma` parameter.

In [7]:
classifier = SVC(kernel="rbf", gamma=0.5, C=2)

`gamma` is similar to the C parameter. You can essentially tune the model to be more or less sensitive to the training data. A higher `gamma`, say `100`, will put more importance on the training data and could result in overfitting. Conversely, A lower `gamma` like `0.01` makes the points in the training data less relevant and can result in underfitting.