# Support Vector Machine (SVM) Tutorial

SVMs are algorithms that can be used for both classification and regression purposes, although they are more commonly used for **classification**.

To understand how SVM's work, imagine each data point a vector point in $n$-dimensional space, where $n$ is the total number of features.

An SVM classifies data by finding the **optimal hyperplane** that best divides the data into groups by class. In a 2D space (data with two features), this "hyperplane" is a line dividing the space into two regions, as shown below. The trick to finding the optimal line is maximizing the distance from the line to any data. This is the **maximum margin**.

<img src="https://cdn-images-1.medium.com/max/1200/0*0o8xIA4k3gXUDCFU.png">


**Support vectors** are the data points nearest to the hyperplane. If these points were removed, the optimal hyperplane would change. The position and number of points outside the support vectors will not change the hyperplane fit at all. This means SVM's can have solid results on small datasets (with valuable support vectors). In the image below, the support vectors are circled.

![](https://raw.githubusercontent.com/BeaverWorksMedlytics2020/Data_Public/master/Images/Week1/svm_4.png))

Intuitively, the further from the hyperplane our support vectors points lie, the larger the margin, and the more confident we are in our classifier. Therefore, we ideally want our data points to be as far away from the hyperplane as possible, while still being on the correct side.

So what happens when data overlaps, or doesn't have a clear dividing line? Take this image as an example:

<img src="https://miro.medium.com/max/720/1*fv8DDZLaR0t7SO-W6tdDAg.png">

Here we have these two options. Try to draw a line despite some points being on the wrong side:

<img src="https://miro.medium.com/max/600/1*1dwut8cWQ-39POHV48tv4w.png">

Or give up on having a straight line, and define a curved or segmented line instead:

<img src="https://miro.medium.com/max/600/1*gt_dkcA5p0ZTHjIpq1qnLQ.png"> 

Both options can work! However, there are tradeoffs. In some cases, the first option may not be accurate enough. However, the second option may take too long for large data sets, and may over-fit to the training data.

## Important Parameters for SVM

In this notebook, we will be using sklearn's SVC (Support Vector Classifier documentation found here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). Feel free to look at the other available SVM models here: https://scikit-learn.org/stable/modules/svm.html.

In this section we will describe three key parameters for training SVMs: Regularization, Gamma, and Kernel.

### Regularization
---

**Regularization (C)** impacts the division of data by telling the SVM optimization how much you want to avoid misclassifying the training data.

- Low regularization values create smooth decision boundaries
- High regularization values create more complex decision boundaries, but may over-fit to the training set

Low C:

<img src="https://miro.medium.com/max/600/1*1dwut8cWQ-39POHV48tv4w.png">

High C:

<img src="https://miro.medium.com/max/600/1*gt_dkcA5p0ZTHjIpq1qnLQ.png"> 

### Gamma
---

**Gamma** defines how close a training data point needs to be to impact the decision boundary. High gamma can lead to a lot of the data not being considered.

<img src="https://miro.medium.com/max/720/1*dGDQxV8j83VB90skHsXktw.png">
<img src="https://miro.medium.com/max/720/1*ClmsnU_yb1YtIwAAr7krmg.png">

### Kernel
---

A **kernel** is essentially a transformation that makes decision boundaries possible for different shaped distributions. In the example below, it is impossible to draw a straight line to separate the circles from the squares.

<img src="https://miro.medium.com/max/720/1*C3j5m3E3KviEApHKleILZQ.png"> <br><br>

However, if we apply a kernel to transform the data into 3D space (for example, with z = x² + y²), we may be able to draw a line on the Z-X or Z-Y plane.

<img src="https://miro.medium.com/max/720/1*FLolUnVUjqV0EGm3CYBPLw.png"> <br><br>

Looking at it again in X-Y, we have managed to separate the data quite well.

<img src="https://miro.medium.com/max/720/1*NN5VCpVg9gPCLYrDl0YFYw.png"> <br><br>

# Example SVM

As an example, we'll use an SVM to predict diabetes using the Pima Diabetes dataset. Load and view the data in the cells below:

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC 

## Loading the Data

In [None]:
url = "https://raw.githubusercontent.com/BeaverWorksMedlytics2020/Data_Public/master/NotebookExampleData/Week1/diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)

# Dropping NaN rows
invalid = ['plas', 'pres', 'skin', 'test', 'mass']

for i in invalid:
    data[i].replace(to_replace=0, value=np.nan, inplace=True)
    
data = data.dropna(axis=0).reset_index(drop=True)

data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
1,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
2,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
3,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
4,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1


## Splitting Data into Training, Validation, and Testing

In [None]:
X_cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']

y_col = 'class'

test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(data[X_cols], data[y_col], test_size=test_size, random_state=0)

# Further split X and y of training into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=test_size, random_state=0)

## Building the Model
Next, we create a model using SVC, and fit the data.

In [None]:
# Creating a model with sklearn's SVC
svm = SVC(gamma=.1, C=2)

# Training/fitting a model with training data
svm.fit(X_train, y_train)

SVC(C=2, gamma=0.1)

## Evaluation

In [None]:
# Printing accuracy of training and validation data
y_train_pred=svm.predict(X_train)
print("Training Accuracy is ", accuracy_score(y_train, y_train_pred)*100)
y_val_pred=svm.predict(X_val)
print("Validation Accuracy is ", accuracy_score(y_val,y_val_pred)*100)

Training Accuracy is  100.0
Validation Accuracy is  63.49206349206349


As you can see above, despite achieving a training accuracy of 100%, the validation accuracy is only 63%. This suggests that the model has been **over-fit**! In general if your training accuracy reaches 100%, you've most likely over-fit your model.

Play around with the parameters to try to balance out the accuracies. You can start with the ones we've mentioned above, but look through documentation for more options!

Once you feel like your model's at a good place, you can do one last evaluation using the **testing data**. Don't forget, your testing data should never be used to change your model and is reserved for one last evaluation.

In [None]:
y_test_pred=svm.predict(X_test)

print("Training Accuracy is ", accuracy_score(y_train, y_train_pred)*100)
print("Validation Accuracy is ", accuracy_score(y_val,y_val_pred)*100)
print("Testing Accuracy is ", accuracy_score(y_test,y_test_pred)*100)

Training Accuracy is  100.0
Validation Accuracy is  63.49206349206349
Testing Accuracy is  68.35443037974683


# Conclusion
Pros of SVM
- SVM works well with small data sets with many attributes
- SVM models run fast and don't use much memory, as they only depend on a few support vectors

Cons of SVM
- Training time is long, which isn't well suited for larger data sets
- SVM is less effective on "noisier" datasets with overlapping classes
- Results are very dependent on parameters, which can be hard to tune on small data sets

## Resource

Lesson adapted from https://medium.com/machine-learning-101
> This is really great series of articles on introductory machine learning, take a look if you feel like you need additional clarification

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6aa92908-daf4-47cc-a24d-1e94a9949d60' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>