<a href="https://colab.research.google.com/github/vu-topics-in-big-data-2021/examples/blob/main/example-ml/svm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://cdn.vanderbilt.edu/vu-www4/brandbar/svg/Optimized/vanderbilt.svg)



# Support Vector Machines

### About this Lesson
In this lesson, we will learn about and implement a support vector machine for both classification and regression using scikit-learn.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import Data

For this lesson, we will use two datasets. The first set is the Wisconsin breast cancer dataset, which will be used for classification of benign vs malignant tumors. The second set is a slightly modified version of the Auto MPG data set, which will be used to predict fuel consumption. Both sets are originally from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/)

In [None]:
cancer_data = pd.read_csv("https://raw.githubusercontent.com/vu-topics-in-big-data-2021/examples/main/example-ml/datasets/classification/wisconsin_cancer.csv")

cancer_y = cancer_data["target"]
cancer_X = cancer_data.drop("target", axis=1)

cancer_X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
car_data = pd.read_csv("https://raw.githubusercontent.com/vu-topics-in-big-data-2021/examples/main/example-ml//datasets/regression/auto-mpg.csv")
car_X = car_data[['cylinders', 'horsepower', 'weight']]
car_y = car_data['mpg']

car_X.head()

Unnamed: 0,cylinders,horsepower,weight
0,8,130.0,3504.0
1,8,165.0,3693.0
2,8,150.0,3436.0
3,8,150.0,3433.0
4,8,140.0,3449.0


## Creating a Train and Test Split

One of the most important aspects of creating models is to be able to evaluate how well those models work. One way that we often do this in data science is to create a train-test split of the data. This process involves dividing (usually randomly) the full dataset into two mutually exclusive sets: one set that will be used to train the model and one set that will be used to test the model. This provides an accurate picture of how the model performs on data that it has never seen before.

We can use the scikit-learn library to divide the dataset into two parts. There are many different ways that we can split the data, but a common way is to use 80% of the data for training and 20% for testing, so that is what we will do here. In this case, we will also set what is called a random seed to ensure that everyone in the class gets the same train-test split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_cancer_train, X_cancer_test, y_cancer_train, y_cancer_test = train_test_split(cancer_X, cancer_y, test_size=0.2, random_state=1)
X_car_train, X_car_test, y_car_train, y_car_test = train_test_split(car_X, car_y, test_size=0.2, random_state=1)

## SVM Classification

Now we are ready to build the model. Support vector machines work by drawing a boundary plane between the data such that the distance from the points to the plane is maximized. Since the distance is maximized, the plane represents the maximal decision boundary between the classes. The scikit-learn library has a built-in support vector classification. So, we'll use that to create our model from the training data.

In [None]:
from sklearn.svm import SVC

Support vector machines use a parameter known as a kernel during training and prediction. The kernel essentially takes the data that we have and maps it onto a higher dimensional space. There are several different kernel functions that can be used, but the simplest is the *linear* kernel, which does not do any change to the features at all and instead draws a linear separating boundary. We will use the linear kernel function for this dataset.

In [None]:
clf = SVC(kernel='linear')
clf.fit(X_cancer_train, y_cancer_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Now that the classifier is trained, we want to use the testing data to evaluate its accuracy. When evaluating the performance of a binary classifier such as this one, there are many metrics which are available. Some of the most commonly used as easily interpretable are accuracy, precision, recall, and F1-Score.

Accuracy measures the percentage of test data that was correctly labeled. The closer the accuracy is to 1, the better the results. In this sense, it is very easy to interpret. For accuracy to be a good measure of performance, the number of points which contain each label must be approximately equal.

Precision and recall are two metrics designed to capture the problem of unbalanced datasets described above. Recall measures the ability of the model to find all of the relevant cases in the dataset. Precision measures the ability of the model to find only the relevant cases in the dataset. For our case, a high recall would mean that out of all of the malignant tumors, we correctly labeled most of them. While a high precision would mean that out of the data that we labeled as malignant, most of them actually were malignant.

F1 Score is the harmonic average of precision and recall, meaning that it is a way to measure the overall performance of the classifier in a single metric, regardless of whether the data is balanced or not.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
predictions = clf.predict(X_cancer_test)

print("Accuracy Score:\t %.2f" % accuracy_score(y_cancer_test, predictions))
print("Precision Score: %.2f" % precision_score(y_cancer_test, predictions))
print("Recall Score:\t %.2f" % recall_score(y_cancer_test, predictions))
print("F1 Score:\t %.2f" % f1_score(y_cancer_test, predictions))

Accuracy Score:	 0.96
Precision Score: 0.94
Recall Score:	 1.00
F1 Score:	 0.97


So we can see that based on all four of our metrics, the SVM classifier performed very well.

## SVM Regression

Support vector machines are very versatile. Not only can they be used for classification, but also for regression. For regression, instead of drawing a plane which maximizes the distance between data points, the plane is drawn to minimize the distance (or intersect with) the data. Once again, scikit-learn can be used to train a support vector regression model.

In [None]:
from sklearn.svm import SVR

Once again, the support vector machine takes a parameter which represents the kernel function. To start, let's use a linear kernel, meaning that we will be performing linear regression.

In [None]:
linear_model = SVR(kernel='linear')
linear_model.fit(X_car_train, y_car_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Now to evaluate the model, we can use several statistical metrics and the testing data that we generated earlier. There are many different metrics for evaluating linear regression but three of the most popular and easy to interpret are mean absolute error, mean squared error, and the Pearson correlation coefficient. We will use these three metrics from the scikit-learn library to evaluate our model.

Mean absolute error measures the absolute value of the difference between the model's result and the actual value of each tested point, and averages these errors. It is very easy to interpret because it is basically just the average error and can be interpreted in the same units as the target variable.

Mean squared error measures the square of the difference between the model's result and the actual value of each tested point, and averages these errors. Mean squared error is popular becuase it penalizes large error more than small ones, so it can sometimes give a better sense of how many really big errors are being committed.

The Pearson correlation coefficient, also called $R^2$ value, measures how well a line fits to a set of data. This is clearly very useful for linear regression since the model generates a line. In general, the best possible value is 1.0 and indicates that the line fits the data perfectly, and the value gets smaller if the model is worse.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
predictions = linear_model.predict(X_car_test)

print("Mean absolute error: %.2f" % mean_absolute_error(y_car_test, predictions))
print("Mean squared error: %.2f" % mean_squared_error(y_car_test, predictions))
print("R2 Score: %.2f" % r2_score(y_car_test, predictions))

Mean absolute error: 3.42
Mean squared error: 21.14
R2 Score: 0.70
