# Support Vector Machine

**Basic Description**
- The goal of SVM is to form a hyperplane that separates the training data
- It aims to maximize the margin, which is the minimum distance from the decision boundary to any training point. The points closest to the hyperplane are called the support vectors
- The decision boundaries can be nonlinear. SVMs accomplish this by using kernels to transform the data into a higher-dimensional space where hyperplanes can best separate points

**Bias-Variance Tradeoff**
- Greater complexity decreases bias some, increasing variance

**Upsides**
- Able to draw non-linear decision boundaries relative to features

**Downsides**
- SVMs have high computational complexity and may not be appropriate for large datasets or overlapping classes with unclear separation

**Other Notes**

## Load Packages and Prep Data

In [2]:
# custom utils
from utils import custom
from utils.cf_matrix import make_confusion_matrix

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVC

import time

In [3]:
# load data
X_train, y_train, X_test, y_test = custom.load_data()

X_train (62889, 42)
y_train (62889,)
X_test (15723, 42)
y_test (15723,)


## Model 1
- Use default hyperparameters
- Notable
    - C=1.0
    - kernel='rbf'

In [4]:
# start timer
start = time.time()

# fit SVM model
svm_1 = SVC()
x = svm_1.fit(X_train, y_train)

# end timer
end = time.time()
print(end - start)

36.59624910354614


In [6]:
# start timer
start = time.time()

# cross-validation scoring
svm_1_scores = custom.cv_metrics(svm_1, X_train, y_train)
svm_1_scores

# end timer
end = time.time()
print(end - start)

660.1521825790405


In [10]:
# show scores
svm_1_scores

accuracy     0.955
precision    0.835
recall       0.403
f1           0.544
dtype: float64

## Model 2
- Regularize by removing correlated features
- Perhaps improve compute demands

### Feature Selection

In [7]:
# drop correlated features

correlated_features = set()
correlation_matrix = X_train.corr()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)

print('Number of redundant features:',len(correlated_features))
print('Removed features:',correlated_features)
print()
X_train_selected = X_train.drop(columns = correlated_features)
print('Remaining features:',list(X_train_selected.columns))

Number of redundant features: 31
Removed features: {'fwidth', 'rnd_ell_prod', 'ethickness', 'dp', 'fthickness', 'angularity', 'thick_vol_prod', 'sieve', 'fiber_width', 'circularity', 'ellipse_ratio', 'w_l_ratio', 't_l_aspect_ratio', 'area', 'chull_surface_area', 'surface_area', 'flength', 'chull_perimeter', 'compactness', 'elength', 'perimeter', 'ellipticity', 'ewidth', 'roundness', 'l_w_ratio', 'l_t_ratio', 'fiber_length', 'thick_perm_prod', 't_w_ratio', 'concavity', 'chull_area'}

Remaining features: ['da', 'volume', 'sphericity', 'solidity', 'convexity', 'extent', 'transparency', 'curvature', 'w_t_ratio', 'krumbein_rnd', 'thick_trans_prod']


### Fit Model

In [8]:
# start timer
start = time.time()

# fit SVM model
svm_2 = SVC()
x = svm_2.fit(X_train_selected, y_train)

# end timer
end = time.time()
print(end - start)

33.242859840393066


### Cross-Validation

In [9]:
# start timer
start = time.time()

# cross-validation scoring
svm_2_scores = custom.cv_metrics(svm_2, X_train, y_train)

# end timer
end = time.time()
print(end - start)

646.38356590271


In [11]:
# show scores
svm_2_scores

accuracy     0.955
precision    0.835
recall       0.403
f1           0.544
dtype: float64

## Conclusion
- SVMs are not a good candidate model based on compute requirements and metrics performance. Time is better spent tuning other models for now.