# Support Vector Machine

Support vector machines (SVMs) are supervised algorithms for both classification and regression.
Based on discriminative classification: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.

Datapoints from different classes are separated by lines (if SVM uses a linear kernel) which have margins.
These margins are maximized till they "touch" some datapoints.
These datapoints are called "support vectors" and are the only datapoints that are considered for future predictions. Datapoints which are not in the margins don't influence the prediction.

C parameter determines how tolerant is the margin with respect to data points inside itself. The lesser the more tolerant it is.

**References**
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html)
* [An Idiot's guide to Support vector machines (SVMs)](http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf)

## Load Data

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
import matplotlib.pyplot as plt
import pickle

# Environment settings
data_path = 'Data/'

# Deserialize previously saved data from "preprocessing"
with open(data_path+'train_pp.obj', 'rb') as train_pp, \
open(data_path+'test_pp.obj', 'rb') as test_pp:
    df_train = pickle.load(train_pp)
    df_test = pickle.load(test_pp)

df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S,Title
0,1,False,3,0,1,1,0,0.014151,0,1,6
1,2,True,1,1,4,1,0,0.139136,0,0,1
2,3,True,3,1,2,0,0,0.015469,0,1,2
3,4,True,1,1,3,1,0,0.103644,0,1,1
4,5,False,3,0,3,0,0,0.015713,0,1,6
5,6,False,3,0,1,0,0,0.01651,1,0,6
6,7,False,1,0,5,0,0,0.101229,0,1,6
7,8,False,3,0,0,3,1,0.041136,0,1,3
8,9,True,3,1,2,0,2,0.021731,0,1,1
9,10,True,2,1,0,1,0,0.058694,0,0,1


## Data processing and model training

In [2]:
# Preprocessing
dv_train_X = df_train.drop(['PassengerId','Survived'], axis=1).values
dv_train_y = df_train['Survived'].values

In [3]:
# Prepare training set
X_train, X_test, y_train, y_test = train_test_split(
    dv_train_X, dv_train_y, test_size=0.25, random_state=1, stratify=dv_train_y);

In [4]:
# Grid search to find best parameter values
param_grid = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'gamma' : [0.001, 0.01, 0.1, 1]
}

grid_svc = GridSearchCV(svm.SVC(), param_grid, cv=10, scoring='accuracy')
grid_svc.fit(dv_train_X, dv_train_y)

print('Best score: {}'.format(grid_svc.best_score_))
print('Best parameters: {}'.format(grid_svc.best_params_))

Best score: 0.8372615039281706
Best parameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}


In [5]:
# Model training
svc = svm.SVC(**grid_svc.best_params_).fit(X_train, y_train)

## Model parameters

In [6]:
# Features importance (for linear kernels only)
try:
    f_name = df_train.drop(['PassengerId','Survived'], axis=1).columns.values
    f_score = map(lambda x: -x.round(2), svc.coef_[0])
    
    print('{:<10}{:16}{:>10}'.format('RANK', 'FEATURE', 'SCORE'))
    for i, f in enumerate(sorted(zip(f_name, f_score), key=lambda x: x[1], reverse=True)):
        print('{:<10}{:16}{:10}'.format(i+1, f[0], f[1]))
    
except AttributeError:
    print('non-linear kernels are not support')

non-linear kernels are not support


In [7]:
# Number of support vectors
print('number of support vectors: {}'.format(len(svc.support_vectors_)))

number of support vectors: 264


## Score

In [8]:
# Test set score
testset_score = svc.score(X_test, y_test)
print('Accuracy with test set: {} (+/- {})'
      .format(round(testset_score.mean(),2), round(testset_score.std() * 2,2)))

Accuracy with test set: 0.85 (+/- 0.0)


In [9]:
# Cross-validation score
cv_iterations = 10
cv_score = cross_val_score(svc, dv_train_X, dv_train_y, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

Accuracy with cross-validation (split size = 10): 0.84 (+/- 0.07)


## Test set prediction

In [10]:
# Prediction on test set
dv_test_X = df_test.drop(['PassengerId'], axis=1).values

test_prediction_results = pd.DataFrame(
    data={'PassengerId': df_test['PassengerId'].values,
          'Survived': svc.predict(dv_test_X).astype(int)})

# Write results to a csv file
test_prediction_results.to_csv(data_path+'outputs/support-vector-machine.csv', index=False)