# Support Vector Machine

Support vector machines (SVMs) are supervised algorithms for both classification and regression.
Based on discriminative classification: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.

Datapoints from different classes are separated by lines (if SVM uses a linear kernel) which have margins.
These margins are maximized till they "touch" some datapoints.
These datapoints are called "support vectors" and are the only datapoints that are considered for future predictions. Datapoints which are not in the margins don't influence the prediction.

C parameter determines how tolerant is the margin with respect to data points inside itself. The lesser the more tolerant it is.

**References**
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html)
* [An Idiot's guide to Support vector machines (SVMs)](http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf)

## Load Data

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
import pickle

# Environment settings
data_path = 'Data/'

# Deserialize previously saved data from "preprocessing"
with open(data_path+'train_pp.obj', 'rb') as train_pp, \
open(data_path+'test_pp.obj', 'rb') as test_pp:
    df_train = pickle.load(train_pp)
    df_test = pickle.load(test_pp)

## Data processing and model training

In [2]:
# Experimental features
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1

df_train = df_train.drop(['SibSp', 'Parch', 'Fare'], axis=1)
df_test = df_test.drop(['SibSp', 'Parch', 'Fare'], axis=1)

In [3]:
# Preprocessing
dv_train_X = df_train.drop(['PassengerId','Survived'], axis=1).values
dv_train_y = df_train['Survived'].values

In [4]:
# Prepare training set
X_train, X_test, y_train, y_test = train_test_split(
    dv_train_X, dv_train_y, test_size=0.25, random_state=1, stratify=dv_train_y);

In [5]:
# Model training
svc_params = {
    'kernel': 'linear', # kernel type
    'C': 100.0 #regularization parameter
}

svc = svm.SVC(**svc_params).fit(X_train, y_train)

## Model parameters

In [6]:
# Features importance
f_name = df_train.drop(['PassengerId','Survived'], axis=1).columns.values
f_score = map(lambda x: -x.round(2), svc.coef_[0])

print('{:<10}{:10}{:>10}'.format('RANK', 'FEATURE', 'SCORE'))
for i, f in enumerate(sorted(zip(f_name, f_score), key=lambda x: x[1], reverse=True)):
    print('{:<10}{:10}{:10}'.format(i+1, f[0], f[1]))

RANK      FEATURE        SCORE
1         Sex             0.73
2         Title           0.73
3         Pclass          0.38
4         FamilySize      0.34
5         Embarked_S      0.23
6         Embarked_Q      0.14
7         Age             0.01


In [7]:
# Number of support vectors
print('number of support vectors: {}'.format(len(svc.support_vectors_)))

number of support vectors: 279


## Score

In [8]:
# Test set score
testset_score = svc.score(X_test, y_test)
print('Accuracy with test set: {} (+/- {})'
      .format(round(testset_score.mean(),2), round(testset_score.std() * 2,2)))

Accuracy with test set: 0.86 (+/- 0.0)


In [9]:
# Cross-validation score
cv_iterations = 5
cv_score = cross_val_score(svc, dv_train_X, dv_train_y, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

Accuracy with cross-validation (split size = 5): 0.83 (+/- 0.04)


## Test set prediction

In [10]:
# Prediction on test set
dv_test_X = df_test.drop(['PassengerId'], axis=1).values

test_prediction_results = pd.DataFrame(
    data={'PassengerId': df_test['PassengerId'].values,
          'Survived': svc.predict(dv_test_X).astype(int)})

# Write results to a csv file
test_prediction_results.to_csv(data_path+'outputs/support-vector-machine.csv', index=False)