## Support Vector Machines: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Support Vector Machines model.

### Read in Data

In [1]:
import joblib
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
import warnings

from sklearn import metrics
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# The path to your data. Arrange this based on your own file location.
df_tr = pd.read_csv('titanic_data/train_features.csv')
df_tr_labels = pd.read_csv('titanic_data/train_labels.csv', header=None)

In [2]:
# The path to your data. Arrange this based on your own file location.
df_val = pd.read_csv('titanic_data/val_features.csv')
df_val_labels = pd.read_csv('titanic_data/val_labels.csv', header=None)

# The path to your data. Arrange this based on your own file location.
df_test = pd.read_csv('titanic_data/test_features.csv')
df_test_labels = pd.read_csv('titanic_data/test_labels.csv', header=None)

### Cross Validation
![CV](img/CV.png)
![Cross-Val](img/Cross-Val.png)

### Hyperparameter tuning

![c](img/c.png)

In [3]:
df_tr.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,2,0,62.0,10.5,0,0
1,3,0,8.0,29.125,5,0
2,3,0,32.0,56.4958,0,0
3,3,1,20.0,9.825,1,0
4,2,1,28.0,13.0,0,0


In [4]:
df_tr_labels.value_counts()

0    333
1    201
dtype: int64

In [5]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{:.3f} (+/-{:.2f}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

### 1. Find the best parameter values of kernel type and C in SVM

In [6]:
svc = SVC()
parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100]
}

cv = GridSearchCV(svc, parameters, cv=3, n_jobs=-1)
cv.fit(df_tr, df_tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'C': 100, 'kernel': 'linear'}

0.672 (+/-0.03) for {'C': 0.001, 'kernel': 'linear'}
0.624 (+/-0.00) for {'C': 0.001, 'kernel': 'rbf'}
0.704 (+/-0.04) for {'C': 0.01, 'kernel': 'linear'}
0.624 (+/-0.00) for {'C': 0.01, 'kernel': 'rbf'}
0.798 (+/-0.12) for {'C': 0.1, 'kernel': 'linear'}
0.648 (+/-0.01) for {'C': 0.1, 'kernel': 'rbf'}
0.798 (+/-0.12) for {'C': 1, 'kernel': 'linear'}
0.646 (+/-0.02) for {'C': 1, 'kernel': 'rbf'}
0.796 (+/-0.11) for {'C': 10, 'kernel': 'linear'}
0.678 (+/-0.04) for {'C': 10, 'kernel': 'rbf'}
0.801 (+/-0.11) for {'C': 100, 'kernel': 'linear'}
0.783 (+/-0.05) for {'C': 100, 'kernel': 'rbf'}


In [7]:
cv.best_estimator_

SVC(C=100, kernel='linear')

In [8]:
cv.best_estimator_.score(df_val, df_val_labels)

0.7528089887640449

What is the model performance when tested on the validation set?

In [9]:
pred = cv.best_estimator_.predict(df_val)

In [10]:
print(metrics.classification_report(df_val_labels, pred))

              precision    recall  f1-score   support

           0       0.79      0.84      0.81       113
           1       0.68      0.60      0.64        65

    accuracy                           0.75       178
   macro avg       0.73      0.72      0.73       178
weighted avg       0.75      0.75      0.75       178



#### Randomized Search

In [11]:
svc = SVC()
from scipy.stats import uniform

parameters = {
    'kernel': ['linear', 'rbf'],
    'C': uniform(loc=0, scale=10)
}

cv = RandomizedSearchCV(svc, parameters, cv=3, n_iter=20, n_jobs=-1)
cv.fit(df_tr, df_tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'C': 4.631719300611734, 'kernel': 'linear'}

0.798 (+/-0.12) for {'C': 4.631719300611734, 'kernel': 'linear'}
0.654 (+/-0.02) for {'C': 5.557853341152079, 'kernel': 'rbf'}
0.648 (+/-0.02) for {'C': 2.153025794028304, 'kernel': 'rbf'}
0.655 (+/-0.04) for {'C': 0.2100503276029786, 'kernel': 'rbf'}
0.798 (+/-0.12) for {'C': 1.2452639700196255, 'kernel': 'linear'}
0.798 (+/-0.12) for {'C': 5.601837052007713, 'kernel': 'linear'}
0.796 (+/-0.11) for {'C': 8.525211693816937, 'kernel': 'linear'}
0.674 (+/-0.03) for {'C': 7.955646533932077, 'kernel': 'rbf'}
0.798 (+/-0.12) for {'C': 1.6002852690192926, 'kernel': 'linear'}
0.798 (+/-0.12) for {'C': 5.704511570871247, 'kernel': 'linear'}
0.654 (+/-0.01) for {'C': 4.790495876410203, 'kernel': 'rbf'}
0.646 (+/-0.02) for {'C': 0.9778986682425606, 'kernel': 'rbf'}
0.798 (+/-0.12) for {'C': 1.5285147411971256, 'kernel': 'linear'}
0.674 (+/-0.03) for {'C': 8.13228293636323, 'kernel': 'rbf'}
0.648 (+/-0.02) for {'C': 1.2285094627911641, 'k

In [12]:
cv.best_estimator_

SVC(C=4.631719300611734, kernel='linear')

### 2. Find the best parameter values of kernel type, C, and gamma in SVM

In [13]:
cv.best_estimator_

SVC(C=4.631719300611734, kernel='linear')

In [14]:
cv.best_estimator_.score(df_val, df_val_labels)

0.7471910112359551

What is the model performance when tested on the validation set?

In [15]:
pred = cv.best_estimator_.predict(df_val)

In [16]:
print(metrics.classification_report(df_val_labels, pred))

              precision    recall  f1-score   support

           0       0.78      0.83      0.81       113
           1       0.67      0.60      0.63        65

    accuracy                           0.75       178
   macro avg       0.73      0.72      0.72       178
weighted avg       0.74      0.75      0.74       178



### 3. Choose another model and perform hyperparameter tuning using the train set.   
What is the model performance when tested on the validation set?

### 4. (Optional) You are free to check other models, apply feature engineering, etc.Try new ideas and see if the model performance improves.

### 5. Finally, choose your best model (compared based on validation performance) and test it on the testing set.   
Which model is the nest?  
What is the model performance when tested on the testing set? 

### Write out pickled model
Sklearn models can be saved. 
- joblib's dump (to save) and load (to load) methods can be used for this purpose.

In [17]:
joblib.dump(cv.best_estimator_, 'SVM_model.pkl')

['SVM_model.pkl']

In [18]:
joblib.load("SVM_model.pkl")

SVC(C=4.631719300611734, kernel='linear')