## Support Vector Machines: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Support Vector Machines model.

### Read in Data

![CV](img/CV.png)
![Cross-Val](img/Cross-Val.png)

_Welcome back to the final lesson in the Support Vector Machines chapter. Just like the final lesson in the Logstic Regression chapter, in this lesson we will use `GridSearchCV`  to do Grid Search within k-fold Cross Validation in order to find the optimal hyperparameter settings for SVM that generates the best model._

_I included these two images for one more quick refresher on the stage of the pipeline we're in and what k-fold Cross-Validation looks like. Again, k-fold CV is a way to really robustly test a model to make sure you're getting a good feel for the range of outcomes for your model._

_Lets start by importing the packages we will need for this lesson, then we will read in the training features and the training labels._

In [1]:
import joblib
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

tr_features = pd.read_csv('../train_features.csv')
tr_labels = pd.read_csv('../train_labels.csv', header=None)

### Hyperparameter tuning

![kernel](img/kernel.png)
![c](img/c.png)

_Now lets jump into the hyperparameter tuning. We'll be looking to optimize the kernel and C hyperparameters._

_One more reminder, I wrote this very simple `print results` function that is just going to help us explore the results of this GridSearchCV. It just pulls out the hyperparameter setting, the average test score (across k folds), and standard deviation of the test scores (across k folds) and prints those out._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_So we have called the `SVC` object and stored it as `svc`. We could hardcode hyperparamter values in here by entering them in the parentheses but we want to test that using `GridSearchCV` and we'll just leave the rest of the hyperparameter settings as the default setting we saw in the last lesson._

_Lets define our hyperparameter dectionary. So the keys in this dictionary align with the name of the hyperparameters that would be passed into `SVC` and then we just need to define the list of settings we want to test. So we'll explore `linear` and `rbf` for kernel and we're going to limit our `C` values to 0.1, 1, and 10. As you will see in a minute, SVM is quite slow to train, even on our really small training set, so I'm going to restrict the number of combinations we're looking at._

_So now call `GridSearchCV`, pass in our model object (`svc`), the hyperparameter dictionary, and tell it we want to do 5-fold Cross Validation._

_Then as we saw previously, the standard `sklearn` API requires you to just call `.fit()` and we will pass in the training features and training labels. Again, what is going to happen here is it will grab the first hyperparameter combination, that's `linear` kernel with C=0.1. Then it will pass those hyperparameter settings into into `SVC` and it will use that setting and run Cross Validation. So because we're doing 5 fold CV, it will loop through the 5 subsets of data, each time fitting on 4 and evaluating on the 5th. Then it will store the average test score for that loop (and other things). It will do this for each hyperparameter combination._

_So lets look at the results._

In [None]:
svc = SVC()
parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10]
}
cv = GridSearchCV(svc, parameters, cv=5)
cv.fit(tr_features, tr_labels)

print_results(cv)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


  y = column_or_1d(y, warn=True)


_Lets take a quick look at the best estimator based on test score_

In [None]:
cv.best_estimator_

### Write out pickled model

_Lastly, lets write out this model so we can compare it to the best models using some of the other algorithms in the final chapter of this course. Remember, cv.best estimator is a MODEL - there is an actual equation stored. So we can save it, then read it back in and go right to work making predictions with that model._

_We're going to pickle the model and write it out using `joblib`. Pass in the model object and just tell it to write to the same location where we stored our other model `../SVM model.pkl`._

In [None]:
joblib.dump(cv.best_estimator_, '../SVM_model.pkl')