## Logistic Regression: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Logistic Regression model.

### Read in Data

![CV](img/CV.png)
![Cross-Val](img/Cross-Val.png)

_Welcome back to the final lesson in the Logistic Regression chapter. In this lesson we will do some model testing for Logistic Regression. More specifically, we will use the `GridSearchCV` tool to do Grid Search within k-fold Cross Validation in order to find the optimal hyperparameter settings for Logistic Regression that generates the best model._

_If you're unfamiliar with k-fold Cross-Validation, it basically just takes your dataset and splits it into k subsets. Then it will iterate through those k subsets and on each loop it will fit a model on k-1 subsets and then test it on the remaining subset. It will generate performance metrics for each loop and it's a great way to robustly train and test your model (now you have model performance on each example in your dataset)._

_We'll explore this more as we move through this lesson. Lets first import the packages we will need for this lesson and then we're going to read in the training features and the training labels that we wrote out last chapter. I will note I indicated `header=None` for the labels because there is not column name for that dataframe._

In [1]:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

tr_features = pd.read_csv('../train_features.csv')
tr_labels = pd.read_csv('../train_labels.csv', header=None)

### Hyperparameter tuning

![C](img/c.png)
![C LR](img/c_lr.png)

_These two images are just reminders of what we reviewed last lesson in regards to the only hyperparameter we'll be looking to optimize, the C regularization parameter._

_First, I wrote this very simple `print results` function that is just going to help us explore the results of this GridSearchCV. It just pulls out the hyperparameter setting, the average test score (across k folds), and standard deviation of the test scores (across k folds) and prints those out._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_Ok, lets get into the actual GridSearchCV. So we have called the `LogisticRegression` object here and stored it as `lr`. If we wanted to hardcode any hyperparameter values in we would enter them in these parentheses. Otherwise, it will just use the defaults which we saw in the last lesson._

_Next we're going to define our parameters dictionary and you'll do this any time you're using `GridSearchCV`. So the key here needs to align with the name of the hyperparameter in `LogisticRegression` and that's called `C`, then the value in this dictionary will be a list of the values we want to explore. So lets say 0.001, 0.01, 0.1, 1, 10, 100, and 1000._

_Then we're going to call `GridSearchCV` and you need to pass in the model object (`lr`), the parameter dictionary, and how many folds you want - we will do 5. And assign that to `cv`._

_Then remember I told you how the API for all `sklearn` objects are exactly the same with `.fit()` and `.predict()`. Well, `GridSearchCV` is no different. So we call `cv.fit()` and pass in our features, and then our labels._

_Now, what is going to happen here is it will grab the first hyperparameter setting, that's `C`=0.001. Then it will pass that into `LogisticRegression` and it will use that setting and run Cross Validation. So because we're doing 5 fold CV, it will loop through the 5 subsets of data, each time fitting on 4 and evaluating on the 5th. Then it will store the average test score for that loop (and other things). It will do this for each hyperparameter setting._

_Lets look at these results._

In [3]:
lr = LogisticRegression()
parameters = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(tr_features, tr_labels)

print_results(cv)

BEST PARAMS: {'C': 1}

0.678 (+/-0.092) for {'C': 0.001}
0.704 (+/-0.099) for {'C': 0.01}
0.796 (+/-0.13) for {'C': 0.1}
0.798 (+/-0.123) for {'C': 1}
0.794 (+/-0.118) for {'C': 10}
0.794 (+/-0.118) for {'C': 100}
0.794 (+/-0.118) for {'C': 1000}


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

_Another nice thing sklearn does is that it stores the best fit model (based on test score) as an attribute in the Cross Validation object. So lets look at that and you can see that C=1 is the best model._

In [4]:
cv.best_estimator_

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### Write out pickled model

_Lastly, lets write out this best model so we can compare it to the best models using some of the other algorithms in the final chapter of this course. Now, I'll just call out that cv.best estimator is a MODEL - there is an actual equation stored. So we can save it, then read it back in and go right to work making predictions with that model._

_We're going to pickle the model and write it out using `joblib`. Pass in the model object and just tell it to write to the same location where we stored the data `../LR model.pkl`._

In [5]:
joblib.dump(cv.best_estimator_, '../LR_model.pkl')

['../LR_model.pkl']