# Model Comparison

Jupyter Notebook referenced from my website:
[Software Nirvana: Baseline Model (1)](https://sdiehl28.netlify.com/2018/03/baseline-model-1)

### Goals
Discuss how to chose the "best" model between two models based on their Cross Validated Scores.

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

In [2]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')

# break up the dataframe into X and y
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']

# As before, remove all non-numeric fields and PassengerId
drop_fields = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'PassengerId']
X = X.drop(drop_fields, axis=1)

# Remove all columns with null values (1st iteration only)
X.dropna(axis=1, inplace=True)
X.dtypes

Pclass      int64
SibSp       int64
Parch       int64
Fare      float64
dtype: object

<a name="comparison"></a>
### Discussion
[Back to Outline](#outline)

The purpose of Cross Validation is to estimate the model's accuracy on out-of-sample data.  This estimate can be used for determining:
* how well the model will work when deployed to production
* how to compare different models
* how to compare the same model having different hyperparameters

The final model deployed to production will be built using all the available data, as this produces the most accurate model.  But we cannot estimate the model's accuracy using all the data as this would lead to overfitting.  Model evaluation is best performed by evaluating the model on data it has never seen.

Cross Validation may introduce some bias in its accuracy estimate, but this bias is usually similar for all models under consideration.  When used to compare the difference between the scores for two models, this bias largely cancels out, so cross validation is excellent for comparing the performance of different models.

The bias in the accuracy estimate is in part due to the fact that the models are built on overlapping data and overlapping data is not independent.  This means that the 5 estimates in 5-fold cross validation are not 5 truly independent estimates but perhaps more like 3 or 4 independent estimates.  This also means that the computation of variance and standard deviation of the accuracy scores are too low.  This impact can be seen when the model is deployed to production and it performs less well than expected.  That said, except where there is a very large amount of data, the cross validated accuracy estimate is perhaps the best estimate available.

Often the point-estimate of the Cross Validated score, such as the mean of the scores, is used to compare two models.  However if the standard deviation of the scores is comparable to the difference between the mean scores, then the lower score might be lower just by chance.  In some situations this is fine.  Picking one of two equally good models is as good as picking the other.

In some situations, we would like to know if a proposed model change results in a model that is better by a statistically significant amount.  For example, a model change is proposed that will cost time and money to implement and we want to know beforehand if this cost is warranted.  And we may be looking for more than just statistical significance, but practical significance.  For example, we might want a hypothesis test that says the new model predicts with 3% better accuracy than the old model at the .05 significance level.




In [3]:
# Compute 5x2CV Scores for iteration 1
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, RepeatedKFold
k_folds = 2
repeats = 5
random_seed = 108
crossvalidation = RepeatedKFold(n_splits=k_folds, n_repeats=repeats, 
                                random_state=random_seed)

classifier = LogisticRegression()
scores = cross_val_score(classifier, X, y, cv=crossvalidation, 
                         scoring='accuracy', n_jobs=-1)

print('Scores: ', np.round(scores, 3))
print(f'Cross Validated Accuracy: {scores.mean():.3f}  SD: {scores.std():.3f}')

# save these results so they can be used for comparison in future iterations
np.save('../data/iter01.data', scores)

Scores:  [0.666 0.708 0.673 0.694 0.686 0.679 0.695 0.676 0.686 0.688]
Cross Validated Accuracy: 0.685  SD: 0.012


<a name="summary"></a>
### Summary
[Back to Outline](#outline)

In this first iteration we:
* discussed how Scikit Learn computes accuracy, the confusion matrix, and cross validation scores
* created a simple model to use as our baseline
* established a baseline accuracy of 68.5%
* discussed how to compare models by their cross validated scores
* showed that our baseline model (68.5%) is better than the null model (61.6%)