# Cross Validation<br/>*Right and Wrong*

### Goals
1. Standardize the variables (to use Logistic Regression's regularization properly)
2. Demonstrate the right and wrong way to perform Cross Validation

Before continuing with iterative model development, it is important to understand some of the subtleties of using Cross Validation.

### Avoiding Data Leakage
There are two common statements made about model building and the test set.  The second statement includes the first.
1. Never use the test set's target variable to build a model.
2. Never use any part of the test set to build a model.

The first statement is an absolute must.  If this is not done, the estimate of model performance could be very much too high.

The second statement represents good practice. Although looking at some part of the test data other than the target variable might not affect the estimate of model performance very much, there is rarely a need to do so.

The easiest way to ensure that there is no data leakage is to encapsulate all data transformation operations inside of a Pipe and use that pipe.

Every Scikit Learn example I have reviewed on Scikit Learn which performs a data transformation, encapsulates that transformation inside of a pipe.  In other words, the Scikit Learn examples ensure that no part of the test data is being used to build a model.

The reason for stressing this point is that *most* of the [kernels](https://www.kaggle.com/c/titanic/kernels) on Kaggle for the Titanic data set violate this rule.  A common example is to see all of the data standardized prior to starting the model building process.  This is poor practice even if it doesn't make much of a difference on the Titanic data set.

Below Standardization will be performed the "right way" (using the equivalent of a pipe) and the "wrong way" (on all data up front) to show the difference.

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

import titanic_helper_code as tt

In [2]:
# Version Information
import sys
print('python:     ', sys.version)
print('numpy:      ', np.__version__)
print('pandas:     ', pd.__version__)
import matplotlib
print('matplotlib: ', matplotlib.__version__)
print('seaborn:    ', sns.__version__)
print('sklearn:    ', sk.__version__)
!lsb_release -d

python:      3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]
numpy:       1.16.4
pandas:      0.24.2
matplotlib:  3.1.0
seaborn:     0.9.0
sklearn:     0.21.1
Description:	Ubuntu 18.04.2 LTS


### Previous Model Building Iteration

In [3]:
# Copied from my titanic_helper_code.py
def get_Xy_v1(filename='./data/train.csv'):
    """Data Encoding for Iteration 1

    Version 1
    * Pclass, Fare, and Sex encoded as 1/0 for female/male
    """

    # read data
    all_data = pd.read_csv(filename)
    X = all_data.drop('Survived', axis=1)
    y = all_data['Survived']
    
    # encode data
    X['Sex'] = X['Sex'].replace({'female':1, 'male':0})
    
    # drop unused columns
    drop_columns = ['PassengerId', 'Name', 'Age', 'SibSp', 'Parch', 
                    'Ticket', 'Cabin', 'Embarked']
    X = X.drop(drop_columns, axis=1)
    
    return X, y

In [4]:
X, y = tt.get_Xy_v1()

<a name="crossvalidation"></a>

### Cross Validation: The Right Way
The goal here is simply to show that there is a difference between computing the accuracy when no test data is looked at vs when some of the test data is looked at (but not the target variable).

In order to show a difference in what follows, strong regularization is used, C=0.001

In [5]:
# imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold

In [6]:
# define a specific set of CV folds for repeatability
cv_select = RepeatedStratifiedKFold(n_splits=2, n_repeats=10, random_state=108)

In [7]:
# perform CV with transformation, without a pipe, to illustrate the concept
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
lr = LogisticRegression(penalty='l2', C=0.001, solver='liblinear')

score_per_fold = []
for train_idx, test_idx in cv_select.split(X,y):
    
    # train subset
    X_train = X.iloc[train_idx, :]
    y_train = y.iloc[train_idx]
    
    # test subset
    X_test = X.iloc[test_idx, :]
    y_test = y.iloc[test_idx]
    
    # standardize the variables on train
    X_train_transformed = ss.fit_transform(X_train)
    
    # fit model on train
    lr.fit(X_train_transformed, y_train)
    
    # standardize variables on test
    X_test_transformed = ss.transform(X_test) # do not call fit_transform!
    
    # predict using fitted model on test
    predictions = lr.predict(X_test_transformed)
    
    # evaluate accuracy
    fold_score = accuracy_score(y_test, predictions)
    score_per_fold.append(fold_score)
    
scores = np.array(score_per_fold)
tt.print_scores(scores)

20 Scores  min:0.757 max:0.813
CV Mean Score: 0.784 +/- 0.017


### Cross Validation: The Wrong Way
The wrong, but common way to do this is to standardize the variables over the *entire* dataset and then estimate model accuracy using either a train/test split or cross validation.

Below it is shown that there is a difference between performing CV correctly and incorrectly.

In [8]:
# Prior to Cross Validation: standardize *all* the data up front
# This is "data leakage"

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
wrong_scores = cross_val_score(lr, X_scaled, y, 
                               cv=cv_select, scoring='accuracy')

# We do *not* get the same scores as above!
print("Scores Match: ", (scores == wrong_scores).all())
print("Scores Diff:  ", np.round(wrong_scores.mean() - scores.mean(), 4))

Scores Match:  False
Scores Diff:   0.0029


We see that Standardizing all the values prior to evaluating the model with cross validation led to an estimate of model performance that was slightly too high.

### Cross Validation The Wrong Way: Discussion
**What was wrong:** Used all data for standardization.  That is, test data was used to compute the mean and standard deviation which was applied in the standarization transform.

**What may happen:** Estimate of model accuracy may be too high.

**With any transformation that does not look at the target variable, this might not be a problem.**  In the above, we saw that it made almost no difference.

**With any transformation which does look at the target variable, as some variable selection procedures do, this may be a very serious problem** leading to highly inflated values of model accuracy.

**Great Explanation and Story by Robert Tibshirani:**  
Robert Tibshirani, in the youtube video [Cross Validation: Right and Wrong](https://www.youtube.com/watch?v=S06JpVoNaA0&list=PL5-da3qGB5IA6E6ZNXu7dp89_uv8yocmf), 
explains the right and wrong way to perform cross validation in detail.  

In the video, he presents a story about a Ph.D. oral dissertation in which the presenter filtered away variables *prior* to cross validation, using correlations to the target variable, and the serious effect this had on his medical research.

### Summary
**Best practice is to ensure that the entire model building process is encapsulated inside of cross validation.**  

This is most easily accomplished by encapsulating all data transformation operations inside of a pipe, as will be shown in the next notebook.

Note that certain mappings, such as encoding "female" as 1 and "male" as 0, are independent of any test data.  Such encodings need not be encapsulated within a pipe.