# Cross Validation<br/>*Avoiding Data Leakage*

### Goals
1. Standardize the variables
2. Demonstrate the difference between standardizing all the data up front vs standardizing the data on training data only

### Avoiding Test Data Leakage
There are two common statements made about model building and the test set.  The second statement includes the first.
1. Never use the test set's target variable as part of the model building process.
2. Never use any of the test set data as part of the model building process.

The first statement is always true<sup>[1](#footnotes)</sup>.  
The second statement is good practice.

Example 1: Avoiding use of test data  
```python
s = StandardScaler()

X_train, X_test, y_train, y_test = \
    train_test_split(X, y)
X_train_transform = s.fit_transform(X_train)
X_test_transform = s.fit(X_test)
```

Example 2:  Using non target values of the test data  
```python
s = StandardScaler()

X_transform = s.fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X_transform, y)
```  

With a medium sized homogeneous data set, the two examples above will likely result in nearly the same model being created.  This is because random samples of len(X_train) will have similar mean and standard deviations.  However for small amounts of data, or for unbalanced data, the second example above will produce a model that is overly optimistic in its ability to predict on unseen data.

When using cross validation instead of a train/test split, the easiest way to perform the transform without using the test data is to use a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).  

Below Standardization will be performed both with and without looking at the test data in order to show the difference.

<a name="footnotes"></a>

##### Footnotes
1. "On Comparison of feature selection algorithms" \[Refaeilzadeh et al., 2007\] states that for *ranking* feature selection algorithms for small amounts of data, it may be permissible to look at the target variable in a limited sense.  Unless you are an expert though, the target variable in the test set should not be used as part of the model building process.

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

import titanic_helper_code as tt

In [2]:
# Version Information
import sys
print('python:     ', sys.version)
print('numpy:      ', np.__version__)
print('pandas:     ', pd.__version__)
import matplotlib
print('matplotlib: ', matplotlib.__version__)
print('seaborn:    ', sns.__version__)
print('sklearn:    ', sk.__version__)
!lsb_release -d

python:      3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]
numpy:       1.16.4
pandas:      0.24.2
matplotlib:  3.1.0
seaborn:     0.9.0
sklearn:     0.21.2
Description:	Ubuntu 18.04.2 LTS


### Previous Model Building Iteration

In [3]:
# Copied from my titanic_helper_code.py
def get_Xy_v1(filename='./data/train.csv'):
    """Data Encoding for Iteration 1

    Version 1
    * Pclass, Fare, and Sex encoded as 1/0 for female/male
    """

    # read data
    all_data = pd.read_csv(filename)
    X = all_data.drop('Survived', axis=1)
    y = all_data['Survived']
    
    # encode data
    X['Sex'] = X['Sex'].replace({'female':1, 'male':0})
    
    # drop unused columns
    drop_columns = ['PassengerId', 'Name', 'Age', 'SibSp', 'Parch', 
                    'Ticket', 'Cabin', 'Embarked']
    X = X.drop(drop_columns, axis=1)
    
    return X, y

In [4]:
X, y = tt.get_Xy_v1()

<a name="crossvalidation"></a>

### Cross Validation: Without Use of Test Data
The goal is simply to show there is a difference between computing the accuracy when no test data is looked at vs when some of the test data is looked at (but not the target variable).

In order to show a difference in what follows, strong regularization is used, C=0.001.

In the next notebook, a Pipeline will be used to simplify the code.

In [5]:
# imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold

In [6]:
# define a specific set of CV folds for repeatability
cv_select = RepeatedStratifiedKFold(n_splits=2, n_repeats=10, 
                                    random_state=108)

In [7]:
# perform CV with transformation, without a pipe, 
# to illustrate the concept
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
lr = LogisticRegression(penalty='l2', C=0.001, solver='liblinear')

score_per_fold = []
for train_idx, test_idx in cv_select.split(X,y):
    
    # train subset
    X_train = X.iloc[train_idx, :]
    y_train = y.iloc[train_idx]
    
    # test subset
    X_test = X.iloc[test_idx, :]
    y_test = y.iloc[test_idx]
    
    # standardize the variables on train
    X_train_transformed = ss.fit_transform(X_train)
    
    # fit model on train
    lr.fit(X_train_transformed, y_train)
    
    # standardize variables on test
    X_test_transformed = ss.transform(X_test) # not fit_transform!
    
    # predict using fitted model on test
    predictions = lr.predict(X_test_transformed)
    
    # evaluate accuracy
    fold_score = accuracy_score(y_test, predictions)
    score_per_fold.append(fold_score)
    
scores = np.array(score_per_fold)
tt.print_scores(scores)

20 Scores  min:0.738 max:0.809
CV Mean Score: 0.784 +/- 0.016


### Cross Validation: With Use of Test Data
A common way but not necessarily good method is to standardize the variables over the *entire* dataset and then estimate model accuracy using either a train/test split or cross validation.

In [8]:
# Prior to Cross Validation: standardize *all* the data up front
# This is "data leakage"

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
wrong_scores = cross_val_score(lr, X_scaled, y, 
                               cv=cv_select, scoring='accuracy')

# We do *not* get the same scores as above!
print("Scores Match: ", (scores == wrong_scores).all())
print("Scores Diff:  ", np.round(wrong_scores.mean() - scores.mean(), 4))

Scores Match:  False
Scores Diff:   0.0033


We see that Standardizing all the values prior to evaluating the model with cross validation led to an estimate of model performance that was slightly too high.

### Summary
**Best practice is to ensure that the entire model building process is encapsulated inside of cross validation.**  

This is most easily accomplished by encapsulating all data transformation operations inside of a pipe, as will be shown in the next notebook.

A good story about the above is presented by Robert Tibshirani, in the youtube video [Cross Validation: Right and Wrong](https://www.youtube.com/watch?v=S06JpVoNaA0&list=PL5-da3qGB5IA6E6ZNXu7dp89_uv8yocmf).  In this case, "wrong" means to perform feature selection using the target variable in the test set, outside of the cross validation loop.

Note that encoding data is not the same as transforming data.  For example, encoding "Passenger First Class" as 1, and "Passenger Second Class" as 2, can be performed without looking at the test set at all.