# Iterative Model Dev 2 <br/>*Imputation and Cross Validation*

Jupyter Notebook referenced from my website:
[Software Nirvana: Iterative Dev 2](https://sdiehl28.netlify.com/2018/03/titanic-2/)

### Where We Are
In the first iteration, we created a simple model and showed that the accuracy was better than the null model.  The null model is the model that predicts the predominant class in all cases.

### What's Next
<a href="https://en.wikipedia.org/wiki/Imputation_(statistics)">Imputation on Wikipedia</a>

This notebook will impute the missing values for Age and use Age as an additional attribute for prediction.  We will also check to see if adding the Age variable improved prediction accuracy.

Special attention will be paid to avoid a common beginner's mistake, which is to look at the test data when performing imputation or other preprocessing steps.  The easiest way to ensure there is no "test set leakage", is to use a Pipeline.

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

### Previous Iteration

In [2]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')

# break up the dataframe into X and y
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']

# As before, remove all non-numeric fields and PassengerId
drop_cols = ['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
X = X.drop(drop_cols, axis=1)
X.dtypes

Pclass      int64
Age       float64
SibSp       int64
Parch       int64
Fare      float64
dtype: object

### Impute Age: Cross Validation the Right Way
This will be performed manually to emphasize how preprocessing operations work on the folds in a cross validation.  The key is that the held-out data, aka the test data, can never be looked at.

This will also be performed with an Imputer and a Pipeline to show a more concise workflow.

To a beginner, it can appear that using an Imputer and a Pipeline is a lot of extra work.  Why not just impute the Age before cross validation and be done with it.  If you were to look that at the ["Kernels"](https://www.kaggle.com/c/titanic/kernels?sortBy=votes&group=everyone&pageSize=20&competitionId=3136) on Kaggle posted for the Titantic dataset, you would see that at least half the people do just that.  But this is bad practice and can lead to an estimate of model accuracy that is too high.  Looking at the test data prior to training your model is called "data leakage".

### Impute Age without using Pipeline
A nice introduction to overfitting, train/test split, and cross validation in Python is:[Train/Test Split and Cross Validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)

K-Fold Cross Validation *is* train/test split, but performed K times to get a more accurate estimate of model accuracy.

We train the model on the train data and we test the accuracy of the model on the test data.

With a train/test split, we have missing Age values in both the train and test sets.  

For the train set, we replace the missing values with the average Age of the train set.  

For the test set, we are not permitted to look at its data as the purpose of the test set is for model evaluation.  If we look at its data, we have made the mistake of "data leakage".  The test data has "leaked" into our training data as part of our model building process.  This may cause our estimate of model accuracy to be too high.  For the test set, we replace the missing values with the average Age of the *train* set. 

All of the above holds true for Cross Validation as well, as Cross Validation is just multiple train/test splits.

For each of the K folds, we will compute the mean Age value in the train set, and use that value to replace the missing values in both the train and test sets.

In [3]:
# Set random_state to get the same folds each time we call KFold()
random_state = 121212

# use low number of splits for illustration, 5 or 10 is the recommended value
n_splits = 2

# create K folds
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, KFold
crossvalidation = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)

In [4]:
# use the K folds to impute Age and compute the accuracy score
my_scores = np.zeros(n_splits)
i = 0
lr_model = LogisticRegression()
for train_idx, test_idx in crossvalidation.split(X):
    # train subset
    X_train = X.iloc[train_idx, :].copy()
    y_train = y[train_idx].copy()
    
    # test subset
    X_test = X.iloc[test_idx, :].copy()
    y_test = y[test_idx].copy()
    
    # find the average age on the train set
    train_age_mean = X_train['Age'].mean()
    
    # use this value for *both* the train and test set
    X_train.loc[X_train['Age'].isnull(), 'Age'] = train_age_mean
    X_test.loc[X_test['Age'].isnull(), 'Age'] = train_age_mean # Key Concept!
    
    # fit model on train
    lr_model.fit(X_train, y_train)
    
    # predict using model on test
    predictions = lr_model.predict(X_test)
    
    # evaluate accuracy
    my_scores[i] = accuracy_score(y_test, predictions)
    i += 1

### Aside: Pandas copy
In the above, we had to use .copy() for our train and test sets.  This is critical to avoiding the Pandas warning: SettingWithCopyWarning.

The imputation requires us to modify both the train and test sets for X.  However X.iloc[] returns a *view* into X, not a copy of a subset of X.  If we try to use this view to modify data, we will get: SettingwithCopyWarning.

In most cases this warning means your code will not do what you intended.  Therefore you should always write code that does not produce this Pandas warning.

It can be difficult to discover why this warning was issued.  One way to track this down is to print out the .is_copy member of your dataframe.  If you find you get this warning when trying to modify data through a view (implemented as a weak reference) then you need to get an independent copy of your data using .copy() and the warning will go away.

In [5]:
print(X.iloc[train_idx, :].is_copy) # view into dataframe
print(X.iloc[train_idx, :].copy().is_copy) # independent copy of dataframe

<weakref at 0x7fbcc8c5c4a8; to 'DataFrame' at 0x7fbccb7c0be0>
None


### Impute Age with Pipeline
Scikit Learn correctly uses the mean of the train set as the replacement value for missing values in the test set.  However this is all done behind the scenes.  The following is exactly the same as the above, but requires much less code.

In [6]:
# Setup: same as above
random_state = 121212
n_splits = 2
crossvalidation = KFold(n_splits=n_splits, shuffle=True, 
                        random_state=random_state)

In [7]:
# Use an Imputer and a Pipeline
# Note: Age is the only column in X with null values
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='mean')

from sklearn.pipeline import make_pipeline
classifier = make_pipeline(imputer, LogisticRegression())

# cross_val_score() will properly compute 
# imputation and score per fold
scores = cross_val_score(classifier, X, y, cv=crossvalidation, 
                         scoring='accuracy', n_jobs=1)

# Check to see that we got the same scores 
# as in the above for-loop
print("Scores Match: ", (scores == my_scores).all())
print(scores, scores.mean())

Scores Match:  True
[0.70852018 0.69438202] 0.7014511009220538


### Both Methods Produce the Same Result
The for-loop over each K fold train/test splits is the same as using an Imputer and LogisticRegression in a Pipeline in cross_val_score().

### Impute Age: Cross Validation The Wrong Way
In my review of Kaggle Kernels published for the Titantic dataset about half of the ["Kernels"](https://www.kaggle.com/c/titanic/kernels?sortBy=votes&group=everyone&pageSize=20&competitionId=3136) imputed the Age value over the entire data set.  This is a potentially serious mistake.

In [8]:
# Prior to Cross Validation:
#   impute the missing values as the mean of all the data
# Don't do this!  This is "data leakage"!
# Replace all null Age values with the mean of all Age Values
X.loc[X['Age'].isnull(), 'Age'] = X['Age'].mean()

In [9]:
# Setup: same as above
random_state = 121212
n_splits = 2
crossvalidation = KFold(n_splits=n_splits, shuffle=True, 
                        random_state=random_state)

In [10]:
# Use cross_val_score()
wrong_scores = cross_val_score(LogisticRegression(), X, y, 
                               cv=crossvalidation,
                               scoring='accuracy', n_jobs=1)

In [11]:
# We do *not* get the same scores as above
# because we did the cross validation wrong!
print("Scores Match: ", (scores == wrong_scores).all())
print(scores, scores.mean())

Scores Match:  False
[0.70852018 0.69438202] 0.7014511009220538


Imputing all the missing values, over both the train and test sets, prior to performing cross validation produced different results!

The use of the Imputer and estimator in the Pipeline, or the hand coded for-loop over the K folds, produced the correct result.

For this dataset, the difference is very little and I had to use n_splits=2 and try a few different random_state values to get folds which illustrate this difference.

### Cross Validation The Wrong Way: Discussion
**What was wrong:** Used both train and test data for imputation. 

**What may happen:** Estimate of model accuracy may be too high.

**Importance in Practice:** With imputation, this is often not much of an issue if the amount of data from which to perform the imputation is large. Nevertheless, there is no point in making a potentially serious error when using Pipelines makes it easy to do this correctly.

**Common Real Life Situation with Serious Consequences:**
If you have a lot of variables, and you decide, prior to performing cross validation, that you will remove some of the variables based on some statistic of the data, such as too little correlation with the target variable, then you will almost certainly overfit your model.  That is, your report of model accuracy will be too high.

**Great Explanation and Story by Robert Tibshirani:**
Robert Tibshirani, in the youtube video [Cross Validation: Right and Wrong](https://www.youtube.com/watch?v=S06JpVoNaA0&list=PL5-da3qGB5IA6E6ZNXu7dp89_uv8yocmf), explains the above in detail and presents a wonderful anecdotal story about a Ph.D. oral dissertation presenter filtering away variables prior to performing cross validation and the serious effect it had on his medical research.

### Summary

In this iteration we:
* showed the right and wrong way to perform cross validation
* showed that using Pipelines with Cross Validation improves the quality of the software by making it easy to concisely perform cross validation correctly.
* added the Age variable to our model and used it's mean to impute missing values
* showed that our estimate of model accuracy increased from 68.5% to 70.1%