# Iteration 2: Impute Age Without Test Data Leakage

Jupyter Notebook referenced from my website: <a href="https://sdiehl28.netlify.com/projects/titanic/titanic02/" target="_blank">Software Nirvana: Titantic02</a> 

### Where We Are
In the first iteration, we created a simple model and showed that the accuracy was better than the null model.  The null model is the model that predicts the predominant class in all cases.

### What's Next
This notebook will impute the missing values for Age and use it as an additional attribute for prediction.  We will check to see if adding the age variable improved prediction accuracy.

Special attention will be paid to avoid a common beginner's mistake, which is to look at the test data when performing imputation or other preprocessing steps.  The easiest way to ensure that is no "test set leakage", is to use a Pipeline.

<a name="outline"></a>
### Outline
1. [Previous Iteration](#previous)
2. [Exploratory Data Analysis](#eda)
3. [Preprocessing](#preprocess)
4. [Model Building](#model)
5. [Model Evaluation](#eval)
6. [Summary](#summary)

### Common Imports and Notebook Setup

In [344]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

<a name="previous"></a>
### Previous Iteration
[Back to Outline](#outline)

In [364]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')

In [365]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


In [366]:
# As before, removing non-numeric and PassengerId
drop_cols = ['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
X = X.drop(drop_cols, axis=1)
X.dtypes

Pclass      int64
Age       float64
SibSp       int64
Parch       int64
Fare      float64
dtype: object

### Impute Age Value: With and Without Pipeline
This will be performed manually, to emphasize that we are not looking at the test data, and with a scikit learn Imputer and Pipeline.

The reason for using an Imputer and a Pipeline is to make it *eaiser* to write code which avoids looking at the test data, especially when using cross validation.

To a beginner, it sometimes appears that using an imputer and a pipeline is extra work.  In order to see that it is not, it is helpful to correctly impute the age value with, and without using a pipeline.

### Impute Age without Pipeline

In [367]:
X_save = X.copy()

In [556]:
X['Age'].isnull().sum()

0

In [686]:
random_state = 121212
n_splits = 2 # use low number of splits to illustrate problem, this is not the recommended value
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, KFold
crossvalidation = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
X = X_save.copy()
print(X['Age'].isnull().sum())
print(X['Age'].mean())

177
29.69911764705882


In [687]:
# the following is very similar to the example at the end of the previous notebook
my_scores = np.zeros(n_splits)
i = 0
lr_model = LogisticRegression()
for train_idx, test_idx in crossvalidation.split(X):
    # train subset
    X_train = X.iloc[train_idx, :].copy()
    y_train = y[train_idx].copy()
    
    # test subset
    X_test = X.iloc[test_idx, :].copy()
    y_test = y[test_idx].copy()
    
    # find the average age on the test set
    train_age_mean = X_train['Age'].mean()
    print(train_age_mean)
    
    # use this value for *both* the train and test set
    X_train.loc[X_train['Age'].isnull(), 'Age'] = train_age_mean
    X_test.loc[X_test['Age'].isnull(), 'Age'] = train_age_mean # THIS IS THE KEY STEP TO UNDERSTAND
    
    # fit model on train
    lr_model.fit(X_train, y_train)
    
    # predict using model on test
    predictions = lr_model.predict(X_test)
    
    # evaluate accuracy
    my_scores[i] = accuracy_score(y_test, predictions)
    i += 1

29.970521739130433
29.445365853658537


### Impute Age with Pipeline

In [688]:
# Now do the same with an Imputer and a Pipeline
# Note that the only column with null values, in X, is the numeric value Age
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='mean')

In [689]:
from sklearn.pipeline import make_pipeline
classifier = make_pipeline(imputer, LogisticRegression())

In [690]:
from sklearn.model_selection import cross_val_score, KFold

crossvalidation = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
scores = cross_val_score(classifier, X, y, cv=crossvalidation, scoring='accuracy',
 n_jobs=1)

In [691]:
# Check to see if we got the same scores
print("Scores Match: ", (scores == my_scores).all())
print(scores, scores.mean())

Scores Match:  True
[0.70852018 0.69438202] 0.7014511009220538


### Impute Age: The Wrong Way
This is a common enough mistake to warrant demonstration.

In [692]:
# Imputation using the test set
# This is "test set leakage"!
# The test data was used to compute the mean.  This leads to overstating model accuracy.
X.loc[X['Age'].isnull(), 'Age'] = X['Age'].mean()
print(X['Age'].mean())

29.699117647058763


In [693]:
crossvalidation = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
scores = cross_val_score(LogisticRegression(), X, y, cv=crossvalidation, scoring='accuracy',
 n_jobs=1)

In [695]:
# Check to see if we got the same scores
print("Scores Match: ", (scores == my_scores).all())
print(scores, scores.mean())

Scores Match:  False
Test Leakage Score > Correct Score:  True
[0.70627803 0.69662921] 0.7014536201944879


### Replace Test Set Null Age Values with Mean from *Train* Set
This step is key to understanding how to avoid "test set data leakage".  If we look at the data in the test set, in any way, it no longer acts as a test set.

We must replace null values in the test set with the mean from the *train* set without looking at any of the values in the *test* set.

In [611]:
# Test dataset will be used to evaluate the model's accuracy
# Set the null Age values, in test, to the mean Age value, in train
test_age_null = X_test['Age'].isnull()
X_test.loc[test_age_null, 'Age'] = X_train['Age'].mean()

In [612]:
# double check that the values that were null are now the mean
X_test.loc[test_age_null, 'Age'].head()

Series([], Name: Age, dtype: float64)

### Summary of Imputation
It is critical to understand that we replaced null values in the test set with values computed from the train set.  We never looked at the data in test set

Although we did not use Scikit Learn's imputer, when used properly, it also uses data from the train set to transform the test set.

### Examine Datatypes
This cell was copied from the 1st iteration.  In this notebook we are not going to address datatypes, but keep this note as a reminder (or put it in an issue tracking system).

Based on a review of the data dictionary at [titanic](https://www.kaggle.com/c/titanic/data), and an examination of the values of each column, the following variables need to be converted to categorical:

**Next Iteration: convert the following variables to categorical**
- Pclass
- Sex
- Embarked


In [17]:
# For 2nd Iteration only, ignore all text and categorical variables
X_train = X_train.drop('Pclass', axis=1)
X_test = X_test.drop('Pclass', axis=1)
X_train = X_train.drop('Name', axis=1)
X_test = X_test.drop('Name', axis=1)
X_train = X_train.drop('Sex', axis=1)
X_test = X_test.drop('Sex', axis=1)
X_train = X_train.drop('Ticket', axis=1)
X_test = X_test.drop('Ticket', axis=1)
X_train = X_train.drop('Embarked', axis=1)
X_test = X_test.drop('Embarked', axis=1)
X_train = X_train.drop('Cabin', axis=1)
X_test = X_test.drop('Cabin', axis=1)

A natural question to ask is, wouldn't it have been easier to drop these columns prior
to creating the train/test split so we wouldn't have to apply the same operation (drop column) to one?  The answer is "yes", but the proper way to do this, while ensuring no "test data leakage", is by way of pipelines and that will be discussed in a subsequent notebook.

In [18]:
# Examine the datatypes of each remaining column
X_train.dtypes

PassengerId      int64
Age            float64
SibSp            int64
Parch            int64
Fare           float64
dtype: object

<a name="preprocess"></a>
### Preprocessing
[Back to Outline](#outline)

Preprocessing was done "inline" with the Exploratory Data Analysis above.

<a name="model"></a>
### Model Building
[Back to Outline](#outline)

In [19]:
# Build Model
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

<a name="eval"></a>
### Model Evaluation
[Back to Outline](#outline)

In [20]:
# Score will compute the accuarcy
base_model.score(X_test, y_test)

0.6753731343283582

67.5% is better than the previous iteration of 65.7%.  This may or may not be statistically significant.  A hypothesis test could be performed to see if it is, but that will not be done here.

Here we will take a more simple approach.  The accuracy improved so we will tentatively continue to use Age Imputation as we iteratively Kaizen the model building process.

<a name="summary"></a>
### Conclusion
[Back to Outline](#outline)

In this iteration:
* we showed an example of preprocessing without looking at the test set
* measured an accuracy of 67.5% which is better than the previous iteration of 65.7%