# Iteration 2: Impute Age Without Test Data Leakage

Jupyter Notebook referenced from my website:: <a href="https://sdiehl28.netlify.com/projects/titanic/titanic02/" target="_blank">Software Nirvana: Titantic02</a> 

### Where We Are
In the first iteration, we created a simple model and showed that the accuracy was better than the null model.  The null model is the model that predicts the predominant class in all cases.

### What's Next
This notebook will focus on avoiding a common beginner's mistake, which is to look at the test data when performing imputation or any other preprocessing step.

<a name="outline"></a>
### Outline
1. [Previous Iteration](#previous)
2. [Exploratory Data Analysis](#eda)
3. [Preprocessing](#preprocess)
4. [Model Building](#model)
5. [Model Evaluation](#eval)
6. [Summary](#summary)

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

<a name="previous"></a>
### Previous Iteration
[Back to Outline](#outline)

In [3]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')

In [4]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


In [5]:
# create the train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, stratify=y, random_state=111)

******
### This Section Deals with a Subtle Train/Test Split Bug
The train/test split should create *copies* of a dataframe rather than a view into a dataframe.  However in the version I am using, a view is returned for X.

In [6]:
print(X_train.is_copy)
print(X_test.is_copy)
print(y_train.is_copy)
print(y_test.is_copy)

<weakref at 0x7f603b9069f8; to 'DataFrame' at 0x7f603c486048>
<weakref at 0x7f603b9069f8; to 'DataFrame' at 0x7f603c486048>
None
None


In the above we see a weakref, this means we have a view.  That will cause problems with chained assignment later.  See: [Setting With Copy Warning](https://www.dataquest.io/blog/settingwithcopywarning/)

In [7]:
# Shouldn't have to do this, but it ensures we have independent copies
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

In [8]:
print(X_train.is_copy)
print(X_test.is_copy)
print(y_train.is_copy)
print(y_test.is_copy)

None
None
None
None


This is correct.  If DataFrame.is_copy is None, then we are not using a view into another DataFrame.
******

<a name="eda"></a>
### Exploratory Data Analysis
[Back to Outline](#outline)

One of the first things to check for is **null values**.

In [9]:
# Find the percentage of missing values per column
nrows, ncols = X_train.shape
X_train.isnull().sum() / nrows

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.219904
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.770465
Embarked       0.003210
dtype: float64

### Null Value Analysis
The following is a reasonable judgement call as to how to proceed based on the observed percentages of null values.
1. The Age attribute has some missing values => impute missing values
2. Most of the Cabin attribute is missing => remove it
3. Very few Emarked records are missing => remove records with missing Emarked value

In this notebook we will perform Age Imputation but leave removing Emarked values for later.

**Next Iteration:**
- Removed records with missing Emarked value **

### Impute Age Value
For instructional purposes, to emphasize not looking at the test data, this this will be performed manually rather than using a Scikit Learn Imputer.

In [10]:
# Get a boolean Series where Age is null
train_age_null = X_train['Age'].isnull()

In [11]:
# Verify we did this correctly
X_train.loc[train_age_null, 'Age'].head()

680   NaN
727   NaN
531   NaN
410   NaN
718   NaN
Name: Age, dtype: float64

In [12]:
# Train dataset will be used to train model
# Set the null Age values, in train, to the mean Age value, in train
X_train.loc[train_age_null, 'Age'] = X_train['Age'].mean()

In [13]:
# Observe the mean value
X_train['Age'].mean()

29.39765432098768

In [14]:
# double check that the values that were null are now the mean
X_train.loc[train_age_null, 'Age'].head()

680    29.397654
727    29.397654
531    29.397654
410    29.397654
718    29.397654
Name: Age, dtype: float64

### Replace Test Set Null Age Values with Mean from *Train* Set
This step is key to understanding how to avoid "test set data leakage".  If we look at the data in the test set, in any way, it no longer acts as a test set.

We must replace null values in the test set with the mean from the *train* set without looking at any of the values in the *test* set.

In [15]:
# Test dataset will be used to evaluate the model's accuracy
# Set the null Age values, in test, to the mean Age value, in train
test_age_null = X_test['Age'].isnull()
X_test.loc[test_age_null, 'Age'] = X_train['Age'].mean()

In [16]:
# double check that the values that were null are now the mean
X_test.loc[test_age_null, 'Age'].head()

639    29.397654
507    29.397654
334    29.397654
598    29.397654
643    29.397654
Name: Age, dtype: float64

### Summary of Imputation
It is critical to understand that we replaced null values in the test set with values computed from the train set.  We never looked at the data in test set

Although we did not use Scikit Learn's imputer, when used properly, it also uses data from the train set to transform the test set.

### Examine Datatypes
This cell was copied from the 1st iteration.  In this notebook we are not going to address datatypes, but keep this note as a reminder (or put it in an issue tracking system).

Based on a review of the data dictionary at [titanic](https://www.kaggle.com/c/titanic/data), and an examination of the values of each column, the following variables need to be converted to categorical:

**Next Iteration: convert the following variables to categorical**
- Pclass
- Sex
- Embarked


In [17]:
# For 2nd Iteration only, ignore all text and categorical variables
X_train = X_train.drop('Pclass', axis=1)
X_test = X_test.drop('Pclass', axis=1)
X_train = X_train.drop('Name', axis=1)
X_test = X_test.drop('Name', axis=1)
X_train = X_train.drop('Sex', axis=1)
X_test = X_test.drop('Sex', axis=1)
X_train = X_train.drop('Ticket', axis=1)
X_test = X_test.drop('Ticket', axis=1)
X_train = X_train.drop('Embarked', axis=1)
X_test = X_test.drop('Embarked', axis=1)
X_train = X_train.drop('Cabin', axis=1)
X_test = X_test.drop('Cabin', axis=1)

A natural question to ask is, wouldn't it have been easier to drop these columns prior
to creating the train/test split so we wouldn't have to apply the same operation (drop column) to one?  The answer is "yes", but the proper way to do this, while ensuring no "test data leakage", is by way of pipelines and that will be discussed in a subsequent notebook.

In [18]:
# Examine the datatypes of each remaining column
X_train.dtypes

PassengerId      int64
Age            float64
SibSp            int64
Parch            int64
Fare           float64
dtype: object

<a name="preprocess"></a>
### Preprocessing
[Back to Outline](#outline)

Preprocessing was done "inline" with the Exploratory Data Analysis above.

<a name="model"></a>
### Model Building
[Back to Outline](#outline)

In [19]:
# Build Model
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

<a name="eval"></a>
### Model Evaluation
[Back to Outline](#outline)

In [20]:
# Score will compute the accuarcy
base_model.score(X_test, y_test)

0.6753731343283582

67.5% is better than the previous iteration of 65.7%.  This may or may not be statistically significant.  A hypothesis test could be performed to see if it is, but that will not be done here.

Here we will take a more simple approach.  The accuracy improved so we will tentatively continue to use Age Imputation as we iteratively Kaizen the model building process.

<a name="summary"></a>
### Conclusion
[Back to Outline](#outline)

In this iteration:
* we showed an example of preprocessing without looking at the test set
* measured an accuracy of 67.5% which is better than the previous iteration of 65.7%