# Target Leakage in Machine Learning

© Yuriy Guts, 2018

## Example 02: Data Preparation Stage

In this example, we will explore how preprocessing the dataset before partitioning can introduce minor leakage about the test features into the training pipeline.
As a result, our model will have slightly better scores compared to the more robust approach where we derive preprocessing parameters on the training subset, and then use them to transform the test set.

**Note**: This is a toy example on a rather small dataset so the impact won't be large but visible enough to illustrate the point.

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, Imputer

### Read Data

In [3]:
from sklearn.neighbors import KNeighborsClassifier

Let's read the [Titanic](https://www.kaggle.com/c/titanic/data) dataset.

In [4]:
df = pd.read_csv('data/titanic-train.csv')

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We'll be using KNN, so let's one-hot encode the two categorical variables: sex and the port of departure.

In [6]:
df['IsFemale'] = df['Sex'].map({'male': 0, 'female': 1})
df['IsAgeMissing'] = df['Age'].isnull()
df[['EmbarkedC', 'EmbarkedQ', 'EmbarkedS']] = pd.get_dummies(df['Embarked'])

In [7]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,IsFemale,IsAgeMissing,EmbarkedC,EmbarkedQ,EmbarkedS
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,False,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,False,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,False,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,False,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,False,0,0,1


Let's leave only the simple features that are likely to carry the most signal.

In [8]:
df_X = df[['Pclass', 'IsFemale', 'Age', 'IsAgeMissing', 'SibSp', 'Parch', 'Fare', 'EmbarkedC', 'EmbarkedQ', 'EmbarkedS']].copy()
df_y = df['Survived'].copy()

In [9]:
df_X[df_X['Age'].isnull()]

Unnamed: 0,Pclass,IsFemale,Age,IsAgeMissing,SibSp,Parch,Fare,EmbarkedC,EmbarkedQ,EmbarkedS
5,3,0,,True,0,0,8.4583,0,1,0
17,2,0,,True,0,0,13.0000,0,0,1
19,3,1,,True,0,0,7.2250,1,0,0
26,3,0,,True,0,0,7.2250,1,0,0
28,3,1,,True,0,0,7.8792,0,1,0
...,...,...,...,...,...,...,...,...,...,...
859,3,0,,True,0,0,7.2292,1,0,0
863,3,1,,True,8,2,69.5500,0,0,1
868,3,0,,True,0,0,9.5000,0,0,1
878,3,0,,True,0,0,7.8958,0,0,1


In [10]:
df_X['Age'].mean()

29.69911764705882

### Preprocess Data

**MISTAKE INCOMING!** Now we will transform the entire dataset, before partitioning it into train and test. This is likely to cause leakage if the features drift significantly across the training and evaluation folds. We do not actually know the distribution of the test features at prediction time.

In [11]:
mean_imputer = Imputer(missing_values='NaN', strategy='mean')
scaler = StandardScaler()



In [12]:
df_X['Age'] = mean_imputer.fit_transform(df_X[['Age']])
df_X[['Age', 'Fare']] = scaler.fit_transform(df_X[['Age', 'Fare']])

In [13]:
df_X.head()

Unnamed: 0,Pclass,IsFemale,Age,IsAgeMissing,SibSp,Parch,Fare,EmbarkedC,EmbarkedQ,EmbarkedS
0,3,0,-0.592481,False,1,0,-0.502445,0,0,1
1,1,1,0.638789,False,1,0,0.786845,1,0,0
2,3,1,-0.284663,False,0,0,-0.488854,0,0,1
3,1,1,0.407926,False,1,0,0.42073,0,0,1
4,3,0,0.407926,False,0,0,-0.486337,0,0,1


In [14]:
print(mean_imputer.statistics_)

[29.69911765]


In [15]:
print(scaler.mean_)
print(scaler.scale_)


[29.69911765 32.20420797]
[12.99471687 49.66553444]


Only now will we partition and train the model.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, random_state=12345)

In [17]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)
print('y_train:', y_train.shape)
print('y_test: ', y_test.shape)

X_train: (623, 10)
X_test:  (268, 10)
y_train: (623,)
y_test:  (268,)


### Train and Evaluate Model

In [18]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [19]:
y_test_pred = model.predict_proba(X_test)[:, -1]

In [20]:
log_loss_before = log_loss(y_test, y_test_pred)
auc_before = roc_auc_score(y_test, y_test_pred)

In [21]:
print('Test LogLoss:', log_loss_before)
print('Test AUC:    ', auc_before)

Test LogLoss: 4.03039167749059
Test AUC:     0.7556662087912088


## Removing Leakage

### Read Data

In [22]:
print(X_train['Age'].mean())
print(X_test['Age'].mean())

-0.045685896518762134
0.10620266242980987


Let's repeat our initial dataset preparation (missing indicator variable, one-hot encoding, feature selection)

In [23]:
df = pd.read_csv('data/titanic-train.csv')
df['IsFemale'] = df['Sex'].map({'male': 0, 'female': 1})
df['IsAgeMissing'] = df['Age'].isnull()
df[['EmbarkedC', 'EmbarkedQ', 'EmbarkedS']] = pd.get_dummies(df['Embarked'])
df_X = df[['Pclass', 'IsFemale', 'Age', 'IsAgeMissing', 'SibSp', 'Parch', 'Fare', 'EmbarkedC', 'EmbarkedQ', 'EmbarkedS']].copy()
df_y = df['Survived'].copy()

But now we'll partition first, then figure out the preprocessing.

In [24]:
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, random_state=12345)

In [25]:
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

Learn imputation parameters only on the training set...

In [26]:
mean_imputer = Imputer(missing_values='NaN', strategy='mean')
scaler = StandardScaler()



In [27]:
X_train['Age'] = mean_imputer.fit_transform(X_train[['Age']])
X_train[['Age', 'Fare']] = scaler.fit_transform(X_train[['Age', 'Fare']])

In [28]:
print(mean_imputer.statistics_)

[28.95791583]


In [29]:
print(scaler.mean_)
print(scaler.scale_)

[28.95791583 31.82662424]
[12.86442653 45.07671825]


...and use them to **transform** the test set.

In [30]:
X_test['Age'] = mean_imputer.transform(X_test[['Age']])
X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])

### Train and Evaluate Model

In [31]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [32]:
y_test_pred = model.predict_proba(X_test)[:, -1]

In [33]:
log_loss_after = log_loss(y_test, y_test_pred)
auc_after = roc_auc_score(y_test, y_test_pred)

In [34]:
print('Test LogLoss:', log_loss_after)
print('Test AUC:    ', auc_after)

Test LogLoss: 4.033417536506322
Test AUC:     0.7531478937728937


## Evaluate the Impact of Leakage

In [35]:
print('LogLoss difference:', log_loss_after - log_loss_before)
print('AUC difference:    ', auc_after - auc_before)

LogLoss difference: 0.0030258590157323795
AUC difference:     -0.0025183150183151204


###### Example of how StandardScaler works from sklearn library

In [42]:
data = pd.DataFrame([[0,0],[0,0],[1,1],[1,1]], columns=['col1','col2'])

In [44]:
s = StandardScaler()
data_fit = s.fit(data)

In [49]:
print('mean of the data',data_fit.mean_)
print('scale(std) of the data',data_fit.scale_)

mean of the data [0.5 0.5]
scale(std) of the data [0.5 0.5]


In [51]:
transformed_data = s.transform(pd.DataFrame([[2,2]]))

In [53]:
transformed_data

array([[3., 3.]])

In [56]:
data = s.transform(data)
data

array([[-1., -1.],
       [-1., -1.],
       [ 1.,  1.],
       [ 1.,  1.]])