## Arbitrary value imputation

Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.

Instead, we want to flag them. We want to capture the missingness somehow.

In previous lectures we saw 2 methods to do this:

1) adding an additional binary variable to indicate whether the value is missing (1) or not (0)

2) replacing the NA by a value at a far end of the distribution

Here, I suggest an alternative to option 2, which I have seen in several Kaggle competitions. It consists of replacing the NA by an arbitrary value. Any of your creation, but ideally different from the median/mean/mode, and not within the normal values of the variable.

The problem consists in deciding which arbitrary value to choose.

### Advantages

- Easy to implement
- Captures the importance of missingess if there is one

### Disadvantages

- Distorts the original distribution of the variable
- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
- Hard to decide which value to use
 If the value is outside the distribution it may mask or create outliers

### Final note

When variables are captured by third parties, like credit agencies, they place arbitrary numbers already to signal this missingness. So if not common practice in data competitions, it is common practice in real life data collections.

===============================================================================

## Real Life example: 

### Predicting Survival on the Titanic: understanding society behaviour and beliefs

Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

=============================================================================

In the following cells, I will show how this procedure impacts features and machine learning using the Titanic and House Price datasets from Kaggle.

If you haven't downloaded the datasets yet, in the lecture "Guide to setting up your computer" in section 1, you can find the details on how to do so.

In [1]:
import pandas as pd
import numpy as np

# for classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# to split and standarize the datasets
from sklearn.model_selection import train_test_split

# to evaluate classification models
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
# load the Titanic Dataset with a few variables for demonstration

data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])
data.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [3]:
# let's look at the percentage of NA
data.isnull().mean()

Survived    0.000000
Age         0.198653
Fare        0.000000
dtype: float64

In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data, data.Survived, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((623, 3), (268, 3))

In [5]:
def impute_na(df, variable):
    df[variable+'_zero'] = df[variable].fillna(0)
    df[variable+'_hundred']= df[variable].fillna(100)
    

In [6]:
# let's replace the NA with the median value in the training set
impute_na(X_train, 'Age')
impute_na(X_test, 'Age')

X_train.head(20)

Unnamed: 0,Survived,Age,Fare,Age_zero,Age_hundred
857,1,51.0,26.55,51.0,51.0
52,1,49.0,76.7292,49.0,49.0
386,0,1.0,46.9,1.0,1.0
124,0,54.0,77.2875,54.0,54.0
578,0,,14.4583,0.0,100.0
549,1,8.0,36.75,8.0,8.0
118,0,24.0,247.5208,24.0,24.0
12,0,20.0,8.05,20.0,20.0
157,0,30.0,8.05,30.0,30.0
127,1,24.0,7.1417,24.0,24.0


### Logistic Regression

In [7]:
# we compare the models built using Age filled with zero, vs Age filled with 100

logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization
logit.fit(X_train[['Age_zero','Fare']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['Age_zero','Fare']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['Age_zero','Fare']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization
logit.fit(X_train[['Age_hundred','Fare']], y_train)
print('Train set')
pred = logit.predict_proba(X_train[['Age_hundred','Fare']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test[['Age_hundred','Fare']])
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
Logistic Regression roc-auc: 0.6863462831608859
Test set
Logistic Regression roc-auc: 0.7137499999999999
Train set
Logistic Regression roc-auc: 0.6803594282119694
Test set
Logistic Regression roc-auc: 0.7227976190476191


In [8]:
# random forests

rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)
rf.fit(X_train[['Age_zero', 'Fare']], y_train)
print('Train set zero imputation')
pred = rf.predict_proba(X_train[['Age_zero', 'Fare']])
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set zero imputation')
pred = rf.predict_proba(X_test[['Age_zero', 'Fare']])
print('Random Forests zero imputation roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))
print()
rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)
rf.fit(X_train[['Age_hundred', 'Fare']], y_train)
print('Train set median imputation')
pred = rf.predict_proba(X_train[['Age_hundred', 'Fare']])
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set median imputation')
pred = rf.predict_proba(X_test[['Age_hundred', 'Fare']])
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))
print()

Train set zero imputation
Random Forests roc-auc: 0.7555855621353116
Test set zero imputation
Random Forests zero imputation roc-auc: 0.7490476190476191

Train set median imputation
Random Forests roc-auc: 0.7490781111038807
Test set median imputation
Random Forests roc-auc: 0.7653571428571431



We can see that replacing NA with 100 makes the models perform better than replacing NA with 0. This is, if you remember from the lecture "Replacing NA by mean or median" because children were more likely to survive than adults. Then filling NA with zeroes, distorts this relation and makes the models loose predictive power. See below for a re-cap.

In [9]:
print('Average real survival of children: ', X_train[X_train.Age<15].Survived.mean())
print('Average survival of children when using Age imputed with zeroes: ', X_train[X_train.Age_zero<15].Survived.mean())
print('Average survival of children when using Age imputed with median: ', X_train[X_train.Age_hundred<15].Survived.mean())

Average real survival of children:  0.5740740740740741
Average survival of children when using Age imputed with zeroes:  0.38857142857142857
Average survival of children when using Age imputed with median:  0.5740740740740741


### Final notes

The arbitrary value has to be determined for each variable specifically. For example, for this dataset, the choice of replacing NA in age by 0 or 100 are valid, because none of those values are frequent in the original distribution of the variable, and they lie at the tails of the distribution.

However, if we were to replace NA in fare, those values are not good any more, because we can see that fare can take values of up to 500. So we might want to consider using 500 or 1000 to replace NA instead of 100.

As you can see this is totally arbitrary. And yet, it is used in the industry.

Typical values chose by companies are -9999 or 9999, or similar.

**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**