## Adding a variable to capture NA

In previous notebooks, we learned how to replace missing values by the mean, median, or by extracting a random value.  These methods assume that the data is missing completely at random (MCAR).

There are other methods that can be used when values are not missing at random, for example, arbitrary value imputation or end of distribution imputation. However, these imputation techniques will affect the variable distribution quite dramatically, and they are therefore not suitable for linear models.

**So what can we do if the data is not MCAR and we want to use linear models? **

If data are not missing at random, it is a good idea to replace missing observations with the mean, median, or mode while **flagging** them with a **Missing Indicator**.A missing indicator is a binary variable that indicates whether the data was missing (1) or not (0).

### For which variables can I add a missing indicator?

We can add a missing indicator to both numerical and categorical variables. 

#### Note

Adding a missing indicator is never used alone. On the contrary, it is always used together with another imputation technique, which can be mean / median imputation for numerical variables, or frequent category imputation for categorical variables. For both categorical and numerical variables, we can combine random sample imputation with the addition of a missing indicator.

Commonly used together:

- Mean/median imputation + missing indicator (Numerical variables)

- Frequent category imputation + missing indicator (categorical variables)

- Random sample imputation + missing indicator (numerical and categorical)

### Assumptions

- Data is not missing at random

- Missing data is predictive

### Advantages

- Captures the importance of missing data if there is one

### Limitations

- Expands the feature space

Adding a missing indicator will increase the number of variables in the dataset. So if the dataset contains 10 features, and all of them have missing values, after adding a missing indicator, we will have a dataset with 20 features: the original 10 features plus an additional 10 binary features, which indicate for each of the original variables whether they were missing or not. This may not be a problem in datasets with tens to a few hundred variables, but if our original dataset contains thousands of variables, by creating an additional variable to indicate NA, we will end up with very big datasets. 

In addition, data tends to be missing across multiple variables, which often leads to many of the missing indicators being similar to or even identical to each other.

### Final note

Typically, mean / median / mode imputation is done together with adding a variable to capture those observations where the data was missing, thus covering 2 angles: if the data was missing completely at random, this would be contemplated by the mean / median / mode imputation, and if it wasn't, this would be captured by the missing indicator.

These methods combined are the top choice in data science competitions. See, for example, the winning solution of the KDD 2009 cup: ["Winning the KDD Cup Orange Challenge with Ensemble Selection](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf).

## Datasets:

- Ames House Price
- Titanic

To download the datasets, please refer to the lecture **Datasets** in **Section 2** of this course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
# load the Titanic Dataset with a few variables for demonstration

data = pd.read_csv("../../titanic.csv", usecols=["age", "fare", "survived"])
data.head()

Unnamed: 0,survived,age,fare
0,1,29.0,211.3375
1,1,0.9167,151.55
2,0,2.0,151.55
3,0,30.0,151.55
4,0,25.0,151.55


In [3]:
# let's look at the percentage of NA

data.isnull().mean()

survived    0.000000
age         0.200917
fare        0.000764
dtype: float64

To add a binary missing indicator, we don't necessarily need to learn anything from the training set, so in principle, we could do this in the original dataset and then separate it into train and test. However, I do not recommend this practice.

In addition, if you are using Scikit-learn to add the missing indicator, the indicator needs to learn from the train set, which features to impute, that is, which are the features for which the binary variable needs to be added. We will see more about different implementations of missing indicators in future notebooks. For now, let's see how to create a binary missing indicator manually.

In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[["age", "fare"]],  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 2), (393, 2))

In [5]:
# Let's explore the missing data in the train set
# the percentages should be fairly similar to those
# of the whole dataset

X_train.isnull().mean()

age     0.191048
fare    0.000000
dtype: float64

In [6]:
# add the missing indicator

# this is done very simply by using np.where from numpy
# and isnull from pandas:

X_train["Age_NA"] = np.where(X_train["age"].isnull(), 1, 0)
X_test["Age_NA"] = np.where(X_test["age"].isnull(), 1, 0)

X_train.head()

Unnamed: 0,age,fare,Age_NA
501,13.0,19.5,0
588,4.0,23.0,0
402,30.0,13.8583,0
1193,,7.725,1
686,22.0,7.725,0


In [7]:
# the mean of the binary variable, coincides with the
# perentage of missing values in the original variable

X_train["Age_NA"].mean()

0.19104803493449782

In [8]:
# yet the original variable, still shows the missing values
# which need to be replaced by any of the techniques
# we have learnt

X_train.isnull().mean()

age       0.191048
fare      0.000000
Age_NA    0.000000
dtype: float64

In [9]:
# for example median imputation

median = X_train["age"].median()

X_train["age"] = X_train["age"].fillna(median)
X_test["age"] = X_test["age"].fillna(median)

# check that there are no more missing values
X_train.isnull().mean()

age       0.0
fare      0.0
Age_NA    0.0
dtype: float64

### House Prices dataset

In [10]:
# we are going to use the following variables,
# some are categorical some are numerical

cols_to_use = [
    "LotFrontage",
    "MasVnrArea",  # numerical
    "BsmtQual",
    "FireplaceQu",  # categorical
    "SalePrice",  # target
]

In [11]:
# let's load the House Prices dataset

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)
print(data.shape)
data.head()

(1460, 5)


Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,SalePrice
0,65.0,196.0,Gd,,208500
1,80.0,0.0,Gd,TA,181500
2,68.0,162.0,Gd,TA,223500
3,60.0,0.0,TA,Gd,140000
4,84.0,350.0,Gd,TA,250000


In [12]:
# let's inspect the variables with missing values

data.isnull().mean()

LotFrontage    0.177397
MasVnrArea     0.005479
BsmtQual       0.025342
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

In [13]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),
    data["SalePrice"],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

((1022, 4), (438, 4))

In [14]:
# Remove target from variable list

cols_to_use = cols_to_use[:-1]

cols_to_use

['LotFrontage', 'MasVnrArea', 'BsmtQual', 'FireplaceQu']

In [15]:
# Let's create a list with the name
# of the new variables

indicators = [f"{var}_NA" for var in cols_to_use]

In [16]:
# Let's add the indicators
X_train[indicators] = X_train[cols_to_use].isna().astype(int)
X_test[indicators] = X_test[cols_to_use].isna().astype(int)

In [17]:
# Let's explore the first rows

X_train.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,LotFrontage_NA,MasVnrArea_NA,BsmtQual_NA,FireplaceQu_NA
64,,573.0,Gd,,1,0,0,1
682,,0.0,Gd,Gd,1,0,0,0
960,50.0,0.0,TA,,0,0,0,1
1384,60.0,0.0,TA,,0,0,0,1
1100,60.0,0.0,TA,,0,0,0,1


In [18]:
# now let's evaluate the mean value of the missing indicators

# calculate the mean:

X_train[indicators].mean()

LotFrontage_NA    0.184932
MasVnrArea_NA     0.004892
BsmtQual_NA       0.023483
FireplaceQu_NA    0.467710
dtype: float64

In [19]:
# the mean of the missing indicator
# coincides with the percentage of missing values
# in the original variable

X_train.isnull().mean()

LotFrontage       0.184932
MasVnrArea        0.004892
BsmtQual          0.023483
FireplaceQu       0.467710
LotFrontage_NA    0.000000
MasVnrArea_NA     0.000000
BsmtQual_NA       0.000000
FireplaceQu_NA    0.000000
dtype: float64

In [20]:
# let's impute the NA with  the median for numerical
# variables
# remember that we calculate the median using the train set

median = X_train["LotFrontage"].median()
X_train["LotFrontage"] = X_train["LotFrontage"].fillna(median)
X_test["LotFrontage"] = X_test["LotFrontage"].fillna(median)

median = X_train["MasVnrArea"].median()
X_train["MasVnrArea"] = X_train["MasVnrArea"].fillna(median)
X_test["MasVnrArea"] = X_test["MasVnrArea"].fillna(median)


# let's impute the NA in categorical variables by the
# most frequent category (aka the mode)
# the mode needs to be learnt from the train set

mode = X_train["BsmtQual"].mode()[0]
X_train["BsmtQual"] = X_train["BsmtQual"].fillna(mode)
X_test["BsmtQual"] = X_test["BsmtQual"].fillna(mode)

mode = X_train["FireplaceQu"].mode()[0]
X_train["FireplaceQu"] = X_train["FireplaceQu"].fillna(mode)
X_test["FireplaceQu"] = X_test["FireplaceQu"].fillna(mode)

In [21]:
# and now let's check there are no more NA
X_train.isnull().mean()

LotFrontage       0.0
MasVnrArea        0.0
BsmtQual          0.0
FireplaceQu       0.0
LotFrontage_NA    0.0
MasVnrArea_NA     0.0
BsmtQual_NA       0.0
FireplaceQu_NA    0.0
dtype: float64

As you can see, we have now the double of features respect to the original dataset. The original dataset had 4 variables, the pre-processed dataset contains 8, plus the target.

**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**