# Missing data imputation
Another important topic in the preparation of data for machine learning models is the issue of missing data.

In case of missing data, one can:

1. Omit observations with missings (not recommended, avoid if possible)
2. Omit in the analysis a variable with a large number of missing data
3. Replace the missings in a sensible way
4. for the qualitative variable - create a separate category "no data"

**Ad 1.**

Missing value in one column eliminates whole observation from modelling. If dataset is large, the elimination of some observations due to missing data, especially if missings are random, is not a big problem.

Worse if the dataset is small and missings have non-random character. Their elimination will change thestructure of the sample and will cause drawing wrong conclusions from the analysis.

**Ad 2.**

If a relatively large number of missing data is concentrated in a small number of variables, it may make sense to omit these variables in the analysis, which will limit the number of features taken under consideration in modeling, but will not limit the sample (and, equally important - it will not change its structure).

Often it is assumed that if the variable has more than 5%-10% missings, one should consider omitting it in the analysis.

**Ad 3.**

An alternative to removing observations with missings is replacing missing values in a sensible way. Replacing/filling missings is also called **IMPUTATION**.

The missing value of the quantitative variable can be replaced (ALWAYS after analyzing the data and its context): 

* by the single selected value (sometimes missings can be replaced with 0),

    * average / median / mode for the whole sample
    * mean / median / mode for the subgroup (defined by another variable / variables) to which observation belongs 
    * by the predicted value from some model estimated on non-missings (regressive imputation) in which the feature with missing values is an explanatory variable, and other variables are predictors (in other words - check if the value of the feature with missings can be predicted with the use of values of other, non-missing variables for this observation)
    * by a random value from a reasonable distribution (e.g. from non-missings)
    * apply multiple imputation - replace missing value many times with different values (from the assumed distribution), on each completed dataset perform the analysis and average results.

**Ad 4.**

For the qualitative variable a missing value can be:

* encoded as a separate category
* replace with the most common value (mode) - rather not recommended as it changes the distribution.
* replace with the mode for the subgroup
* replace with the random value from non-missings


Lets load medical data and check if there are missing values. Lets limit ourselfs to a much smaller number of variables for this lesson.

In [31]:
import pandas as pd
import numpy as np
import pickle
import statsmodels.api as sm

import matplotlib.pyplot as plt
plt.style.use('seaborn-ticks')
%matplotlib inline

pd.set_option("display.max_columns",101)

import gc
import math

medical = pd.read_pickle("data/medical.p")
print(medical.UCURNINS.unique())
medical["UCURNINS"] = (medical.UCURNINS=="Yes").astype(int)
print(medical.UCURNINS.unique())
allVars = ["UCURNINS",  "USATMED", "FDOCT", "FDENT", "UEDUC3"]
medical = medical[allVars]
features = ["USATMED", "FDOCT", "FDENT", "UEDUC3"]

['Yes' 'No']
[1 0]


Lets normalize the data set for our features in a simple way.

In [32]:
medical[features] = medical[features]/medical[features].max()

isna() returns boolean df. When we sum the columns we can see how many missing values there are in each one. Clearly there are none in this case.

In [3]:
medical.isna().sum()

UCURNINS    0
USATMED     0
FDOCT       0
FDENT       0
UEDUC3      0
dtype: int64

Lets split the data into training and validation. We will create "missing" values in some column in the training set to evaluate different inputation methods later on.

In [17]:
import random
from sklearn.model_selection import train_test_split
medicalT, medicalV, y_train, y_test = train_test_split(medical, medical.UCURNINS, stratify=medical.UCURNINS, test_size=0.9, random_state=random.randint(0,10000))
print(medicalT.shape, medicalV.shape)

(3507, 5) (31565, 5)


**CAUTION!!** To amplify the results we are reducing the size of training dataset very strongly. All the results below may depend heavily on the stochstic element.

Lets create a copy to have a deteset in which we will input missing values.

In [18]:
medicalM = medicalT.copy()

# lets get the number of observation that is equal to 1%.
len1 = int(medicalM.shape[0]/100)


medicalM.loc[medicalM.sample(50*len1).index, "FDENT"] = np.nan
medicalM.loc[medicalM.sample(35*len1).index, "FDOCT"] = np.nan
medicalM.loc[medicalM.sample(20*len1).index, "USATMED"] = np.nan
medicalM.loc[medicalM.sample(10*len1).index, "UEDUC3"] = np.nan

We can now see that in our "missing" dataset we have complete information for most observations. Most missings are in just one row but it can happen that we will get up to 4 missing values in a row.

In [19]:
medicalM.isna().sum(axis=1).value_counts()

1    1528
2     921
0     846
3     193
4      19
dtype: int64

## Ommitting missing values
The most basic thing we can do is omit the observations that do not have complete observations for all variables.


In [20]:
medicalOmit = medicalM.copy()
medicalOmit.dropna(inplace=True)
print(medicalM.shape)
print(medicalOmit.shape)

(3507, 5)
(846, 5)


This way we lose a lot of information. Lets see how it affects our training. We run our logit regression on limited number of variables to clearly see the efects of missing values.

In [21]:
from sklearn.model_selection import KFold
from sklearn import neighbors
from sklearn.metrics import roc_auc_score


n_neighbors = 20
clf = neighbors.KNeighborsClassifier(n_neighbors, n_jobs=-1, p=2)

clf.fit(medicalT[features].values, medicalT["UCURNINS"].values)
preds = clf.predict_proba(medicalV[features].values)

clf.fit(medicalOmit[features].values, medicalOmit["UCURNINS"].values)
predsOmit = clf.predict_proba(medicalV[features].values)


print("No missings:", roc_auc_score(medicalV["UCURNINS"], preds[:,1]))
print("Ommiting missings:",roc_auc_score(medicalV["UCURNINS"], predsOmit[:,1]))

No missings: 0.7618182165070773
Ommiting missings: 0.7493880540661864


As we can see logistic regression is highly resiliant to number of observations and missing values. We have anly slightly lower score when we omit missing values. In other cases it may be much stronger so lets see if we can mitigate it somehow.

## Inputation using averages

The most basic way to input missign values is to input mean value (of whole dataset or by some type of grouping/aggregation).

In [22]:
medicalMean = medicalM.copy()

In [23]:
# features = ["USATMED", "FDOCT", "FDENT", "UEDUC3"]
medicalMean.loc[medicalMean.USATMED.isna(), "USATMED"] = medicalM["USATMED"].mean()
medicalMean.loc[medicalMean.FDOCT.isna(), "FDOCT"] = medicalM["FDOCT"].mean()
medicalMean.loc[medicalMean.FDENT.isna(), "FDENT"] = medicalM["FDENT"].mean()
medicalMean.loc[medicalMean.UEDUC3.isna(), "UEDUC3"] = medicalM["UEDUC3"].mean()

In [24]:
medicalMean.isna().sum()

UCURNINS    0
USATMED     0
FDOCT       0
FDENT       0
UEDUC3      0
dtype: int64

In [25]:
from sklearn.model_selection import KFold
from sklearn import neighbors
n_neighbors = 20
clf = neighbors.KNeighborsClassifier(n_neighbors, n_jobs=-1, p=2)

clf.fit(medicalT[features].values, medicalT["UCURNINS"].values)
preds = clf.predict_proba(medicalV[features].values)

clf.fit(medicalMean[features].values, medicalMean["UCURNINS"].values)
predsOmit = clf.predict_proba(medicalV[features].values)


print("No missings:", roc_auc_score(medicalV["UCURNINS"], preds[:,1]))
print("Ommiting missings:",roc_auc_score(medicalV["UCURNINS"], predsOmit[:,1]))

No missings: 0.7618182165070773
Ommiting missings: 0.7490579218120181


In [26]:
medicalMean = medicalM.copy()
medicalMean.UEDUC3.value_counts()

0.5    1868
1.0     948
0.0     341
Name: UEDUC3, dtype: int64

In [27]:
for var in features:
    means = medicalM.groupby("UEDUC3")[var].mean()
    medicalMean["temp"] = medicalMean.UEDUC3.replace(means)
    medicalMean.loc[medicalMean[var].isna(), var] = medicalMean.loc[medicalMean[var].isna(), "temp"]
    
medicalMean.loc[medicalMean.USATMED.isna(), "USATMED"] = medicalM["USATMED"].mean()
medicalMean.loc[medicalMean.FDOCT.isna(), "FDOCT"] = medicalM["FDOCT"].mean()
medicalMean.loc[medicalMean.FDENT.isna(), "FDENT"] = medicalM["FDENT"].mean()
medicalMean.loc[medicalMean.UEDUC3.isna(), "UEDUC3"] = medicalM["UEDUC3"].mean()

In [28]:
from sklearn.model_selection import KFold
from sklearn import neighbors
n_neighbors = 20
clf = neighbors.KNeighborsClassifier(n_neighbors, n_jobs=-1, p=2)

clf.fit(medicalT[features].values, medicalT["UCURNINS"].values)
preds = clf.predict_proba(medicalV[features].values)

clf.fit(medicalMean[features].values, medicalMean["UCURNINS"].values)
predsOmit = clf.predict_proba(medicalV[features].values)


print("No missings:", roc_auc_score(medicalV["UCURNINS"], preds[:,1]))
print("Ommiting missings:",roc_auc_score(medicalV["UCURNINS"], predsOmit[:,1]))

No missings: 0.7618182165070773
Ommiting missings: 0.7497223164254384


## Working with missing nominal non-ordinal data
Mode is the most common value. Mode is the most common way to input values for missing non-ordinal data. To use more advanced inputation methods than basing on aggregation and mode we would need to employ multinominal classification.

## Excercises

**Exercise 10.1**

Please check missing values of variables in the titanic set. Visualize missings with `md.pattern()` and `aggr()`. WARNING! Check if you correctly identify missing data in text columns!

**Exercise 10.2**

Please create a copy of the titanic data containing only non-missing observations - how many have been left?


**Exercise 10.3**

Please create a copy of the titanic data, then:

* remove variables with a very large number of missings
* missings from the "embarked" column replace with the value "S" 
* missings from the "age" column replace with the mean / median in subgroups according to the variables "pclass", "survived" and "sibsp"