# Handling Missing Data

Will Badart <badart_william@bah.com>

created: **SEP 2018**

This notebook is an explanation of different techniques for handling missing data (particularly, large swaths of missing data). We will compare how each technique affects models' performance. I'll be using [this][SO] Stack Overflow post as an outline for the different techniques we'll explore.

[SO]: https://stackoverflow.com/a/35684975/4025659

## The Dataset

I'll be using the [breast cancer][dataset] dataset from `sklearn` as a base, and performing a (reverse?) pre-processing step of removing the values from a random sampling of cells to simulate the problem of missing data.

[dataset]: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [1]:
import numpy as np

from random import (
    choices, random, randint, seed as seed_py)
from sklearn.datasets import load_breast_cancer

RANDOM_STATE = 0xdecafbad
PROB_MISSING = 0.9

seed_py(RANDOM_STATE)

def should_i_do_it():
    return random() < PROB_MISSING

def stomp_indexes(x):
    options = list(range(len(x)))
    return choices(options,
                   # See NOTE below
                   weights=[6, 3, 1] * 10,
                   k=randint(0, len(x)))

def append_col(arr, newcol):
    assert len(arr) == len(newcol)
    return np.append(
        arr, newcol.reshape(len(newcol), 1), axis=1)

X, y = load_breast_cancer(return_X_y=True)

for x in X:
    if should_i_do_it():
        targets = stomp_indexes(x)
        x[targets] = np.nan

Xy = append_col(X, y)

**NOTE:** The weights here are a cycle of 30 values which alternate between `6`, `3`, and `1`. The acheived effect is that 10 of the features are missing, on average, more than 20% of the time, 10 of the features 15-20% of the time, and the remaining 10 less than 10% of the time.

This creates a distinction between "good" features (with fewer missing values) and bad ones.

## Assessing the Damage

Below, I note a few summary statistics to give an idea of the distribution of missing values.

First, we note the proportion of values which have been squashed.

In [2]:
import matplotlib.pyplot as plt

def proportion_na(col):
    return sum(np.isnan(col)) / len(col)

proportions = [proportion_na(col) for col in X.T]
ax = plt.gca()
ax.bar(range(len(proportions)), proportions)
ax.set_ylim([0, 1])
plt.show()

<Figure size 640x480 with 1 Axes>

Please review the above proportions and adjust `PROB_MISSING` and the `weights` keyword argument to `choices` to your preference.

Here, the `count` row shows the number of non-missing values in the column.

In [3]:
import pandas as pd
pd.DataFrame(Xy).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
count,284.0,378.0,490.0,300.0,394.0,500.0,307.0,403.0,497.0,305.0,...,292.0,391.0,485.0,294.0,414.0,496.0,290.0,400.0,499.0,569.0
mean,13.999511,19.36291,92.684816,640.283667,0.096308,0.105046,0.091598,0.049484,0.181258,0.063617,...,25.634623,107.386957,889.244536,0.130969,0.259754,0.271677,0.114128,0.292258,0.084485,0.627417
std,3.548366,4.200937,25.110112,329.493834,0.014474,0.053688,0.079407,0.038586,0.02725,0.007578,...,6.355213,33.92333,586.330346,0.022498,0.16058,0.207228,0.06504,0.064015,0.01839,0.483918
min,6.981,9.71,43.79,178.8,0.05263,0.01938,0.0,0.0,0.1167,0.05024,...,12.49,50.41,223.6,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.6,16.4975,75.225,422.825,0.086002,0.065402,0.029805,0.02082,0.1619,0.05871,...,21.16,83.64,513.1,0.1166,0.151375,0.115925,0.06551,0.2517,0.072045,0.0
50%,13.255,18.885,86.545,549.8,0.09662,0.094035,0.06651,0.0335,0.1792,0.06194,...,25.18,97.96,688.6,0.1301,0.2209,0.22655,0.09856,0.2826,0.0802,1.0
75%,15.6225,21.87,106.825,751.325,0.1049,0.1306,0.1368,0.07449,0.1956,0.06739,...,29.9025,127.75,1088.0,0.143475,0.3392,0.3814,0.1605,0.318925,0.09232,1.0
max,27.22,39.28,188.5,2010.0,0.1634,0.3454,0.3635,0.2012,0.304,0.09575,...,47.16,229.3,4254.0,0.2226,1.058,1.252,0.2867,0.6638,0.2075,1.0


Below is the proportion of data objects with no missing values at all:

In [4]:
len([x for x in X if not any(np.isnan(x))]) / len(X)

0.14411247803163443

## Dealing with it

The five strategies I'm going to try are:

1. *Drop rows with missing data:* it's exactly what it sounds like
2. *Mean/ mode:* fill missing cells with the mean (if we had categorical attributes, mode) of the present values of the column
3. *Conditional mean/mode:* same as (2) but only take the mean of rows which share your label
4. *Hot-decking:* use a distance metric to find the closest row which has a value in your missing column, and use that
5. *KNN:* same as hot-deck but *K* > 1.

### 1. Drop rows with missing values

Since (1) affects the shape of `X`, there also a little extra handling that needs to be done for `y`:

In [6]:
def filter_na(A):
    combined = np.array([
        x for x in A if not any(np.isnan(x))])
    return combined

Xy_dropped = filter_na(Xy)

The rest of the strategies don't change `y`, but some need to consider it.

### 2. Fill with column mean

In [12]:
def fill_mean(ax):
    ax[np.isnan(ax)] = ax[~np.isnan(ax)].mean()
    return ax

X_mean = np.copy(X)
np.apply_along_axis(fill_mean, 0, X_mean)
Xy_mean = append_col(X_mean, y)

### 3. Conditional fill with column mean

In [None]:
def fill_cond_mean(A):
    for row in A:
        same_class = A[A[:, -1] == row[-1]]
        for j, v in enumerate(row):
            if np.isnan(v):
                row[j] = mean(same_class[:, j])
    return A

Xy_cond = np.copy(Xy)
fill_cond_mean(Xy_cond)

### 4. Hot-decking

### 5. KNN

In [30]:
from functools import partial
from heapq import nsmallest
from scipy.spatial.distance import euclidean

def euclidean_with_nan(j, x, y):
    def skip(k, a):
        return a[[i for i, _ in enumerate(a) if i != k]]
    x, y = skip(j, x), skip(j, y)
    # Don't consider pairs with mismatching missing vals
    if not (np.isnan(x) == np.isnan(y)).all():
        return float('inf')
    return euclidean(x[~np.isnan(x)], y[~np.isnan(y)])

def knn(A, x, k=5, skip=None):
    return nsmallest(
        k, A, key=partial(euclidean_with_nan, skip, x))

def fill_knn(A, k=5):
    for row in A:
        for j, v in enumerate(row):
            if np.isnan(v):
                neighbors = knn(A, row, k, j)
                row[j] = mean(neighbors[:, j])

In [32]:
knn(X, X[1])

[array([      nan, 1.777e+01, 1.329e+02,       nan,       nan, 7.864e-02,
              nan, 7.017e-02,       nan,       nan, 5.435e-01, 7.339e-01,
              nan, 7.408e+01,       nan, 1.308e-02,       nan, 1.340e-02,
        1.389e-02,       nan, 2.499e+01, 2.341e+01,       nan, 1.956e+03,
        1.238e-01, 1.866e-01, 2.416e-01,       nan,       nan, 8.902e-02]),
 array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01]),
 array([1.969e+01, 2.125e+01, 1.300e+02, 1.203e+03, 1.096e-01, 1.599e-01,
        1.974e-01, 1.279e-01, 2.069e-01, 5.999e-02, 7.456e-01, 7.869e-01,
        4.585e+00, 9.403e+01, 6.150e-03, 4.006e-02, 3.832e-02, 2.058e-02,
        2.250e-02, 4.571e-03, 2.35