# How random are the NaNs?

I wanted to take a look at the distribution of the missing values to see if they are completely random.

Primarily I'm interested in the feasibility of synthesizing subsets of test data that would evaluate comparably to the LB metric. So I will investigate:

- Is the distribution of NaNs in the data random?
- If so, how sensitive is the metric to our synthesized test data?
- Are eval metrics on the test subsets comparable to the LB metric?

TLDR: 
- NaNs seem distributed completely at random.
- using a subset of the complete records and randomly generating missing values seems to produce evaluation results that align well with leaderboard results.


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns

from scipy.stats import binom
import matplotlib.pyplot as plt

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

# does this speed imputation up?
from sklearnex import patch_sklearn
patch_sklearn()

RANDOM_STATE=42
INPUT_PATH = Path('../input/tabular-playground-series-jun-2022')

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.3f}'.format)

np.random.seed(RANDOM_STATE)

In [None]:
dtypes = {'row_id' : 'int',
          'F_2_0' : 'int', 'F_2_1' : 'int', 'F_2_2' : 'int',
          'F_2_3' : 'int', 'F_2_4' : 'int', 'F_2_5' : 'int', 
          'F_2_6' : 'int', 'F_2_7' : 'int', 'F_2_8' : 'int',
          'F_2_9' : 'int', 'F_2_10' : 'int', 'F_2_11' : 'int',
          'F_2_12' : 'int', 'F_2_13' : 'int', 'F_2_14' : 'int',
          'F_2_15' : 'int', 'F_2_16' : 'int', 'F_2_17' : 'int',
          'F_2_18' : 'int', 'F_2_19' : 'int', 'F_2_20' : 'int',
          'F_2_21' : 'int', 'F_2_22' : 'int', 'F_2_23' : 'int',
          'F_2_24' : 'int'}

data = pd.read_csv(INPUT_PATH / 'data.csv', 
                   index_col='row_id',
                   dtype = dtypes)

submission = pd.read_csv(INPUT_PATH / 'sample_submission.csv', 
                         index_col='row-col')

In [None]:
def cols_by_prefix(columns, prefix):
    return [x for x in columns if x.startswith(prefix)]

cols_f1 = cols_by_prefix(data.columns, 'F_1')
cols_f2 = cols_by_prefix(data.columns, 'F_2')
cols_f3 = cols_by_prefix(data.columns, 'F_3')
cols_f4 = cols_by_prefix(data.columns, 'F_4')
cols_f134 = cols_f1 + cols_f3 + cols_f4

data_f134 = data[cols_f134]
data_f1 = data[cols_f1]
data_f2 = data[cols_f2]
data_f3 = data[cols_f3]
data_f4 = data[cols_f4]

# What's Missing

Let's examine the data and the column groups to get a sense for what's missing.

- We have a million rows. 
- Column groups F_1, F_3, F_4 are all floating point and contain missing values, and every column has some missing values. 
- Column group F_2 appear to be ordinals or categoricals of some kind, and have no missing values.

In [None]:
data_f1.describe()

In [None]:
data_f2.describe()

In [None]:
data_f3.describe()

In [None]:
data_f4.describe()

# Hypothesis - values are missing at random

In [None]:
print(f'total missing values: {data.isna().sum().sum()}')

We have 1,000,000 NaNs - 1 per row. So if there are 55 columns with NaNs, our hypothesis is that cells are missing completely at random, with probability $1/55$.

Let's look at the correlation between features that have missing values - here we are just looking at correlating missingness, so we replace the values of the features with a missing indicator.

Let's also look at correlation of missing values with the values in the F_2 columns, to determine if particular values of those columns induce missingness in the other columns.

Conclusion: Missing values don't appear to be correlated with each other, or with the values in the F_2 columns.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,8))

temp = pd.concat([data_f134.isnull().astype(int), data[cols_f2]], axis=1)
corr = temp.corr()
sns.heatmap(corr, vmin=-1, vmax=1,  cmap='BrBG')

plt.show()

# Examine the column distributions

If every cell is randomly NaN with probability $1/55$, then we can view each column as a binomial random variable, distributed as $Binom(1000000, 1/55)$

Per the binomial distribution, this gives:
$$
n = 1000000, \, p=1/55 \\
\mu = np = 18181.81818181 \\
\sigma^2 = np(1-p) = 17851.23966942 \\
\sigma = \sqrt{\sigma^2} = 133.60853142
$$

What do our sample mean and sample variance look like?
- obviously the mean matches exactly 
- the variance/standard deviation seem reasonably close to our hypothesis



In [None]:
data_f134.isna().sum().agg(['mean', 'var', 'std'])

What does this binomial distribution look like?

In [None]:
p = 1/55
n = 1E6

rv = binom(n,p)
mean, var = rv.stats()
mean = mean[()]
var = var[()]
std = var**.5

fig, ax = plt.subplots(1, 1, figsize=(10,8))

x = np.arange(binom.ppf(0.005, n, p),
              binom.ppf(0.995, n, p))

ax.plot(x, binom.pmf(x, n, p), 'bo', ms=5, label='pmf')
ax.legend(loc='best', frameon=False)
plt.title(f'binom(1E6, 1/55) [mean:{mean:.2f} std:{std:.2f}]')

plt.show()


Note this looks normal! When $n$ is large, we can approximate a binomial distribution with a normal distribution.

So in our case, we have 55 RVs, which we hypothesize are distributed $N(18181.8181, 17851.2396)$ and standard deviation $133.608$ 

We expect to see the count of missing values by column clustered around the mean, with a spread of a few hundred (since our standard deviation is $\approx 133$)

Let's plot the missing value counts by column, and also look at a histogram. 
- We see the values are indeed clustered around the mean.
- The histogram doesn't look entirely normal, but remember our sample size here is only 55.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14,6))

ax[0].set_title("NaN column counts")
data_f134.isna().sum().plot(ax=ax[0], ylim=(0,20000))


ax[1].set_title("NaN column counts - histogram")
na_col_counts = data_f134.isna().sum()
na_col_counts.hist(ax=ax[1])

plt.show()

What else can we do to confirm each column is indeed pulled from a normal distribution?

Well, if it's normal, we expect ~68% of values to fall within 1-sigma, ~95% within 2-sigma, and ~99.7% within 3-sigma. 

Looks pretty close!


In [None]:
na_row_counts = data_f134.isna().sum(axis=0)

for i in range(1,4):
    sigma_lb = mean - i * std
    sigma_ub = mean + i * std
    pct = na_row_counts.between(sigma_lb, sigma_ub).sum() / na_row_counts.size
    print(f"{i}-sigma: {pct:.2f}")


_Conclusion_: the column distributions do indeed appear drawn from the same binomial distribution.

_Notes_: 
- I think we could do a t-test on each of the columns vs the population mean.
- I think we could do a hypothesis test on the sample variance since the binomial distribution is normal for large $n$

# Examine Row Distributions
If we examine row distributions, we expect missing values to be distributed $ binom{55, 1/55) $. Note our sample size is now 1000000.

For small n, the binomial distribution doesn't look anything like a normal distribution. But since we have such a large sample, we can compare our proportions with the distribution.



In [None]:
n = 55

fig, ax = plt.subplots(1, 2, figsize=(14,6))

x = np.arange(binom.ppf(0.000000001, n, p),
              binom.ppf(0.999999999, n, p))

# plot the PMF
ax[0].plot(x, binom.pmf(x, n, p), 'bo', ms=8)
ax[0].set_title("PMF - binom(55,1/55)")

# plot the sample proportions
proportions = data.isnull().sum(axis=1).value_counts() / 1E6
ax[1].plot(proportions, 'r+', ms=8)
ax[1].set_title("Missing value proportions")
plt.show()

Looks like a perfect match!

Conclusions:
- Missing values appear uncorrelated with each other
- Column distributions match expectations
- Row distributions match expections
- Values are missing completely at random in F1,F3,F4 column groups, with probability $1/55$

# Why does this matter? Testing!

So why does this matter? Well, if values are missing completely at random, perhaps we can use the rows with no missing values to synthesis our own test data for imputation. 

And since we know what the values actually are, we can use the test metric to evaluate our imputation without having to submit to the competition.

Also, if using a subset of the data is predictive of the LB score, then our training/evaluation can go faster.

So the basic idea is:
- use the ~365000 rows with no missing values
- randomly generate a set of missing values 
- run a couple of imputation techniques on this synthetic data
- is the delta in performance reflective of the delta in performance on the full dataset / leaderboard?


In [None]:
def make_training(df, n, p, random_state):
    # first find all rows with *no* NaN; sample n rows
    df = df[~df.isnull().any(axis=1)]
    if n > 0:
        df = df.sample(n=n, random_state=random_state)
    
    # random mask of NaN locations; only cols F_1*, F_3*, F_4*
    mask = np.random.random(df[cols_f134].shape) < p
    df_na = df[cols_f134].mask(mask)

    # put it back together with F_2*
    df_na = pd.concat([df_na[cols_f1], df[cols_f2], df_na[cols_f3], df_na[cols_f4]], axis=1)
    return df, df_na, df_na.isna().sum().sum()

def rmse(df1, df2, n):
    sse = (df1.to_numpy() - df2.to_numpy())**2
    return (sse.sum()/n)**0.5


## Mean Imputer

Here we run a sklearn SimpleImputer. I've tested this in the competition with the full dataset
- __LB=1.41613__

RMSE on our synthetic test set is close!

In [None]:
%%time

train, train_na, na_count = make_training(data, -1, p, RANDOM_STATE)
imputer = SimpleImputer(strategy="mean")
train_na[:] = imputer.fit_transform(train_na)
print(f'RMSE={rmse(train, train_na, na_count)}')

## Iterative Imputer

Here we run sklearn IterativeImputer, restricting to 3 iterations. I've also tested this on the leaderboard:
- __LB=0.99623__

Once again, RMSE on the synthetic test data is close!

In [None]:
%%time

train, train_na, na_count = make_training(data, -1, p, RANDOM_STATE)
imputer = IterativeImputer(verbose=2, max_iter=3, random_state=RANDOM_STATE)
train_na[:] = imputer.fit_transform(train_na)
print(f'RMSE={rmse(train, train_na, na_count)}')

## Variability

Perhaps we just got lucky. Here we'll take 3 folds of 200,000 records from the available data, and compare RMSE of all 3.

Results seem pretty pretty consistent. Seems promising.

In [None]:
imputer = IterativeImputer(verbose=2, max_iter=3, random_state=RANDOM_STATE)
rmses = []
for i in range(3):
    train, train_na, na_count = make_training(data, 200000, p, RANDOM_STATE)
    train_na[:] = imputer.fit_transform(train_na)
    rmses.append(rmse(train, train_na, na_count))
    print(f'RMSE={rmses[i]}')
print(f'RMSE mean over folds: {np.mean(rmses)}')

# Conclusions

- Values are missing completely at random in column groups F1, F3, F4, with probability $1/55$
- We can create synthetic test data by randomizing missing values in the complete records of the full dataset
- This synthetic test data produces evaluation results that seem a good proxy for the full dataset results on the leaderboard

