# Mystery Behind Null Values

So it turns out that null values are not entirely random throughout the data, but seem to have a pattern, so I am going to explore this a little further.

Now this kind of makes sense as well, since when you access market data, you either have the data for almost every time period, or not, but rarely have any uncertainty. Unless you have multiple markets or securities. Different markets and securities may not have some data available, so in this notebook, I try to categorize these different markets or securities based on the null values, and see if they have an significantly different distribution.

In [None]:
# Importing Necessary Libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = 999

In [None]:
dtypes = {
    'date':'int16',
    'weight':'float16',
    'resp_1':'float16',
    'resp_2':'float16',
    'resp_3':'float16',
    'resp_4':'float16',
    'resp':'float16',
    'feature_0':'int8',
    'feature_1':'float16',
    'feature_2':'float16',
    'feature_3':'float16',
    'feature_4':'float16',
    'feature_5':'float16',
    'feature_6':'float16',
    'feature_7':'float16',
    'feature_8':'float16',
    'feature_9':'float16',
    'feature_10':'float16',
    'feature_11':'float16',
    'feature_12':'float16',
    'feature_13':'float16',
    'feature_14':'float16',
    'feature_15':'float16',
    'feature_16':'float16',
    'feature_17':'float16',
    'feature_18':'float16',
    'feature_19':'float16',
    'feature_20':'float16',
    'feature_21':'float16',
    'feature_22':'float16',
    'feature_23':'float16',
    'feature_24':'float16',
    'feature_25':'float16',
    'feature_26':'float16',
    'feature_27':'float16',
    'feature_28':'float16',
    'feature_29':'float16',
    'feature_30':'float16',
    'feature_31':'float16',
    'feature_32':'float16',
    'feature_33':'float16',
    'feature_34':'float16',
    'feature_35':'float16',
    'feature_36':'float16',
    'feature_37':'float16',
    'feature_38':'float16',
    'feature_39':'float16',
    'feature_40':'float16',
    'feature_41':'float16',
    'feature_42':'float16',
    'feature_43':'float16',
    'feature_44':'float16',
    'feature_45':'float16',
    'feature_46':'float16',
    'feature_47':'float16',
    'feature_48':'float16',
    'feature_49':'float16',
    'feature_50':'float16',
    'feature_51':'float16',
    'feature_52':'float16',
    'feature_53':'float16',
    'feature_54':'float16',
    'feature_55':'float16',
    'feature_56':'float16',
    'feature_57':'float16',
    'feature_58':'float16',
    'feature_59':'float16',
    'feature_60':'float16',
    'feature_61':'float16',
    'feature_62':'float16',
    'feature_63':'float16',
    'feature_64':'float16',
    'feature_65':'float16',
    'feature_66':'float16',
    'feature_67':'float16',
    'feature_68':'float16',
    'feature_69':'float16',
    'feature_70':'float16',
    'feature_71':'float16',
    'feature_72':'float16',
    'feature_73':'float16',
    'feature_74':'float16',
    'feature_75':'float16',
    'feature_76':'float16',
    'feature_77':'float16',
    'feature_78':'float16',
    'feature_79':'float16',
    'feature_80':'float16',
    'feature_81':'float16',
    'feature_82':'float16',
    'feature_83':'float16',
    'feature_84':'float16',
    'feature_85':'float16',
    'feature_86':'float16',
    'feature_87':'float16',
    'feature_88':'float16',
    'feature_89':'float16',
    'feature_90':'float16',
    'feature_91':'float16',
    'feature_92':'float16',
    'feature_93':'float16',
    'feature_94':'float16',
    'feature_95':'float16',
    'feature_96':'float16',
    'feature_97':'float16',
    'feature_98':'float16',
    'feature_99':'float16',
    'feature_100':'float16',
    'feature_101':'float16',
    'feature_102':'float16',
    'feature_103':'float16',
    'feature_104':'float16',
    'feature_105':'float16',
    'feature_106':'float16',
    'feature_107':'float16',
    'feature_108':'float16',
    'feature_109':'float16',
    'feature_110':'float16',
    'feature_111':'float16',
    'feature_112':'float16',
    'feature_113':'float16',
    'feature_114':'float16',
    'feature_115':'float16',
    'feature_116':'float16',
    'feature_117':'float16',
    'feature_118':'float16',
    'feature_119':'float16',
    'feature_120':'float16',
    'feature_121':'float16',
    'feature_122':'float16',
    'feature_123':'float16',
    'feature_124':'float16',
    'feature_125':'float16',
    'feature_126':'float16',
    'feature_127':'float16',
    'feature_128':'float16',
    'feature_129':'float16',
    'ts_id':'int32'
}

In [None]:
train = pd.read_feather('../input/fast-reading-w-pickle-feather-parquet-jay/jane_street_train.feather')
train = train.astype(dtypes)

In [None]:
train

Now, let's first create a column that has the number of missing values in that row, and then see a summary of train data.

In [None]:
train['numNull'] = train.isnull().sum(axis=1)
train

If you scroll to the end, you will see a new column 'numNull', which is just the number of null values in the row. Now we can already see that there is some order, as we see that the first few rows all have null values of 31, and the last few have 0s. Let's see what are all of the unique number of null values per row.

In [None]:
train['numNull'].unique()

The number of unique elements is very low, considering the massive size of the dataset, but this is still more than what I expected. Let's see what we can do. I think a few of these might just be outliers, having a handful of points in them, so would be insignificant. Let's first find the number of unique rows in each of the null values

In [None]:
b = train.groupby('numNull').count()
b = b.reset_index()
b

Just what we expcted. There are a lot of null counts that have less than 3 digit null values. What I am going to do now is remove all of the nullcounts that have a null count less than 100. Anything more is significant and of interest.

Side note: We can see that most rows have no null values. So any information that comes out of this null value analysis would probably only lead to a small score boost, nothing like a major breakthrough!

In [None]:
a = train.groupby('numNull').mean()
a = a.reset_index()
significantNulls = [0, 1, 2, 3, 4, 6, 7, 12, 13, 14, 15, 16, 18, 19, 21, 25, 26, 27, 35, 36, 39, 40, 42]
a = a.loc[significantNulls]
a

Wow! Now inspecting this, we already see that there is a difference in distribution for some features, based on the number of null values in a row. 

# Investigating Avg Resp and Weight

Let's make some plots and see this further. The first set of plots I am going to make is just the average weight and Resp (1, 2, 3, 4 and normal) for each null count. Let's see if there is any difference in the average return.

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(15, 15))
j = 0
for col in ['weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp']:
    ax[int(j / 2), j%2].bar(a['numNull'], a[col])
    ax[int(j / 2), j%2].set_title("Average " + col + " for Different Null Counts")
    j += 1
plt.show()

So the average weight does have some variation, but doesn't seem to related to the null counts. 
The resp features on the other hand are quite interesting. Some null counts have a significantly negative return on average, whereas some have a positive return. This could be a very helpful bit of information when training a model. 

Now, let's just look at the distribution of the resp features in the train per null count (right now we did the average, now we're doing the distribution from train). I deliberately made the plots big so that you could see lines of all colors.

In [None]:
for col in ['weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp']:
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    label = []
    for null in a['numNull'].values:
        nullData = train[train['numNull'] == null]
        sns.distplot(nullData[col], ax=ax)
        label.append(null)
    fig.legend(labels=label)
    plt.show()

So there is some variability in each of the null counts, especially with the brown line (numNull = 28). That has an interesting flattish hump. Also, these do look like the returns of different securities, as they follow a similar distribution, but have different skewness. This supports the notion that it might be something to do with different markets of securities.

# Variations in Train
Now let's see if the training data has any general distinction based on null counts. I will use PCA here since the data is quite high dimensional.

In [None]:
cols = list(train.columns)
for removeCol in ['date', 'weight', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp', 'ts_id', 'numNull']:
    cols.remove(removeCol)

X = train[cols]
X = X.fillna(0)

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
y = pca.fit_transform(X)

fig, ax = plt.subplots(6, 4, figsize=(15,15))
i = 0
for null in a['numNull'].values:
    c = y[train['numNull'] == null]
    ax[int(i / 4), i%4].scatter(c[:, 0], c[:, 1])
    ax[int(i / 4), i%4].set_title("NumNull = " + str(null))
    i += 1
plt.show()

So it seems like most of the data is leaning towards a similar trend, but there are significant variations. NumNull = 1, 76, 16 look like blobs, rather than having the tail that many others have. 

So it may even make sense to make separate models for each null count, but it doesn't seem plausible due to the timing constraints and the fact that some null counts have low number of data points.

# Final Notes

So this was a short look at the missing values. The most important piece of information that this gives us (I think) is that we are probably dealing with multiple securities / markets over here, rather than a single homogeneous index.

Some other things that could be added to the feature engineering stage:
- Add a feature that is just the number of null values in the row. Complex models might be understand the variation in training data due to null counts.
- Add more features for the average weight and resp for that null count (beware that this is leaking some information, so may lead to some amount of overfitting. You could try adding some noise here to make it more robust).

There is one more interpretation of the null counts, which is that it tells us the time of the day (link to discussion: https://www.kaggle.com/c/jane-street-market-prediction/discussion/201264). I am not too sure about this, since there are many days where such a pattern is not repeated, which I think can only be explained by the fact that there are multiple markets / securites, but these are just my thoughts. 

Thanks for going through it and please do upvote if you liked it.
If you want a more general understanding and analysis of the data, check out my EDA notebook: https://www.kaggle.com/yushg123/a-walk-down-jane-street-eda-baseline
