<a id="top"></a>

# Nan Values Analysis Jane Street

<img src="https://www.europol.europa.eu/sites/default/files/images/finance_budget.jpg">


The notebook is divided in the following sections:

- [*0.0 Imports and Data Loading*](#imports)<br>
- [*1.0 Top 40 Features for number of NaNs*](#top40)<br>
- [*1.1 NaN values based on daily_ts_id, for the first 30000 trades*](#30k)<br>
- [*1.2 Distribution of Time Of Day where Nan Occur for each feature*](#timeofday)<br>
- [**Take on NaNs vs Time of Day**](#nan_vs_timeofday)<br>
- [*2.0 Market basket analysis on Sets of Nan Features*](#market_basket)<br>
- [*2.1 Distribution of feature itemsets*](#feature_itemsets)<br>
- [**Take on Features having the same NaN behaviour**](#take_market_basket)<br>



Please let me know your opinion in the comments! Glad to improve!

## TLDR: 

- For most of the features having NaN values, NaN values occur in the first trades of each day

- Group of features have NaNs for the same trades (Market Basket Analysis)

<a id="imports"></a>
# Imports and Data Loading

Here we just install necessary packages (datatable), import them, define functions for later usage and load train.csv data. 

In [None]:
!pip install datatable

In [None]:
import numpy as np
import pandas as pd
import tqdm
import itertools
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
plt.style.use('fivethirtyeight')
import seaborn as sns
pd.options.display.max_columns = 200
import os
import gc
import re
import datatable as dt

def chunks(l, n):
    """ Yield n successive chunks from l.
    """
    newn = int(len(l) / n)
    for i in range(0, n-1):
        yield l[i*newn:i*newn+newn]
    yield l[n*newn-newn:]
    
input_path = '/kaggle/input/'
root_path = os.path.join(input_path, 'jane-street-market-prediction')

In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype.name

        if col_type not in ['object', 'category', 'datetime64[ns, UTC]']:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
%%time

train = (dt.fread(os.path.join(root_path, "train.csv")).to_pandas()
        .query('weight > 0').pipe(reduce_mem_usage)
        .reset_index(drop = True))

train['action'] = (train.resp > 0).astype(int)

resp_cols = [i for i in train.columns if 'resp' in i]

features_names = list(set(train.columns) - set(resp_cols) - set(['weight', 'ts_id', 'date', 'action']))
features_index = list(map(lambda x: int(re.sub("feature_", "", x)), features_names))
features = sorted(list(zip(features_names, features_index)), key = lambda x: x[1])
features = [i[0] for i in features] + resp_cols

In [None]:
#count
nan_values_train = (train
 .apply(lambda x: x.isna().sum(axis = 0)/len(train))
 .to_frame()
 .rename(columns = {0: 'percentage_nan_values'})
.sort_values('percentage_nan_values', ascending = False)
)

display((train
 .apply(lambda x: x.isna().sum(axis = 0))
 .to_frame()
 .rename(columns = {0: 'count_nan_values'})
.sort_values('count_nan_values', ascending = False)
.transpose()), nan_values_train.transpose(),
       print("Number of features with at least one NaN value: {}/{}".format(len(nan_values_train.query('percentage_nan_values>0')),
                                                                           len(train.columns))))

<a id="top40"></a>

# Top 40 Features for number of NaNs

In [None]:
fig, ax = plt.subplots(figsize = (20, 12))

sns.set_palette("RdBu", 10)
#RdBu, YlGn
ax = sns.barplot(x='percentage_nan_values', 
            y='feature', 
            palette = 'GnBu_r',
            data=nan_values_train.reset_index().rename(columns = {'index': 'feature'}).head(40))

for p in ax.patches:
    width = p.get_width() 
    if width < 0.01:# get bar length
        ax.text(width,       # set the text at 1 unit right of the bar
            p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.4f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center')  # vertical alignment
    else:
        if width < 0.04:
            color_text = 'black'
        else:
            color_text = 'white'
        ax.text(width /2, 
                # set the text at 1 unit right of the bar
            p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.4f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center',
            color = color_text,
            fontsize = 10)  # vertical alignment

ax.set_title('Top 40 Features for percentage of NaN Values')



<img src="https://images.emojiterra.com/google/android-10/512px/1f914.png" width="50" height="50" style="top:03%; left:80%"> 

Some features have almost the same number of nan values. Let's plot their distribution over time.
First of all I define a new column which represent the daily_ts_id (or the trade number - 1 of the day). Then I'll be looking for relationships between Nan Values and time. 

In [None]:
train['daily_ts_id'] = (train.groupby('date').cumcount())

<a id="30k"></a>
# NaN values based on daily_ts_id, for the first 30000 trades

Let's see some of them and then I'll leave you the option to check all of them.

In [None]:
top_nan_features = nan_values_train.head(40).index.tolist()
mini_df = pd.concat([(train[top_nan_features].isna().astype(int)),train[['ts_id']]], 1).iloc[:30000, :]
new_day = (train.iloc[:30000, :].query("daily_ts_id == 0").ts_id.tolist())
for feature_name in top_nan_features[:4]:
    fig, ax = plt.subplots(1, 1, figsize = (15, 8))
    
    mini_df[[feature_name, "ts_id"]].plot(y = feature_name , kind = 'line',
                                         xlabel = 'ts_id', 
                                         ylabel = feature_name+ "_nans", linewidth=0.3,
                                         legend = False,
                                         ax = ax)
    for m in range(len(new_day)):
        ax.axvline(new_day[m], alpha = 0.5, ymin = 0, ymax = 1, linestyle = ":", color = 'blue')
        if m == 2:
            ax.text(new_day[m]-1700, 1.1, "day {}".format(m), size = 16, alpha = 0.8)
        else:
            ax.text(new_day[m]+70, 1.1, "day {}".format(m), size = 16, alpha = 0.8)

Unhide the following to see all of the NaN plots

In [None]:
top_nan_features = nan_values_train.head(40).index.tolist() #take the first 40 with most nans
mini_df = pd.concat([(train[top_nan_features].isna().astype(int)),train[['ts_id']]], 1).iloc[:30000, :]
new_day = (train.iloc[:30000, :].query("daily_ts_id == 0").ts_id.tolist())

for feature_name in top_nan_features:
    fig, ax = plt.subplots(1, 1, figsize = (15, 8))
    
    mini_df[[feature_name, "ts_id"]].plot(y = feature_name , kind = 'line',
                                         xlabel = 'ts_id', 
                                         ylabel = feature_name+ "_nans", linewidth=0.3,
                                         legend = False,
                                         ax = ax)
    for m in range(len(new_day)):
        ax.axvline(new_day[m], alpha = 0.5, ymin = 0, ymax = 1, linestyle = ":", color = 'blue')
        if m == 2:
            ax.text(new_day[m]-1700, 1.1, "day {}".format(m), size = 16, alpha = 0.8)
        else:
            ax.text(new_day[m]+70, 1.1, "day {}".format(m), size = 16, alpha = 0.8)

<img src="https://images.emojiterra.com/google/android-10/512px/1f914.png" width="50" height="50" style="top:03%; left:80%"> 

I think we can clearly see some patterns here: 

- some features (17, 27, 28, etc.) have most NaNs some time/trades after the first one;

- some features (21, 22, 31, 32, etc.) have most NaNs right at the start of each day;

- almost all other features belong to one of the 2 groups above, but their NaNs are more sporadic.

Let's try to verify these hypotheses.

I will add a new column 'pseudo_time_of_day', which is the trade number scaled to [0,1] interval.


In [None]:
train = (train.set_index('date').join(train.groupby('date').size().reset_index().rename(columns = {0: 'daily_number_of_trades'}).set_index('date'))
        .reset_index())

In [None]:
train['pseudo_time_of_day'] = (train['daily_ts_id']/train['daily_number_of_trades'])
gc.collect()

In [None]:
nan_df = pd.concat([(train[top_nan_features].isna().astype(int)), train[['pseudo_time_of_day']]], 1)

<a id = "timeofday"></a>
# Distribution of Time Of Day where Nan Occur for each feature 

Let's see some of them and then I'll leave you the option to check all of them. 

In [None]:
features_chunks = chunks(top_nan_features, 10)
chunk_len = 4

for enum, k_chunk in enumerate(features_chunks):
    if enum > 0:
        break
    fig, axes = plt.subplots(2, 2, figsize = (40, 30))
    ax = axes.ravel()
    
    for i in range(len(k_chunk)):

        feature_name = k_chunk[i]
        feature = nan_df.loc[nan_df[feature_name] == 1]['pseudo_time_of_day']
        
        sns.distplot(feature, hist=True, kde=True, color = 'red', hist_kws={'edgecolor':'black'},
                     kde_kws={'linewidth': 2, 'color': 'blue'}, ax = ax[i%10])
        ax[i%10].grid(True)
        ax[i%10].set(xlabel = 'pseudo_time_of_day')
        ax[i%10].set_title(feature_name, fontsize=30)
    fig.suptitle('Time of Day where NaN occur', fontsize = 45)

Unhide the following to see all distributions

In [None]:
features_chunks = chunks(top_nan_features, 10)
chunk_len = 4

for enum, k_chunk in enumerate(features_chunks):
    
    fig, axes = plt.subplots(2, 2, figsize = (40, 30))
    ax = axes.ravel()
    
    for i in range(len(k_chunk)):
        try:
            feature_name = k_chunk[i]
            
            feature = nan_df.loc[nan_df[feature_name] == 1]['pseudo_time_of_day']
        except:
            print(k_chunk)
        
        sns.distplot(feature, hist=True, kde=True, color = 'red', hist_kws={'edgecolor':'black'},
                     kde_kws={'linewidth': 2, 'color': 'blue'}, ax = ax[i%10])
        ax[i%10].grid(True)
        ax[i%10].set(xlabel = 'pseudo_time_of_day')
        ax[i%10].set_title(feature_name, fontsize=30)
    fig.suptitle('Time of Day where NaN occur', fontsize = 45)

<a id = "nan_vs_timeofday"></a>
# Take on NaNs vs Time of Day

<img src="https://images.emojiterra.com/google/android-10/512px/1f914.png" width="50" height="50" style="top:03%; left:80%"> 

I guess we could state the following:

- almost all features have most nans occurring at the beginning of each day 

- all features have nans occurring just after 'midday' (where midday corresponds to half of the daily trades)

<a id = "market_basket"></a>
# Market basket analysis on Sets of Nan Features



<img src="https://upload.wikimedia.org/wikipedia/commons/4/4a/AffinityAnalysis.png">


Here I consider each trade as a transaction where the features bought are those that are Nan: 
    
the idea is to look for frequent itemsets, i.e. group of features being often NaN together!

In [None]:
gc.collect()
ALREADY_CALCULATED = True
if not ALREADY_CALCULATED:
    nan_df['nan_features'] = nan_df.iloc[:, :-1].apply(lambda x: list(np.where(np.array(x) == 1)[0]), axis = 1)
    nan_df = nan_df.loc[nan_df.nan_features.apply(lambda x: len(x) > 0)]
    nan_df.reset_index(drop = True).to_pickle('nan_df_itemsets.pickle')
    te = TransactionEncoder()
    te_ary = te.fit(nan_df.nan_features).transform(nan_df.nan_features)

    df = pd.DataFrame(te_ary, columns=te.columns_)

    frequent_itemsets = (apriori(df, min_support=0.35, use_colnames=True))
    frequent_itemsets = (frequent_itemsets.loc[frequent_itemsets.itemsets.apply(lambda x: len(x)>1)].reset_index(drop = True)
                        .sort_values('support', ascending = False))
    frequent_itemsets['n_of_items'] = frequent_itemsets.itemsets.apply(lambda x: len(x))
    frequent_itemsets = (frequent_itemsets.set_index('support').join(frequent_itemsets.groupby('support').n_of_items.max().rename('max_n_of_items'))
    .query('n_of_items == max_n_of_items'))
    pd.options.display.max_colwidth = 300
else:
    nan_df = pd.read_pickle("/kaggle/input/nandfforitemsets/nan_df_itemsets.pickle")
    frequent_itemsets = pd.read_pickle('/kaggle/input/freq-itemsets/frequent_itemsets_035_already_calculated.pickle')
gc.collect()

In [None]:
display(frequent_itemsets)

In [None]:
group_of_features = nan_df.columns.tolist()[:14]

Features 0 to 13 have a crazy similar behaviour! 

<a id = "feature_itemsets"></a>
## Distribution of feature itemsets

In [None]:
features_chunks = list(chunks(group_of_features, 3))
features_chunks = [sorted(i) for i in features_chunks]
mini_df = train[group_of_features + ['ts_id']].iloc[:50000, :]
new_day = (train.iloc[:50000, :].query("daily_ts_id == 0").ts_id.tolist())

for k_chunk in features_chunks:
    chunk_len = len(k_chunk)

    fig, axes = plt.subplots(chunk_len, 1, figsize = (40, 30))
    
    ax = axes.ravel()
    
    for i in range(len(k_chunk)):

        feature_name = k_chunk[i]
        feature = train[feature_name]
        
        mini_df[[feature_name , "ts_id"]].plot(kind = 'line',
                                              xlabel = 'ts_id', 
                                              linewidth=0.3, legend = False,
                                              ax = ax[i%chunk_len], sharex=True)
        
        ax[i%chunk_len].grid(True)
        ax[i%chunk_len].set(xlabel = 'ts_id')
        ax[i%chunk_len].set_title(feature_name)
        ax[i%chunk_len].set_ylim(ymin=mini_df[feature_name].min(), ymax=mini_df[feature_name].max())
        for m in range(len(new_day)):
            ax[i%chunk_len].axvline(new_day[m], alpha = 0.5, ymin = 0, ymax = 1, linestyle = ":", color = 'blue')
            if m == 2:
                ax[i%chunk_len].text(new_day[m]-1700, 1.1, "day {}".format(m), size = 16, alpha = 0.8)
            else:
                ax[i%chunk_len].text(new_day[m]+70, 1.1, "day {}".format(m), size = 16, alpha = 0.8)

<a id = "take_market_basket"></a>
# Take on Features having the same NaN behaviour

<img src="https://images.emojiterra.com/google/android-10/512px/1f914.png" width="50" height="50" style="top:03%; left:80%"> 

I guess we could state the following:

- looking at the first days (which of course is a limitation) of our dataset it seems some features are very similar in their behaviour (17-18, 27-28, 7-8, 108-84, 102-114, 72-78, 90-96). Of course a correlation/crosscorrelation analysis maybe very helpful and provide further insight. I did not include that here since I wanted to focus just on NaN analysis. There's another notebook of mine on that, but I saw there are plenty of others so I won't brag about mine! 