In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import datatable as dt

import os

## <center style="background-color:Gainsboro; width:40%;">Summary</center>
* [Introduction](#introduction)
* [Reading Data](#reading_data)
* [Rows Count Over Time](#rows_over_time)
* [Null Counts by Feature Over Time](#missing_over_time)
* [Features Values Over Time](#value_over_time)
* [Conclusion](#conclusion)

<a class="anchor" id="introduction"></a>
## <center style="background-color:Gainsboro; width:40%;">Introduction</center>

This notebook evalutes the features and target behaviour over time. I want to evaluate here if there are major shifts in missing count or features/target distribution. This is important because we want to avoid our model to learn from patterns that are no long applicable.

<a class="anchor" id="reading_data"></a>
## <center style="background-color:Gainsboro; width:40%;">Reading Data</center>


In [None]:
%%time

example_sample_submission = dt.fread("../input/jane-street-market-prediction/example_sample_submission.csv").to_pandas()
example_test = dt.fread("../input/jane-street-market-prediction/example_test.csv").to_pandas()
features = dt.fread("../input/jane-street-market-prediction/features.csv").to_pandas()
train = dt.fread("../input/jane-street-market-prediction/train.csv").to_pandas()

In [None]:
display(example_sample_submission.head())
display(example_test.head())
display(features.head())
display(train.head())

<a class="anchor" id="rows_over_time"></a>
## <center style="background-color:Gainsboro; width:40%;">Rows Count Over Time</center>

Here we will evaluate the number of examples available for each date.

In [None]:
total_by_date = train.assign(total=lambda x:1).loc[:,['date','total']].groupby("date").count().reset_index()
fig,ax = plt.subplots(figsize = (15,5))
sns.lineplot(data=total_by_date, x='date', y='total', ax=ax)

Nothing noteworthy here. The rows are roughly evenly spread across all dates.

<a class="anchor" id="missing_over_time"></a>
## <center style="background-color:Gainsboro; width:40%;">Null Counts by Feature Over Time</center>

Here we will evaluate the proportion of null values for each feature over time.

In [None]:
def plot_over_time(data, cols, ylabel):
    fig,ax = plt.subplots(figsize = (15,5))

    for col in cols:
        sns.lineplot(data=data, x='date', y=col, ax=ax, label=col)

    ax.legend()
    ax.set(xlabel='date', ylabel=ylabel)

def plot_in_groups(data, cols, n=5):
    while len(cols) > 0:
        plot_over_time(data, cols[:n], ylabel='missing_count')
        cols = cols[n:]
        
features = [col for col in train.columns if "feature" in col]
features_resp_weight = features + ['resp', 'weight']

missing_over_time = train[['date']+features_resp_weight]\
    .groupby('date')\
    .apply(lambda df: df.isnull().sum()/len(df))\
    .drop(columns=['date'])\
    .reset_index()

max_missing_count = missing_over_time.drop(columns=['date']).max(axis=0).drop(['resp', 'weight'])

In [None]:
plot_in_groups(missing_over_time,max_missing_count[max_missing_count<0.01].index)

In [None]:
plot_in_groups(missing_over_time,max_missing_count[(max_missing_count>=0.01) & (max_missing_count<0.05)].index)

In [None]:
plot_in_groups(missing_over_time,max_missing_count[(max_missing_count>=0.05) & (max_missing_count<0.10)].index)

In [None]:
plot_in_groups(missing_over_time, max_missing_count[max_missing_count>=0.10].index)

In [None]:
plot_over_time(missing_over_time, ['resp'], ylabel='missing_count')

In [None]:
plot_over_time(missing_over_time, ['weight'], ylabel='missing_count')

What we want to see here is if there are major shifts in the missing count. It is noticeable that right before day 300 several features showed a big decrease in missing counts, but this shouldn't be a problem.
<br><br>From the plots we can see that this is not the case and we can safely use all features with regard to missing count.
<br><br>The target `resp` and `weight` don't have any missing values, as expected.

<a class="anchor" id="value_over_time"></a>
## <center style="background-color:Gainsboro; width:40%;">Features Values Over Time</center>

Here we will evaluate the mean value of each feature over time.

In [None]:
# Arbitrary maximum number of unique values to consider the feature categorical
max_categories_count = 50

features_nunique = {feature:train[feature].nunique() for feature in features}

categorical_features = [feature for (feature, nunique) in features_nunique.items() if nunique <= max_categories_count]
numerical_features = [feature for (feature, nunique) in features_nunique.items() if nunique > max_categories_count]

print("Categorical Features:")
print(", ".join(categorical_features))

print("\nNumerical Features:")
print(", ".join(numerical_features))

### Categorical Features

In [None]:
feature_0_mean = train[['date','feature_0']].replace({'feature_0': {-1:0}}).groupby('date').mean().reset_index()

fig, ax = plt.subplots(figsize = (15,5))

sns.lineplot(data=feature_0_mean, x='date', y='feature_0', ax=ax)
ax.set(xlabel='date', ylabel='positive class proportion')

The feature_0 distribution over time is fairly stable, we should be able to use it safely.

### Numerical Features

In [None]:
feature_mean_over_time = train[['date']+numerical_features]\
    .groupby('date')\
    .mean()\
    .reset_index()

max_mean = feature_mean_over_time.drop(columns=['date']).max(axis=0)

In [None]:
plot_in_groups(feature_mean_over_time,max_mean[max_mean<1].index)

In [None]:
plot_in_groups(feature_mean_over_time,max_mean[(max_mean>=1) & (max_mean<5)].index)

In [None]:
plot_in_groups(feature_mean_over_time,max_mean[(max_mean>=5) & (max_mean<10)].index)

In [None]:
plot_in_groups(feature_mean_over_time,max_mean[max_mean>=10].index)

The mean values of the numerical features remains resonably stable over time, with no major shifts.
<p> It woud be ideal to evaluate the distribution of the features over time, but this will be left to a future version.

<a class="anchor" id="target_value_over_time"></a>
## <center style="background-color:Gainsboro; width:40%;">Target Values Over Time</center>

Here we will evaluate the distribution of the target over time.

In [None]:
plot_data = train.assign(agg_date=lambda df: (df.date/10).round())
fig,ax = plt.subplots(figsize = (15,5))

sns.boxplot(x="agg_date", y="resp", data=plot_data, ax=ax, whis=[1,99])

In [None]:
fig,ax = plt.subplots(figsize = (15,5))

sns.boxplot(x="agg_date", y="resp", data=plot_data, ax=ax, whis=[1,99], showfliers=False)

In the first plot (including outliers) we can see that historicaly the target (`resp`) values range from roughly -0.6 to 0.5. In the second plot (that disregards the first and last percentile), we can see that 98% of the values falls roughly within the [-0.1, 0.1] range.
<br><br>
One important thing to notice is that the target values are most of the time roughly symetrical around 0. This allows us to use classification algorithms without worrying about class balancing.
<br><br>
Lastly, I wonder whether this variable is the percentual change of some sort of stock.

<a class="anchor" id="weight_value_over_time"></a>
## <center style="background-color:Gainsboro; width:40%;">Weight Values Over Time</center>

Here we will evaluate the distribution of the weight over time.

In [None]:
plot_data = train.assign(agg_date=lambda df: (df.date/10).round())
fig,ax = plt.subplots(figsize = (15,5))

sns.boxplot(x="agg_date", y="weight", data=plot_data, ax=ax, whis=[1,99])

In [None]:
plot_data = train.assign(agg_date=lambda df: (df.date/10).round())
fig,ax = plt.subplots(figsize = (15,5))

sns.boxplot(x="agg_date", y="weight", data=plot_data, ax=ax, whis=[0,95], showfliers=False)

In the first plot (including outliers) we can see that `weight` values can go as high as 170 or so, but the second plot (excluding the 5% highest values of each period) shows that the majority of the values (95%) falls within the range [0,16].
<br><br>
This might be the stock value, which is why it is as much important as `resp` in regard to the evaluation.

<a class="anchor" id="conclusion"></a>
## <center style="background-color:Gainsboro; width:40%;">Conclusion</center>

This initial analysis shows that it is fairly safe to use all the features provided, once we didn't detect any major shift in their behaviour over time.