# Context
Some Kagglers are training without samples with ```date```<=85. There is indeed a marked aberration on day 85. This notebook displays the aberration, without advocating for those samples to be dismissed as outliers.

Cumulative plots of ```feature_*``` and number of trades per day are already given in the discussion, [Did Jane Street modify their trading model around day 85?](https://www.kaggle.com/c/jane-street-market-prediction/discussion/201930). Here I plot the daily $p_i$ against ```date```. As defined under the competition [evaluation tab](https://www.kaggle.com/c/jane-street-market-prediction/overview/evaluation) for each ```date``` i, we have

$ p_i = \sum_j (weight_{ij} * resp_{ij} * action_{ij}) $

$ t = \frac{\sum p_i}{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{\mid i \mid}} $

In [None]:
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from pytz import timezone
print('tic', datetime.now(timezone('Canada/Pacific')).isoformat(timespec='minutes'))

In [None]:
train = pd.read_csv('../input/jane-street-market-prediction/train.csv')

# just slimming down

# remove rows we don't need
train = train.loc[ train['weight']>0 ]

# remove columns we don't need
train = train[ ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'date', 'weight'] ]

In [None]:
targets = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']

dailyp = pd.DataFrame(index=train['date'].unique(), columns=targets)
dailyp.index.name = 'date'
plt.figure(figsize=(15, 30))
for ntarget, target in enumerate(targets):
# assuming action=1 when target>0
    df = train.loc[ train[target]>0 ].copy()
    dailyp[target] = df.groupby('date', sort=False).apply(lambda x: (x['weight'] * x[target]).sum())
    plt.subplot(5, 1, 1+ntarget)
    plt.plot(dailyp.index, dailyp[target], '.r')
    plt.axvline(85)
    plt.grid(); plt.xlabel('date'); plt.ylabel('daily p'); plt.title(target)

In [None]:
# sanity
pick = train.loc[ (train['date']==333) & (train['weight']>0) & (train['resp_3']>0) ]
manual = (pick['weight']*pick['resp_3']).sum()
auto = dailyp.loc[333, 'resp_3']
ratio = (manual-auto)/auto
if ratio > .01:  # 1% tolerance
    print('insane, not ok')
else:
    print('sane, ok')
manual, auto, ratio

In [None]:
dailyp

# Quantiles
Now break-down the red dots above into components. Same points, just richer info and richer colours.

In [None]:
plt.figure(figsize=(15, 30))
for ntarget, target in enumerate(targets):
    plt.subplot(5, 1, 1+ntarget)
    dailyp[f'{target}_quantile'] = pd.qcut(dailyp[target], 5, labels=False).astype(int)
    sns.scatterplot(data=dailyp, x='date', y=target, hue=f'{target}_quantile', palette='tab10')
    plt.axvline(85)
    plt.grid()

In [None]:
# sanity
manual = np.where(dailyp['resp_3'].sort_values()>=dailyp.loc[333, 'resp_3'])[0].min()  //  (len(dailyp)/5)
manual = int(manual)
auto = dailyp.loc[333][['resp_3_quantile']].values
np.testing.assert_allclose(auto, manual)
manual, auto

In [None]:
# further sanity
# quantiles by definition should have flat histograms
plt.hist(dailyp['resp_3_quantile'])

In [None]:
print('toc', datetime.now(timezone('Canada/Pacific')).isoformat(timespec='minutes') )