# Jane Street Market Prediction - data exploration
[Kaggle Jane Street Market Prediction](https://www.kaggle.com/c/jane-street-market-prediction/overview)


## Calculate utility score of training set where `action` = 1 for all `resp` > 0
[training set](https://www.kaggle.com/c/jane-street-market-prediction/data?select=train.csv)

In [1]:
import pandas as pd
import numpy as np
import datatable as dt
X = dt.fread('jane-street-market-prediction/train.csv').to_pandas()

In [2]:
X['pj'] = X.weight * np.where((X.resp > 0), X.resp, 0)
pi = X.groupby(['date']).pj.sum()
t = pi.sum()/((pi**2).sum()**0.5) * (250/pi.count())**0.5
u = min(max(t, 0), 6) * pi.sum()
X.drop('pj', 1)
print(u)

224162.2681796676


## Calculate utility score of the mock test set where `action` = 1 for all *predicted* `resp` > 0
[mock test set](https://www.kaggle.com/c/jane-street-market-prediction/data?select=example_test.csv)

In [3]:
y = np.where((X.resp > 0), 1, 0)
X.drop(['date', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'resp', 'ts_id'], axis=1, inplace=True)

In [4]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
gbc = HistGradientBoostingClassifier(verbose=1).fit(X, y)

Binning 2.272 GB of training data: 42.850 s
Binning 0.252 GB of validation data: 1.759 s
Fitting gradient boosted rounds:
[1/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.61420, val loss: 0.61429, in 1.928s
[2/100] 1 tree, 31 leaves, max depth = 12, train loss: 0.54952, val loss: 0.54970, in 1.823s
[3/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.49559, val loss: 0.49583, in 1.766s
[4/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.45003, val loss: 0.45034, in 1.783s
[5/100] 1 tree, 31 leaves, max depth = 13, train loss: 0.41107, val loss: 0.41143, in 1.699s
[6/100] 1 tree, 31 leaves, max depth = 12, train loss: 0.37752, val loss: 0.37793, in 1.795s
[7/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.34841, val loss: 0.34886, in 1.842s
[8/100] 1 tree, 31 leaves, max depth = 14, train loss: 0.32301, val loss: 0.32351, in 1.774s
[9/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.30073, val loss: 0.30127, in 1.817s
[10/100] 1 tree, 31 leaves, max depth = 1

In [5]:
del X, y

In [6]:
import pandas as pd
import numpy as np
import datatable as dt
example_sample_submission = dt.fread('jane-street-market-prediction/example_sample_submission.csv').to_pandas()
example_test = dt.fread('jane-street-market-prediction/example_test.csv').to_pandas()

In [8]:
pred = gbc.predict(example_test.drop('date', 1))
pred[pred == 0]

array([], dtype=int32)

In [None]:
pred[pred == 0]
example_sample_submission[example_sample_submission == 0]

In [None]:
X = pd.concat([example_sample_submission.set_index('ts_id'), example_test.set_index('ts_id')], axis=1)

In [None]:
X

In [None]:
X['pj'] = X.weight * X.action
pi = X.groupby(['date']).pj.sum()
t = pi.sum()/((pi**2).sum()**0.5) * (250/pi.count())**0.5
u = min(max(t, 0), 6) * pi.sum()
print(u)

In [None]:
X['pj'] = X.weight * np.where((X.resp > 0), X.resp, 0)
pi = X.groupby(['date']).pj.sum()
t = pi.sum()/((pi**2).sum()**0.5) * (250/pi.count())**0.5
u = min(max(t, 0), 6) * pi.sum()
print(u)