Playground for Jane Street Market Prediction Competition on Kaggle

Introduction

Jane Street hosted a code competition of predicting the stock market (Feb 2021 to Aug 2021) using the past high frequency trading data (2 years of data before 2018?) on Kaggle: https://www.kaggle.com/c/jane-street-market-prediction. The training data provided contain 500 days of high frequency trading data, total 2.4 million rows. The public leaderboard data contain 1 year of high frequency trading data from some time before Aug 2020 and up to that. The private ranges from a random time from July/Aug 2020 up to Aug 2021 (it was March 2021 as of the time of writings). This training dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity.

This is a code competition in that we have to prepare a pipeline of models that can do inference 1 trading opportunity at a time (no peaking into the future) subject to the inference API on Kaggle, and this submission should be able to perform the inference for 1.1 million samples in under 5 hours on cloud. For each row, we will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering.

Team: Semper Augustus

Shuhao Cao, Carl McBride Ellis, Ethan Zheng

Model performance on live stock market data update

Date of LB	Ranking	Overfit Ensemble (OE)	OE delta	Local Best CV (LBC)	LBC delta
Mar 5	99/4245, top 2.33%	4790.458		4541.474
Mar 17	75/4245, top 1.77%	5153.324	+363	4952.939	+411
Mar 31	252/4245,top 5.93%	3934.002	-1219	3849.940	-1103
Apr 14	260/4245,top 6.12%	3999.195	+65	4010.201	+160
Apr 29	252/4245,top 5.93%	3843.239	-156	3889.275	-121
May 12	152/4245,top 3.58%	4506.561	+663	4493.300	+604
May 28	171/4245,top 4.03%	4467.388	-39	4419.595	-74
Jun 9	148/4245,top 3.48%	4441.644	-26	4350.219	-69
Jun 25	206/4245,top 4.85%	4488.654	+47	4468.779	+118
Jul 21	270/4245,top 6.36%	4479.715	-9	4445.238	-23
Aug 2	359/4245,top 8.46%	4465.683	-14	?	?
Aug 18	212/4245,top 4.99%	4369.578	-96	4346.610	?
Final standing	241/4245,top 5.68%	4272.599	-67	4144.837	-202

Final submissions

Data preparation

All data: only drop the two partial days and the two <2k ts_id days (done first).
fillna() uses the past day mean including all weight zero rows for every feature.
~~Most common values fillna for spike features rows.~~ (not any more after categorical embedding)
Smoother data: aside from 1, query day > 85, ~~drop ts_id > 9000 days~~(decreases CV by a margin so still included), dropping of data before day 85 is covered in Carl's EDA: Jane Street: EDA of day 0 and feature importance.
Final training uses only weight > 0 rows, ~~with a randomly selected 40% of weight zero rows' weight being replaced by 1e-7 to reduce overfitting~~ (reduces CV so discarded).
~~A new de-noised target is generated with all five targets~~ (CV too good but leaderboard bad).

Models

(PT) PyTorch baseline with the skip connection mechanics, around 400k parameters, fast inference. Easy to get overfit.
(S) Carl found that some features have an extremely high number of common values. Based on close inspection. I have a conjecture that they are certain categorical features' embedding. So this model is designed to add an embedding block for these features. Also with the skip connection mechanics, around 300k parameters, best local CV and best single model leaderboard score.
(AE) Tensorflow implementation of an autoencoder + a small MLP net with skip connection in the first layer. Small net. Currently the best scored public ones with a serious CV using 3 folds ensemble.
(TF) Tensorflow Residual MLP using a filtering layer with high dropout rates to filter out hand-picked unimportant features suggested by Carl.
~~(TF overfit) the infamous overfit model with a 1111 seed.~~ (we decided to exclude this one in the final submission)

Train

Validation score

Instead of the common accuracy or area-under-curve metrics for the classification problem, this competition is evaluated on a utility score.

For each date $i$, we define: for r representing the resp (response), w representing the weight, and a representing the action (1 for taking the trade, 0s for pass):

$\displaystyle p_i = \sum_{j} w_{ij} r_{ij} a_{ij}$

Then it is summed up to

$\displaystyle t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}},$

Finally the utility is computed by:

$\displaystyle u = \min(\max(t,0), 6) \sum_i p_i.$

Essentially, without considering some real market constraint, when every p_i become positive, this is to maximize

$\displaystyle \left(\sum_i p_i\right)^2 \cdot \left( \sum_i p_i^2\right)^{-1}$

which we will use to construct a fine tuner for trained models.

Train-validation strategy

A grouped validation strategy based on a total of 100 days as validation, a 10-day gap between the last day of train and the first of valid, three folds. The gap is due to the speculation of certain features being the moving average of certain metrics for the tradings.

splits = {
          'train_days': (range(0,457), range(0,424), range(0,391)),
          'valid_days': (range(467, 500), range(434, 466), range(401, 433)),
          }

Training of different models

Volatile models: all data with only resp, resp_3, resp_4 as targets.
Smoother models: smoother data with all five resps. ~~3. De-noised models: smoother data with all five resps + a de-noised target~~.
Optimizer is simply Adam with a cosine annealing scheduler that allow warm restarts. Rectified Adam for tensorflow models.
During training of torch models, a fine-tuning regularizer is applied each 10 epochs to maximize the utility function by choosing action being the sigmoid of the outputs (Only for torch models, I do not know how to incorporate this in tensorflow training, as tensorflow's custom loss function is not that straightforward to keep track of extra inputs between batches).

Submissions

Local best CV ones within a several-seeded bag. Final models: a set of 3(S) + 3(PT) + 3(AE) + 1(TF) for both smooth and volatile data.
Trained with all data using the “public leaderboard as CV” epochs determined earlier, plus the infamous tensorflow seed 1111 overfit model. The validation for this submission is based on the variation of the utility score in all train data among all 25-day non-overlapping spans.
As our designated submission timed out... due to my poor judgement on the number of models to ensemble, we decided to choose an overfit model using the first pipeline.

Inference

CPU inference because the submission is CPU-bounded rather GPU. Torch models are usually faster than TF, TF models with numba backend enabled. (Update Feb 23 after the competition ended) I found that GPU inference became faster than CPU as more Tensorflow-based models are incorporated in the pipeline.
(Main contribution of Semper Augustus) Use feature_64's average gradient (a scaled version of $\arcsin (t)$) suggest by Carl, and the number of trades in the previous day as a criterion to determine the models to include. Reference: slope test of the past day class by Ethan and iter_cv simulation written by Shuhao, slope validation
Blending is always concatenating models in a bag then taking the middle 60%'s average (median if only 3 models), then concatenating again to take the middle 60% average (50% if a day is busy). For example, if we have 5 (PT) + 3 (AE) + 1 (TF), then 5 (PT)'s predictions are concatenated and averaged along axis 0 with the middle three, and (AE) submissions are taken the median. Lastly, the subs are concatenated again to take the middle 9 entries (15 total).
Regular days: 3 (P), 3 (S) ~~with denoised target~~, 3 (AE), and 1 (TF) trained on the smoother models.
Busy days: above models trained on all data.

Below were the notes and tries before we orchestrated our final solution.

Things to try for the final submission:

Ideas and notes

Final sub: 1 with best public LB+CV, 1 experimental.
submission idea: using different models for the iter_env, see if we can extract features representing general market trend, then switch/ensemble different models. Especially when the market is in volatile, the strategy in general needs to be more conservative.
feature_64 may be used as time to build CV strategy. Ref: https://www.kaggle.com/c/jane-street-market-prediction/discussion/202253
Number of trades ts_id might be related to volatility on that day: https://www.kaggle.com/c/jane-street-market-prediction/discussion/201930#1125847
Another notebook on volatility (this one we should use its info at dicretion): https://www.kaggle.com/charlesleahy/identifying-volatility-features
Two-Sigma's stock prediction competition (with a similar format) winning solution of 5th place: https://medium.com/kaggle-blog/two-sigma-financial-modeling-code-competition-5th-place-winners-interview-team-best-fitting-279a493c76bd
The data in the actual test is disjoint from the train (confirmed by the host at https://www.kaggle.com/c/jane-street-market-prediction/discussion/199551)

EDA

Only 35%-40% of the samples have action being 1, depending on the CV split.
Carl's observation: huge spike in the histogram of features 3,4,6,19,20,22,38 etc, also lurking on the far left side of features 71, 85, 87, 92, 97, 105, 127 and 129. A smaller spike is seen for feature 116.

NN models

Current NN models use date>85 and weight>0.

Current best: Ethan's AE+MLP baseline the last 2 folds, not fine-tune models, with a custom median ensembling.
(After debugging) Both custom median (average of middle 50%) and np.mean has better public score.
Current NN models uses fillna either with mean or forward fill, mean performs better on public LB but certain is subject to leakage.

Autoencoder

Kaggle Notebook: https://www.kaggle.com/ztyreg/fork-of-s8900-ts
Local Notebook: TBD
Score: 8358.763
Submission time: ~2 hours (CPU)

Thoughts:

Forward fill (8781.740) seems to be better than mean imputation, although I haven't tested if the difference is significant

AE+MLP+prediction cache

Attempt 0.1: simply saving pred_df.copy() and using pd.concat is way too slow (7-8 iteration/s << 45 which is the current starter's).
TO-DO: add a class so that prediction is a function under this class, model outputs to give more information, and some objects "depicting" the current market volatility.

A new Residual+MLP model

The key is to train using the actual resp columns as target, and when doing the inference, apply the sigmoid function to the output (why BCEwLogits performs better than CrossEntropy???).
Set up the baseline training, adding a 16-target model (using various sums between the resp columns).
Tested the sensitivity of seeds to the CV vs public leaderboard. Bigger model in general is less sensitive than smaller models (esp the seed 1111 overfit model).
A local-public LB stable training strategy: RAdam/Adam with cosine annealing scheduler, utility function regularizer finetuning every 10 epochs with a 1e-3*lr learning rate, 1 or 2 denoised targets added, 50% median average ensembling.
Feature neutralization might not fit the iteration speed needed for the inference.

Validation scores

ResNet (TF), two features group, regular days

Fold	Seed	Score
0	1127802	1621.86
1	1127802	1080.24
1	792734	1221.17
2	1127802	80.85
2	97275	146.31
0	157157	1554.01
1	157157	1273.48
2	157157	19.76

ResNet (TF), two features group, volatile days

Fold	Seed	Score
0	1127802	1640.27
1	1127802	1054.42
2	1127802	45.15
0	157157	1563.25
1	157157	1253.98
2	157157	11.14
0	745273	1511.12
1	962656	0.01
0	5567273	1457.13
1	123835	1290.73
2	676656	34.38

ResNet+spike (TF+S), three features group, regular days (too slow for inference not going into the final sub pipeline)

Fold	Seed	Score
0	1127802	1417.43
1	1127802	1082.22
2	1127802	59.87
2	802	175.96

(AE) regular days (only first two folds are used due to time constraint)

Fold	Seed	Score
0	692874	1413.37
0	1127802	1552.13
1	692874	1037.59
1	1127802	1209.71
2	692874	144.69
2	1127802	144.29
0	157157	1529.70
1	157157	1052.70
2	157157	402.80

(AE) volatile days (only first two folds are used due to time constraint)

Fold	Seed	Score
0	969725	1485.01
0	1127802	1672.50
0	618734	1623.88
0	283467	1670.67
1	969725	1284.02
1	1127802	1347.90
1	618734	969.63
1	283467	1006.84
2	969725	0.83
2	1127802	0.26
2	618734	0
2	283467	49.79

Gradient boosting models

XGBoost:

Kaggle Notebook: https://www.kaggle.com/ztyreg/xgb-benchmark
Local Notebook: https://github.com/scaomath/kaggle-jane-street/blob/main/lgb/v01_explore.ipynb
Score: 5557.170
Submission time: ~2 hours (CPU)

Notes:

Training 1 XGBoost model only takes about 5 minutes, so we do not need to save the model
Needs different feature processing than the autoencoder model

Thoughts:

Add time lag features
- Add all lag1 features: no improvement (5039.022)
Add transformed features (abs, log, std, polynomial)

Trained models

Google drive folder: TBD

Repo structure

├── model
│   └── model dumps: hdf5, pt, etc
├── data
│   ├── EDA ipynbs
│   ├── processed data
│   └── raw data
├── nn
├── transformer
├── lgb
├── data.py: comepetition data downloader
├── utils.py: utility functions
├── README.md: can be used as a log
└── .gitignore

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
data		data
lgb		lgb
mlp		mlp
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
cv.py		cv.py
cv_final.py		cv_final.py
cv_splits.py		cv_splits.py
data.py		data.py
eda.ipynb		eda.ipynb
eda_final.ipynb		eda_final.ipynb
iter_cv.py		iter_cv.py
iter_cv_torch.py		iter_cv_torch.py
janest.code-workspace		janest.code-workspace
utils.py		utils.py
utils_js.py		utils_js.py
utils_lgb.py		utils_lgb.py

scaomath/kaggle-jane-street

Folders and files

Latest commit

History

Repository files navigation