# Final solution of Semper Augustus


## Model performance on live stock market data update

| Date of LB |  Ranking   | Overfit Ensemble (version 2) | delta | Local Best CV (version 6) |  delta |
|:----------:|:----------:|:---------------------:|:--------:|---------------------|:---------:|
|    Mar 5   | 99/4245    |       4790.458        |          |       4541.474     |           |
|   Mar 17   | 75/4245    |       5153.324        |   +363   |       4952.939      | +411      |


## Data preparation
The data contain 500 days of high frequency trading data from Jane Street, total 2.4 million rows.

0. All data: only drop the two partial days and the two <2k `ts_id` days (done first).
1. `fillna()` past day mean including all weight zero rows. 
2. ~~Most common values `fillna` for spike features rows.~~ (not any more after categorical embedding)
4. Smoother data: aside from 1, query day > 85, ~~drop `ts_id` > 9000 days~~ (reduces CV).
5. Final training uses only `weight > 0` rows, ~~with a randomly selected 40% of weight zero rows' weight being replaced by 1e-7 to reduce overfitting~~ (reduces CV so discarded).
6. ~~A new de-noised target is generated with all five targets~~ (CV too good > 0.57 but leaderboard bad).

## Models
- (PT) PyTorch baseline with the skip connection mechanics, around 400k parameters, fast inference. Easy to get overfit.
- (S) Carl found that some features have an extremely high number of common values. Based on close inspection. I have a conjecture that they are certain categorical features' embedding. So this model is designed to add an embedding block for these features. Also with the skip connection mechanics, around 300k parameters, best local CV (>0.555 on multiple folds) and best single model leaderboard score (7818).
- (AE) Tensorflow implementation of an autoencoder + a small MLP net with skip connection in the first layer. Small net. Currently the best scored public ones with a serious CV using 3 folds ensemble.
- (TF) Tensorflow Residual MLP using a filtering layer with high dropout rates to filter out hand-picked unimportant features suggested by Carl. The filter layer input is different for busy day model and regular day model.
- (TF overfit) the infamous overfit model with a 1111 seed.

## Train

### Train-validation splits
A grouped validation strategy based on a total of 100 days as validation, a 10-day gap between the last day of train and the first of valid, three folds.
```python
splits = {
          'train_days': (range(0,457), range(0,424), range(0,391)),
          'valid_days': (range(467, 500), range(434, 466), range(401, 433)),
          }
```

1. Volatile models: all data with only `resp`, `resp_3`, `resp_4` as targets.
2. Smoother models: smoother data with all five `resp`s.
~~3. De-noised models: smoother data with all five `resp`s + a de-noised target~~.
4. Optimizer is simply Adam with a cosine annealing scheduler that allow warm restarts. Rectified Adam for tensorflow models.
5. During training of torch models, a fine-tuning regularizer is applied each 10 epochs to maximize the utility function by choosing action being the sigmoid of the outputs (Only for torch models, I do not know how to incorporate this in `tensorflow` training, as tensorflow's custom loss function is not that straightforward to keep track of extra inputs between batches).

### Fine-tuning using utility
For each date $i$, we define: for `r` representing the `resp` (response), `w` representing the `weight`, and `a` representing the `action` (1 for taking the trade, 0s for pass):
<p align="center">
<img src="https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle%20p_i%20%3D%20%5Csum_%7Bj%7D%20w_%7Bij%7D%20r_%7Bij%7D%20a_%7Bij%7D">
</p>

Then it is summed up to 
<p align="center">
<img src="https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle%20t%20%3D%20%5Cfrac%7B%5Csum%20p_i%20%7D%7B%5Csqrt%7B%5Csum%20p_i%5E2%7D%7D%20*%20%5Csqrt%7B%5Cfrac%7B250%7D%7B%7Ci%7C%7D%7D%2C">
</p>

Finally the utility is computed by:
<p align="center">
<img src="https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle%20u%20%3D%20%5Cmin(%5Cmax(t%2C0)%2C%206)%20%20%5Csum_i%20p_i.">
</p>

Essentially, without considering some real market constraint, when every `p_i` become positive, this is to maximize 

<p align= "center">
<img src="https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle%20%5Cleft(%5Csum_i%20p_i%5Cright)%5E2%20%5Ccdot%20%5Cleft(%20%0A%20%5Csum_i%20p_i%5E2%5Cright)%5E%7B-1%7D">
</p>
We have constructed a fine-tuner using this to train the SpikeNet, by replacing the discrete $p_i$ with a continuously changing one.


## Submissions
1. Local best CV ones within a three seeds bag. Final models: a set of `2(S) + 2(PT) + 2(AE) + 2(TF)` for smooth days, and `5(S) + 2(PT) + 2(AE) + 3(TF)` for volatile days.
2. ~~Trained with all data using the “public leaderboard as CV” epochs determined earlier, plus the infamous tensorflow seed 1111 overfit model. The validation for this submission is based on the variation of the utility score in all train data among all 25-day non-overlapping spans~~.
3. As our designated submission timed out (version 4)...we decided to choose an overfit model using this pipeline.

### Inference pipeline
1. CPU inference because the submission is CPU-bounded rather GPU. Torch models are usually faster than TF, TF models with `numba` backend enabled.
2. Use `feature_64`'s [average gradient (a scaled version of $\arcsin (t)$) suggest by Carl](https://www.kaggle.com/c/jane-street-market-prediction/discussion/208013#1135364), and the number of trades in the previous day as a criterion to determine the models to include. Reference: [slope test of the past day class by Ethan and iter_cv simulation written by Shuhao](https://www.kaggle.com/ztyreg/validate-busy-prediction), [slope validation](https://www.kaggle.com/ztyreg/validate-busy-prediction)
3. Blending is always concatenating models in a bag then taking the middle 60%'s average (median if only 3 models), then concatenating again to take the middle 60% average (50% if a day is busy). For example, if we have `5 (PT) + 3 (AE) + 1 (TF)`, then `5 (PT)`'s predictions are concatenated and averaged along `axis 0` with the middle three, and `(AE)` submissions are taken the median. Lastly, the subs are concatenated again to take the middle 9 entries (15 total).

###### Version notes

Final submission: one overfit model (ver 2) and one serious model (ver 6)

- Ver 1: test run with 3 embed+resnet models, 3 tf ae+mlp models, volatile and regular for each
- Ver 2: including the overfit seed tf model (fixed a bug, forgot using past_day_mean in the submission pipeline)
- Ver 3: final run ver 1: typo in the final sub, maybe too tired, a good test against the concat blending...(4 hours 50 min inference....time)
- Ver 4: fixed the typo, decrease the number of models to be safe.
- Ver 5: increased 1 more common model for both regular and volatile, only first two folds (more regular than fold 2) for regular days tf resnet model.
- Ver 6: Rerun of ver 5 
- Ver 7-8: Rerun of ver 4
- Ver 9-11, 20-29: CPU rerun of ver 6
- Ver 12-19, 30-: GPU rerun of ver 6

In [None]:
import os
import time
import pickle
import random
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from collections import namedtuple
from sklearn.metrics import log_loss, roc_auc_score

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss, MSELoss
from torch.nn.modules.loss import _WeightedLoss
import torch.nn.functional as F

from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Concatenate, Lambda, GaussianNoise, Activation
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers.experimental.preprocessing import Normalization
import tensorflow as tf
import tensorflow_addons as tfa

tf.config.optimizer.set_jit(True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

import warnings
warnings.filterwarnings ("ignore")

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [None]:
import torch
print(torch.cuda.is_available())

### Settings and column names for different models

In [None]:
BETA = 0.7
ALPHA = 0.6 # for simple mean subtracted features

feat_cols = [f'feature_{i}' for i in range(130)]

target_cols = ['action_1', 'action_2', 'action_3','action', 'action_4']

target_cols_volatile = ['action_3','action', 'action_4']

f_mean = np.load('../input/janest-mlp-models/f_mean_after_85_include_zero_weight.npy').reshape(1,-1)
spike_val = np.load('../input/jane-street-train-data-final/spike_common_vals_42.npy').reshape(1,-1)

In [None]:
# features for spike net
feat_spike_index = [1, 2, 3, 4, 5, 6, 10, 14, 16, 69, 70, 71, 73, 74, 75, 76, 79, 80, 81, 82, 85,
                    86, 87, 88, 91, 92, 93, 94, 97, 98, 99, 100, 103, 104, 105, 106, 109, 111, 112, 115, 117, 118]

features_spike = [f'feature_{i}' for i in feat_spike_index]
feat_cols = [f'feature_{i}' for i in range(130)]

resp_cols  = ['resp_1', 'resp_2', 'resp_3','resp', 'resp_4', ]  
resp_cols_vol  = ['resp_3','resp', 'resp_4', ] 
cat_cols = [f+'_c' for f in features_spike]

In [None]:
##### Making features for baseline torch models
all_feat_cols = [col for col in feat_cols]
all_feat_cols.extend(['cross_41_42_43', 'cross_1_2'])

In [None]:
# resnet for all five resp
features_2_index = [0, 1, 2, 3, 4, 5, 6, 15, 16, 25, 26, 35, 
             36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
             49, 50, 51, 52, 53, 54, 59, 60, 61, 62, 63, 64, 65, 
             66, 67, 68, 69, 70, 71, 76, 77, 82, 83, 88, 89, 94, 
             95, 100, 101, 106, 107, 112, 113, 118, 119, 128, 129]

features_1_index = [0] + list(set(range(130)).difference(features_2_index))

features_1 = [f'feature_{i}' for i in features_1_index]

features_2 = [f'feature_{i}' for i in features_2_index]


In [None]:
# resnet for volatile days
features_1_index_v = [0,
                   7, 8, 17, 18, 27, 28, 55, 72, 78, 84, 90, 96, 102, 108, 114, 120, 121,
                   11, 12, 21, 22, 31, 32, 57, 74, 80, 86, 92, 98, 104, 110, 116, 124, 125] 
                # resp_1 resp_2 feat
    
features_2_index_v = [0] + list(set(range(130)).difference(features_1_index_v))

features_1_v = [f'feature_{i}' for i in features_1_index_v]

features_2_v = [f'feature_{i}' for i in features_2_index_v]

# Torch models

In [None]:
##### Model&Data fnc
class ResMLP(nn.Module):
    def __init__(self, hidden_size=256, 
                       output_size=len(target_cols), 
                       input_size=len(all_feat_cols),
                       dropout_rate=0.2):
        super(ResMLP, self).__init__()
        self.batch_norm0 = nn.BatchNorm1d(input_size)
        self.dropout0 = nn.Dropout(0.2)

        self.dense1 = nn.Linear(input_size, hidden_size)
        self.batch_norm1 = nn.BatchNorm1d(hidden_size)
        self.dropout1 = nn.Dropout(dropout_rate)

        self.dense2 = nn.Linear(hidden_size+input_size, hidden_size)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size)
        self.dropout2 = nn.Dropout(dropout_rate)

        self.dense3 = nn.Linear(hidden_size+hidden_size, hidden_size)
        self.batch_norm3 = nn.BatchNorm1d(hidden_size)
        self.dropout3 = nn.Dropout(dropout_rate)

        self.dense4 = nn.Linear(hidden_size+hidden_size, hidden_size)
        self.batch_norm4 = nn.BatchNorm1d(hidden_size)
        self.dropout4 = nn.Dropout(dropout_rate)

        self.dense5 = nn.Linear(hidden_size+hidden_size, output_size)

        self.Relu = nn.ReLU(inplace=True)
        self.PReLU = nn.PReLU()
        self.LeakyReLU = nn.LeakyReLU(negative_slope=0.01, inplace=True)
        # self.GeLU = nn.GELU()
        self.RReLU = nn.RReLU()

    def forward(self, x):
        x = self.batch_norm0(x)
        x = self.dropout0(x)

        x1 = self.dense1(x)
        x1 = self.batch_norm1(x1)
        x1 = self.LeakyReLU(x1)
        x1 = self.dropout1(x1)

        x = torch.cat([x, x1], 1)

        x2 = self.dense2(x)
        x2 = self.batch_norm2(x2)
        x2 = self.LeakyReLU(x2)
        x2 = self.dropout2(x2)

        x = torch.cat([x1, x2], 1)

        x3 = self.dense3(x)
        x3 = self.batch_norm3(x3)
        x3 = self.LeakyReLU(x3)
        x3 = self.dropout3(x3)

        x = torch.cat([x2, x3], 1)

        x4 = self.dense4(x)
        x4 = self.batch_norm4(x4)
        x4 = self.LeakyReLU(x4)
        x4 = self.dropout4(x4)

        x = torch.cat([x3, x4], 1)

        x = self.dense5(x)

        return x
    
    
class SpikeNet(nn.Module):
    def __init__(self, hidden_size=256,
                 cat_dim=len(cat_cols),
                 output_size=len(resp_cols),
                 input_size=len(feat_cols),
                 dropout_rate=0.2,
                 alpha=ALPHA):
        super(SpikeNet, self).__init__()
        # self.embed = nn.Embedding(cat_dim, 2)
        self.embed = nn.Linear(cat_dim, int(cat_dim*alpha))
        self.emb_dropout = nn.Dropout(0.1)

        self.batch_norm0 = nn.BatchNorm1d(input_size+int(cat_dim*alpha))
        self.dropout0 = nn.Dropout(0.1)

        self.dense1 = nn.Linear(input_size+int(cat_dim*alpha), hidden_size)
        # nn.init.kaiming_normal_(self.dense1.weight.data)
        self.batch_norm1 = nn.BatchNorm1d(hidden_size)
        self.dropout1 = nn.Dropout(dropout_rate)

        self.dense2 = nn.Linear(
            hidden_size+input_size+int(cat_dim*alpha), hidden_size)
        # nn.init.kaiming_normal_(self.dense2.weight.data)
        self.batch_norm2 = nn.BatchNorm1d(hidden_size)
        self.dropout2 = nn.Dropout(dropout_rate)

        self.dense3 = nn.Linear(hidden_size+hidden_size, hidden_size)
        # nn.init.kaiming_normal_(self.dense3.weight.data)
        self.batch_norm3 = nn.BatchNorm1d(hidden_size)
        self.dropout3 = nn.Dropout(dropout_rate)

        self.dense4 = nn.Linear(hidden_size+hidden_size, output_size)
        # nn.init.kaiming_normal_(self.dense4.weight.data)

        self.LeakyReLU = nn.LeakyReLU(negative_slope=0.01, inplace=True)

    def forward(self, x, x_cat):
        #
        x_cat = self.embed(x_cat)
        # x_cat = self.emb_dropout(x_cat)
        x = torch.cat([x, x_cat], dim=1)
        x = self.batch_norm0(x)
        x = self.dropout0(x)

        x1 = self.dense1(x)
        x1 = self.batch_norm1(x1)
        x1 = self.LeakyReLU(x1)
        x1 = self.dropout1(x1)

        x = torch.cat([x, x1], 1)

        x2 = self.dense2(x)
        x2 = self.batch_norm2(x2)
        x2 = self.LeakyReLU(x2)
        x2 = self.dropout2(x2)

        x = torch.cat([x1, x2], 1)

        x3 = self.dense3(x)
        x3 = self.batch_norm3(x3)
        x3 = self.LeakyReLU(x3)
        x3 = self.dropout3(x3)

        x = torch.cat([x2, x3], 1)

        x = self.dense4(x)

        return x

In [None]:
torch_weights_1 = ['../input/jane-street-train-data-final/emb_fold_0_util_1351_auc_0.5536.pth',
                '../input/jane-street-train-data-final/emb_fold_1_util_1232_auc_0.5539.pth',
#                 '../input/jane-street-train-data-final/emb_fold_2_util_266_auc_0.5441.pth'
                  ]

N_TORCH = len(torch_weights_1)

torch_model_1 = []
for _fold in range(N_TORCH):
    torch.cuda.empty_cache()
    model = SpikeNet()
    model.to(device)
    model_weights = torch_weights_1[_fold]
    try:
        model.load_state_dict(torch.load(model_weights))
    except:
        model.load_state_dict(torch.load(model_weights, map_location=torch.device('cpu')))
    model.eval()
    torch_model_1.append(model)
    print(f"spike net {_fold} loaded.")
    
torch_model_1 = torch_model_1[:2]

## volatile day 

In [None]:
torch_weights_2 = ['../input/jane-street-train-data-final/emb_volatile_fold_0_util_1445_auc_0.5550.pth',
                  '../input/jane-street-train-data-final/emb_volatile_fold_1_util_1225_auc_0.5557.pth',
                  '../input/jane-street-train-data-final/emb_volatile_fold_2_util_240_auc_0.5455.pth']

N_TORCH_2 = len(torch_weights_2)

torch_model_2 = []
for _fold in range(N_TORCH_2):
    torch.cuda.empty_cache()
    model = SpikeNet()
    model.to(device)
    model_weights = torch_weights_2[_fold]
    try:
        model.load_state_dict(torch.load(model_weights))
    except:
        model.load_state_dict(torch.load(model_weights, map_location=torch.device('cpu')))
    model.eval()
    torch_model_2.append(model)
    print(f"Volatile spike net {_fold} loaded.")

In [None]:

torch_weights_3 = ['../input/jane-street-train-data-final/pt_volatile_0_util_1424_auc_0.5520.pth',
                  '../input/jane-street-train-data-final/pt_volatile_1_util_1137_auc_0.5470.pth',
#                   '../input/jane-street-train-data-final/pt_volatile_2_util_322_auc_0.5444.pth',
                  ]

N_TORCH_3 = len(torch_weights_3)

torch_model_3 = []
for _fold in range(N_TORCH_3):
    torch.cuda.empty_cache()
    model = ResMLP()
    model.to(device)
    model_weights = torch_weights_3[_fold]
    try:
        model.load_state_dict(torch.load(model_weights))
    except:
        model.load_state_dict(torch.load(model_weights, map_location=torch.device('cpu')))
    model.eval()
    torch_model_3.append(model)
    print(f"Volatile torch {_fold} loaded.")
    
torch_model_3 = torch_model_3[:2]

# Tensorflow models

In [None]:
# enable mish
from tensorflow.keras import backend as K

class Mish(tf.keras.layers.Layer):

    def __init__(self, **kwargs):
        super(Mish, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        return inputs * K.tanh(K.softplus(inputs))

    def get_config(self):
        base_config = super(Mish, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

    def compute_output_shape(self, input_shape):
        return input_shape

def mish(x):
	return tf.keras.layers.Lambda(lambda x: x*K.tanh(K.softplus(x)))(x)

tf.keras.utils.get_custom_objects().update({'mish': tf.keras.layers.Activation(mish)})

def create_resnet_reg(n_features, n_features_2, n_labels, hidden_size, 
                  learning_rate=1e-3, label_smoothing = 0.005):    
    input_1 = tf.keras.layers.Input(shape = (n_features,), name = 'Input1')
    input_2 = tf.keras.layers.Input(shape = (n_features_2,), name = 'Input2')

    head_1 = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(hidden_size, activation="mish"), 
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(hidden_size//2, activation = "mish")
        ],name='Head1') 

    input_3 = head_1(input_1)
    input_3_concat = tf.keras.layers.Concatenate()([input_2, input_3])

    head_2 = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(hidden_size, "mish"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(hidden_size, "mish"),
        ],name='Head2')

    input_4 = head_2(input_3_concat)
    input_4_concat = tf.keras.layers.Concatenate()([input_3_concat, input_4]) 

    head_3 = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(hidden_size, kernel_initializer='lecun_normal', activation='mish'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(hidden_size//2, kernel_initializer='lecun_normal', activation='mish'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(n_labels, activation="sigmoid")
        ],name='Head3')

    output = head_3(input_4_concat)


    model = tf.keras.models.Model(inputs = [input_1, input_2], outputs = output)
    model.compile(optimizer=tfa.optimizers.RectifiedAdam(learning_rate=learning_rate), 
                  loss=tf.keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing), 
                  metrics=['AUC'])
    
    return model


def create_resnet(n_features, n_features_2, n_labels, hidden_size, learning_rate=1e-3, 
                  label_smoothing = 0.005):    
    input_1 = tf.keras.layers.Input(shape = (n_features,), name = 'Input1')
    input_2 = tf.keras.layers.Input(shape = (n_features_2,), name = 'Input2')

    head_1 = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(hidden_size, activation="mish"), 
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(hidden_size//2, activation = "mish")
        ],name='Head1') 

    input_3 = head_1(input_1)
    input_3_concat = tf.keras.layers.Concatenate()([input_2, input_3])

    head_2 = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(hidden_size, "mish"),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(hidden_size//2, "mish"),
        ],name='Head2')

    input_4 = head_2(input_3_concat)
    input_4_concat = tf.keras.layers.Concatenate()([input_3_concat, input_4]) 

    head_3 = tf.keras.Sequential([
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dense(hidden_size, kernel_initializer='lecun_normal', activation='mish'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(n_labels, activation="sigmoid")
        ],name='Head3')

    output = head_3(input_4_concat)


    model = tf.keras.models.Model(inputs = [input_1, input_2], outputs = output)
    model.compile(optimizer=tfa.optimizers.RectifiedAdam(learning_rate=learning_rate), 
                  loss=tf.keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing), 
                  metrics=['AUC'])
    
    return model

In [None]:
tf_weights_0 = ['../input/jane-street-train-data-final/resnet_reg_fold_0_seed_1127802.h5',
                  '../input/jane-street-train-data-final/resnet_reg_fold_1_seed_157157.h5',
                  '../input/jane-street-train-data-final/resnet_reg_fold_2_res_seed_97275.h5']

n_tf_0 = len(tf_weights_0)

tf_model_0 = []
for _fold in range(n_tf_0):
    tf.keras.backend.clear_session()
    model = create_resnet_reg(len(features_1), len(features_2), len(resp_cols), hidden_size=256, label_smoothing=5e-03)
    model_weights = tf_weights_0[_fold]
    model.load_weights(model_weights)
    model.call = tf.function(model.call, experimental_relax_shapes=True)
    tf_model_0.append(model)
    print(f"regular tf resnet {_fold} loaded.")

tf_model_0 = tf_model_0[:2]

In [None]:
tf_weights_1 = ['../input/jane-street-train-data-final/resnet_volatile_fold_0_seed_1127802.h5',
                '../input/jane-street-train-data-final/resnet_volatile_fold_1_seed_123835.h5',
                '../input/jane-street-train-data-final/resnet_volatile_fold_2_seed_1127802.h5']

n_tf_1 = len(tf_weights_1)

tf_model_1 = []
for _fold in range(n_tf_1):
    tf.keras.backend.clear_session()
    model = create_resnet(len(features_1_v), len(features_2_v), len(resp_cols_vol), hidden_size=256, label_smoothing=5e-03)
    model_weights = tf_weights_1[_fold]
    model.load_weights(model_weights)
    model.call = tf.function(model.call, experimental_relax_shapes=True)
    tf_model_1.append(model)
    print(f"volatile tf resnet {_fold} loaded.")

## tf model 2



In [None]:
def create_autoencoder(input_dim,output_dim,noise=0.05):
    i = Input(input_dim)
    encoded = BatchNormalization()(i)
    encoded = GaussianNoise(noise)(encoded)
    encoded = Dense(64,activation='relu')(encoded)
    decoded = Dropout(0.2)(encoded)
    decoded = Dense(input_dim,name='decoded')(decoded)
    x = Dense(32,activation='relu')(decoded)
    x = BatchNormalization()(x)
    x = Dropout(0.2)(x)
    x = Dense(output_dim,activation='sigmoid',name='label_output')(x)
    
    encoder = Model(inputs=i,outputs=encoded)
    autoencoder = Model(inputs=i,outputs=[decoded,x])
    
    autoencoder.compile(optimizer=Adam(0.001),loss={'decoded':'mse',
                                                    'label_output':'binary_crossentropy'})
    return autoencoder, encoder

def create_model(hp,input_dim,output_dim,encoder):
    inputs = Input(input_dim)
    
    x = encoder(inputs)
    x = Concatenate()([x,inputs]) #use both raw and encoded features
    x = BatchNormalization()(x)
    x = Dropout(hp.Float('init_dropout',0.0,0.4))(x)
    
    for i in range(hp.Int('num_layers',1,4)):
        x = Dense(hp.Int(f'num_units_{i}',64,256))(x)
        x = BatchNormalization()(x)
        x = Lambda(tf.keras.activations.swish)(x)
        x = Dropout(hp.Float(f'dropout_{i}',0.0,0.4))(x)
        
    x = Dense(output_dim,activation='sigmoid')(x)
    model = Model(inputs=inputs,outputs=x)
    model.compile(optimizer=Adam(hp.Float('lr',0.00001,0.1,default=0.001)),
                  loss=BinaryCrossentropy(label_smoothing=hp.Float('label_smoothing',0.0,0.1)),
                  metrics=[tf.keras.metrics.AUC(name = 'auc')])
    return model

In [None]:
encoder_file = '../input/jane-street-train-data-final/encoder_reg.hdf5'
hp_file = f'../input/jane-street-train-data-final/hp_ae_reg.pkl'

tf.keras.backend.clear_session()
_, encoder = create_autoencoder(len(feat_cols),len(target_cols),noise=0.1)

encoder.load_weights(encoder_file)
encoder.trainable = False

model_fn = lambda hp: create_model(hp,len(feat_cols),len(target_cols), encoder)
tf_models_2 = []

hp = pd.read_pickle(hp_file)
for _fold in range(3):
    tf.keras.backend.clear_session()
    model = model_fn(hp)
    model.load_weights(f'../input/jane-street-train-data-final/ae_reg_fold_{_fold}.hdf5')
    model.call = tf.function(model.call, experimental_relax_shapes=True)
    tf_models_2.append(model)
    print(f"Regular tf ae model {_fold} loaded")
    
tf_models_2 = tf_models_2[:2]

In [None]:
tf.keras.backend.clear_session()
_, encoder_vol = create_autoencoder(len(feat_cols),len(target_cols_volatile),noise=0.1)

encoder_vol.load_weights('../input/jane-street-train-data-final/encoder_volatile.hdf5')
encoder_vol.trainable = False

model_fn_vol = lambda hp: create_model(hp,len(feat_cols),len(target_cols_volatile), encoder_vol)
tf_models_3 = []

hp = pd.read_pickle(f'../input/jane-street-train-data-final/hp_ae_volatile.pkl')
for _fold in range(3):
    tf.keras.backend.clear_session()
    model = model_fn_vol(hp)
    model.load_weights(f'../input/jane-street-train-data-final/ae_volatile_fold_{_fold}.hdf5')
    model.call = tf.function(model.call, experimental_relax_shapes=True)
    tf_models_3.append(model)
    print(f"Volatile tf ae model {_fold} loaded")
    
tf_models_3 = tf_models_3[:2]

# Inference

Print the number of models in the ensemble below.

In [None]:
## common for regular and volatile
print("Number of baseline models for all days")
print(len(torch_model_1))
print(len(tf_model_0))

## volatile
print("\nNumber of models for volatile days")
print(len(torch_model_2))
print(len(torch_model_3))
print(len(tf_model_1))
print(len(tf_models_3))

## only regular day
print("\nNumber of models for regular days")
print(len(tf_models_2))

In [None]:
def median_avg(predictions, beta=BETA):
    '''
    predictions should be of a vector shape (..., n_models)
    beta: if beta is 0.5, then the middle 50% will be averaged
    '''
    sorted_predictions = np.sort(predictions)
    n_model = len(sorted_predictions)
    mid_point = n_model//2+1
    n_avg = int(n_model*beta)

    to_avg = sorted_predictions[mid_point-n_avg//2-1:mid_point+n_avg//2]
    
    return to_avg.mean()

In [None]:
class RunningPDA:
    '''
    https://www.kaggle.com/lucasmorin/running-algos-fe-for-fast-inference?scriptVersionId=50754012
    Modified by Ethan to add slope prediction using feature 64
    inspired by Carl: https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance
    '''
    def __init__(self, past_mean=0, start=1000, end=2500, slope=0.00116):
        self.day = -1
        self.past_mean = past_mean # past day mean, initialized as the mean
        self.cum_sum = 0
        self.day_instances = 0 # current day instances
        self.past_value = past_mean # the previous row's value, initialized as the mean
        self.past_instances = 0 # instances in the past day
        
        self.start = start
        self.end = end
        self.slope = slope
        self.start_value = None
        self.end_value = None

    def clear(self):
        self.n = 0
        self.windows.clear()

    def push(self, x, date):
        x = fast_fillna(x, self.past_value)
        self.past_value = x
        
        # change of day
        if date > self.day:
            self.day = date
            if self.day_instances > 0:
                self.past_mean = self.cum_sum/self.day_instances
            self.past_instances = self.day_instances
            self.day_instances = 1
            self.cum_sum = x
            
            self.start_value, self.end_value = None, None
            
        else:
            self.day_instances += 1
            self.cum_sum += x
        
        if self.day_instances == self.start:
            self.start_value = x[:, 64]
        if self.day_instances == self.end:
            self.end_value = x[:, 64]

    def get_mean(self):
        return self.cum_sum/self.day_instances

    def get_past_mean(self):
        return self.past_mean

    def get_past_trade(self):
        return self.past_instances
    
    def predict_today_busy(self):
        if self.start_value is None or self.end_value is None:
            return False
        return (self.end_value - self.start_value) / (self.end - self.start) < self.slope
    
    
def fast_fillna(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array

In [None]:
th = 0.50
f = median_avg
HIGH_VOL = 5000

GPU = torch.cuda.is_available()

pdm = RunningPDA(past_mean=f_mean, start=1000, end=2500, slope=0.00144)

In [None]:
import janestreet
env = janestreet.make_env()
env_iter = env.iter_test()

In [None]:
########## GPU
pbar = tqdm(total=15219)
for (test_df, pred_df) in env_iter:

    date = test_df['date'].values
    x_tt = test_df[feat_cols].values

    pdm.push(x_tt, date)
    past_day_mean = pdm.get_past_mean()
    past_day_vol = pdm.get_past_trade()

    if test_df['weight'].values[0] > 0:

        if np.isnan(x_tt.sum()):
            x_tt = np.nan_to_num(x_tt) + np.isnan(x_tt) * past_day_mean
        x_cat = (x_tt[:,feat_spike_index] - spike_val).astype(np.int32)


        ###### torch_pred_1: spikenet, 2 or 3 models
        torch_preds_1 = [model(torch.tensor(x_tt,dtype=torch.float).to(device), 
                               torch.tensor(x_cat,dtype=torch.float).to(device))\
                           .sigmoid().detach().cpu().numpy()\
                           for model in torch_model_1]
        torch_pred_1 = np.mean(torch_preds_1, axis=0)

        ### tf resnet for regular days
        x_tt_1 = x_tt[:,features_1_index]
        x_tt_2 = x_tt[:,features_2_index]
        tf_preds_0 = [model([x_tt_1, x_tt_2], training = False).numpy() for model in tf_model_0]
        tf_pred_0 = np.mean(tf_preds_0, axis=0)

        if past_day_vol > HIGH_VOL or pdm.predict_today_busy():
            ####### spike net for volatile days
            torch_preds_2 = [model(torch.tensor(x_tt,dtype=torch.float).to(device), 
                                torch.tensor(x_cat,dtype=torch.float).to(device))\
                             .sigmoid().detach().cpu().numpy()\
                           for model in torch_model_2]
            torch_pred_2 = np.median(torch_preds_2, axis=0)

            ####### vanilla torch for volatile days
            cross_41_42_43 = x_tt[:, 41] + x_tt[:, 42] + x_tt[:, 43]
            cross_1_2 = x_tt[:, 1] / (x_tt[:, 2] + 1e-5)
            feature_inp = np.c_[x_tt, cross_41_42_43.reshape(1, 1), cross_1_2.reshape(1, 1)]

            torch_preds_3 = [model(torch.tensor(feature_inp,dtype=torch.float).to(device))\
                        .sigmoid().detach().cpu().numpy() for model in torch_model_3]
            torch_pred_3 = np.mean(torch_preds_3, axis=0)

            ### resnet for volatile days
            x_tt_1_v = x_tt[:,features_1_index_v]
            x_tt_2_v = x_tt[:,features_2_index_v]
            tf_preds_1 = [model([x_tt_1_v, x_tt_2_v], training = False).numpy() for model in tf_model_1]
            tf_pred_1 = np.median(tf_preds_1, axis=0)

            ## ae+mlp model volatile days
            tf_preds_3 = [model(x_tt, training = False).numpy() for model in tf_models_3]
            tf_pred_3 = np.mean(tf_preds_3, axis=0)

            #### concat blending
            pred = np.c_[torch_pred_1, torch_pred_2, torch_pred_3, tf_pred_0, tf_pred_1, tf_pred_3].squeeze()
            pred = f(pred, beta=0.6)
        else:

            # tf_preds for ae+mlp model
            tf_preds_2 = [model(x_tt, training = False).numpy() for model in tf_models_2]
            tf_pred_2 = np.mean(tf_preds_2, axis=0)

            pred = np.c_[torch_pred_1, tf_pred_0, tf_pred_2].squeeze()
            pred = f(pred)

        pred_df.action.values[0] = int(pred >= th)
    else:
        pred_df.action.values[0] = 0

    env.predict(pred_df)
    pbar.update()
pbar.close()

In [None]:
####### CPU

# pbar = tqdm(total=15219)
# for (test_df, pred_df) in env_iter:

#     date = test_df['date'].values
#     x_tt = test_df[feat_cols].values

#     pdm.push(x_tt, date)
#     past_day_mean = pdm.get_past_mean()
#     past_day_vol = pdm.get_past_trade()

#     if test_df['weight'].values[0] > 0:

#         if np.isnan(x_tt.sum()):
#             x_tt = np.nan_to_num(x_tt) + np.isnan(x_tt) * past_day_mean
#         x_cat = (x_tt[:,feat_spike_index] - spike_val).astype(np.int32)


#         ###### torch_pred_1: spikenet, 2 or 3 models
#         torch_preds_1 = [model(torch.tensor(x_tt,dtype=torch.float), 
#                                torch.tensor(x_cat,dtype=torch.float)).sigmoid().detach().numpy()\
#                            for model in torch_model_1]
#         torch_pred_1 = np.mean(torch_preds_1, axis=0)

#         ### tf resnet for regular days
#         x_tt_1 = x_tt[:,features_1_index]
#         x_tt_2 = x_tt[:,features_2_index]
#         tf_preds_0 = [model([x_tt_1, x_tt_2], training = False).numpy() for model in tf_model_0]
#         tf_pred_0 = np.mean(tf_preds_0, axis=0)

#         if past_day_vol > HIGH_VOL or pdm.predict_today_busy():
#             ####### spike net for volatile days
#             torch_preds_2 = [model(torch.tensor(x_tt,dtype=torch.float), 
#                                 torch.tensor(x_cat,dtype=torch.float)).sigmoid().detach().numpy()\
#                            for model in torch_model_2]
#             torch_pred_2 = np.median(torch_preds_2, axis=0)

#             ####### vanilla torch for volatile days
#             cross_41_42_43 = x_tt[:, 41] + x_tt[:, 42] + x_tt[:, 43]
#             cross_1_2 = x_tt[:, 1] / (x_tt[:, 2] + 1e-5)
#             feature_inp = np.c_[x_tt, cross_41_42_43.reshape(1, 1), cross_1_2.reshape(1, 1)]

#             torch_preds_3 = [model(torch.tensor(feature_inp,dtype=torch.float))\
#                         .sigmoid().detach().numpy() for model in torch_model_3]
#             torch_pred_3 = np.mean(torch_preds_3, axis=0)

#             ### resnet for volatile days
#             x_tt_1_v = x_tt[:,features_1_index_v]
#             x_tt_2_v = x_tt[:,features_2_index_v]
#             tf_preds_1 = [model([x_tt_1_v, x_tt_2_v], training = False).numpy() for model in tf_model_1]
#             tf_pred_1 = np.median(tf_preds_1, axis=0)

#             ## ae+mlp model volatile days
#             tf_preds_3 = [model(x_tt, training = False).numpy() for model in tf_models_3]
#             tf_pred_3 = np.mean(tf_preds_3, axis=0)

#             #### concat blending
#             pred = np.c_[torch_pred_1, torch_pred_2, torch_pred_3, tf_pred_0, tf_pred_1, tf_pred_3].squeeze()
#             pred = f(pred, beta=0.6)
#         else:

#             # tf_preds for ae+mlp model
#             tf_preds_2 = [model(x_tt, training = False).numpy() for model in tf_models_2]
#             tf_pred_2 = np.mean(tf_preds_2, axis=0)

#             pred = np.c_[torch_pred_1, tf_pred_0, tf_pred_2].squeeze()
#             pred = f(pred)

#         pred_df.action.values[0] = int(pred >= th)
#     else:
#         pred_df.action.values[0] = 0

#     env.predict(pred_df)
#     pbar.update()
# pbar.close()

In [None]:
torch_pred_1

In [None]:
tf_pred_0

In [None]:
torch_pred_2 # should be in a busy day (day 2 in example test)

In [None]:
torch_pred_3 # should be in a busy day (day 2 in example test)

In [None]:
tf_pred_1 # should be in a busy day (day 2 in example test)

In [None]:
tf_pred_3 # should be in a busy day (day 2 in example test)

In [None]:
np.c_[torch_pred_1, tf_pred_0, tf_pred_2].squeeze() # regular day ensemble

In [None]:
np.c_[torch_pred_1, torch_pred_2, torch_pred_3, tf_pred_0, tf_pred_1, tf_pred_3].squeeze() # busy day ensemble