## What's this notebook different from others 


+ Consider intra-`time_id` context + `BertLayer` + PCC loss function -> hits LB 0.148
    + > I've just started tuning the model structure and hyperparameters, and I believe there's still a lot of room for improvement!
+ Provide a flexible template that allows us to validate a wide range of our ideas without changing any code (use command-line arguments if run locally or overwrite the settings in the last cell if run on Kaggle), including:
    + Basic hyperparameters
        + optimizer
        + learning rate
        + learning rate scheduler
        + ...
    + Model structures
        + number of layers
        + dimension of each layer
        + where to add multi-head-attention layers 
        + whether to use dropout, dropout rate
        + ...
    + Others
        + apply MSE or PCC as loss function
        + number of folds
        + whether to early stop
        + dataset split ratio
        + ...

## More about this notebook

It is based on PyTorch Lightning and therefore supports a lot of advanced features such as

+ Multi-GPU training, TPU training
+ 16-bit precision training
+ TensorBoard visualization
+ Accumulates grads every k batches
+ Stochastic weight average
+ ...

without the need to change the code.

In addition, if you would like to run locally, please directly refer to this [GitHub repo](https://github.com/siahuat0727/ubiquant-market-prediction), install the dependencies follow the README, and it should work! :)


## Approach and model structure

> There are many valuable public notebooks I haven't gone through now. (And I am a newbie in market prediction, so the following is just to express my thoughts.) I would appreciate it if you would like to comment and share your experience or opinion!

I'm not going to focus on feature engineering in this notebook, even though it might have some benefits.
We know from many public notebooks that the dataset has been preprocessed and is clean enough to be directly fed into a DNN model. Therefore, I am focusing on what context information we let to model to capture and what we want the model to learn.

I believe we can divide the modeling problem into three stages, with each stage gradually considering more context.

**Stage 1 - Consider every target independently**

This is the structure that most other public notebooks of the DNN approach used. Although this template supports the tuning of Stage 1, I am not going to pay much attention to it.

**Stage 2 - Consider intra-`time_id` context information**

In this stage, we consider all the data in a `time_id` as a whole and then predict the corresponding targets.

Refer to [What this competition is about](https://www.kaggle.com/c/ubiquant-market-prediction/discussion/303397), we are probably predicting the position of an asset in a day. If that is the case, the target value itself doesn't mean anything if there are no other targets compared to it. Therefore, I would prefer **correlation-based loss** instead of MSE-like loss. ~~Even though many people are sharing that MSE loss can get a better result in this contest.~~

 
To gain the benefits of intra-`time_id` information, I believe an intuitive way is to treat the investments data of each `time_id` as an unordered sequence and then apply the model in the NLP field. A self-attention (multi-head attention) layer without positional encoding would be a good choice. In particular, I am trying `BertLayer` since the BERT model has been proven effective. Additionally, it contains a skip connection which can (probably) ensure that at least adding a self-attention layer won't be worse.

>  To the extreme, if an identity mapping (_which means don't consider intra-`time_id` context in our case_) were optimal, it would be easier to push the residual (_self-attention layer in our case_) to zero than to fit an identity mapping by a stack of nonlinear layers.  -- _Deep Residual Learning for Image Recognition_

**Stage 3 - Consider inter-`time_id` context information**

It is not implemented yet. An intuitive way is to have a cross-attention to all the previous inputs, but the time complexity of this structure is unacceptable if the time series is too long. To tackle this challenge, we can define a sufficient number of tokens to join the self-attention and let the result tokens recurrently feed to the next time-step to act as a memory pool.

## Batching strategy and data augmentation

Starting with Stage 2, we expect to sample training data by `time_id`. And we can't stack it directly to create a mini-batch because the number of investments data is different at each `time_id`. An intuitive solution is to set batch size equal to 1, but this is not efficient for training and can lead to training instability. Again, we are probably predicting the location of a set of assets. If that is the case, we can randomly truncate some data to match the number of investments for every `time_id` in the batch. Indeed, I believe this can treat as a **data augmentation** for Stage 2 and 3 since we are implicitly teaching the model that **the absence of some assets data shouldn't affect the relative positions of the remaining assets.** (means nothing for Stage 1 since it doesn't consider other assets data when making a prediction)

## Others

+ unseen investment data augmentation (?)
	+ randomly mask the investment embedding when training
+ cv vs. lb
	+ I think the next urgent thing might be learning about cv strategies from the discussion

Check the following work on [this notebook](https://www.kaggle.com/siahuat/memory-transformer-intra-inter-time-context) if you are interested!

**Please upvote if you like it!**

## Talk is cheap. Here is the code.

In [None]:
!pip install ../input/202202-libraries/torchmetrics-0.7.2-py3-none-any.whl --user

In [None]:
import torchmetrics
assert torchmetrics.__version__ == '0.7.2', torchmetrics.__version__

import torch
print(torch.cuda.get_device_name(0))

## constants.py

In [None]:
FEATURES = [f'f_{i}' for i in range(300)]

## model.py

In [None]:
import torch
from torch import nn
from transformers import BertConfig
from transformers.models.bert.modeling_bert import BertLayer as _BertLayer


class SafeEmbedding(nn.Embedding):
    "Handle unseen id"

    def forward(self, input):
        output = torch.empty((*input.size(), self.embedding_dim),
                             device=input.device,
                             dtype=self.weight.dtype)

        seen = input < self.num_embeddings
        unseen = seen.logical_not()

        output[seen] = super().forward(input[seen])
#         output[unseen] = torch.zeros_like(self.weight[0])
        output[unseen] = self.weight.mean(dim=0).detach()
        return output


class FlattenBatchNorm1d(nn.BatchNorm1d):
    "BatchNorm1d that treats (N, C, L) as (N*C, L)"

    def forward(self, input):
        sz = input.size()
        return super().forward(input.view(-1, sz[-1])).view(*sz)


class BertLayer(_BertLayer):
    def forward(self, *args, **kwargs):
        return super().forward(*args, **kwargs)[0]


class Net(nn.Module):
    def __init__(self, args, n_feature):
        super().__init__()

        self.emb = SafeEmbedding(args.n_emb, args.emb_dim)

        in_size = args.emb_dim + n_feature
        szs = [in_size] + args.szs

        self.layers = nn.Sequential(*self.get_layers(args, szs))

        self.post_init()

    def get_layers(self, args, szs):
        layers = []

        for layer_i, (in_sz, out_sz) in enumerate(zip(szs[:-1], szs[1:])):
            layers.append(nn.Linear(in_sz, out_sz))
            layers.append(FlattenBatchNorm1d(out_sz))
            layers.append(nn.SiLU(inplace=True))

            if args.dropout > 0.0:
                layers.append(nn.Dropout(p=args.dropout, inplace=True))

            if layer_i in args.mhas:
                layers.append(BertLayer(BertConfig(
                    num_attention_heads=8,
                    hidden_size=out_sz,
                    intermediate_size=out_sz)))

        layers.append(nn.Linear(szs[-1], 1))
        return layers

    def forward(self, x_id, x_feat):
        x_emb = self.emb(x_id)
        out = torch.cat((x_emb, x_feat), dim=-1)
        return self.layers(out).squeeze(-1)

    def post_init(self):
        for m in self.modules():
            if isinstance(m, (nn.Linear, SafeEmbedding)):
                nn.init.kaiming_normal_(
                    m.weight, mode="fan_out", nonlinearity="relu")
            elif isinstance(m, (nn.BatchNorm1d, nn.LayerNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Embedding):
                m.weight.data.normal_(mean=0.0, std=0.02)

## data_module.py

In [None]:
import numpy as np
import pandas as pd
import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader, random_split

# from constants import FEATURES


def collate_fn(datas):
    prems = [torch.randperm(data[0].size(0)) for data in datas]
    length = min(data[0].size(0) for data in datas)
    return [
        torch.stack([d[i][perm][:length] for d, perm in zip(datas, prems)])
        for i in range(3)
    ]


class MyDataset(torch.utils.data.Dataset):
    def __init__(self, *tensor_lists) -> None:
        assert all(len(tensor_lists[0]) == len(
            t) for t in tensor_lists), "Size mismatch between tensor_lists"
        self.tensor_lists = tensor_lists

    def __getitem__(self, index):
        return tuple(t[index] for t in self.tensor_lists)

    def __len__(self):
        return len(self.tensor_lists[0])


def df_to_input_id(df):
    return torch.tensor(df['investment_id'].to_numpy(dtype=np.int16),
                        dtype=torch.int)


def df_to_input_feat(df):
    return torch.tensor(df[FEATURES].to_numpy(),
                        dtype=torch.float32)


def df_to_target(df):
    return torch.tensor(df['target'].to_numpy(),
                        dtype=torch.float32)


def load_data(path):
    df = pd.read_parquet(path)
    groups = df.groupby('time_id')
    return [
        groups.get_group(v)
        for v in df.time_id.unique()
    ]


def split(df_groupby_time, split_ratios):
    ids = [df_to_input_id(df) for df in df_groupby_time]
    feats = [df_to_input_feat(df) for df in df_groupby_time]
    targets = [df_to_target(df) for df in df_groupby_time]

    dataset = MyDataset(ids, feats, targets)

    lengths = []
    for ratio in split_ratios[:-1]:
        lengths.append(int(len(dataset)*ratio))
    lengths.append(len(dataset) - sum(lengths))

    return random_split(dataset, lengths)


class UMPDataModule(pl.LightningDataModule):
    def __init__(self, args):
        super().__init__()
        self.args = args

        datasets = split(load_data(args.input), args.split_ratios)
        if len(datasets) == 3:
            self.tr, self.val, self.test = datasets
        else:
            self.tr, self.val = datasets
            self.test = self.val

    def train_dataloader(self):
        return DataLoader(self.tr, batch_size=self.args.batch_size,
                          num_workers=self.args.workers, shuffle=True,
                          collate_fn=collate_fn, drop_last=True,
                          pin_memory=True)

    def _val_dataloader(self, dataset):
        return DataLoader(dataset, batch_size=1,
                          num_workers=self.args.workers, pin_memory=True)

    def val_dataloader(self):
        return self._val_dataloader(self.val)

    def test_dataloader(self):
        return self._val_dataloader(self.test)

## litmodule.py

In [None]:
import torch
from pytorch_lightning import LightningModule
from pytorch_lightning.callbacks import (EarlyStopping, LearningRateMonitor,
                                         ModelCheckpoint,
                                         StochasticWeightAveraging)
from torch import nn
from torchmetrics import PearsonCorrCoef

# from constants import FEATURES
# from model import Net


def get_loss_fn(loss):
    def mse(preds, y):
        return nn.MSELoss()(preds, y)

    def pcc(preds, y):
        assert preds.dim() == 2, preds.size()
        assert preds.size() == y.size(), (preds.size(), y.size())

        cos = nn.CosineSimilarity(dim=1, eps=1e-6)
        return -cos(preds - preds.mean(dim=1, keepdim=True),
                    y - y.mean(dim=1, keepdim=True)).mean()

    return {
        'mse': mse,
        'pcc': pcc,
    }[loss]


class UMPLitModule(LightningModule):
    def __init__(self, args):
        super().__init__()
        self.args = args
        self.model = Net(args, n_feature=len(FEATURES))
        self.test_pearson = PearsonCorrCoef()
        self.loss_fn = get_loss_fn(args.loss)

    def forward(self, *args):
        return self.model(*args)

    def training_step(self, batch, batch_idx):
        x_id, x_feat, y = batch

        preds = self.forward(x_id, x_feat)
        loss = self.loss_fn(preds, y)
        self.log('train_loss', loss, on_epoch=True)
        return loss

    def _evaluate_step(self, batch, batch_idx, stage):
        x_id, x_feat, y = batch

        preds = self.forward(x_id, x_feat)
        self.test_pearson(preds, y)
        self.log(f'{stage}_pearson', self.test_pearson, prog_bar=True)

    def test_step(self, batch, batch_idx):
        return self._evaluate_step(batch, batch_idx, 'test')

    def validation_step(self, batch, batch_idx):
        return self._evaluate_step(batch, batch_idx, 'val')

    def configure_optimizers(self):
        kwargs = {
            'lr': self.args.lr,
            'weight_decay': self.args.weight_decay,
        }

        optimizer = {
            'adam': torch.optim.Adam(self.model.parameters(), **kwargs),
            'adamw': torch.optim.AdamW(self.model.parameters(), **kwargs),
        }[self.args.optimizer]

        optim_config = {
            'optimizer': optimizer,
        }
        if self.args.lr_scheduler is not None:
            optim_config['lr_scheduler'] = {
                'step_lr': torch.optim.lr_scheduler.StepLR(
                    optimizer, step_size=5, gamma=0.8),
            }[self.args.lr_scheduler]

        return optim_config

    def configure_callbacks(self):
        callbacks = [
            LearningRateMonitor(),
            ModelCheckpoint(monitor='val_pearson', mode='max', save_last=True,
                            filename='{epoch}-{val_pearson:.4f}'),
        ]
        if self.args.swa:
            callbacks.append(StochasticWeightAveraging(swa_epoch_start=0.7,
                                                       device='cpu'))
        if self.args.early_stop:
            callbacks.append(EarlyStopping(monitor='val_pearson',
                                           mode='max', patience=10))
        return callbacks

## main.py

In [None]:
from argparse import ArgumentParser

from pytorch_lightning import Trainer, seed_everything
from pytorch_lightning.loggers import TensorBoardLogger

# from data_module import (UMPDataModule, df_to_input_feat, df_to_input_id,
#                          load_data)
# from litmodule import UMPLitModule


def get_name(args):
    return '-'.join(filter(None, [  # Remove empty string by filtering
        'x'.join(str(sz) for sz in args.szs),
        'x'.join(str(mha) for mha in args.mhas),
        f'epch{args.max_epochs}',
        f'btch{args.batch_size}',
        f'{args.optimizer}',
        f'drop{args.dropout}',
        f'schd{args.lr_scheduler}',
        f'loss{args.loss}',
        f'lr{args.lr}',
        f'wd{args.weight_decay}',
        f'swa{args.swa}',
        f'emb{args.emb_dim}',
    ])).replace(' ', '')


def submit(args, ckpts):

    litmodels = [
        UMPLitModule.load_from_checkpoint(ckpt_path, args=args).eval()
        for ckpt_path in ckpts
    ]

    import ubiquant
    env = ubiquant.make_env()   # initialize the environment

    for test_df, submit_df in env.iter_test():
        input_ids = df_to_input_id(test_df).unsqueeze(0)
        input_feats = df_to_input_feat(test_df).unsqueeze(0)

        with torch.no_grad():
            submit_df['target'] = torch.cat([
                litmodel.forward(input_ids, input_feats)
                for litmodel in litmodels
            ]).mean(dim=0)

        env.predict(submit_df)   # register your predictions


def test(args):
    seed_everything(args.seed)

    litmodel = UMPLitModule.load_from_checkpoint(args.checkpoint, args=args)
    dm = UMPDataModule(args)

    Trainer.from_argparse_args(args).test(litmodel, datamodule=dm)


def train_single(args, seed):
    seed_everything(seed)

    litmodel = UMPLitModule(args)
    dm = UMPDataModule(args)

    name = get_name(args)
    logger = TensorBoardLogger(save_dir='logs', name=name)

    trainer = Trainer.from_argparse_args(args,
                                         logger=logger,
                                         deterministic=True,
                                         precision=16)
    trainer.fit(litmodel, dm)

    best_ckpt = trainer.checkpoint_callback.best_model_path
    test_result = trainer.test(ckpt_path=best_ckpt,
                               datamodule=dm)

    return {
        'ckpt_path': best_ckpt,
        'test_pearson': test_result[0]['test_pearson']
    }


def train(args):
    return [
        train_single(args, seed)
        for seed in range(args.seed, args.seed + args.n_fold)
    ]


def parse_args(is_kaggle=False):
    parser = ArgumentParser()
    parser = Trainer.add_argparse_args(parser)

    parser.add_argument('--workers', type=int, default=2)
    parser.add_argument(
        '--input', default='../input/ubiquant-parquet/train_low_mem.parquet')

    # Hyperparams
    parser.add_argument('--batch_size', type=int, default=8)
    parser.add_argument('--lr', type=float, default=0.001)
    parser.add_argument('--weight_decay', type=float, default=1e-4)
    parser.add_argument('--seed', type=int, default=42)
    parser.add_argument('--optimizer', default='adam',
                        choices=['adam', 'adamw'])
    parser.add_argument('--lr_scheduler', default=None)
    parser.add_argument('--loss', default='pcc', choices=['mse', 'pcc'])
    parser.add_argument('--emb_dim', type=int, default=32)
    parser.add_argument('--n_fold', type=int, default=1)
    parser.add_argument('--split_ratios', type=float, nargs='+',
                        default=[0.7, 0.15, 0.15],
                        help='train, val, and test set (optional) split ratio')
    parser.add_argument('--early_stop', action='store_true')
    parser.add_argument('--swa', action='store_true',
                        help='whether to perform Stochastic Weight Averaging')

    # Model structure
    parser.add_argument('--n_emb', type=int, default=4000)  # TODO tight
    parser.add_argument('--szs', type=int, nargs='+',
                        default=[512, 256, 128, 64])
    parser.add_argument(
        '--mhas', type=int, nargs='+', default=[],
        help=('Insert MHA layer (BertLayer) at the i-th layer (start from 1). '
              'Every element should be <= len(szs)'))
    parser.add_argument('--dropout', type=float, default=0.0,
                        help='Set to 0.0 to disable')

    # Test
    parser.add_argument('--test', action='store_true')

    # Checkpoint
    parser.add_argument('--checkpoint', help='path to checkpoints (for test)')

    # Handle kaggle platform
    args, unknown = parser.parse_known_args()

    if not is_kaggle:
        assert not unknown, f'unknown args: {unknown}'

    assert all(0 < i <= len(args.szs) for i in args.mhas)
    return args


def run_local():
    args = parse_args()

    if args.test:
        test(args)
        return

    best_results = train(args)
    test_pearsons = [res['test_pearson'] for res in best_results]
    print(f'mean={sum(test_pearsons)/len(test_pearsons)}, {test_pearsons}')
    print(best_results)

In [None]:
def kaggle():
    args = parse_args(True)
    # On kaggle mode, we are using only the args with default value
    # To changle arguments, please hard code it below, e.g.:
    # args.loss = 'mse'
    # args.szs = [512, 128, 64, 64, 64]

    args.szs = [512, 128, 64, 64, 32]     
    args.mhas = [3]
    args.dropout = 0.4
    args.weight_decay = 0.0002
    args.split_ratios = [0.8, 0.2]
    args.gpus = 1
    args.max_epochs = 1
    args.early_stop = True
    args.szs = [512, 256, 64, 64, 32]  
    args.mhas = [3, 4]     
    
    
    
    do_submit = True
    train_on_kaggle = False
    
    if train_on_kaggle:
        best_results = train(args)
        ckpts = [res['ckpt_path'] for res in best_results]

        test_pearsons = [res['test_pearson'] for res in best_results]
        print(f'mean={sum(test_pearsons)/len(test_pearsons)}, {test_pearsons}')
    else:
        from glob import glob
        # TODO fill in the ckpt paths
        ckpts = []
        ckpts = glob('../input/512x256x64x64x32x3x4x04xplateaux0001x00003/*/version_*/*/e*.ckpt')
        # ckpts.sort()
        # ckpts = [ckpts[5]]
        print('\n'.join(ckpts))

    assert ckpts

    if do_submit:
        submit(args, ckpts)


if __name__ == '__main__':
    is_kaggle = True
    if is_kaggle:
        kaggle()
    else:
        run_local()

**Use tensorboard to visualize the training process**

```
tensorboard --logdir=logs/
```