# How are the buckets shifted in time

Throughout this competition I was not enitrely sure, how the exact sacling and shifting process of the orderbook and trade data happened.
In the discussion forum I came across these two posts by Matteo and Jiashen:

> Hi, a random amount of seconds is dropped at the beginning of each time_id. That is meant to make the data harder to revert for someone with access to the full dataset.
> The normalization happens at the beginning of each time bucket by scaling the WAP to 1, so small deviations in price can be explained because of it.

https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/249474#1400435

> Hi @stassl , thank you very much for your question. Yes, you are right. If, at the second when the bucket starts there is no book update, we will rebase the second_in_bucket field which means the time sync for that stock on second level breaks. However if a stock has a book update on that second, then there is no rebase hence the sync does not break. Now in hindsight we should admit that this part of the design is not ideal --- something we also learned as competition host.

https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/251775#1386295

If I understand theses statements correctly, it means that there is no way to differentiate between the following two orderbooks, after rebasing them to 0:

Book 1

|  second| wap |
| --- | --- |
|  100| 1.4 |
|  220| 1.5 |
|  400| 1.2 |

Book 2

|  second| wap |
| --- | --- |
|  200| 1.4 |
|  320| 1.5 |
|  500| 1.2 |

as both get rebased to

|  second| wap |
| --- | --- |
|  000| 1.4 |
|  120| 1.5 |
|  300| 1.2 |

This means, a competitor might be able to gain a significant advantage when shifting the buckets by the right amount of seconds during preprocessing stage. When we plot the number occurences of each $seconds\_in\_bucket$ value, we can see that the $seconds\_in\_bucket$ close to the end of a bucket get lower. On the other hand the first few $seconds\_in\_bucket$ have more occurences. This makes sense, as a bucket which originally has no update at second 0, can't have an update at second 599 later, as the bucket will be shifted.



In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import collections

basePath = "../input/optiver-realized-volatility-prediction/"

bookTrainFolder  = os.path.join(basePath, "book_train.parquet") # Folder for book data  

activity = np.zeros(600) # An array to count the number of orderbook changes that happened at each second

for f in os.listdir(bookTrainFolder):
    parquetFile = os.listdir(os.path.join(bookTrainFolder, f))[0]

    df = pd.read_parquet(os.path.join(bookTrainFolder, f, parquetFile)).to_numpy()

    counts = collections.Counter(df[:, 1].astype(int)) # Count the total orderbook changes of one stock at each second
    
    for second in counts:
        activity[second] += counts[second] # Add activity of all the stocks together

plt.bar(np.arange(600), activity, width=1.0) # Plot activity
plt.show()

In [None]:
plt.bar(np.arange(25), activity[-25:], width=1.0) # Plot activity of the last 25 seconds
plt.show()

Many will have noticed that the mean of $|log{WAP_0 \over 1}|$ (the change of the first calculated wap compared to a wap of 1) is greater than the mean of $|log{WAP_n \over WAP_{n-1}}|$ (the change of any two neighboring waps). If the buckets are normalized to 1, this should not be the case. 
Matteo stated that:
> a random amount of seconds is dropped at the beginning of each time_id.

This means, for some buckets, much more seconds actually have passed since point in time where the normalization to 1 has happend. Which also explains the bigger price movement in the beginning. 

In this competition I have tried to shift buckets to the "right" where all $second\_in\_bucket<599$ as these buckets are the only ones that have been possibly shifted. If we make the assumption that most stocks have a slight correlation (especially during bigger market movements), we can minimize $\epsilon$ in the following equation for any two distinct stocks $A$ and $B$ within a $time\_id$, to make an approximation of the bucket shifts:

$$\epsilon_{AB} = \sum\limits_{second=1}^{599} {\Big({log\big({WAP_{stockA_{second}} \over WAP_{stockA_{second - 1}}}\big) - log\big({WAP_{stockB_{second}} \over WAP_{stockB_{second - 1}}}\big)\Big)}^2}$$

In practice I have tried a few different things; For example I selected the stocks within a $time\_id$, that have the highest $seconds\_in\_bucket$ value (the least shiftable stocks) and used these stocks as scale for the other stocks. I then minimzed the sum of $\epsilon_{AB}$ for all possible stocks $A$ and $B$, where $A$ is a stock from the scaling group and $B$ is a stock from the group of the other stocks. I also choose to use a small moving average of the log returns instead of the raw log returns, in case the correlation between stocks was slightly offset.

Shifting the buckets like this has the disadvantage, that stocks that are negatively correlated will not align very well.

# Did this help my score?

Sadly I was not able to see any noticable improvements in my score, but I kept this method of preprocessing in my final models anyway. I also haven't fully explored all the possible methods to align the buckets in time, so I wonder if any of you were able to improve by messing with the alignment of the buckets.
