# De-anonymization: Time Aggregation Tags

After some very poor submssions and a lot of timeouts, I decided to focus on de-anonymizing the data. 

Here, I show the meaning of: `tag_{0, 1, 2, 3, 4, 5}`.

## Results


| Tag      | Aggregation Type | Characteristic Time (% of day) | Characteristic Time in min
| ----------- | ----------- | ----------- | ----------- |
| `tag_0`      | Rolling Sum/Average       | 0.03 | ~ 20 sec
| `tag_1`      | Rolling Sum/Average       | 0.27 | ~ 3 min
| `tag_2`      | Rolling Sum/Average       | 0.59 | ~ 5 min
| `tag_3`      | Rolling Sum/Average       | 1.33 | ~ 10 min
| `tag_4`      | Rolling Sum/Average       | 4.55 | ~ 30 min
| `tag_5`      | Cumulative throughout the day        | | 


## Pre-requisite 

This notebook will use other knowledge about the tags without showing how I got those results. What you need to know: 
- `tag_6` and `tag_23` are key tags that group features in 2 different groups. 
  - `tag_6` means a price feature
  - `tag_23` means an "additive" feature (either a number of trades or volume trade, still not sure). 
- For a given day, grouping trades made on the same instrument is possible. One way to do this is to look at `feature_{41, 42, 43, 44, 45}`. They are constant for a day / stock combination. 
  - `41, 42, 43` are probably an embedding (PCA on returns?) of the different stocks. 
  - `44, 45` might be related to volumes? 

In [None]:
# Preparing the data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import warnings
warnings.filterwarnings('ignore')


from imp import reload
import utility

u = utility.Utility('/kaggle/input/jane-street-market-prediction/')
train = pd.read_csv(u.filepath_train(), nrows=int(1e6))
train = u.add_intraday_ts(train)
train_na = u.build_train_na(train)

## Clusters and Intraday distributions of NaNs

Some notebooks already saw that the NaN value had an intraday pattern. [NaN values depending on time of day](https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day). Other saw that they clustered with features (i.e. some features are NaNs together) [NaN Cluster](https://www.kaggle.com/samir95/features-nan-clusters-and-pca). 


I dug into those ideas to find the meaning of the first five tags. Here are the steps I took: 
- Cluster the features according to their "NaN similarity". (not shown here, just SVD + KMeans + Manual data viz) 
- For each cluster, plot throughout a day, the NaN indicator. 
- Make sense of the clusters of features with the `features.csv` file. 


The cluster are governed by the 5 first tags of the `features.csv` file. To prove this, I grouped the features into 5 different groups: 
- `tag_23` and `tag_0` true 
- `tag_23` and `tag_1` true
- `tag_23` and `tag_2` true
- `tag_23` and `tag_3` true
- `tag_23` and `tag_4` true

`tag_23` is used as a filter to have underlying hidden features of the same category. 
`tag_6` could work as well. 
Or even no other filtering that `tag_{0, 1, 2, 3, 4}`


Below are the results for a date chosen randomly (12): 

In [None]:
train_na = train_na[train_na['date'] == 12]

query = 'tag_6' 
groups_def = {
    'group_0': u.get_features(query + ' & tag_0'),
    'group_1': u.get_features(query + ' & tag_1'),
    'group_2': u.get_features(query + ' & tag_2'),
    'group_3': u.get_features(query + ' & tag_3'),
    'group_4': u.get_features(query + ' & tag_4'),
}
for g in groups_def.keys(): 
    train_na[g] = train_na[groups_def[g]].mean(axis=1)

fig, axs = plt.subplots(5, 1, sharex=True, sharey=True, figsize=(10, 5))
axs[0].set_title('NA Clustering tag_6 + {tag_0, tag_1, tag_2, tag_3, tag_4}')
for i, g in enumerate(groups_def.keys()): 
    axs[i].plot(train_na['intraday_ts'], train_na[g], label=g)
    axs[i].set_ylabel(g)
plt.xlabel('Intraday Time')
plt.show()


query = 'tag_23' 
groups_def = {
    'group_0': u.get_features(query + ' & tag_0'),
    'group_1': u.get_features(query + ' & tag_1'),
    'group_2': u.get_features(query + ' & tag_2'),
    'group_3': u.get_features(query + ' & tag_3'),
    'group_4': u.get_features(query + ' & tag_4'),
}
for g in groups_def.keys(): 
    train_na[g] = train_na[groups_def[g]].mean(axis=1)

fig, axs = plt.subplots(5, 1, sharex=True, sharey=True, figsize=(10, 5))
axs[0].set_title('NA Clustering tag_23 + {tag_0, tag_1, tag_2, tag_3, tag_4}')
for i, g in enumerate(groups_def.keys()): 
    axs[i].plot(train_na['intraday_ts'], train_na[g], label=g)
    axs[i].set_ylabel(g)
plt.xlabel('Intraday Time')
plt.show()

##### What do we observe?
From those graphs, we can see that there are NaN twice a day. At the open, and around "mid-day". Furthermore, we can see that the "duration" of the NaN window depends on the group. 

Those results are true for all the dates. 

##### How can we explan this? 
Financial markets are scheduled. Some events occur at pre-arranged times: open and close of course, but others throughout the day. Some markets have mid-day auctions, some markets even have a lunch break. 

It seems that the "open event" and the "lunch event" causes an underlying metric to become NaN for a short period of time. For example if the underlying metric is a price, there is no price during auction. This NaN is then persited in our dataset features if the feature construction uses a rolling window. 

If we call $f$ our feature (from the dataset) and $u$ an underlying metric, non shown in our dataset, we have the convolution with a weight $w$:
$$ f(t) = \sum_{s > t - \tau} w(t-s) u(s)$$
If there is a NaN in $u(s)$ between $t-\tau$ and $t$, there will be a NaN in $f(t)$. 

It explains the time distribution of NaN and the features being clustered with NaN. 
The time window of a NaN period will be the value $\tau$. 


##### What is $w$, $\tau$ and $u$ ? 

$w$ can be anything, the constant value 1 will give you plain rolling average, an exponential function will give you a exponential moving average, etc... 
$\tau$ is the characteristic time of the rolling mechanism. 

$u$ is an underlying metric not shown in the dataset. It can be for example the price of the stock, or the number of trades. 

A feature (in our dataset) is defined by this rolling mechanism on an underlying metric. The underlying metric will be defined by other tags of the features. 

See the table below for an example. The underlying (hidden) metric is defined by "tag_23 and tag_15 and tag_26". 
For the sake of the example, one could imagine the meaning:  
- "tag_23" -> number of trades
- "tag_15" -> buyer
- "tag_26" -> trade quantity > 100

Then to complete the definition of the feature, you add one of the tags (0, 1, 2, 3, 4, 5). 

Therefore, we would have: 
- `feature_108` is rolling sum over the last 30 min of the number of trades made by a buyer on this exchange with size larger than 100. 
- `feature_109` is rolling sum over the last 20 sec of the number of trades made by a buyer on this exchange with size larger than 100. 
- etc

Of course, those values are then re-normalized. 

In [None]:
u.select_tags('tag_23 & tag_15 & tag_26')

## Characteristic Time $\tau$ of the rolling metrics

Now that we know that the tags refer to those rolling window, we can try to quantify this characteristic $\tau$ for each tag (0, 1, 2, 3, 4). 

We keep the grouping of features. Then, for each day, we look only a the mid-day NaN period. We compute the "length" of the period (in terms of % of the trading day). 

In [None]:
train_na = u.build_train_na(train)
query = 'tag_23' 
groups = ['group_0', 'group_1', 'group_2', 'group_3', 'group_4']
groups_def = {
    'group_0': u.get_features(query + ' & tag_0'),
    'group_1': u.get_features(query + ' & tag_1'),
    'group_2': u.get_features(query + ' & tag_2'),
    'group_3': u.get_features(query + ' & tag_3'),
    'group_4': u.get_features(query + ' & tag_4'),
}
for g in groups_def.keys(): 
    train_na[g] = train_na[groups_def[g]].mean(axis=1)

train_na = train_na[train_na['intraday_ts'] > 0.4][['date', 'intraday_ts'] + groups]
def bound_ts_nan(df, g): 
    x =  df[df[g] == 1]
    return x['intraday_ts'].min(), x['intraday_ts'].max()

na_summary = []
for date in train_na['date'].unique():
    train_na_date = train_na[(train_na['date'] == date)]
    this_date_data = {'date': date}
    for g in groups: 
        bounds = bound_ts_nan(train_na_date, g)
        this_date_data[f'{g}_min'] = bounds[0]
        this_date_data[f'{g}_max'] = bounds[1]
    na_summary.append(this_date_data)
na_summary = pd.DataFrame(na_summary)
for g in groups: 
    na_summary[f'{g}_size'] = na_summary[f'{g}_max'] - na_summary[f'{g}_min'] 
    
plt.figure(figsize=(10, 5))
for g in groups: 
    plt.plot(na_summary['date'], na_summary[f'{g}_size'] * 100, label=g)
plt.legend()
plt.ylabel('% Time of day')
plt.xlabel('Date')
plt.title('Lag Size')
plt.show()

By averaging over those dates, we find: 

In [None]:
for g in groups: 
    mean_size = (na_summary[f'{g}_size'] * 100).mean()
    print(f'{g} \t % of day lag: {mean_size: .2f}')

## More evidence 

To further prove my statement, I am displaying such an underlying variable with the different time aggregation for a given stock, for a given day. 
You can see the smoothing effect of the tags 0->4. 

See my code to understand how I am able to isolate one financial instrument. I will try to post a notebook on this later. 

In [None]:
train_date = train[train['date'] == 12]
stocks = u.add_stock_id(train_date)

# Pick the most frequent stock
stock_id = stocks.groupby('stock_id').agg({'ts_id': 'count'}).sort_values('ts_id', ascending=False).index[0]
stock = stocks[stocks['stock_id'] == stock_id]

# Features to plot: 
features_to_plot = u.get_features('tag_23 & tag_15 & tag_26')[:5]
tags = ['tag_4', 'tag_0', 'tag_3', 'tag_2', 'tag_1']
plt.figure(figsize=(10, 5))
for f, t in zip(features_to_plot, tags): 
    plt.scatter(
        stock['intraday_ts'], 
        stock[f], 
        label=f'{t}: {f}', 
        marker='.',    )
plt.legend()
plt.xlabel('Time of day')
plt.show()

## And tag_5 ??

`tag_5` appears for some underlying features, but not all. In short appears for features related to `tag_23`. 
And it seems to play a similar role as tags 0->4. Yet, it's slightly different. For me, `tag_5` means a cumsum of an underlying variable.

It could be also written again as a convolution, with $w$ being non-zero everywhere. 

$$ f(t) = \sum_{s<t} w(t-s) u(s)$$

On the same example as before: 

In [None]:
train_date = train[train['date'] == 12]
stocks = u.add_stock_id(train_date)

# Pick the most frequent stock
stock_id = stocks.groupby('stock_id').agg({'ts_id': 'count'}).sort_values('ts_id', ascending=False).index[0]
stock = stocks[stocks['stock_id'] == stock_id]

# Features to plot: 
f = u.get_features('tag_23 & tag_15 & tag_26')[5]
t = 'tag_5'
plt.figure(figsize=(10, 5))
plt.scatter(
    stock['intraday_ts'], 
    stock[f], 
    label=f'{t}: {f}', 
    marker='.',    )
plt.legend()
plt.xlabel('Time of day')
plt.show()

# Conclusion

I hope there is enough evidence to show that the five first tags refer to time aggreation methods, with different characteristic times. 

If the conclusions of this notebook are agreed by others, I will show more de-anonymization in another notebook. My current hypotheses are: 
- `tag_14` is a stock embedding, so is `tag_18`
- `tag_6` is a price
- `tag_23` is a volume / number of trades / quantity
- `tag_20` is a spread


Let me know your thoughts ! 