# De-anonymization: Buy/Sell/Net/Gross

### Introduction


##### Previously
It's the 4th, and likely the last notebook for me on this competition. Below are the links of the first three ones: 
1. Time Aggregation tag `tag_{0, 1, 2, 3, 4, 5}` ([Notebook](https://www.kaggle.com/gregorycalvez/de-anonymization-time-aggregation-tags))
2. Price, Stock and Quantity `tag_{6, 14, 23}` ([Notebook](https://www.kaggle.com/gregorycalvez/de-anonymization-price-quantity-stocks))
3. Min, Max and Time `tag_{12, 13, 22}` ([Notebook](https://www.kaggle.com/gregorycalvez/de-anonymization-min-max-and-time))

##### Results
In this notebook, we will focus on the tags `tag_{24, 25, 26, 27}` and show that they likely correspond to: 
- `tag_24`: Net (ie. Buy - Sell)
- `tag_25`: Buy
- `tag_26`: Sell
- `tag_27`: Gross (ie. Buy + Sell)

Then, we'll also look quickly into `tag_19`, `tag_20` and `tag_28`.   

In [None]:
### Imports and data loading 

import pandas as pd
import datatable as dt 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import plotly_express as px
import warnings
warnings.filterwarnings('ignore')


from imp import reload
import janest_utility as utility
reload(utility)

u = utility.Utility('/kaggle/input/jane-street-market-prediction/')
train_datatable = dt.fread(u.filepath_train())
train = train_datatable.to_pandas()
train = u.add_intraday_ts(train)
train = u.add_stock_id_all(train)
n_trades = u.get_n_trades(train)
train = u.add_feature_0(train)

train_date = train[train['date'] == 299]
stock = train[(train['date'] == 299) & (train['stock_id'] == '5')]

# Buy / Sell / Net / Gross

I mentionned in other notebooks that `tag_23` is a "quantity or volume" tag. 

For each feature associated with `tag_23`, the feature is build as follow: 
- `tag_23`
- One of `tag_{0, 1, 2, 3, 4, 5}`
- One of `tag_{24, 25, 26, 27}`
- One of `tag_{15, 17}`

We already mentionned earlier that `tag_{0, 1, 2, 3, 4, 5}` is a time-aggregation tag. For this study, we will use `tag_0` features. `tag_0` is a rolling window with a very short characteristic time. We use those features to look into `tag_[24, 27]` as they will be almost "noiseless". 

For now, we will only use `tag_15` features. We will look into the difference of `tag_15` and `tag_17` later. 

In [None]:
fs = u.get_features('tag_23 & tag_0 & tag_15')
u.select_tags('tag_23 & tag_0 & tag_15').astype(int)

##### A lower bound on `tag_{25, 26, 27}` and a center on `tag_24`
Here we use all the `tag_15` features but the same applies for the equivalent `tag_17` features. 

As seen below on the graphs: 
- `feature_73` is centered around 0 and "look like" a normal distribution 
- `feature_85`, `feature_97` have a lower bound and "look like" exponential distribution
- `feature_109` has a lower bound and "look like" a Gamma distribution. 

But the interesting part is that the features seem to have some universal constants as show by the `value_counts`. On that day, we have 114 trades with the same exact `feature_{73, 85, 97, 109}`. 

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(20, 4))
for ax, f in zip(axs, fs): 
    sns.distplot(train_date[f], kde=False, ax=ax)
    ax.set_title(f'Distrubution of {f}')
plt.show()

display(train_date[fs].value_counts().head())

What are those values ? The answer is the in the title of the notebook section. For `feature_{85, 97, 109}`, the "magic numbers" are the minimum of the feature. For `feature_73`, it is the center of the distribution. 


##### Some pretty plots and a dim-2 manifold
To understand the link between those features, I scatter-plotted them. I also colored the points which had any of the magic numbers (for `feature_{85, 97}`). 

I did the plots using a full day of data (left) and using a specific stock / day (right). 

In [None]:
minimums = {f: train[f].min() for f in fs}
# Coloring 
train_date['category'] = np.where(
    (train_date['feature_97'] == minimums['feature_97']) & (train_date['feature_85'] == minimums['feature_85']), 'black',    # Both 85 and 97 reach a min
    np.where((train_date['feature_85'] == minimums['feature_85']), 'purple',                                                 # 85 only reach a min
             np.where((train_date['feature_97'] == minimums['feature_97']), 'red',                                           # 97 only reach a min
                      'pink'                                                                                                 # None reach a min
                     )
            )
)
### A single stock
stock = train_date[(train_date['stock_id'] == '5')]
# Limit the data to complete grid
stock_lim = stock[
    (np.abs(stock[fs[0]]) < 2)
    & (stock[fs[3]] < 2)
]

# Plots
fig, axs = plt.subplots(3, 2, sharex=True, figsize=(20, 20))
axs[0, 0].scatter(
    train_date[fs[0]], 
    train_date[fs[3]], 
    marker='+', 
    c=train_date['category'], 
)
axs[0, 0].set_ylim((-5, 13))
axs[0, 0].set_ylabel(fs[3])

axs[1, 0].scatter(
    train_date[fs[0]], 
    train_date[fs[1]], 
    marker='+', 
    c=train_date['category'], 
)
axs[1, 0].set_ylim((-2, 15))
axs[1, 0].set_ylabel(fs[1])

axs[2, 0].scatter(
    train_date[fs[0]], 
    train_date[fs[2]], 
    marker='.', 
    c=train_date['category'], 
)
axs[2, 0].set_ylim((-2, 15))
axs[2, 0].set_ylabel(fs[2])


axs[2, 0].set_xlabel('feature_73')
axs[2, 0].set_xlim((-10, 10))


axs[0, 0].set_title('Relations between `feature_{73, 85, 97, 109}`}')


# Plots
axs[0, 1].scatter(
    stock_lim[fs[0]], 
    stock_lim[fs[3]], 
    marker='+', 
    c=stock_lim['category'], 
)
axs[0, 1].set_ylim((-5, 2))
axs[0, 1].set_ylabel(fs[3])

axs[1, 1].scatter(
    stock_lim[fs[0]], 
    stock_lim[fs[1]], 
    marker='+', 
    c=stock_lim['category'], 
)
axs[1, 1].set_ylim((-2, 3))
axs[1, 1].set_ylabel(fs[1])

axs[2, 1].scatter(
    stock_lim[fs[0]], 
    stock_lim[fs[2]], 
    marker='.', 
    c=stock_lim['category'], 
)
axs[2, 1].set_ylim((-2, 3))
axs[2, 1].set_ylabel(fs[2])


axs[2, 1].set_xlabel('feature_73')
axs[2, 1].set_xlim((-2, 2))

axs[0, 1].set_title('For one stock')

plt.show()

##### What do we see?
1. I hope that the graphs above help understand the importance of the minimums of `feature_{85, 97}`. 
2. Furthermore, on this stock (chosen carefully), we see that the values of the features are aligned. They are discrete. 
3. For each stock, those 4 features are strongly related to each other. In mathematical terms, I figured that if you define a point by the coordinate on the 4 features, the set of trades lie in a 2-dimension manifold. 

I also used a lot of 3d plotting but thsoe render poorly on the notebook and I did not reproduce them. 

**In short, only two parameters are enough to fully give `feature_{73, 85, 97, 109}`**

### Straighten up the plot - Inverse the host's normalization


##### Host's normalization
To obfuscate the data and to help us, the host normalized all the features. In mathematical terms, for each feature available to use, say `feature_x`, we have: 
$$ \text{feature_}x = \phi_x(\text{original_feature_}x) $$
With $\phi_x$ being the normalization function of the `feature_x`. I expect those functions $\phi$ to be usual sklearn scalers (e.g. [StandartScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)).

##### And our four features?
For the four features in question, for the stock used in the plots above, the values of the features are definitely discrete. 

So I imagined that the values "original_features" in question were simply 0, 1, 2, 3, 4, etc (the set of natural number, the simplest set of discrete values). Moreover, that would explain why the features we have are discrete and why they are lower-bounded for some 3 of them (`original_feature_73` being in $\mathbb Z$ and `original_feature_{85, 97, 109}` being in $\mathbb N$). 

With this assumption, it's easy to find and apply $\phi_x^{-1}$ to our features. The code below does this and show the plots once again. 

The idea is that the normalization makes the relation complicated to "see". I hoped that going back to the "original space" will help find the relations. And it indeed helped. 

In [None]:
### Phi^{-1}
for f in fs: 
    map_f = {v: i for i, v in enumerate(np.sort(stock_lim[f].unique().tolist()))}
    stock_lim[f'map_{f}'] = stock_lim[f].map(map_f)

### Plots
fig, axs = plt.subplots(3, 2, figsize=(20, 20))
# Not straight
axs[0, 0].scatter(
    stock_lim[fs[0]], 
    stock_lim[fs[3]], 
    marker='+', 
    c=stock_lim['category'], 
)
axs[0, 0].set_ylim((-5, 2))
axs[0, 0].set_ylabel(fs[3])
axs[1, 0].scatter(
    stock_lim[fs[0]], 
    stock_lim[fs[1]], 
    marker='+', 
    c=stock_lim['category'], 
)
axs[1, 0].set_ylim((-2, 3))
axs[1, 0].set_ylabel(fs[1])
axs[2, 0].scatter(
    stock_lim[fs[0]], 
    stock_lim[fs[2]], 
    marker='.', 
    c=stock_lim['category'], 
)
axs[2, 0].set_ylim((-2, 3))
axs[2, 0].set_ylabel(fs[2])
axs[2, 0].set_xlabel('feature_73')
axs[0, 0].set_xlim((-2, 2))
axs[1, 0].set_xlim((-2, 2))
axs[2, 0].set_xlim((-2, 2))
axs[0, 0].set_title('Plots in feature space')
### Straight plots
axs[0, 1].scatter(
    stock_lim[f'map_{fs[0]}'], 
    stock_lim[f'map_{fs[3]}'], 
    marker='+', 
    c=stock_lim['category'], 
)
axs[0, 1].set_ylabel(f'map_{fs[3]}')
axs[1, 1].scatter(
    stock_lim[f'map_{fs[0]}'], 
    stock_lim[f'map_{fs[1]}'], 
    marker='+', 
    c=stock_lim['category'], 
)
axs[1, 1].set_ylabel(f'map_{fs[1]}')

axs[2, 1].scatter(
    stock_lim[f'map_{fs[0]}'], 
    stock_lim[f'map_{fs[2]}'], 
    marker='.', 
    c=stock_lim['category'], 
)
axs[2, 1].set_ylabel(f'map_{fs[2]}')

axs[2, 1].set_xlabel(f'map_{fs[0]}')

axs[0, 1].set_title('Plots in original space')

plt.show()

##### 2-dimension manifold

I said earlier that the points consisting of the four features as coordinates were on a 2d manifold. So that means, that there exist two mathematical relations between the four features. 

The two mathematical equalities can be checked perfecly in the original space values (`map_feature_x` in the code). Those equalities are: 
* `feature_73` = `feature_85` - `feature_97`
* `feature_109` = `feature_85` + `feature_97`


For anyone familiar with finance, those equations ring a bell. They are exactly what you would have if: 
- `feature_73` is a Net quantity
- `feature_109` is a Gross quantity
- `feature_85` is a Buy quantity
- `feature_97` is a Sell quantity


### Some comments

##### Round lots
Those plots can be reproduced only for some very specific stocks in the dataset. Why is that and why is it still consistent with my conclusion? 

`tag_23` features are quantities. In financial markets, for a given asset, the quantity cannot be anything. Often, it must be a "round lot" (ie a multiple of 100) or a integer. ([Round Lot](https://www.investopedia.com/terms/r/roundlot.asp))\
But in practice, if the stock is cheap enough, the usual quantities traded are in the thousands or more, meaning that it actually appears continuous. 
However, for some specific assets the price is so large that the trade quantities are usually integers below 10 (eg. [Lindt](https://uk.finance.yahoo.com/quote/lisn.sw/))

##### `tag_15` v. `tag_17`
I have no proof of this but my assumption for `tag_15` v. `tag_17` is then that one of them is a quantity in number of shares and the other is a quantity in notional (i.e price times number of shares). 

Extreme assets like Lindt would explain why having those two variants of the same variable is important. 

See the plot below showing that the `tag_15` features and `tag_17` features are definitely related. But the normalization done by the host makes it impossible to understand the relation. 

In [None]:
plt.figure(figsize=(15, 8))
for date, stock_id, in n_trades.head(10)[['date', 'stock_id']].values: 
    stock = train[(train['date'] == date) & (train['stock_id'] == stock_id) & (train['feature_109'] != minimums['feature_109'])].sort_values('feature_112')
    plt.plot(
        stock['feature_112'], 
        stock['feature_118'],
        marker='.', 
        label=f'Date: {date}, Stock ID: {stock_id}'
    )
plt.legend()
plt.xlabel('Tag_15 feature')
plt.ylabel('Tag_17 feature')
plt.show()

### Retrieving $\phi$

For those four features, we then can plot the normalization function $\phi$. 
This normalization functions definitely depend on the stocks (eg. via the average daily traded volume). So the parameters would be different but the "shape" of the normalization functions are the same. 

Below is a plot of what those normalization functions look like. 

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
for i, f in enumerate(fs): 
    ix = i // 2
    iy = i % 2
    axs[ix, iy].scatter(
        stock_lim[f'map_{f}'], 
        stock_lim[f],
        marker='+'
    )
    axs[ix, iy].set_xlabel('Mapped')
    axs[ix, iy].set_ylabel('Raw')
    axs[ix, iy].set_title(f)
plt.show()

##### Fit $\phi$, hyperbolic functions?
- `feature_73`: the normalization function is some kind of [arcsinh](https://reference.wolfram.com/language/ref/ArcSinh.html) function
- `feature_{85, 97, 109}`: the normalization functions seem to be of the same form (with different params) for those three features. Maybe a [arccosh](https://reference.wolfram.com/language/ref/ArcCosh.html). 

Of course, I tried to fit them. But unfortunately, I did not manage to get a 0 error. My hope was that $\phi$ was dependend on the stock and that those normalizations would help us map the stocks from one day to another. 

I think I did not manage to fit a perfect function on those points partly because there is a discontinuity around 0. The $\phi$ is probably defined piece-wise. I won't expose my failed fitting attempts here. 

# Quick look at `tag_{19, 20, 28}`

### A graph

In [None]:
stock = train[(train['date'] == 299) & (train['stock_id'] == '5')]
stock['feature_51_cat'] = np.where(stock['feature_51'] < -1.2, 'orange', 'blue')
fig, axs = plt.subplots(3, 1, sharex=True, figsize=(20, 10))
axs[0].scatter(
    stock['intraday_ts'], 
    stock['feature_51'],
    marker='+', 
)
axs[0].set_ylabel('feature_51')
axs[0].set_title('feature_51')
axs[1].scatter(
    stock[stock['feature_51_cat'] == 'orange']['intraday_ts'], 
    stock[stock['feature_51_cat'] == 'orange']['feature_51'],
    marker='+', 
    color='orange'
)
axs[1].set_title('feature_51, lower band')
axs[1].set_ylabel('feature_51')
axs[2].scatter(
    stock['intraday_ts'], 
    stock['feature_3_x0'],
    marker='+', 
    color='red'
)
axs[2].set_ylabel('(Normalized) Price')
axs[2].set_title('Price (feature_3 x feature_0)')
axs[2].set_xlabel('Time')
plt.show()

### What do we see? 

For that stock, for that day, `feature_51` shows a very singular pattern. It has bands. 
I think that the bands are so visible on this stock thanks to the characteristics of the stock (ie. a large share price). 

For each band, `feature_51` is strongly negatively correlated with the price of this asset as you can see on the price. 

The same happens for most of the features among `tag_{19, 20, 28}`. Some notebooks already showed a relation between `tag_19` and the `weight` values, meaning that `tag_19` could refer to the liquidity available (i.e. size at the first limit) of the stock. 

# Conclusion

### On this notebook
This notebook was harder to write than the others. Mainly because the main idea rely on four variables being related with two equations and it's hard to visualize. I again resorted to math-like formalization to explain it, I hope it makes sense. 

Furthermore, I played so much with the data that I might just be seeing relations that do not actually exist anymore. So any challenging comment is welcomed. 


### Summary of de-anonymization
Here's a quick summary of the de-anonymization results. This will likely be the last notebook in that series, I will probably never know how right/wrong those results were... Let me know your thoughts. \
In any case, the process of digging into that data was a lot of fun. 


- `tag_{0, 1, 2, 3, 4, 5}`: time aggregation tag
- `tag_6`: price
- `tag_{7, 8, 9, 10}`: ??? 
- `tag_{12, 13}`: min / max
- `tag_14`: stock embedding with price
- `tag_{15, 17}`: number of shares, notional
- `tag_16`: ??? 
- `tag_18`: Average daily volumes
- `tag_{19, 20, 28}`: ??? 
- `tag_{23}`: Quantity traded
- `tag_{24, 25, 26, 27}`: Net, Buy, Sell, Gross