# De-anonymization: Price, Quantity and Stocks

### Introduction

In a previous notebook  [Time aggregation tags](https://www.kaggle.com/gregorycalvez/de-anonymization-time-aggregation-tags), I showed the meaning of `tag_{0, 1, 2, 3, 4, 5}`. These are the "time-aggregation tag", see table below. 


| Tag      | Aggregation Type | Characteristic Time (% of day) | Characteristic Time in min
| ----------- | ----------- | ----------- | ----------- |
| `tag_0`      | Rolling Sum/Average       | 0.03 | ~ 20 sec
| `tag_1`      | Rolling Sum/Average       | 0.27 | ~ 3 min
| `tag_2`      | Rolling Sum/Average       | 0.59 | ~ 5 min
| `tag_3`      | Rolling Sum/Average       | 1.33 | ~ 10 min
| `tag_4`      | Rolling Sum/Average       | 4.55 | ~ 30 min
| `tag_5`      | Cumulative throughout the day        | None | None


We continue the study by showing that it is possible to identify trades on the same financial instrument in the same day. 
In this notebook I will use the word "stock" instead of "financial instrument". Yet, I am not sure that we are actually dealing with stocks (it could be options, ETFs, or something else entirely). 

Here we try to show that: 
- Trades for a stock / day combination can be isolated 
- `tag_14`, `tag_18` might be embedding of the stocks 
- `tag_6` means a price metric
- `tag_23` means a volume-like metric
- `tag_20` might be a spread

# Stock Identification

### How `tag_5` gives the stocks

<span style="color:blue">**Remark:** this sections is useless as there is another (easier) way of identifying stocks. But this section explains how I found the easier way. Feel free to jump to the next section for an "easier" stock identification</span>

In the previous notebook, `tag_5` appeared to be a time-aggregation tag. `tag_5` is always associated with `tag_23`. 

Furthermore, `tag_23` seems to be "modulated" by other tags. A feature that has `tag_23` will always be of the form: 
```feature_x = tag_{0, 1, 2, 3, 4, 5} x tag_{15, 17} x tag_23 x tag_{24, 25, 26, 27}```

For a given day (date=12), let's plot the features with `tag_5`, `tag_23`, `tag_26`, i.e. `feature_101`, `feature_107`. Those are two very similar features that represent some cumulative variables, hence - and it's the key part - smooth function of time. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly_express as px

import utility
u = utility.Utility('/kaggle/input/jane-street-market-prediction/')
train = pd.read_csv(u.filepath_train(), nrows=int(1e5))
train = u.add_intraday_ts(train)
train_date = train[train['date'] == 12]

In [None]:
display(u.select_tags('tag_5').astype(int))

In [None]:
fig = px.scatter_3d(
    train_date, 
    'intraday_ts', 
    'feature_101', 
    'feature_107', 
    size=np.ones(train_date.shape[0]) * 0.01,
    color='intraday_ts'
)
fig.show()

##### What do we see? 
On this 3D graph, we can easily see that the data (`feature_101`, `feature_107`) consists of points being sampled on a set of continuous functions of time. 
This is the case for all the features sharing `tag_5` and `tag_23`. 

##### What does it mean and how can we use it? 
I understood `tag_23` as being an "additive" metric. For instance the number of trades or the volume traded. Therefore, each one of the "continuous functions" seen on the graph must be the same instrument.

Then, we need to cluster those points. 

##### Cluster the points
To categorize the points into set of points forming "smooth" functions, I couldn't find a nice "ready-to-use" library. An EM algorithm might have worked. But instead, I designed a greedy algorithm based on the a smooth-time-series approach. 

The algorithm is fully implemented in the `utility.py` file and I will only call it here for demo purposes. 

In [None]:
### Preparing the data
train_date['intraday_ts'] = 1 - train_date['intraday_ts']  # Flip time axis
train_date.sort_values('intraday_ts', inplace=True)
features_to_use = ['feature_101', 'feature_107']
times = train_date['intraday_ts'].values
x_points = train_date[features_to_use].values
points = [(t, x) for t, x in zip(times, x_points)]

### Running the algo (details in the utility file)
alpha_prior = np.array([-3.22728931, -6.3402686 ])   # Found with regression on the whole dataset
epsilon = 20                                         # Manual Tweaking
distance_threshold = 0.08                            # Manual Tweaking
t_threshold = 0.3                                    # Do not start a new curve in the middle of the day
clusterer = utility.Cluster1DSmoothFunctions(
    distance_threshold, 
    t_threshold, 
    epsilon, 
    verbose=True
)
stock_ids = clusterer.run(points, alpha_prior)
train_date['stock_id'] = stock_ids
train_date['stock_id'] = train_date['stock_id'].astype(str)
train_date['intraday_ts'] = 1 - train_date['intraday_ts']  # Re-flip time axis
# Plotting the results
fig = px.scatter_3d(
    train_date,
    'intraday_ts', 
    'feature_101', 
    'feature_107', 
    size=np.ones(train_date.shape[0]) * 0.01,
    color='stock_id', 
)
fig.show()

In [None]:
# Ploting some tag_5 features one 1 stock
plt.figure(figsize=(10, 5))
one_stock = train_date[train_date['stock_id'] == '41']
plt.plot(one_stock['intraday_ts'], one_stock['feature_101'], label='feature_101')
plt.plot(one_stock['intraday_ts'], one_stock['feature_107'], label='feature_101')
plt.plot(one_stock['intraday_ts'], one_stock['feature_77'], label='feature_101')
plt.xlabel('Time')
plt.ylabel('Features')
plt.title('Some features on one Stock')
plt.show()

##### What now ? 
This "clustering" is not perfect. 
But at least, some clusters seem to make sense. So we retrieved at least a series of trades from one stock. 

Based on that series of trades, I figured out many things. But the most important one: all this process (`tag_5+tag_23` and custom clustering) is useless: 

`tag_14` and related features `feature_{41, 42, 43}` give the stock. 

In [None]:
display(one_stock[['feature_41', 'feature_42', 'feature_43']].drop_duplicates())

### Embedding of stocks: `tag_14` and `tag_18`

`tag_14` seems a good indicator of a stock. The values of the `feature_{41, 42, 43, 44, 45}` (note the continuity in the naming of features) seem to be constant for a stock / day combination. That allows us to "cluster" more efficiently the graph seen previously. 

In [None]:
display(u.select_tags('tag_14 | tag_18').astype(int))

In [None]:
train_date = train[train['date'] == 3]
train_date = u.add_stock_id(train_date)
most_present_stocks = train_date.groupby('stock_id', as_index=False).agg({'ts_id': 'count'}).rename(columns={'ts_id': 'num_trades'}).sort_values('num_trades', ascending=False).head(100)['stock_id'].values.tolist()
most_present_stocks = [str(i) for i in most_present_stocks]

to_plot = train_date[train_date['stock_id'].isin(most_present_stocks)]
fig = px.scatter_3d(
    to_plot, 
    'intraday_ts', 
    'feature_101', 
    'feature_107', 
    size=np.ones(to_plot.shape[0]) * 0.01,
    color='stock_id'
)
fig.show()

# Prices `tag_6` and `feature_0`

With all that new information, I started to look for a price chart. And it seems that `tag_6` is the answer. 

A quick view of the `features.csv` file show how important this tag is. As for `tag_23`, `tag_6` seems to be associated with other tags "modulating" the price. A normal feature with `tag_6` is something like: 
```feature_x = tag_{0, 1, 2, 3, 4, 5} x tag_6 x tag_9 (y/n) x tag_{11, 12, 13}```
or
```feature_x = tag_6 x tag_9 (y/n) x tag_{7, 8, 10}```

I can't fully explain this pattern today. But if `tag_{0, 1, 2, 3, 4, 5}` are all 0 with `tag_6` being one, I guessed that the features were representing a non aggregated price, "instantaneous price"

For every feature with `tag_9 == 1` and `tag_6 == 1`, there is a feature with `tag_9 == 0 & tag_6 == 1`. And they are always very very correlated (maybe a bid/ask or buy/sell tag). So let's select some features, and look at our stock !

In [None]:
display(u.select_tags('tag_6 & ~tag_9 & ~tag_0 & ~tag_1 & ~tag_2 & ~tag_3 & ~tag_4').astype(int))

In [None]:
features_to_look_at = u.get_features('tag_6 & ~tag_9 & ~tag_0 & ~tag_1 & ~tag_2 & ~tag_3 & ~tag_4')
stock_id = '533'
stock = to_plot[to_plot['stock_id'] == stock_id]
sns.pairplot(stock[features_to_look_at])
plt.show()

### `feature_0` helping us


`feature_0` was the first feature to be de-anonymized. It seems to be the side of the trade. How can it help us then ?

First thing, those features are not distributed randomly. But the distribution of `feature_{3, 5, 37, 39}` was clustered in 2 parts. I clustered that data and studied it more closely. It turned out that they clustered perfectly on the `feature_0` values. 

So I tried multiplying those features with `feature_0`. 
In financial terms, multiplying a price with the side of the trade gives you the "cost" of the trade. Example: You BUY 1 share @ 100USD, your cash becomes -100USD (i.e. you used 100USD to buy something). If you sold those share, you cash would become +100USD. 

Here what those new features give us: 

In [None]:
features_to_look_at = ['feature_3', 'feature_5', 'feature_37', 'feature_39']
new_features_to_look_at = []
for f in features_to_look_at: 
    new_f = f'{f}_0'
    new_features_to_look_at.append(new_f)
    stock[new_f] = stock['feature_0'] * stock[f]
sns.pairplot(stock[new_features_to_look_at])
plt.show()

### Back to prices

Let's see what it looks like once we plot those new features with time.
I also added some colors to show `feature_0` being `1` or `-1`. 

Those graphs are similar to "classical" intraday price chart. With the fact that `tag_6` features with rolling windows are impacted by NaNs at open, mid-day and randomly throughout the day makes me think of `tag_6` as a price tag. 

In [None]:
plt.figure(figsize=(20, 10))
n_fig = len(new_features_to_look_at)
for i, f in enumerate(new_features_to_look_at): 
    plt.subplot(n_fig, 1, i+1)
    plt.scatter(stock['intraday_ts'], stock[f], c=stock['feature_0'], cmap='plasma')
    plt.ylabel(f)
plt.xlabel('Time')
plt.legend()
plt.show()

# Other guesses

### On embedding values

##### `tag_14`, a PCA?
Just a guess, would `feature_{41, 42, 43}` be the 3 first components of a PCA embedding the stocks ?

##### `tag_18`, average of daily trading volume ?
`tag_18` is associated with `tag_{15, 17}`, which are assocatied to `tag_23`. I already said that `tag_23` was an additive variable (maybe the volume or number of trades). So `tag_18` would be the [ADTV](https://www.investopedia.com/terms/a/averagedailytradingvolume.asp#:~:text=Average%20daily%20trading%20volume%20(ADTV)%20is%20the%20average%20number%20of,find%20the%20average%20daily%20volume.) ?

### `tag_{15, 17}` buy / sell ?
Another guess. No justification

### `tag_{7, 8, 10}` related instruments price?
Maybe currencies, options, related to the main financial instrument ?

### `tag_9`, bid / ask indicator?

### `tag_20` a spread?
See below the pairplot of the features related to `tag_20`. All these features are "aligned" they are on a grid. 

In financial markets, markets are like an auction. People ready to buy "bid" for a price, people ready to sell "ask" for a price. Those prices are are not equal (otherwise a trade would be possible). There difference between those two values is called a spread. 

Prices are "regulated" and must be a multiple of a "tick size", eg. 0.01USD. Therefore, the spread must be a multiple of that tick size as well. It migt be what we see in that pairplot below. 

In [None]:
display(u.select_tags('tag_20').astype(int))
features_spread = u.get_features('tag_20')

In [None]:
sns.pairplot(stock[features_spread])
plt.show()

# Summary

I hope that this notebook, with the previous notebook [Time aggregation tags](https://www.kaggle.com/gregorycalvez/de-anonymization-time-aggregation-tags) will make other people look into further de-anonymization of the data. 

### Next Steps


##### Further feature relations
We managed to plot "price charts" of some stocks with the data. But there remains many things to be done. With the knowledge on how to isolate trades relating to a unique instrument, many new relations between features can be uncovered. 

##### `date` de-anonymization
Furthermore, if one manages to fully de-anonymize the date, one could imagine being able to match the price charts computed form the dataset with the tick prices of "famous" securities. If that is done, we could dream of being able to explain most of the features. 

Let me know your thoughts ! 