![Decoding](http://www.daskeyboard.com/blog/decode-our-das-keyboard-holiday-message-and-win/decryptthemessage-2/)

Inspired by the excellent notebooks ["De-anonymization: Time Aggregation Tags"](https://www.kaggle.com/gregorycalvez/de-anonymization-time-aggregation-tags/notebook#De-anonymization:-Time-Aggregation-Tags) and ["De-anonymization: Price, Quantity, Stocks"](https://www.kaggle.com/gregorycalvez/de-anonymization-price-quantity-stocks). (I am wondering how I can @author of notebook, sorry about that.) Here a notebook sharing my insights about the features and the meaning of tags. Comments and ideas are welcome!

**TL; DR**

* Feature 0: the side of the trade
* Feature 64: the time in a date
* Tag 0-4: time window, (length see below, based on feature 64)
* Feature 41-43: identify a stock in a date

|tag|window|
|:----|:--------|
|tag_0|    0.000000|  
|tag_1|    0.006602|
|tag_2|    0.020753|
|tag_3|    0.058880|
|tag_4|    0.236351|


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datatable as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

pd.options.display.max_rows = 999

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time

train_df = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()
feature_tag = pd.read_csv('../input/jane-street-market-prediction/features.csv', index_col=0)
print(train_df.shape)
print(train_df.columns)


<a id='content'></a>
# Table of Contents


* [Explore & Visualization](#section-1)
* [Feature 0](#section-2)
<!--     - [Subsection 1](#subsection-one) -->
<!--     - [Subsection 2](#anything-you-like) -->
* [Feature 64 & Tag 22](#section-3)
* [Tag 0-4](#section-4)
* [Feature 41-43 & Tag 14](#section-5)

* [TODO](#section-100)


In [None]:
# Helper functions
def display_query(query, show = True):
    query_raw = feature_tag.query(query)
    query_compact = query_raw.loc[:, query_raw.any()]
    if show:
        display((query_compact*1).style.background_gradient(cmap='Oranges', vmin=0, vmax=1))
        
    return query_compact


def check_unique(df, sub_cols):
    return df.drop_duplicates().equals(df.drop_duplicates(subset=sub_cols))

<a id="section-1"></a>
# Explore & Visualization
[Back to content](#content)

In [None]:
# How many tags for each feature?
tag_counts = feature_tag.sum(axis=1)
fig = px.bar(tag_counts, title = 'Tag counts')
fig.show()

In [None]:
# How many times each tag occurs in features:
tag_counts = feature_tag.sum(axis=0)
fig = px.bar(tag_counts, title = 'Feature counts')
fig.show()

In [None]:
# Overall visualization:
display((feature_tag*1).style.background_gradient(cmap='Oranges', vmin=0, vmax=1))

In [None]:
# You can play around with different query
# display_query('tag_6') 
# display_query('tag_6 & tag_9')
# display_query('tag_20 | tag_28')
features = display_query('(tag_0 | tag_1 | tag_2 | tag_3 | tag_4)&(tag_23)')


<a id=section-2></a>
# Feature 0: Side of trade

A binary feature with value 1 and -1, with roughly same number of rows.

Educated guess: side of the trade, i.e. buy/sell the stock

[Back to content](#content)

In [None]:
train_df['feature_0'].value_counts()

In [None]:
# Mean resp, ratio of pos vs neg resp for each side:
df = pd.DataFrame()
df['Mean resp'] = train_df.groupby('feature_0')['resp'].mean()
df['Pos resp ratio'] = train_df.groupby('feature_0')['resp'].apply(lambda s: sum(s>0)/len(s))

df

In [None]:
# Visualze the distribution of both side:
row_index = train_df['feature_0']>0

fig, ax = plt.subplots()
ax.hist(train_df.loc[row_index, 'resp'], label = 'Buy Order', bins=100, alpha = 0.3, density=True)
ax.hist(train_df.loc[~row_index, 'resp'], label = 'Sell Order', bins=100, alpha = 0.3, density=True)
ax.legend()
plt.show()

Over long term, the return a holding is slightly positive skew.  

Therefore, the guess here is that "-1" indidates a **sell** order on the market, which means we are **buying** if the trade is executed.   
Similarly, the "1" indicates a **buy** order and we take short position if executed.

<a id=section-3></a>
# Feature_64 & Tag 22: Intraday time

[Back to content](#content)

In [None]:
features = display_query('tag_22')

In [None]:
# Look at a random sample date:
date = 42
feature = 'feature_64'

sample_df = train_df.query(f'date == {date}')
print(f'Range in date {date}: {min(sample_df[feature]):.4f} - {max(sample_df[feature]):.4f}')
sample_df[feature].plot()

In [None]:
# Range of feature 64 in all dates:
f_range_df = pd.DataFrame()
f_range_df['MAX'] = train_df.groupby('date')[feature].max()
f_range_df['MIN'] = train_df.groupby('date')[feature].min()
f_range_df = f_range_df.reset_index()

px.line(f_range_df, x='date', y=['MAX', 'MIN'])

In [None]:
# Spot outlier dates from graphs:
outlier_dates = [2, 14, 87, 294] # 2 & 294 is abnormally short

print(f'Average trades in a date: {train_df["date"].value_counts().mean():.2f}')
print(train_df.loc[train_df.date.isin(outlier_dates), 'date'].value_counts())

In [None]:
# Consistency check of clock feature:
reverse_clock = train_df[feature] < train_df[feature].shift(1)
new_date = train_df['date'] > train_df['date'].shift(1)

all(reverse_clock == new_date)

In [None]:
# Finding the lunch gap:
sub_df = train_df[['date',feature]].copy()
sub_df[f'{feature}_pre'] = sub_df[feature].shift(1)
gap_df = sub_df.loc[(sub_df[feature].diff() > 0.5) & (sub_df[feature]>0) & (sub_df[feature]<4), :]
lunch_start = gap_df[feature+"_pre"].mean()
lunch_end = gap_df[feature].mean()

print(f'The lunch gap is from {lunch_start:.4f} to {lunch_end:.4f}')
px.line(gap_df, x='date', y = [feature, feature+'_pre'])
# any(gap_df.date.duplicated())

In [None]:
# Visualization of number of trades during a trading date:
fig, ax = plt.subplots()
for i in range(10, 18):
    sample_ser = train_df.loc[train_df.date==i, feature]
    ax.scatter(x=sample_ser, y = list(range(len(sample_ser))), label = f'Date {i}', alpha=0.2, s=1)

ax.legend(markerscale = 10)
plt.show()

<a id = "section-4"></a>
# Tag 0-4: Time windows

[Back to content](#content)

In [None]:
basic_query = '(tag_0|tag_1|tag_2|tag_3|tag_4)'
add_query = '&(tag_6|tag_23)'
features = display_query(basic_query+add_query)


In [None]:
date = 12
x_col = 'feature_64'
sample_df = train_df.query(f'date == {date}')
df = pd.DataFrame()
df['time'] = sample_df[x_col]
lunch_end = 1.3769

cols = []
for i in range(5):
    y_col = features.index[features[f'tag_{i}']]
    count_col = f'NA_counts_tag_{i}'
    cols.append(count_col)
    df[count_col] = sample_df[y_col].isnull().sum(axis=1)
    
    missing_time = df.loc[df[count_col]>0, 'time']
    if len(missing_time)>0:
        window = max(missing_time) - lunch_end
        print(f'Estimated window len of tag_{i}: {window:.4f}')
    else:
        print(f'Fail at date {date}')

px.line(df, x='time', y=cols)

In [None]:
def estimate_window_len(date, tag):
    x_col = 'feature_64'
    sample_df = train_df.query(f'date == {date}')
    features = display_query(f'(tag_23|tag_6) & {tag}', show=False)
    y_cols = features.index
    
    df = pd.DataFrame()
    df['time'] = sample_df[x_col]
    df['NA_counts'] = sample_df[y_cols].isnull().sum(axis=1)
    missing_time = df.loc[df['NA_counts']>0, 'time']
    if len(missing_time)>0:
        window = max(missing_time) - lunch_end
        if window > 0:
            return window
        else:
            return 0
    else:
        return np.nan

# estimate_window_len(12, 'tag_2')

estimate_df = pd.DataFrame()
outlier_dates = [2, 14, 87, 294]
for i in range(0, 5):
    for date in range(50):
        if date in outlier_dates:
            continue
        estimate_df.loc[date, f'tag_{i}'] = estimate_window_len(date, f'tag_{i}')

print('Estimate windows: (first 50 dates)')
print(estimate_df.median())


estimate_df = pd.DataFrame()
outlier_dates = [2, 14, 87, 294]
for i in range(0, 5):
    for date in range(450, 500):
        if date in outlier_dates:
            continue
        estimate_df.loc[date, f'tag_{i}'] = estimate_window_len(date, f'tag_{i}')

print('Estimate windows: (last 50 dates)')
print(estimate_df.median())
        

<a id = "section-5"></a>
# Feature 41-43 & Tag 14: Stock in Date

Using feature 41-45 (or just only feature 45) provide the same identity.

[Back to content](#content)

In [None]:
date = 12
id_cols = ['feature_41', 'feature_42', 'feature_43']

sample_df = train_df.query(f'date == {date}').copy()
sample_df['stock_id'] = sample_df['feature_41'].astype(str) +"_"+sample_df['feature_42'].astype(str) +"_"+ sample_df['feature_43'].astype(str)
sample_df[id_cols+['stock_id']]



In [None]:
sample_df['stock_id'].value_counts()

In [None]:
features = display_query('tag_5')
feature_name = features.index.values

In [None]:
# Relation of tag 5 features support the identification:
for i in range(features.shape[0]//2):
    col_x = feature_name[2*i]
    col_y = feature_name[2*i+1]
    fig = px.scatter(sample_df, x=col_x, y=col_y, color = 'stock_id')
    fig.show()

In [None]:
outlier_dates = [2, 14, 87, 294]

df = pd.DataFrame()
df['date'] = list(range(500))
df['trade'] = train_df['date'].value_counts()
df['stock'] = train_df.groupby('date').apply(lambda df: len(df[id_cols].value_counts()) )
df = df.loc[~df.date.isin(outlier_dates), :]
df['ratio'] = df['trade']/df['stock']

df.set_index('date').plot(subplots=True)

print(f'Trades: {df.trade.mean():.2f} with std ({df.trade.std():.2f})')
print(f'Stocks: {df.stock.mean():.2f} with std ({df.stock.std():.2f})')
print(f'Ratio: {df.ratio.mean():.4f} with std {df.ratio.std():.4f}')

<a id = "section-100"></a>
# TODO:

* use time and stock identification to have better understand of other features and tags.
* better NA filling methods than mean/median or naive backward/forward filling
* feature engineer is possible after better understanding meanings of tags.

[Back to content](#content)