## About this notebook

Dimensionality reduction by t-SNE and clustering by HDBSCAN.
In order to prevent the data from becoming too large, only cases with stock_id =0 will be handled.

Also, the installation and implementation of HDBSCAN was done by referring to [
Optiver HDBSCAN](https://www.kaggle.com/something4kag/optiver-hdbscan), and on top of that, this notebook was created for this competition.

Thank you for @something4kag
 
I wrote about UMAP and HDBSCAN in separate notebooks.[UMAP & HDBSCAN (stock_id=0)](https://www.kaggle.com/takemi/umap-hdbscan-stock-id-0)

In [None]:
from IPython.core.display import display, HTML

import pandas as pd
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import os
import gc
import plotly.graph_objects as go

from joblib import Parallel, delayed
from sklearn import preprocessing, model_selection
import matplotlib.pyplot as plt 
import seaborn as sns
from tqdm import tqdm


data_dir = '../input/optiver-realized-volatility-prediction/'

In [None]:
!mkdir -p /tmp/pip/cache/
!cp ../input/hdbscan0827-whl/hdbscan-0.8.27-cp37-cp37m-linux_x86_64.whl /tmp/pip/cache/
!pip install --no-index --find-links /tmp/pip/cache/ hdbscan

In [None]:
import hdbscan
from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

## Simple Preprocess

Concat the elements of train.csv and train_trade.parquet to create train data.


In [None]:
def read_train_test():
    train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')
    # Create a key to merge with book and trade data
    train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)
    print(f'Our training set has {train.shape[0]} rows')
    return train

def trade_preprocessor(file_path):
    df = pd.read_parquet(file_path)
    stock_id = file_path.split('=')[1]
    df['row_id'] = df['time_id'].apply(lambda x:f'{stock_id}-{x}')
    df = df.drop('time_id',axis=1)
    return df

def preprocessor(list_stock_ids, is_train = True):
    
    # Parrallel for loop
    def for_joblib(stock_id):
        # Train
        file_path_trade = data_dir + "trade_train.parquet/stock_id=" + str(stock_id)
        df_tmp = trade_preprocessor(file_path_trade)
        return df_tmp
    
    # Use parallel api to call paralle for loop
    df = Parallel(n_jobs = -1, verbose = 1)(delayed(for_joblib)(stock_id) for stock_id in list_stock_ids)
    # Concatenate all the dataframes that return from Parallel
    df = pd.concat(df, ignore_index = True)
    return df

In [None]:
# Read train and test
train = read_train_test()

# preprocess stock-id = 0
train_ = preprocessor([0], is_train = True)
train = train.merge(train_, on = ['row_id'], how = 'left')

In [None]:
train[train['stock_id']==0]

Group by stock_id and time_id, and calculate mean, max, min, and std values for price, size, and orde_count, respectively, and use them as features.

In [None]:
train = train.groupby(['stock_id','time_id'])['price','size','order_count'].agg(['mean', 'std', 'max', 'min', ]).reset_index()
train

In [None]:
train = train.dropna()
train = train[train['stock_id']==0]
X = np.array(train)

In [None]:
np.array(train)

In [None]:
train

## t-SNE

In [None]:
projection = TSNE().fit_transform(X)
plt.scatter(*projection.T,)

## HDBSCAN

Clustering with HDBSCAN and mapping the result to the t-SNE plot generated in the figure above.

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True).fit(X)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

In [None]:
u, counts = np.unique(clusterer.labels_, return_counts=True)
print(u)
print(counts)

In [None]:
len(clusterer.labels_)