# Ubiquant Market Prediction - A Simple EDA

## About this Competition 
In this competition, the competitors are asked to build a model forecasting an investment's **return rate**, helping improve the ability of **quantitative** reseachers to make better predictions and decisions.
> [Ubiquant](https://www.ubiquant.com/website/home)：「量化交易是做2個$\sigma$以內最正確的事，同時控制2個$\sigma$以外的風險。但要從事量化，要做到2個$\sigma$以外的優秀。 」

## About this Notebook
In this kernel, I'll briefly illustrate what the dataset looks like. Also, some visualization will be implemented to facilitate the preliminary understanding of the data.

## Acknowledgements
* Data type conversion of raw DataFrame - [Speed Up Reading (csv-to-pickle)](https://www.kaggle.com/columbia2131/speed-up-reading-csv-to-pickle)
* File IO using `.parquet` extension - [Fast Data Loading and Low Mem with Parquet Files](https://www.kaggle.com/robikscube/fast-data-loading-and-low-mem-with-parquet-files)

## Table of Contents
* [i. Data Appearance](#data_appearance)
    * *[Large Dataset](#large_dataset)*
    * *[Data File Description](#data_file_description)*
    * *[Basic Description of `train.csv`](#based_description_of_train)*
* [ii. Primary Key](#primary_key)
    * *[`time_id`](#time_id)*
    * *[`investment_id`](#inv_id)*
    * *[Relationship between `time_id` and `investment_id`](#rel_btw_time_id_inv_id)*
* [iii. Target Exploration](#target_exploration)
    * *[Predicting Target Distribution](#return_rate_dist)*
    * *[Predicting Target Series](#return_rate_series)*
    * *[2D Predicting Target Map](#return_rate_map)*
* [iv. Feature Exploration](#feature_exploration)
    * *[Single Feature Distribution](#single_feat_dist)*
    * *[Low-Variance Feature](#low_var_feat)*
    * *[Outliers](#outliers)*
    * *[Feature Interaction](#feat_interaction)*

In [None]:
# Import packages 
import os
import gc
import pickle
import warnings
import random
from random import sample
from tqdm import tqdm

import pandas as pd 
import numpy as np 
from sklearn.feature_selection import mutual_info_regression as mir
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics.cluster import adjusted_mutual_info_score
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.graph_objects as go 
import plotly.express as px 

# Configuration 
warnings.simplefilter('ignore')
pd.set_option('display.max_columns', 200)
random.seed(2022)

In [None]:
# Variable definitions
DATA_PATH_RAW = "../input/ubiquant-market-prediction"
DATA_PATH_RAW_CUSTOM = "../input/ubiquant-raw"
BASE_COLS = ['row_id', 'time_id', 'investment_id', 'target']
FEAT_COLS = [f'f_{i}' for i in range(300)]

<a id="data_appearance"></a>
## i. Data Appearance
In this section, basic data appearance is introduced, helping readers understand how the data is structured and what information the data contains.

<a id="large_dataset"></a>
### *Large Dataset*
The size of raw data is large (about 18.5GB); hence, there's a need to convert and dump the data for better storage and IO efficiency.

In [None]:
# dtype_map = {
#     'row_id': 'str',
#     'time_id': 'uint16',
#     'investment_id': 'uint16',
#     'target': 'float32',
# }
# for col in FEAT_COLS:
#     dtype_map[col] = 'float32'
# df = pd.read_csv(os.path.join(DATA_PATH_RAW, 'train.csv'), dtype=dtype_map)
# df.to_parquet(os.path.join(DATA_PATH_RAW, 'train_light.parquet'), index=False)

<a id="data_file_description"></a>
### *Data File Description* 
#### `train.csv`
`train.csv` contains information to build a model, including anonymized features derived from real market data, groundtruth (*i.e.*, obfuscated metric to predict) and corresponding primary key (*i.e.*, (`time_id`, `investment_id`) pair). Due to the large data size, `train.csv` has been converted and dumped in `.parquet` extension using code snippet above.

In [None]:
df = pd.read_parquet(os.path.join(DATA_PATH_RAW_CUSTOM, 'train_light.parquet'))
df.head()

#### `example_test.csv`
`example_test.csv` is the random data provided to demonstrate what shape and format of data the API will deliver to your notebook when you submit. That is, the file gives us a sense about what raw input data will be fed into the well-trained model in submission stage.

In [None]:
df_test = pd.read_csv(os.path.join(DATA_PATH_RAW, 'example_test.csv'))
df_test.head()

#### `example_sample_submission.csv`
`example_sample_submission.csv` is an example submission file provided so the publicly accessible copy of the API provides the correct data shape and format. And, we can see that there's no `investment_id` existing in the submission file.

In [None]:
df_sub = pd.read_csv(os.path.join(DATA_PATH_RAW, 'example_sample_submission.csv'))
df_sub.head()

In [None]:
del df_test, df_sub
gc.collect()

<a id="based_description_of_train"></a>
### *Basic Description of `train.csv`*
#### Data Shape
Detailed description of columns is as follows (, which can be found in [data page](https://www.kaggle.com/c/ubiquant-market-prediction/data)):
* `row_id` - A unique identifier for the row.
* `time_id` - The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.
* `investment_id` - The ID code for an investment. Not all investment have data in all time IDs.
* `target` - The target.
* `[f_0:f_299]` - Anonymized features generated from market data.

In [None]:
print(f"Shape of train.csv is {df.shape}")

#### Concise Summary
After conversion, raw data becomes more memory-efficient.

In [None]:
df.info()

#### Descriptive Statistics
Observing stats of DataFrame, we can find that most of the features have mean value close to **zero** and std close to **one**.

In [None]:
df.describe()

#### Missing Values
There's no missing values in raw DataFrame.

In [None]:
assert not df.isna().any().any()

<a id="primary_key"></a>
## ii. Primary Key
The primary key of this DataFrame is (`time_id`, `investment_id`) pair, which is presented as `row_id`. Because there's a supplementary statement related to `time_id` and `investment_id` in [data page](https://www.kaggle.com/c/ubiquant-market-prediction/data), I decide to explore the primary key.

<a id="time_id"></a>
### *`time_id`*
In addition to the inconsistency of time gaps between `time_id`s, there are also some skipped `time_id`s.

In [None]:
time_id_min = df['time_id'].min()
time_id_max = df['time_id'].max()
time_ids_range = [_ for _ in range(time_id_min, time_id_max)]
time_ids_skip = set(time_ids_range).difference(set(df['time_id'].unique()))

print(f"There are {df['time_id'].nunique()} unique time_ids.")
print(f"The minimum time_id is {time_id_min}, "
      f"and the maximum time_id is {time_id_max}.")
print(f"Skipped time_ids are {sorted(time_ids_skip)}")

<a id="primary_key"></a>
## ii. Primary Key
The primary key of this DataFrame is (`time_id`, `investment_id`) pair, which is presented as `row_id`. Because there's a supplementary statement related to `time_id` and `investment_id` in [data page](https://www.kaggle.com/c/ubiquant-market-prediction/data), I decide to explore the primary key.

<a id="time_id"></a>
### *`time_id`*
In addition to the inconsistency of time gaps between `time_id`s, there are also some skipped `time_id`s.

In [None]:
inv_id_min = df['investment_id'].min()
inv_id_max = df['investment_id'].max()
inv_ids_range = [_ for _ in range(inv_id_min, inv_id_max)]
inv_ids_skip = set(inv_ids_range).difference(set(df['investment_id'].unique()))

print(f"There are {df['investment_id'].nunique()} unique investment_ids.")
print(f"The minimum investment_id is {inv_id_min}, "
      f"and the maximum investment_id is {inv_id_max}.")
print(f"Skipped investment_ids are {sorted(inv_ids_skip)} "
      f"({len(inv_ids_skip)} in total)")

<a id="rel_btw_time_id_inv_id"></a>
### *Relationship between `time_id` and `investment_id`*
After basic introduction of `time_id` and `investment_id`, following is the exploration of relationship between them.

<a id="n_samples_time_id"></a>
#### Number of Samples in Each `time_id`
Because not all investments have data in all `time_id`s, number of samples (*i.e.*, number of investments) in each `time_id` is different from others. 

In [None]:
n_samples_time_id = df.groupby('time_id').agg('size')
assert n_samples_time_id.equals(df.groupby('time_id')['investment_id'].nunique())

fig, ax = plt.subplots(figsize=(14, 7))
ax.bar(n_samples_time_id.index, n_samples_time_id.values)
ax.set_title("#Samples in Each time_id")
ax.set_xlabel("time_id")
ax.set_ylabel("#Samples")
plt.show()

In [None]:
n_samples_max_t = n_samples_time_id.max()
n_samples_max_time_id = n_samples_time_id.index[n_samples_time_id.argmax()]
n_samples_min_t = n_samples_time_id.min()
n_samples_min_time_id = n_samples_time_id.index[n_samples_time_id.argmin()]

print(f"Maximum #samples in one time_id is {n_samples_max_t}, "
      f"the corresponding time_id is {n_samples_max_time_id}.")
print(f"Minimum #samples in one time_id is {n_samples_min_t}, "
      f"the corresponding time_id is {n_samples_min_time_id}.")

#### Number of Samples of Each `investment_id`
From the perspective of `investment_id`, we can again verify the fact that not all investments have data in all `time_id`s. Investment with `investment_id` 1415 only has data recorded in **2** `time_id`s (*i.e.*, 1216 and 1217), which will make it hard to capture the **temporal pattern**.

In [None]:
n_samples_inv_id = df.groupby('investment_id').agg('size')
assert n_samples_inv_id.equals(df.groupby('investment_id')['time_id'].nunique())

fig, ax = plt.subplots(figsize=(14, 7))
ax.bar(n_samples_inv_id.index, n_samples_inv_id.values)
ax.set_title("#Samples of Each investment_id")
ax.set_xlabel("investment_id")
ax.set_ylabel("#Samples")
plt.show()

In [None]:
n_samples_max_i = n_samples_inv_id.max()
n_samples_max_inv_id = n_samples_inv_id.index[n_samples_inv_id.argmax()]
n_samples_min_i = n_samples_inv_id.min()
n_samples_min_inv_id = n_samples_inv_id.index[n_samples_inv_id.argmin()]

print(f"Maximum #samples of one investment_id is {n_samples_max_i}, "
      f"the corresponding investment_id is {n_samples_max_inv_id}.")
print(f"Minimum #samples of one investment_id is {n_samples_min_i}, "
      f"the corresponding investment_id is {n_samples_min_inv_id}.")

<a id="target_exploration"></a>
## iii. Target Exploration
The predicting target of this competition is the **return rate** of investments, which is relevant for making trading decisions.

<a id="return_rate_dist"></a>
### *Predicting Target Distribution*
#### Complete Target Series
The predicting target seems to follow **normal distribution**, and there's no apparent outliers or tail existing. Hence, data conversion isn't necessary in this case.

In [None]:
fig, ax = plt.subplots(figsize=(14, 7))
ax.hist(df['target'], bins=2000)
ax.set_title("Distribution of Predicting Target")
ax.set_xlabel("Return Rate (Binned)")
ax.set_ylabel("Value Count")
plt.show()

#### Randomly Sampled `time_id`s
The **dispersions** of return rate distributions in different `time_id`s show diversity.

In [None]:
time_ids_rand = sample(list(df['time_id'].unique()), k=16)

fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(14, 14))
for i, time_id in enumerate(time_ids_rand):
    target_ = df[df['time_id'] == time_id]['target']
    mean, std = target_.mean(), target_.std()
    axs[i//4, i%4].hist(target_, bins=250)
    axs[i//4, i%4].set_title(f"time_id={time_id}, mean={mean:.2f}, std={std:.2f}")
    axs[i//4, i%4].set_xlabel("Return Rate (Binned)")    
    axs[i//4, i%4].set_ylabel("Value Count")
    del target_
plt.tight_layout()

#### Randomly Sampled `investment_id`s
The **dispersions** of return rate distributions of different `investment_id`s show significant diversity. The reason behind the scene may be that each investment has its own characteristic (*e.g.*, larger trading volume, greater price fluctuation). Also, we should again notice that not all investments have data in all `time_id`s.

In [None]:
inv_ids_rand = sample(list(df['investment_id'].unique()), k=16)

fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(14, 14))
for i, inv_id in enumerate(inv_ids_rand):
    target_ = df[df['investment_id'] == inv_id]['target']
    mean, std = target_.mean(), target_.std()
    axs[i//4, i%4].hist(target_, bins=250)
    axs[i//4, i%4].set_title(f"inv_id={inv_id}, mean={mean:.2f}, std={std:.2f}")
    axs[i//4, i%4].set_xlabel("Return Rate (Binned)")    
    axs[i//4, i%4].set_ylabel("Value Count")
    del target_
plt.tight_layout()

<a id="return_rate_series"></a>
### *Predicting Target Series*
With sequence nature shown by `time_id`s, I explore return rate from the perspective of **time-series**. We can see that some of the investments have **greater return rate fluctuation** than others, so **temporal pattern extraction** might be important. In addition, some **synchronous movement** in return rate series of different investments show the potential of modeling the **spatial dependency** among differnt investments.

In [None]:
time_ids_complete = [t for t in range(df['time_id'].min(), df['time_id'].max())]
inv_ids_rand = sample(list(df['investment_id'].unique()), k=3)

fig = go.Figure()
for i, inv_id in tqdm(enumerate(inv_ids_rand)):
    target_ = df[df['investment_id'] == inv_id].loc[:, ['time_id', 'target']]
    target_.sort_values('time_id', inplace=True)
    target_.set_index('time_id', drop=True, inplace=True)
    target_ = target_.reindex(index=time_ids_complete)
    
    fig.add_trace(go.Scatter(x=time_ids_complete, y=target_['target'], 
                             mode='lines+markers', name=f'inv_{inv_id}'))
    del target_
    
fig.update_layout(
    title="Target Series of Different Investments",
    xaxis_title="time_id",
    yaxis_title="Return Rate",
    legend_title="investment_id",
)
fig.show()

<a id="return_rate_map"></a>
### *2D Predicting Target Map*
Return rate series can be converted to 2D target map using `time_id` and `investment_id` as two axes. Because not all investments have data in all `time_id`s, I simply leave `NaN`s for entris with no data recorded.

In [None]:
target_map = df.pivot(index='investment_id', columns='time_id', values='target')
target_map.head(2)

#### Cumulative Return Rate
Cumulative return rates of randomly selected investments show siginificant diversity. Could I come to a conclusion that the return rate follows a **random walk**?

In [None]:
inv_ids_rand = sample(list(target_map.index.unique()), 300)

fig = go.Figure()
for i, inv_id in tqdm(enumerate(inv_ids_rand)):
    target_vec = target_map[target_map.index == inv_id].values[0]
    target_cumsum = np.nancumsum(target_vec)
    
    fig.add_trace(go.Scatter(x=target_map.columns, y=target_cumsum, 
                             mode='lines', name=f'inv_{inv_id}'))
    del target_vec, target_cumsum
    
fig.update_layout(
    title="Cumulative Return Rate of Different Investments",
    xaxis_title="time_id",
    yaxis_title="Cumulative Return Rate",
    legend_title="investment_id",
)
fig.show()

#### Spatial Dependency among Investments
To observe whether there's any **spatial dependency** among investments, I derive correlations of return rate series of different `investment_id`s. Also, I only select those `investment_id`s with sufficient `time_id`s (*i.e.*, with threshold of 600 `time_id`s). Finally, some of highly correlated investments are found!

In [None]:
corrs_inv = target_map.T.corr()   # Derive corr of return rates of different invs
corrs_inv = abs(corrs_inv[corrs_inv != 1])   # Take off-diagonal corrs
inv_ids_leg = list(n_samples_inv_id[n_samples_inv_id > 600].index)
corrs_inv = corrs_inv.loc[inv_ids_leg, inv_ids_leg]

corrs_max = {'inv_id1': [], 'inv_id2': [], 'corr': []}
for inv_id, corr_vec in corrs_inv.iterrows():
    corrs_max['inv_id1'].append(inv_id)
    corrs_max['inv_id2'].append(corr_vec.index[corr_vec.argmax()])
    corrs_max['corr'].append(corr_vec.max())
corrs_max = pd.DataFrame.from_dict(corrs_max, orient='columns')
corrs_max.sort_values('corr', ascending=False, inplace=True)
corrs_max.head()

In [None]:
inv_ids_top = [194, 1144, 1121, 1929, 2406, 2669]
target_map_ = target_map.loc[inv_ids_top, :]
sns.pairplot(target_map_.T, corner=True)

In [None]:
fig = go.Figure()
for i, inv_id in tqdm(enumerate(inv_ids_top)):
    target_vec = target_map[target_map.index == inv_id].values[0]
    target_cumsum = np.nancumsum(target_vec)
    
    fig.add_trace(go.Scatter(x=target_map.columns, y=target_cumsum, 
                             mode='lines', name=f'inv_{inv_id}'))
    del target_vec, target_cumsum
    
fig.update_layout(
    title="Cumulative Return Rate of Highly Correlated Investments",
    xaxis_title="time_id",
    yaxis_title="Cumulative Return Rate",
    legend_title="investment_id",
)
fig.show()

<a id="feature_exploration"></a>
## iv. Feature Exploration
Anonymized features can provide rich information boosting the quality of model learning. Though the actual meanings of features are unknown, it's still worth exploring underlying property and hidden pattern of them.

<a id="single_feat_dist"></a>
### *Single Feature Distribution*
#### Histogram
Though their are some similar properties (*e.g.*, zero-mean) among randomly selected features, each feature still has its own characteristic (*e.g.*, [skewness](https://en.wikipedia.org/wiki/Skewness), [multimodal](https://en.wikipedia.org/wiki/Multimodal_distribution)).

In [None]:
feat_nums_rand = sample(range(0, 300), k=16)

fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(14, 14))
for i, feat_num in enumerate(feat_nums_rand):
    feat = df[f'f_{feat_num}']
    mean, std = feat.mean(), feat.std()
    axs[i//4, i%4].hist(feat, bins=250)
    axs[i//4, i%4].set_title(f"f_{feat_num}, mean={mean:.2f}, std={std:.2f}")
    axs[i//4, i%4].set_xlabel("Feature Value (Binned)")    
    axs[i//4, i%4].set_ylabel("Value Count")
    del feat
plt.tight_layout()

#### Boxplot
I use **3IQR whisker** to observe if there's any **extreme outliers** existing, instead of the default **1.5IQR whisker**. Also, **mean value** is marked to help observe if feature values are **strongly centered**. Then, it's obvious that most of features have **extreme outliers**. Moreover, some features have a single data point far away from others (*e.g.*, the minimum of `f_12`, the maximum of `f_104`). Some of features with relatively **low variance** (*e.g.*, `f_124`, `f_170`) are also observed. Finally, all these observations motivate me to dig deeper into [**outlier analysis**](#outliers) below.

In [None]:
fig, axs = plt.subplots(nrows=60, ncols=5, figsize=(20, 240))
for i, feat_num in tqdm(enumerate(FEAT_COLS)):
    feat = df[f'{feat_num}']
    mean, std = feat.mean(), feat.std()
    sns.boxplot(x=feat, color='green', saturation=0.4,
                width=0.3, whis=3, ax=axs[i//5, i%5])
    axs[i//5, i%5].axvline(mean, color= 'orange', linestyle='dotted',
                           linewidth=2)
    axs[i//5, i%5].set_title(f"{feat_num}, mean={mean:.2f}, std={std:.2f}")
    axs[i//5, i%5].set_xlabel("Feature Value")    
    del feat
plt.tight_layout()

<a id="low_var_feat"></a>
### *Low-Variance Feature*
Observing boxplots of features illustrated above, there are some features with **relatively low variance** compared with others. `f_124` is the one with the lowest variance and the highest zero ratio (about $47.54\%$).
#### Histogram

In [None]:
feats_low_var = [124, 170, 175, 182, 200, 272]

fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(14, 7))
for i, feat_num in enumerate(feats_low_var):
    feat = df[f'f_{feat_num}']
    mean, std = feat.mean(), feat.std()
    axs[i//3, i%3].hist(feat, bins=250)
    axs[i//3, i%3].set_title(f"f_{feat_num}, mean={mean:.2f}, std={std:.2f}")
    axs[i//3, i%3].set_xlabel("Feature Value (Binned)")    
    axs[i//3, i%3].set_ylabel("Value Count")
    del feat
plt.tight_layout()

#### Clustering Property
Based on the analysis of **low-variance** features, I try to find out if there's any **time-dependent property**. Through checking feature values of different investments in **every `time_id`**. It's evident that there's **clustering effect along time axis**. Though it's hard to conclude what's going on behind the scene, I think this may be some characteristic similar to [volatility clustering](https://en.wikipedia.org/wiki/Volatility_clustering).

In [None]:
fig, axs = plt.subplots(nrows=122, ncols=10, figsize=(40, 480))
feat_cols = [f'f_{i}' for i in feats_low_var]

for i, t in tqdm(enumerate(df['time_id'].unique())):
    df_ = df[df['time_id'] == t]
    for col in feat_cols:
        axs[i//10, i%10].plot(df_[col], label=col)
    axs[i//10, i%10].axes.xaxis.set_visible(False)
    axs[i//10, i%10].set_title(f"time_id={t}, #inv={len(df_)}")
    if i == 0:
        axs[i//10, i%10].legend()
fig.show()

#### Time-Grouped Dispersion 
**Dynamic dispersion** (measured by std) of feature values along time axis seems have some unknown **periodic fluctuation**. However, we can't get the real time gap between `time_id`s. Also, some of the features have **synchronous movement** of dispersion and following is just a manually selected example.

In [None]:
bins = [0, 50, 114, 306, 367, 586, 650, 665, 729, 818,
        926, 956, 1013, 1051, 1141, 1145, 1209, 1220]
df['time_block'] = pd.cut(df['time_id'], bins=bins, right=True, 
                          include_lowest=True)

fig, axs = plt.subplots(nrows=60, ncols=5, figsize=(24, 240))
for i, feat in tqdm(enumerate(FEAT_COLS)):
    feat_std = df.groupby('time_block')[feat].std()
    if i == 0:
        xtick_lbs = feat_std.index
        xticks = [_ for _ in range(len(xtick_lbs))]
    axs[i//5, i%5].plot(feat_std)
    axs[i//5, i%5].set_title(f"Time-Grouped Std of {feat}")
    axs[i//5, i%5].set_ylabel("Feature Std")
    axs[i//5, i%5].set_xticks(xticks, xtick_lbs, rotation=90)
    del feat_std
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(14, 7))
for i in [0, 1, 2, 5, 6, 8, 9, 296, 297, 298]:
    feat_stds = df.groupby('time_block')[f'f_{i}'].std()
    ax.plot(feat_stds, label=f'f_{i}')
    
ax.set_title("Synchronous Time-Grouped Standard Deviation")
ax.set_xlabel("time_id Interval")
ax.set_ylabel("Standard Deviation")
ax.legend()
ax.set_xticks(xticks, xtick_lbs, rotation=90)
plt.show()

<a id="outliers"></a>
### *Outliers*

<a id="feat_interaction"></a>
### *Feature Interaction*
Aside from single feature distributions, feature dependency is illustrated to show how each feature interacts with other features and predicting target.

In [None]:
df_sub = df.sample(frac=0.2)
cols = FEAT_COLS + ['target']

In [None]:
def plot_heatmap(df, depend):
    '''Plot the heatmap based on correlation or mutual information.
    
    Parameters:
        df: pd.DataFrame, measure of mutual dependence betweeb each 
            featue pair
        depend: str, name of dependence, the choices are as follows:
            {'Corr', 'NMI'}
            
    Return:
        None
    '''
    fig = go.Figure()
    fig.add_trace(go.Heatmap(x=df.index, 
                             y=df.columns, 
                             z=df.T, 
                             colorbar=dict(title=depend), 
                             hoverinfo='text',
                             text=get_hover_text(df)))
    fig.update_layout(
        title=f"{depend} of Feature Pairs",
        xaxis_title="Feature1 (Including target)",
        yaxis_title="Feature2 (Including target)",
        width=800, height=800
    )
    fig_widget = go.FigureWidget(fig)
    fig_widget.show()

def get_hover_text(df):
    '''Return text list storing hover information for interactive 
    heatmap.
    
    Parameters:
        df: pd.DataFrame, measure of mutual dependence betweeb each 
            featue pair
        
    Return:
        hover_txt: list, hover information displayed when hovering 
                   grids in the heatmap
    '''
    hover_txt = []
    for yi, feat_y in enumerate(df.columns):
        hover_txt.append([])
        for xi, feat_x in enumerate(df.index):
            hover_txt[-1].append(f"Feature1: {feat_x}<br />Feature2: {feat_y} "
                                 f"<br />Score: {df[feat_y][feat_x]}")
    return hover_txt

#### Correlation of Feature Pairs
The correlation is calculated using randomly selected subset of samples to reduce computational cost. From interactive heatmap, we can find that there are some **highly correlated** feature pairs (*e.g.*, (`f_148`, `f_205`)). 

In [None]:
corrs = df_sub[cols].corr()
plot_heatmap(corrs, depend='Corr')

#### Correlation with Predicting Target
All anonymized features have low correlations with the predicting target. The maximum absolute correlation is approximately **0.06**.

In [None]:
corrs_with_target = corrs.loc[:, 'target']
corrs_with_target = corrs_with_target[corrs_with_target.index != 'target']
corrs_with_target = abs(corrs_with_target)

fig, ax = plt.subplots(figsize=(14, 7))
ax.hist(corrs_with_target, bins=25)
ax.set_title("Absolute Correlation of Anonymized Features and Target")
ax.set_xlabel("Absolute Correlation (Binned)")
ax.set_ylabel("Value Count")
plt.show()
print(f"Maximum absolute correlation with target is {corrs_with_target.max()}, "
      f"and the corresponding feature is {corrs_with_target.argmax()}")

#### Correlations with Predicting Target in Different Time Blocks
Correlations of anonymized features with return rate are different in **discretized time blocks**, which motivates me to explore correlations in fine-grained time scale [below](#corrs_time_id).

In [None]:
df_sub['time_block'] = pd.cut(df['time_id'], bins=25)
time_gps = df_sub.groupby('time_block')

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(16, 16))
for i, (t, gp) in tqdm(enumerate(time_gps)):
    corrs_gp = gp[cols].corr().loc[:, 'target']
    corrs_gp = corrs_gp[corrs_gp.index != 'target']
    corrs_gp = abs(corrs_gp)

    axs[i//5, i%5].hist(corrs_gp, bins=25)
    axs[i//5, i%5].set_title(f"Time Block {t}")
    axs[i//5, i%5].set_xlabel("Absolute Correlation (Binned)")
    axs[i//5, i%5].set_ylabel("Value Count")
    del corrs_gp
plt.tight_layout()

<a id="corrs_time_id"></a>
#### Correlations with Predicting Target in First 25 `time_id`s
With fine-grained time scale (*i.e.*, `time_id`), correlations of features with return rate become more diversified among different `time_id`s, showing that anonymized features **might** have dynamic effect on return rate in different time intervals.

In [None]:
fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(16, 16))
for i, time_id in tqdm(enumerate(range(25))):
    df_time_id = df[df['time_id'] == time_id]
    corrs_time_id = df_time_id[cols].corr().loc[:, 'target']
    corrs_time_id = corrs_time_id[corrs_time_id.index != 'target']
    corrs_time_id = abs(corrs_time_id)

    axs[i//5, i%5].hist(corrs_time_id, bins=25)
    axs[i//5, i%5].set_title(f"time_id={time_id}")
    axs[i//5, i%5].set_xlabel("Absolute Correlation (Binned)")
    axs[i//5, i%5].set_ylabel("Value Count")
    del df_time_id, corrs_time_id
plt.tight_layout()

#### Correlations with Predicting Target of First 25 `investment_id`s
In addition to `time_id`, anonymized features of different investments also have different correlations with return rate, again showing that each investment has its own characteristic.

In [None]:
inv_ids = sorted(df['investment_id'].unique())[:25]

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(16, 16))
for i, inv_id in tqdm(enumerate(inv_ids)):
    df_inv_id = df[df['investment_id'] == inv_id]
    corrs_inv_id = df_inv_id[cols].corr().loc[:, 'target']
    corrs_inv_id = corrs_inv_id[corrs_inv_id.index != 'target']
    corrs_inv_id = abs(corrs_inv_id)

    axs[i//5, i%5].hist(corrs_inv_id, bins=25)
    axs[i//5, i%5].set_title(f"inv_id={inv_id}")
    axs[i//5, i%5].set_xlabel("Absolute Correlation (Binned)")
    axs[i//5, i%5].set_ylabel("Value Count")
    del df_inv_id, corrs_inv_id
plt.tight_layout()

#### Correlations with Predicting Target at `time_id` 483 and 1205
Based on the analysis in this [section](#n_samples_time_id), `time_id` 1205 has the **most** number of investments and 483 the **least**. Comparing two heatmaps, we can easily observe the difference of correlations between two `time_id`s. Also, the occurrence of `NaN`s in the second heatmap inspires me to analyze **zero ratio** of each feature below. Finally, we can find out that a feature with higher **zero ratio** might have a higher chance to be a **zero vector** in some `time_id` like 483, which might contain valuable information to utilize.

In [None]:
corrs_t1205 = df[df['time_id'] == 1205][cols].corr()
plot_heatmap(corrs_t1205, 'Corr (time_id=1205)')

In [None]:
corrs_t483 = df[df['time_id'] == 483][cols].corr()
plot_heatmap(corrs_t483, 'Corr (time_id=483)')

In [None]:
zero_ratios = {}
n_samples = df.shape[0]

for col in cols:
    zero_ratios[col] = (df[col] == 0).sum() / n_samples
zero_ratios = pd.DataFrame.from_dict(zero_ratios, 
                                     orient='index', 
                                     columns=['zero_ratio'])
zero_ratios.sort_values('zero_ratio', ascending=False, inplace=True)
zero_ratios.T

#### Mutual Information with Predicting Target
[Mutual information](https://en.wikipedia.org/wiki/Mutual_information) can measure the **mutual dependence** of two variables, which isn't limited to **linear dependence** like **correlation**. We can see that most of the anonymized features have high NMI with return rate. But, I still have no idea about what's going on and I'll try to figure it out!
<div class="alert alert-block alert-danger">
    <p>Another thing to mention is that the computational cost of calculating NMI is relatively high, so I don't implement <strong>pairwise</strong> NMI of all feature pairs.</p>
</div>

In [None]:
nmis = {}
for f in tqdm(FEAT_COLS):
    nmis[f] = normalized_mutual_info_score(df_sub['target'], df_sub[f])
nmis = pd.Series(nmis)

fig, ax = plt.subplots(figsize=(14, 7))
ax.hist(nmis, bins=25)
ax.set_title("NMI between Anonymized Features and Target")
ax.set_xlabel("Normalized Mutual Information (Binned)")
ax.set_ylabel("Value Count")
plt.show()
print(f"Maximum NMI with target is {nmis.max()}, "
      f"and the corresponding feature is {nmis.argmax()}")

<div class="alert alert-blocks alert-info" style="text-align: center">
    <h3>Work in Progress...</h3>
    <h3>Thanks for Your Attention!</h3>
</div>