## References and source of analysis:

https://www.kaggle.com/lucamassaron/eda-target-analysis

https://www.kaggle.com/datafan07/ubiquant-market-prediction-what-do-we-have-here

https://www.kaggle.com/gunesevitan/ubiquant-market-prediction-eda

https://www.kaggle.com/code/valleyzw/ubiquant-lgbm-baseline/notebook

https://www.kaggle.com/code/junjitakeshima/ubiquant-simple-lgbm-removing-outliers-en-jp

In [None]:
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
import scipy

Data has been preprocessed and converted to parquet in another notebook to avoid running out of memory

In [None]:
df = pd.read_parquet('/kaggle/input/ubiquant-dataset-compressed/ubiquant_dataset_compressed.parquet')

In [None]:
pd.options.display.float_format = '{:,.3f}'.format

In [None]:
df.info()

In [None]:
df

From the description: "Your challenge is to predict the value of an obfuscated metric (target) relevant for making trading decisions." "Submissions are evaluated on the mean of the Pearson correlation coefficient for each time ID."

About the data, some important things:

* time_id - The ID code for the time the data was gathered. **The time IDs are in order, but the real time between the time IDs is not constant** and will likely be shorter for the final private test set than in the training set.
* investment_id - The ID code for an investment. Not all investment have data in all time IDs.
* [f_0:f_299] - Anonymized features generated from market data.

*row_id* is the concatenation of time_id + investment_id

There are 3141410 samples

Additional comments from the Q&A to consider:

* The mapping relationship between investment_id and a certain investment is fixed, but the investment_ids that appear in the train data, the public leaderboard, and the private leaderboard are not the same, some only appear in the train data, some only in public leaderboard and some only in the private leaderboard.

A quick check to make sure there are no null values

In [None]:
df.isna().sum().sum()

# Target analysis

In [None]:
pd.DataFrame(df.target.describe())

## Target distribution

In [None]:
plt.figure(figsize=(15,8))
axes = sns.kdeplot(df.target, label='target', fill=True)
axes.axvline(df.target.mean(), label='Mean', color='r', linewidth=2, linestyle='--')
axes.axvline(df.target.median(), label='Median', color='b', linewidth=2, linestyle='--')
axes.legend(prop={'size': 15})
print(f'Skew: {df.target.skew():.4f}  -  Kurtosis: {df.target.kurtosis():.4f}')

Data is skewed right (right tail is long relative to the left tail), and excess kurtosis is around 6, where for a normal distribution would be around 0.
Mean is close to 0, which suggests standardization.

Comparison between a normal distribution and the actual distribution

In [None]:
# import warnings
# warnings.filterwarnings("ignore")

plt.figure(figsize=(10, 7))
ax = sns.distplot(df.target, label='target', fit=scipy.stats.norm)
ax.legend(labels=['Actual distrib','Normal']);

Inspecting tails of the distribution with a QQ plot

In [None]:
fig = plt.figure()
ax = fig.add_subplot()
scipy.stats.probplot(df.target, plot=ax);

These deviations are more extreme than normal stock returns, which suggests that the obfuscated target metric is different than just daily returns. Plus, there is no explanation about the time id

**NOTE**: an important update has arised in the last few days, where some people have discovered the relation between the target metric and some of the corresponding stock market listings. Check: https://www.kaggle.com/competitions/ubiquant-market-prediction/discussion/315131

Quoting: 
> There are a total of 1220 time_ids in the data set, and from 2014 to 2018, A shares have 245, 244, 244, 244, 243 trading days respectively, for a total of 1220 trading days. These two numbers coincide exactly, so we conjecture
The dataset covers transaction data from 2014 to 2018. (Or refer to someone else's [sorry i forgot which answer] answer and match it with the volatility during the 2015 stock market crash)

> Note that Target is obviously not the original rate of return. The organizer has processed the rate of return so that the standard deviation of the target in each time_id is roughly the same, so we cannot directly match it with stock prices. However, no matter what changes Target has made, the original higher rate of return is still higher after the transformation, and similarly, the original lower rate of return is still lower after the transformation; therefore,
We only need to calculate the ranking percentile corresponding to the return rate of each stock in each trading day and compare it with the corresponding ranking percentile in each time_id to find the corresponding stock.

> For each investment_id, we compare it with each A-share stock and select the stock with the highest correlation coefficient.

# Assets and target analysis

In [None]:
time_steps = df.time_id.nunique()
assets = df.investment_id.nunique()
print(f"number of assets: {assets}, time steps: {time_steps}")

Note that there are 1220 timestep ids, but actually we have 1211 unique time steps. Which are the missing time-ids?

In [None]:
set(df.time_id.unique()).symmetric_difference(range(0, 1220))

Discontinuities of assets samples by time

In [None]:
df[['investment_id', 'time_id']].plot.scatter('time_id', 'investment_id', figsize=(20, 25), s=0.5)
plt.show()

Most of the discontinuities are between time steps 368 to 372. There are less discontinuities and missing samples on the second half of the period.

In [None]:
trading_day = 368 - 245
trading_day += 11 # holidays
trading_day *= 7/5 # consider weekends
trading_day

On the 8th of July of 2015, trading on the Chinese market was halted. The obtained number above is roughly the same moment when the halting started

When are the assets added to the timeline?

In [None]:
invest_set = set(df['investment_id'][df.time_id == 0].to_numpy())
invests_by_time = [len(invest_set)]
invests_by_time

In [None]:
for pos, time_id in enumerate(df.time_id.unique()[1:]):
    invests_at_t = set(df['investment_id'][df.time_id == time_id].to_numpy())
    diff = len(invest_set.union(invests_at_t)) - len(invest_set)
#     print(diff, end=' ')
    invest_set = invest_set.union(invests_at_t)
    invests_by_time.append(diff + invests_by_time[pos])

In [None]:
plt.plot(invests_by_time);

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
df.groupby('time_id')['investment_id'].nunique().plot()
plt.title("number of assets for which there are samples on each time step")
plt.show()

Quoting what is said on a notebook: "Assets are distributed in a different way, there are assets that are actually more frequently observed and others that are not. A good cv and modelling strategy should keep this into account (**stratify if you are working with subsamples**)."

How are the targets distributed according to each of the assets?

In [None]:
target_inv_mean_df = df.groupby('investment_id')['target'].mean().reset_index().rename(columns={'target': 'target_mean_for_inv_id'})
plt.figure(figsize=(20,6))
plt.plot(target_inv_mean_df.set_index('investment_id')['target_mean_for_inv_id'], label='target_mean_for_inv_id')
plt.title('Mean target value for each investment id');

In [None]:
mean_target = df.groupby(['investment_id'])['target'].mean()
mean_mean_target = np.mean(mean_target)

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
mean_target.plot.hist(bins=60)
plt.title("mean target distribution")
plt.show()

In [None]:
sts_target = df.groupby(['investment_id'])['target'].std()
mean_std_target = np.mean(sts_target)

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
sts_target.plot.hist(bins=60)
plt.title("standard deviation of target distribution")
plt.show()

print(f"Mean of std target: {mean_std_target: 0.5f}")

The standard deviation of the assets targets is skewed to the right. There are some assets for which their target std is very close to 0.

By jointly plotting the distribution of observartions by asset and the mean target value by asset, we may notice that the target value slightly reduces proportionally to the number of samples.

In [None]:
# number of samples per asset
obs_by_asset = df.groupby(['investment_id'])['target'].count()

ax = sns.jointplot(x=obs_by_asset, y=mean_target, kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'red'}})
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()

The dispersion of values tends to grow with less observations, hence we need to re-plot the scatterplot this time using the standard deviation.

In [None]:
ax = sns.jointplot(x=obs_by_asset.values, y=sts_target, kind="reg", 
                   height=8, joint_kws={'line_kws':{'color':'red'}})
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('std target')
plt.show()

Plot some extreme outlier investments

In [None]:
outlier_threshold=0.001
outliers_df = df[["investment_id", "target"]].groupby("investment_id").target.mean()
upper_bound, lower_bound = outliers_df.quantile([1-outlier_threshold, outlier_threshold])

In [None]:
outlier_investments = outliers_df.loc[(outliers_df>upper_bound)|(outliers_df<lower_bound)|(outliers_df==0)].index
outlier_investments

In [None]:
pd.pivot(
    df.loc[df.investment_id.isin(outlier_investments), ["investment_id", "time_id", "target"]],
    index='time_id', columns='investment_id', values='target'
).plot(figsize=(16,12), subplots=True, sharex=True);

**Takeaway**: the less the observations, the higher uncertainty in the mean target. In *training* you need to control this effect by *expliciting the number of observations* because this is predictive of the uncertainty of the predictions. In the *test* phase, instead, when you are working with an asset that you don't know about, you need to *impute an average number of observations, thus expecting an average dispersion of predictions* for that asset.

How does the target evolve through time?

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(3, 1, 1,)
(df.groupby('time_id')['investment_id'].nunique()).plot()
plt.title("number of unique assets by time")

plt.subplot(3, 1, 2)
df.groupby('time_id')['target'].mean().plot()
plt.title("average target by time")
plt.axhline(y=mean_mean_target, color='r', linestyle='--', label="mean")
plt.legend(loc='lower left')

plt.subplot(3, 1, 3)
df.groupby('time_id')['target'].std().plot()
plt.title("std of target by time")
plt.axhline(y=mean_std_target, color='r', linestyle='--', label="mean")
plt.legend(loc='lower left')

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=1.3, 
                    wspace=0.4, 
                    hspace=0.4)

plt.show()

When there are less assets, the target oscillates more with prevalently higher targets.

It is suggested that: "The correlation of assets number and target is negative, in fact. I wonder if we are modelling the asset allocation strategies alongside the markets." Might be as well that the calculation of the obfuscated metric target is affected by the number of available samples on a time step.

Target mean and std along the time axis

In [None]:
time2target_mean = df.groupby(['time_id'])['target'].mean()
time2target_std = df.groupby(['time_id'])['target'].std()

_, axes = plt.subplots(1, 1, figsize=(24, 12))
plt.fill_between(
        time2target_mean.index,
        time2target_mean - time2target_std,
        time2target_mean + time2target_std,
        alpha=0.1,
        color="b",
        label='target std'
    )
plt.plot(
        time2target_mean.index, time2target_mean, "o-", color="b", label="target mean"
    )
plt.axhline(y=mean_mean_target, color='r', linestyle='--', label="mean of means")
plt.legend()
axes.set_ylabel("target")
axes.set_xlabel("time")
plt.show()

The target has a mean close to 0 and std close to 1.

Is the asset near the average or how much is far away from it (you are predicting volatility, basically). In fact the evaluation is based on the mean of the Pearson correlation coefficient for each time ID.

Final suggestions:

Basically, this chart is the key. The task of the competition is to find out the position of an asset in a day. Is the asset near the average or how much is far away from it (you are predicting volatility, basically). In fact the evaluation is based on the mean of the Pearson correlation coefficient for each time ID.

In the following chart we are overimposing the target for asset 70 with the market average and the unit standard deviation band.

Clearly the position of asset 70 depends on its performance but also on the way the mean and standard deviation for that period_id is calculated (are we analyzing the volatility inside a basket of investment, maybe?).

In [None]:
time2target_mean = df.groupby(['time_id'])['target'].mean()
time2target_std = df.groupby(['time_id'])['target'].std()

_, axes = plt.subplots(1, 1, figsize=(24, 12))
plt.fill_between(
        time2target_mean.index,
        time2target_mean - time2target_std,
        time2target_mean + time2target_std,
        alpha=0.1,
        color="b",
    )
plt.plot(
        time2target_mean.index, time2target_mean, "o-", color="b", label="Training score"
    )
plt.axhline(y=mean_mean_target, color='r', linestyle='--', label="mean")

asset = 10
plt.plot(df[df.investment_id==asset].time_id,
               df[df.investment_id==asset].target, '.')

axes.set_ylabel("target")
axes.set_xlabel("time")
plt.show()


**Takeaway**: now your cv strategy should be clear, you have to do groupkfold on the time_id, keeping all the assets realtive to a time_id in train or in validation.

# Features analysis

There are 300 continuous features. (To perform the analysis, a sample is taken so it doesn't take forever)

In [None]:
sampled_df = df.sample(frac=0.05, random_state=42)

In [None]:
sampled_df.drop(columns=['time_id', 'investment_id']).describe().T.style.background_gradient(cmap = 'Blues')\
                           .bar(subset = ["mean",], color = 'lightgreen')\
                           .bar(subset = ["std"], color = '#ee1f5f')\
                           .bar(subset = ["max"], color = '#FFA07A')

Note how the mean and standard deviation are centered around 0 and 1, as happens with the target

Feature descriptions

In [None]:
def visualize_feature(df, column):
    
    print(f'{column}\n{"-" * len(column)}')
    print(f'Mean: {df[column].mean():.4f}  -  Median: {df[column].median():.4f}  -  Std: {df[column].std():.4f}')
    print(f'Min: {df[column].min():.4f}  -  25%: {df[column].quantile(0.25):.4f}  -  50%: {df[column].quantile(0.5):.4f}  -  75%: {df[column].quantile(0.75):.4f}  -  Max: {df[column].max():.4f}')
    print(f'Skew: {df[column].skew():.4f}  -  Kurtosis: {df[column].kurtosis():.4f}')
    missing_count = df[df[column].isnull()].shape[0]
    total_count = df.shape[0]
    print(f'Missing Values: {missing_count}/{total_count} ({missing_count * 100 / total_count:.4f}%)')

    fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(24, 22), dpi=100)

    sns.kdeplot(df[column], label=column, fill=True, ax=axes[0][0])
    axes[0][0].axvline(df[column].mean(), label='Mean', color='r', linewidth=2, linestyle='--')
    axes[0][0].axvline(df[column].median(), label='Median', color='b', linewidth=2, linestyle='--')
    axes[0][0].legend(prop={'size': 15})
    sns.scatterplot(x=df[column], y=df['target'], ax=axes[0][1])
    
    df_feature_means_in_time_ids = df.groupby('time_id')[column].mean().reset_index().rename(columns={column: f'{column}_means_in_time_ids'})
    axes[1][0].plot(df_feature_means_in_time_ids.set_index('time_id')[f'{column}_means_in_time_ids'], label=f'{column}_means_in_time_ids')
    df_feature_stds_in_time_ids = df.groupby('time_id')[column].std().reset_index().rename(columns={column: f'{column}_stds_in_time_ids'})
    axes[1][1].plot(df_feature_stds_in_time_ids.set_index('time_id')[f'{column}_stds_in_time_ids'], label=f'{column}_stds_in_time_ids')
    
    df_feature_means_in_investment_ids = df.groupby('investment_id')[column].mean().reset_index().rename(columns={column: f'{column}_means_in_investment_ids'})
    axes[2][0].plot(df_feature_means_in_investment_ids.set_index('investment_id')[f'{column}_means_in_investment_ids'], label=f'{column}_means_in_investment_ids')
    df_feature_stds_in_investment_ids = df.groupby('investment_id')[column].std().reset_index().rename(columns={column: f'{column}_stds_in_investment_ids'})
    axes[2][1].plot(df_feature_stds_in_investment_ids.set_index('investment_id')[f'{column}_stds_in_investment_ids'], label=f'{column}_stds_in_investment_ids')

    for i in range(3):
        for j in range(2):
            axes[i][j].tick_params(axis='x', labelsize=12.5)
            axes[i][j].tick_params(axis='y', labelsize=12.5)
            axes[i][j].set_ylabel('')
            
    axes[0][0].set_xlabel('')
    axes[0][1].set_xlabel(column, fontsize=12.5)
    axes[0][1].set_ylabel('target', fontsize=12.5)
    
    for i in range(2):
        axes[1][i].set_xlabel('time_id', fontsize=12.5)
        axes[1][i].set_ylabel(column, fontsize=12.5)
        
    for i in range(2):
        axes[2][i].set_xlabel('investment_id', fontsize=12.5)
        axes[2][i].set_ylabel(column, fontsize=12.5)
        
    axes[0][0].set_title(f'{column} Distribution', fontsize=15, pad=12)
    axes[0][1].set_title(f'{column} vs Target', fontsize=15, pad=12)
    axes[1][0].set_title(f'{column} Means as a Function of Time', fontsize=15, pad=12)
    axes[1][1].set_title(f'{column} Stds as a Function of Time', fontsize=15, pad=12)
    axes[2][0].set_title(f'{column} Means as a Function of Investment', fontsize=15, pad=12)
    axes[2][1].set_title(f'{column} Stds as a Function of Investment', fontsize=15, pad=12)
    
    plt.show()

Choose a feature to visualize it

In [None]:
visualize_feature(df, 'f_0')

Features distributions and outliers in them

In [None]:
plt.figure(figsize=(20,15))
for i in range(15) :
    plt.subplot(5, 5, i+1)
    plt.hist(df[f"f_{i}"], bins=100)
    plt.title(f"f_{i}")
plt.show();

It is observable that quite a few of these feature have some extreme values. Let's get a list of those (with std over 70)

In [None]:
outlier_list = []
outlier_cols = []

feature_cols = [f'f_{i}' for i in range(300)]

for col in feature_cols :
    
    temp_df = df[(df[col] > df[col].mean() + df[col].std() * 70) |
                       (df[col] < df[col].mean() - df[col].std() * 70) ]
    if len(temp_df) >0 :
        outliers = temp_df.index.to_list()
        outlier_list.extend(outliers)
        print(col, len(temp_df))
        outlier_cols.append(col)

outlier_list = list(set(outlier_list))
print(len(outlier_list))

In [None]:
plt.figure(figsize=(20,20))
for i, (col) in enumerate(outlier_cols):
    plt.subplot(5, 5, i+1)
    plt.scatter(df[col], df["target"])
    plt.title(col)
plt.show()

**Takeaways**: 
* Remove the extreme points, as they can help improving the performance of the models. 
* Should we also normalize the features? The way to know it is by testing it.

In [None]:
%pip install pingouin -q

In [None]:
import pingouin as pg
pg.normality(df[[f"f_{0}", f'f_1']], method='jarque_bera')

Correlation between features

In [None]:
corr = sampled_df.iloc[:, 4:].corr()
sns.clustermap(corr, metric="correlation", cmap="inferno", figsize=(20, 20))
plt.suptitle('Correlations Between Features', fontsize=24, weight='bold')
plt.show();

There is clustering and many strong correlations between some features

In [None]:
corr = corr.abs()

corrs = corr.unstack()
pair = corrs.sort_values(ascending=False)
pair = pair.reset_index(name='correlation').rename(columns={'level_0': 'feature_a', 'level_1': 'feature_b', 0: 'correlation'})
pair = pair[pair['feature_a'] != pair['feature_b']].iloc[::2,:]
pair = pair[:10]
pair

Displaying those highly correlated features with hexbin plots

In [None]:
def hex_plot(df, pair, rows=3, columns=3, title=None):
    
    '''A function for displaying skew feat distribution'''
    
    fig, axes = plt.subplots(rows, columns, figsize=(30, 25), constrained_layout=True)
    axes = axes.flatten()

    for i,j in enumerate(axes):
        j.hexbin(df[pair['feature_a'].iloc[i]], df[pair['feature_b'].iloc[i]],  gridsize=100, cmap='inferno', bins='log')
        j.set_xlabel(pair['feature_a'].iloc[i], fontsize=18)
        j.set_ylabel(pair['feature_b'].iloc[i], fontsize=18)

        fig.suptitle(f'{title}', fontsize=24, weight='bold')

In [None]:
hex_plot(sampled_df, pair, rows=4, columns=2, title='Highly Correlated Features')

**Takeaway**: We should take a closer look to these variables to prevent multicollinearity while modelling

## Features to Target Correlation

In [None]:
correlations = sampled_df.corrwith(sampled_df['target']).iloc[:-1].to_frame()
correlations['Abs Corr'] = correlations[0].abs()
sorted_correlations = correlations.sort_values('Abs Corr', ascending=False)['Abs Corr']
sorted_correlations[:10]

There is almost no linear correlation between features and target

## Feature values through time

In [None]:
# for any investment chosen
inv_id = 33
for ix in range(0, 300):
    df[df.investment_id == inv_id].reset_index()[f'f_{ix}'].plot(figsize=(20, 7))

As indicated, there are some time gaps in the data

`investment_id` - The ID code for an investment. Not all investment have data in all time IDs.

In [None]:
df[df.investment_id == 50].target.plot()

In [None]:
target_50 = df[df.investment_id == 50].target.reset_index().target

In [None]:
target_50.plot(figsize=(20,5))

In [None]:
target_50.rolling(window=10).mean().plot(figsize=(20, 5))
target_50.plot(alpha=0.25)

In [None]:
target_50.rolling(window=10).mean().rolling(window=10).std().plot()

## Evaluate correlation in time for some targets

In [None]:
df[df.investment_id == 1][['time_id', 'target']]

In [None]:
df[df.investment_id == 1][['time_id', 'target']].set_index('time_id').rename(columns={"target": "target_1"})

In [None]:
df[df.investment_id == 2][['time_id', 'target']].set_index('time_id')

In [None]:
inv_1 = df[df.investment_id == 1][['time_id', 'target']].set_index('time_id').rename(columns={"target": "target_1"})
inv_2 = df[df.investment_id == 2][['time_id', 'target']].set_index('time_id').rename(columns={"target": "target_2"})
two_targets = pd.concat([inv_1, inv_2], axis=1)
two_targets

In [None]:
two_targets.dropna(inplace=True)
two_targets

In [None]:
two_targets.index//20

In [None]:
two_targets.groupby(two_targets.index//10).corr()

In [None]:
two_targets.groupby(two_targets.index//10).corr().loc[:, 'target_1']

In [None]:
two_targets.groupby(two_targets.index//10).corr().loc[:, 'target_1'].loc[:, 'target_2']

In [None]:
two_targets.groupby(two_targets.index//10).corr().loc[:, 'target_1'].loc[:, 'target_2'].plot()

In [None]:
inv_1 = df[df.investment_id == 1][['time_id', 'target']].set_index('time_id').rename(columns={"target": "target_1"})
inv_2 = df[df.investment_id == 3][['time_id', 'target']].set_index('time_id').rename(columns={"target": "target_2"})
two_targets = pd.concat([inv_1, inv_2], axis=1)
two_targets.dropna(inplace=True)
two_targets

In [None]:
two_targets.groupby(two_targets.index//20).corr().loc[:, 'target_1'].loc[:, 'target_2'].plot()

## Check for some possible correlations visually with scatter plots

In [None]:
two_targets.loc[24:500].plot.scatter(x='target_1', y='target_2', s=5)

In [None]:
inv_df = df[df.investment_id == 1][['time_id', 'target']].set_index('time_id').rename(columns={"target": "target_1"})
for inv_id in df.investment_id.values[1:11]:
    inv_ = df[df.investment_id == inv_id][['time_id', 'target']].set_index('time_id').rename(columns={"target": f"target_{inv_id}"})
    inv_df = pd.concat([inv_df, inv_], axis=1)
    
inv_df.dropna(inplace=True)
inv_df

In [None]:
fig, axs = plt.subplots(2, 5, figsize=(20, 10))
axes = axs.flatten()
axes
for idx, ax in enumerate(axes):
    axes[idx].scatter(inv_df.loc[:, 'target_1'], inv_df.loc[:, inv_df.columns[idx + 1]], s=5)