A series of Qlib based kernels. Welcome to explore more features and models in [Qlib](http://github.com/microsoft/qlib)
- [A Different EDA based on Qlib\[EN/中文\]](https://www.kaggle.com/youngyang/a-different-eda-based-on-qlib-en/)
- [A Naive Qlib Example\[EN/中文\]](https://www.kaggle.com/youngyang/qlibnaiveexample-en/)


Main conclusions/主要结论

- The features are well processed (At different timestamps, the mean approximate 0 and the std approximate 1. No inf and NaN are found)
- Part of the features are highly auto-correlated(fundamental factors are usually highly auto-correlated).
- For some features, **they are categorical at some timestamps. But then it becomes numerical at different timestamps**.
- The performance of features/factors shakes/varies greatly over time, **which may be the major reason cause the great gap between your CV and LB**


- 特征都做过比较好的处理 (在不同的时间上， 都接近 均值为0， 方差为0，且没有缺失值)
- 部分特征自相关性很高(基本面因子一般自相关性很高)
- 随着时间的变化，有大量特的特性**在 categorical feature (当天所有的该特征的所有数值集中在两三个值上) 和 numerical feature(该天每个股票的数值都不相同)之间变化**
- 特征/因子 的性能随时间变化非常剧烈（**这应该是导致大家本地CV和LB差别巨大的主要因素**）


In [None]:
# set env
!tar xf ../input/qlib-dev/packages.all -C . 
import sys
sys.path.insert(0, './packages')
%env JOBLIB_TEMP_FOLDER=/tmp

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True)
from tqdm.auto import tqdm
from qlib.contrib.report.data.ana import CombFeaAna, FeaDistAna, FeaNanAnaRatio, FeaInfAna, FeaMeanStd, ValueCNT, FeaACAna, RawFeaAna

Read Data

In [None]:
def read_data(path="../input/train.pkl"):
    df = pd.read_pickle(path)
    df = df.set_index(["time_id", "investment_id"])
    df.columns = pd.MultiIndex.from_tuples([("label" if col  ==  "target" else "feature", col) for col in df.columns])
    df.index.names = ["datetime", "instrument"]  # Qlib's processors requires datetime
    df = df.astype(np.float32) # for supporting unstack operation
    return df

In [None]:
raw_df = read_data("../input/ubiquant-market-prediction-half-precision-pickle/train.pkl")

# Analysis of Features/分析特征

In [None]:
from functools import partial

The explanation of the graphs
- Each columns in the graphs below corresponds to a feature
- Each columns contains 4 sub-graphs, which respectively indicate 
   - the overall hist distribution
   - the std and mean over time
   - ratio of number of unique values to the number of total stock at each timestamp(it is desinged to tell if the feature is categorical or numerical)
   - Auto correlation of the feature with 1 lag


图形解释
- 下面图中的每一列表示一个特征的统计值
- 每一列子图有四行，每一行含义如下
   - 特征整体的分布
   - 随着时间变化，特征的均值和方差
   - 随着时间变化，特征每天的 unique value占当天总股票数量的比例（用于统计特征是数值特征还是类别特征）
   - 特征的自相关性 (1 lag)

In [None]:
fa_full = CombFeaAna(raw_df['feature'], FeaDistAna, FeaMeanStd, partial(ValueCNT, ratio=True), FeaACAna)
fa_full.plot_all(sub_fs=(5, 2), col_n=5, wspace=0.5)

## Case Study of Single Features

> - ratio of number of unique values to the number of total stock at each timestamp(it is desinged to tell if the feature is categorical or numerical)
> - 随着时间变化，特征每天的 unique value占当天总股票数量的比例（用于统计特征是数值特征还是类别特征）

As mentioned above, the distribution of features becomes very wired at sometimes.
We pick a feature and display its distribution on  different days.

如之前所说， 有一些特征的分布在时间上变化很诡异。 因此我们选了一个特征， 然后展示其在不同时间上的分布变化

In [None]:
fea_df = raw_df['feature']

In [None]:
def plot_sample_dist(fname, fea_df):
    """Plot sampled distribution"""
    sample_dates = fea_df.index.get_level_values("datetime").unique().to_series().sample(30)
    fea_focus = fea_df[fname].loc[sample_dates.values, :]
    fea_focus = fea_focus.unstack(level='datetime')
    fa = FeaDistAna(fea_focus)
    fa.plot_all(sub_fs=(4, 2), col_n=5, wspace=0.5)

In [None]:
plot_sample_dist('f_294', fea_df)

# Cluster days/按时间聚类

Each sample corresponds to a timestamp, each sample contains 300 dimensions which corresponds to 300 features.   Each dimension  represends  __< the number of unique values >/< the number of stocks >__ at a specific timestamp of a feature

每个样本对应到每一个时间点， 每个样本三百维， 对应到三百个特征。 每一维代表这个特征某个时间点 __< uniuq value的数量 >/< 总股票数量 >__

In [None]:
def get_daily_uniq_val_ratio(gdf):
    """Get the number of unique values each day. And calculate the ratio of them"""
    return gdf.apply(lambda s: len(s.unique())) / gdf.shape[0]

day_cnt_df = fea_df.groupby('datetime').apply(get_daily_uniq_val_ratio)

And then cluster the samples at different timestamp based on it.

然后我们将这些不同时间点的样本做聚类

We got several cluseters, which demonstrate **that the features weirdly change between categorical and numerical at the same time.**

最后得到了几个类很相似的类， 这表示那**些在categorical 和 numerical之间来回变动的特的的诡异行为是同时的**


In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
clusters = kmeans.fit_predict(day_cnt_df)

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=800)
tsne_results = tsne.fit_transform(day_cnt_df)

df_data = pd.DataFrame()

df_data['tsne-2d-one'] = tsne_results[:,0]
df_data['tsne-2d-two'] = tsne_results[:,1]
df_data['cluster'] = clusters
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="cluster",
    palette=sns.color_palette("hls", 2),
    data=df_data,
    legend="full",
    alpha=0.3
)

# Performance Analysis of factors/因子的性能分析

In [None]:
from qlib.contrib.eva.alpha import calc_all_ic

The graph below demonstrate the IC of each feature overtime with 100 moving average 
- The performance of factors/features shakes/varis overtime greatly.

下图统计了每个特征和Label和IC随时间如何变化 (moving average 100)
- 因子/特征 的性能随时间变化非常剧烈

In [None]:
all_ic = calc_all_ic(raw_df['feature'].to_dict('series'), raw_df[('label', 'target')])
all_ic_df = pd.concat({f: d['ic'] for f, d in all_ic.items()})
all_ic_df = all_ic_df.unstack(0)

fa = RawFeaAna(all_ic_df.rolling(100).mean())
fa.plot_all(sub_fs=(4, 2), col_n=10, wspace=0.5)

The performance of factors are quite good

因子/特征 本身的性能都还不错

In [None]:
all_ic_df.mean().plot(kind='bar', figsize=(60, 10))