# Introduction
- I am new to Kaggle competition and joined this competition.
- I have tried my best but could not achieve high score so far.
- I would like to share what I have done (very basic analysis though...). It would be very helpful if you give me any comments/advices (and upvote if you find this useful)!

# Contents (List of Questions I came up)

- Which model should we use? (**Part I**)
- Should we use all the data for training? If not, how to filter them out? (**Part I**)
- CV and the baseline score (**Part I**)
- Should we apply Target preprocessing? If so, how? (**Part II**)
- Should we apply Feature preprocessing? If so, how? (**Part II**)
- What are missing?

In [None]:
import os
import sys
import gc
import psutil
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge, LinearRegression, LassoCV
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn import metrics
from scipy.stats import pearsonr
from typing import Tuple

mem = psutil.virtual_memory()
print(f' Memory Consumption Rate: {mem.percent}')
print(f' Memory Consumption: {mem.used/1000/1000/1000}')
print(f'Available: {mem.free/1000/1000/1000}')

### Load Data
Thanks to https://www.kaggle.com/datasets/robikscube/ubiquant-parquet, we can reduce the memory usage.

In [None]:
%%time
def reduce_memory_usage(df, features):
    for feature in features:
        item = df[feature].astype(np.float16)
        df[feature] = item
        del item
        gc.collect()
        
target = 'target'
n_features = 300
features = [f'f_{i}' for i in range(n_features)]
feature_columns = ['investment_id', 'time_id'] + features
X = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet', columns=feature_columns + ["target"])
reduce_memory_usage(X, features + ["target"])
print(X.shape)
X.head()

## Filter out Training Data

In [None]:
time_id_list = list(X['time_id'].unique())
investment_id_list = list(X['investment_id'].unique())
data_dummy = [np.nan for _ in range(len(time_id_list)*len(investment_id_list))]
X_filled = pd.DataFrame(data_dummy, columns=['dummy'])
X_filled['time_id'] = np.repeat(time_id_list, len(investment_id_list))
X_filled['investment_id'] = np.tile(investment_id_list, len(time_id_list))
X_filled = X_filled.set_index(['time_id', 'investment_id'])
X_orig = X.set_index(['time_id','investment_id'])
X_orig = X_orig[['target']]
X_filled = X_filled.join(X_orig).drop('dummy', axis=1) # join
X_filled = X_filled.unstack() # move investment_id to columns
X_filled = X_filled.T.reset_index(level=0).drop('level_0', axis=1).T # multi columns to single column
thred = len(X_filled) - 500 # 最低500サンプル（２年分）あれば、通常の学習プロセスで扱う。
df_null_count = X_filled.isnull().sum(axis=0).sort_values(ascending=False)
df_chosen = df_null_count[df_null_count < thred]
investment_id_list = list(df_chosen.index)
X_filled_chosen = X_filled[investment_id_list]
df_nnull = X_filled_chosen.isnull().sum(axis=1).to_frame()
df_nnull.columns = ['nnull']
df_nnull['nnull_diff'] = df_nnull['nnull'].diff(1)
df_nnull['too_many_missing'] = (df_nnull['nnull'] > 1000) | (df_nnull['nnull_diff'] > 200)
X_filled_chosen = X_filled_chosen.iloc[~df_nnull['too_many_missing'].values, :]
X_filled_chosen = X_filled_chosen.astype("object").fillna(method="ffill").astype("float")
X_filled_chosen.dropna(inplace=True)
time_id_chosen = list(X_filled_chosen.index)

train = X[X['time_id'].isin(time_id_chosen)]
train.head()

In [None]:
# a bit of cleaning in memory
del X_filled, X_filled_chosen, X_orig, data_dummy
gc.collect()

In [None]:
class GroupTimeSeriesSplit:
    """
    From: https://www.kaggle.com/c/ubiquant-market-prediction/discussion/304036
    Custom class to create a Group Time Series Split. We ensure
    that the time id values that are in the testing data are not a part
    of the training data & the splits are temporal
    """
    def __init__(self, n_folds: int, holdout_size: int, groups: str, cv = False) -> None:
        self.n_folds = n_folds
        self.holdout_size = holdout_size
        self.groups = groups
        self.cv = cv

    def split(self, X) -> Tuple[np.array, np.array]:
        # Take the group column and get the unique values
        unique_time_ids = np.unique(self.groups.values)

        # Split the time ids into the length of the holdout size
        # and reverse so we work backwards in time. Also, makes
        # it easier to get the correct time_id values per
        # split
        array_split_time_ids = np.array_split(
            unique_time_ids, len(unique_time_ids) // self.holdout_size
        )[::-1]

        # Get the first n_folds values
        array_split_time_ids = array_split_time_ids[:self.n_folds]

        for time_ids in array_split_time_ids:
            # Get test index - time id values that are in the time_ids
            test_condition = X['time_id'].isin(time_ids)
            test_index = X.loc[test_condition].index

            # Get train index - The train index will be the time
            # id values right up until the minimum value in the test
            # data - we can also add a gap to this step by
            # time id < (min - gap)
            if self.cv:
                train_condition = ( X['time_id'] < (np.min(time_ids)) ) | ( X['time_id'] > (np.max(time_ids)) )
            else:
                train_condition = X['time_id'] < (np.min(time_ids))
            train_index = X.loc[train_condition].index

            yield train_index, test_index

## 3. Should we apply Target preprocessing? If so, how?

- In linear analysis, it is important to deal with outlier sample (it would lead the model underfitting).
- To me, it is popular to clip the outlier in the target samples.
- I am not 100% confident if we should do that, because we cannot clip the target samples in Public/Private LB stage.
- Here we check the effect of target clipping in the following way:
  - clip the target in training set.
  - do NOT clip the target in validation set.

### Case 0: No Target Clipping

In [None]:
y_all_orig = train['target'].astype(np.float16)

holdout_size = 60
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1
pearson_means = []
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    X_train = train.loc[tr, features]
    y_train = y_all_orig[tr]
    model = LinearRegression()
    model.fit(X_train, y_train)

    train_val = train.loc[val, :]
    df_pearson_all = pd.DataFrame(columns=['time_id', 'investment_id', 'y_pred', 'y_true'])
    df_pearson_all['time_id'] = train_val['time_id'].values
    df_pearson_all['investment_id'] = train_val['investment_id'].values
    X_val = train.loc[val, features]
    y_val = y_all_orig[val]
    df_pearson_all['y_pred'] = model.predict(X_val)
    df_pearson_all['y_true'] = y_val.values
    m = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).mean()
    std = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).std()
    pearson_means.append(m)
    r2_train = model.score(X_train, y_train)
    r2_val = model.score(X_val, y_val)
    print("fold: {}, val mean corr: {:.3%}, val stdev corr: {:.3%}, train r2: {:.3%}, val r2: {:.3%}".format(fold, m, std, r2_train, r2_val))
    del X_train, y_train, X_val, y_val, tr, val
    gc.collect()
    
print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

### Case 1: Clip Target for both the Lower and Upper side
- Lower 1% quantile and upper 1% quantile

In [None]:
%%time

y_lower_qs = {}
y_upper_qs = {}
TGTID = ['investment_id', 'target']
def f_y(x):
    invest_id = x['investment_id'].values[0]
    x_num = x.drop('investment_id', axis=1).values
    y_lower_qs[invest_id] = np.quantile(x_num, q=0.01, axis=0)
    y_upper_qs[invest_id] = np.quantile(x_num, q=0.99, axis=0) # 0.99
    x_num = np.clip(x_num, y_lower_qs[invest_id], y_upper_qs[invest_id])
    x['target'] = x_num.astype(np.float16)
    return x
y_all_orig = train['target'].astype(np.float16)
y_all_correct = train[TGTID].groupby('investment_id').apply(lambda x: f_y(x))['target'].astype(np.float16)

In [None]:
holdout_size = 60
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1
pearson_means = []
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    X_train = train.loc[tr, features]
    y_train = y_all_correct[tr]
    model = LinearRegression()
    model.fit(X_train, y_train)

    train_val = train.loc[val, :]
    df_pearson_all = pd.DataFrame(columns=['time_id', 'investment_id', 'y_pred', 'y_true'])
    df_pearson_all['time_id'] = train_val['time_id'].values
    df_pearson_all['investment_id'] = train_val['investment_id'].values
    X_val = train.loc[val, features]
    y_val = y_all_orig[val]
    df_pearson_all['y_pred'] = model.predict(X_val)
    df_pearson_all['y_true'] = y_val.values
    m = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).mean()
    std = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).std()
    pearson_means.append(m)
    r2_train = model.score(X_train, y_train)
    r2_val = model.score(X_val, y_val)
    print("fold: {}, val mean corr: {:.3%}, val stdev corr: {:.3%}, train r2: {:.3%}, val r2: {:.3%}".format(fold, m, std, r2_train, r2_val))
    del X_train, y_train, X_val, y_val, tr, val
    gc.collect()
    
print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

- Lower 3% quantile and upper 3% quantile

In [None]:
y_lower_qs = {}
y_upper_qs = {}
TGTID = ['investment_id', 'target']
def f_y(x):
    invest_id = x['investment_id'].values[0]
    x_num = x.drop('investment_id', axis=1).values
    y_lower_qs[invest_id] = np.quantile(x_num, q=0.03, axis=0)
    y_upper_qs[invest_id] = np.quantile(x_num, q=0.97, axis=0) # 0.99
    x_num = np.clip(x_num, y_lower_qs[invest_id], y_upper_qs[invest_id])
    x['target'] = x_num.astype(np.float16)
    return x
y_all_orig = train['target'].astype(np.float16)
y_all_correct = train[TGTID].groupby('investment_id').apply(lambda x: f_y(x))['target'].astype(np.float16)

In [None]:
holdout_size = 60
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1
pearson_means = []
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    X_train = train.loc[tr, features]
    y_train = y_all_correct[tr]
    model = LinearRegression()
    model.fit(X_train, y_train)

    train_val = train.loc[val, :]
    df_pearson_all = pd.DataFrame(columns=['time_id', 'investment_id', 'y_pred', 'y_true'])
    df_pearson_all['time_id'] = train_val['time_id'].values
    df_pearson_all['investment_id'] = train_val['investment_id'].values
    X_val = train.loc[val, features]
    y_val = y_all_orig[val]
    df_pearson_all['y_pred'] = model.predict(X_val)
    df_pearson_all['y_true'] = y_val.values
    m = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).mean()
    std = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).std()
    pearson_means.append(m)
    r2_train = model.score(X_train, y_train)
    r2_val = model.score(X_val, y_val)
    print("fold: {}, val mean corr: {:.3%}, val stdev corr: {:.3%}, train r2: {:.3%}, val r2: {:.3%}".format(fold, m, std, r2_train, r2_val))
    del X_train, y_train, X_val, y_val, tr, val
    gc.collect()
    
print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

### Case 2: Clip Target only for Upper side
- upper 3% quantile
- It improves A LOT on **train r2**.
- From my experience (NIKKEI225 and the option market), market behaves consistently when market crashes. Instead market might not behaves consistently when market goes upward. Clipping only upper side does not seem strange to me...
- For fold 5, the val corr shows relatively low score. (We have a room to improve.)
- **In colusions, we use target clipping with this setup.**

In [None]:
y_upper_qs = {}
TGTID = ['investment_id', 'target']
def f_y(x):
    invest_id = x['investment_id'].values[0]
    x_num = x.drop('investment_id', axis=1).values
    y_upper_qs[invest_id] = np.quantile(x_num, q=0.97, axis=0) # 0.99
    x_num = np.clip(x_num, None, y_upper_qs[invest_id])
    x['target'] = x_num.astype(np.float16)
    return x
y_all_orig = train['target'].astype(np.float16)
y_all_correct = train[TGTID].groupby('investment_id').apply(lambda x: f_y(x))['target'].astype(np.float16)

In [None]:
holdout_size = 60
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1
pearson_means = []
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    X_train = train.loc[tr, features]
    y_train = y_all_correct[tr]
    model = LinearRegression()
    model.fit(X_train, y_train)

    train_val = train.loc[val, :]
    df_pearson_all = pd.DataFrame(columns=['time_id', 'investment_id', 'y_pred', 'y_true'])
    df_pearson_all['time_id'] = train_val['time_id'].values
    df_pearson_all['investment_id'] = train_val['investment_id'].values
    X_val = train.loc[val, features]
    y_val = y_all_orig[val]
    df_pearson_all['y_pred'] = model.predict(X_val)
    df_pearson_all['y_true'] = y_val.values
    m = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).mean()
    std = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).std()
    pearson_means.append(m)
    r2_train = model.score(X_train, y_train)
    r2_val = model.score(X_val, y_val)
    print("fold: {}, val mean corr: {:.3%}, val stdev corr: {:.3%}, train r2: {:.3%}, val r2: {:.3%}".format(fold, m, std, r2_train, r2_val))
    del X_train, y_train, X_val, y_val, tr, val
    gc.collect()
    
print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

## 4. Should we apply Feature preprocessing? If so, how?

- Next, we consider feature clipping and normalization.
- Given we do not distinguish investment_id in the training, it would be important to normalize feature values for different investment_id into the same range.
- Here we check the effect of feature clipping and normalization in the following way:
  - learn to clip and normalize the features in training set.
  - apply trained clipping and scalers in validation set.
  - if a investment_id in validation set is not in training set, we do not do anything.

### Case 1: Clip and Normalize Features in Training

- **train r2** consistently increases in each fold.
- However, **valid r2** decrease for almost all the folds.

In [None]:
FEATID = features + ['investment_id']
TGTID = ['investment_id', 'target']
pearson_means = []
holdout_size = 60
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    X_train = train.loc[tr, FEATID]
    y_train = train.loc[tr, TGTID]

    # Outlier treatment and Normalization of X_train (by investment_id)
    scalers = {}
    lower_qs = {}
    upper_qs = {}
    for invest_id in X_train['investment_id'].unique():
        scalers[invest_id] = StandardScaler()
    def ft_x(x):
        invest_id = x['investment_id'].values[0]
        x_num = x.drop('investment_id', axis=1).values
        lower_qs[invest_id] = np.quantile(x_num, q=0.01, axis=0)
        upper_qs[invest_id] = np.quantile(x_num, q=0.99, axis=0)
        x_num = np.clip(x_num, lower_qs[invest_id], upper_qs[invest_id])
        x_num = scalers[invest_id].fit_transform(x_num)
        x[features] = x_num.astype(np.float16)
        return x
    X_train = X_train.groupby('investment_id').apply(lambda x: ft_x(x))[features].astype(np.float16)

    # Outlier treatment of y_train (by investment_id)
    y_upper_qs = {}
    def ft_y(x):
        invest_id = x['investment_id'].values[0]
        x_num = x.drop('investment_id', axis=1).values
        y_upper_qs[invest_id] = np.quantile(x_num, q=0.97, axis=0)
        x_num = np.clip(x_num, None, y_upper_qs[invest_id])
        x['target'] = x_num.astype(np.float16)
        return x
    y_train = y_train.groupby('investment_id').apply(lambda x: ft_y(x))['target'].astype(np.float16)

    model = LinearRegression()
    model.fit(X_train, y_train)
    train_val = train.loc[val, :]
    df_pearson_all = pd.DataFrame(columns=['time_id', 'investment_id', 'y_pred', 'y_true'])
    df_pearson_all['time_id'] = train_val['time_id'].values
    df_pearson_all['investment_id'] = train_val['investment_id'].values
    train_val.drop(['time_id', 'target'], axis=1, inplace=True)

    # apply trained Scaler/quantiles to X_val
    # if new investmen_id appears, we do not do anything (we could do apply average scaler/quantile for example...). 
    def t_x(x):
        invest_id = x['investment_id'].values[0]
        if invest_id in lower_qs.keys():
            x_num = x.drop('investment_id', axis=1).values
            x_num = np.clip(x_num, lower_qs[invest_id], upper_qs[invest_id])
            x_num = scalers[invest_id].transform(x_num)
            x[features] = x_num.astype(np.float16)
        else:
            #print("invest id not found in train: {}".format(invest_id))
            pass
        return x
    X_val = train_val.groupby('investment_id').apply(lambda x: t_x(x))[features].astype(np.float16)
    y_val = train.loc[val, 'target'].astype(np.float16)
    # evaluation
    df_pearson_all['y_pred'] = model.predict(X_val)
    df_pearson_all['y_true'] = y_val.values
    m = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).mean()
    std = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).std()
    pearson_means.append(m)
    r2_train = model.score(X_train, y_train)
    r2_val = model.score(X_val, y_val)
    print("fold: {}, val mean corr: {:.3%}, val stdev corr: {:.3%}, train r2: {:.3%}, val r2: {:.3%}".format(fold, m, std, r2_train, r2_val))
    del X_train, y_train, X_val, y_val, tr, val
    gc.collect()

print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

### Case 2: Fit and Transform StandardScaler to X_val
- Given the above results, I just wanted to try the case where we know the true mean and stdev of X_val (I know it is not possible to do in this competition).
- We find **very high correlation values** in all the folds (again, I know I am cheating...).
- I have not fully understood what is happening here.
  - It seems that mean and stdev of X shift from training period to validation period (domain shift).
  - If so, a model trained with train dataset would not have prediction capability for validation dataset.
  - However, we found the same model have good prediction ability for validation dataset if we refit scaler to validation dataset.
  - I am a bit confused...
- Can we guess mean and stdev of X in Public/Private LB?
  - we can store X history in the TimeSeriesAPI, at least we can calculate rolling mean/std of X.
  - I thought we can do similar if we use `(X-rolling_mean)/rolling_std` as a feature instead of doing feature normalization.
  - I did a small test but did not work (maybe we can do better...).
- My gut feeling is now using non-liniear model:
  - linear model is based on the fact that each feature is normally distributed.
  - However, we found domain shift. If we believe that provided training data includes all market regimes happening in the future (the one in Private LB), we can build regime switching model depending on the input X.
  - If I have time, I want to try the above (using simple regime swtiching model, GBDT, or NN model expected to learn that internally. We should be careful of overfitting when using large model like GBDT/NN).

In [None]:
FEATID = features + ['investment_id']
TGTID = ['investment_id', 'target']
pearson_means = []
holdout_size = 60
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    X_train = train.loc[tr, FEATID]
    y_train = train.loc[tr, TGTID]

    # Outlier treatment and Normalization of X_train (by investment_id)
    scalers = {}
    lower_qs = {}
    upper_qs = {}
    for invest_id in X_train['investment_id'].unique():
        scalers[invest_id] = StandardScaler()
    def ft_x(x):
        invest_id = x['investment_id'].values[0]
        x_num = x.drop('investment_id', axis=1).values
        lower_qs[invest_id] = np.quantile(x_num, q=0.01, axis=0)
        upper_qs[invest_id] = np.quantile(x_num, q=0.99, axis=0)
        x_num = np.clip(x_num, lower_qs[invest_id], upper_qs[invest_id])
        x_num = scalers[invest_id].fit_transform(x_num)
        x[features] = x_num.astype(np.float16)
        return x
    X_train = X_train.groupby('investment_id').apply(lambda x: ft_x(x))[features].astype(np.float16)

    # Outlier treatment of y_train (by investment_id)
    y_upper_qs = {}
    def ft_y(x):
        invest_id = x['investment_id'].values[0]
        x_num = x.drop('investment_id', axis=1).values
        y_upper_qs[invest_id] = np.quantile(x_num, q=0.97, axis=0)
        x_num = np.clip(x_num, None, y_upper_qs[invest_id])
        x['target'] = x_num.astype(np.float16)
        return x
    y_train = y_train.groupby('investment_id').apply(lambda x: ft_y(x))['target'].astype(np.float16)

    model = LinearRegression()
    model.fit(X_train, y_train)
    train_val = train.loc[val, :]
    df_pearson_all = pd.DataFrame(columns=['time_id', 'investment_id', 'y_pred', 'y_true'])
    df_pearson_all['time_id'] = train_val['time_id'].values
    df_pearson_all['investment_id'] = train_val['investment_id'].values
    train_val.drop(['time_id', 'target'], axis=1, inplace=True)

    # RETRAIN Scaler to X_val
    # if new investmen_id appears, we do not do anything (we could do apply average scaler/quantile for example...). 
    def t_x(x):
        invest_id = x['investment_id'].values[0]
        if invest_id in lower_qs.keys():
            x_num = x.drop('investment_id', axis=1).values
            x_num = np.clip(x_num, lower_qs[invest_id], upper_qs[invest_id])
            x_num = scalers[invest_id].fit_transform(x_num)
            x[features] = x_num.astype(np.float16)
        else:
            #print("invest id not found in train: {}".format(invest_id))
            pass
        return x
    X_val = train_val.groupby('investment_id').apply(lambda x: t_x(x))[features].astype(np.float16)
    y_val = train.loc[val, 'target'].astype(np.float16)
    # evaluation
    df_pearson_all['y_pred'] = model.predict(X_val)
    df_pearson_all['y_true'] = y_val.values
    m = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).mean()
    std = df_pearson_all.groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0]).std()
    pearson_means.append(m)
    r2_train = model.score(X_train, y_train)
    r2_val = model.score(X_val, y_val)
    print("fold: {}, val mean corr: {:.3%}, val stdev corr: {:.3%}, train r2: {:.3%}, val r2: {:.3%}".format(fold, m, std, r2_train, r2_val))
    del X_train, y_train, X_val, y_val, tr, val
    gc.collect()

print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

# 5. What are missing?

I used linear regression model with target clipping and feature clipping/normalization. I have not achieved high correlation score so far.

I think I am missing the following functionality in the model:
- functionality to switch market regime to deal with domain shift.
- functionality to take into account inter-feature effect.
- functionality to deal with outlier (this could improve the OOS correlation by a bit).

In [None]:
print("{}{: >25}{}{: >10}{}".format('|','Variable Name','|','Memory','|'))
print(" ------------------------------------ ")
for var_name in dir():
    if not var_name.startswith("_") and sys.getsizeof(eval(var_name)) > 1000000:
        print("{}{: >25}{}{: >10}{}".format('|',var_name,'|',sys.getsizeof(eval(var_name))/1000/1000/1000,'|'))