# Introduction
- I am new to Kaggle competition and joined this competition.
- I have tried my best but could not achieve high score so far.
- I would like to share what I have done (very basic analysis though...). It would be very helpful if you give me any comments/advices (and upvote if you find this useful)!

# Contents (List of Quesions I came up)

- Which model should we use? (**Part I**)
- Should we use all the data for training? If not, how to filter them out? (**Part I**)
- CV and the baseline score (**Part I**)
- Should we apply Target preprocessing? If so, how? (**Part II**)
- Should we apply Feature preprocessing? If so, how? (**Part II**)
- What are missing?

# 1. Which model should we use?

Financial time series data has a couple of characteristics.
- non-stationarity
- low signal/noise ratio
- etc.

We can easily overfit the training data with common GBDT/NN model (without cares). That is why I started with simple linear regression (with/without regularization) model. We also understand that inter-feature effects can improve the score, so if time allows, I would like to try (Part III?).

# 2. Should we use all the data for training? If not, how to filter them out?

- Because we use linear model, we would not require massive amount of training samples.
- At least, we should have enough amount of training samples in each time_id.
- For each time_id, we could have at most ~3500 entries with different investment_ids.
- It would be difficult to deal with them asset-by-asset. We ignore info of investment_id in the training.
- Here we will identify a subset of time_ids with enough number of investment_id belonging.

In [None]:
import os
import sys
import gc
import psutil
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge, LinearRegression, LassoCV
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn import metrics
from scipy.stats import pearsonr
from typing import Tuple

mem = psutil.virtual_memory()
print(f' Memory Consumption Rate: {mem.percent}')
print(f' Memory Consumption: {mem.used/1000/1000/1000}')
print(f'Available: {mem.free/1000/1000/1000}')

### Load Data
Thanks to https://www.kaggle.com/datasets/robikscube/ubiquant-parquet, we can reduce the memory usage.

In [None]:
%%time
def reduce_memory_usage(df, features):
    for feature in features:
        item = df[feature].astype(np.float16)
        df[feature] = item
        del item
        gc.collect()
        
target = 'target'
n_features = 300
features = [f'f_{i}' for i in range(n_features)]
feature_columns = ['investment_id', 'time_id'] + features
X = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet', columns=feature_columns + ["target"])
reduce_memory_usage(X, features + ["target"])
print(X.shape)
X.head()

### Leave the data with enough number of samples for each time_id and for each investment_id
- Reorganize the data into fixed size of `[number of time_id, number of investment_id]`
- Some entries are set to null (if those do not exist in the original data).

In [None]:
time_id_list = list(X['time_id'].unique())
invetment_id_list = list(X['investment_id'].unique())
data_dummy = [np.nan for _ in range(len(time_id_list)*len(invetment_id_list))]
X_filled = pd.DataFrame(data_dummy, columns=['dummy'])
X_filled['time_id'] = np.repeat(time_id_list, len(invetment_id_list))
X_filled['investment_id'] = np.tile(invetment_id_list, len(time_id_list))
X_filled = X_filled.set_index(['time_id', 'investment_id'])
X_orig = X.set_index(['time_id','investment_id'])
X_orig = X_orig[['target']]
X_filled = X_filled.join(X_orig).drop('dummy', axis=1) # join
X_filled = X_filled.unstack() # move investment_id to columns
X_filled = X_filled.T.reset_index(level=0).drop('level_0', axis=1).T # multi columns to single column
print(X_filled.shape)
X_filled.head()

Keep assets (investment_id) with more than 500 samples (time_id)

In [None]:
thred = len(X_filled) - 500 # 最低500サンプル（２年分）あれば、通常の学習プロセスで扱う。
#print(thred)
df_null_count = X_filled.isnull().sum(axis=0).sort_values(ascending=False)
df_chosen = df_null_count[df_null_count < thred]
investment_id_list = list(df_chosen.index)
X_filled_chosen = X_filled[investment_id_list]
print("Number of assets(investment_id) with more than 500 samples(time_id): {}".format(len(investment_id_list)))

Remove time_ids where:
- small number of investment_id has the data for and/or
- number of investment_id reduces drastically around the time_id.

In [None]:
df_nnull = X_filled_chosen.isnull().sum(axis=1).to_frame()
df_nnull.columns = ['nnull']
df_nnull['nnull_diff'] = df_nnull['nnull'].diff(1)
df_nnull['too_many_missing'] = (df_nnull['nnull'] > 1000) | (df_nnull['nnull_diff'] > 200)
df_nnull.plot(figsize=(20,6), grid=True, title='Number of missing investment_ids w.r.t. time_id')

In [None]:
# Remove Missing Values
print(df_nnull['too_many_missing'].values.sum())
X_filled_chosen = X_filled_chosen.iloc[~df_nnull['too_many_missing'].values, :]
X_filled_chosen.isnull().sum(axis=1).plot(figsize=(20,4), grid=True, title='Number of missing investment_id w.r.t. time_id')

ffill the remaining missing values
- Please note that we could still have missing values for the oldest entries in the historical data.

In [None]:
X_filled_chosen = X_filled_chosen.astype("object").fillna(method="ffill").astype("float")
X_filled_chosen.isnull().sum(axis=1).plot(figsize=(20,4), grid=True, title='Number of missing investment_id w.r.t. time_id')

Lastly, remove the data with missing values.

In [None]:
X_filled_chosen.dropna(inplace=True)
X_filled_chosen.isnull().sum(axis=1).plot(figsize=(20,4), grid=True, title='Number of missing investment_id w.r.t. time_id')
time_id_chosen = list(X_filled_chosen.index)

## Resulting Tranining Data

- About 500 time_ids are left for training. The nice thing is that the recent data remains in the training data, because we would like to forecast target values in the future.

In [None]:
train = X[X['time_id'].isin(time_id_chosen)]
print(train.shape)
train.head()

In [None]:
# a bit of cleaning in memory
del X_filled, X_filled_chosen, X_orig, data_dummy
gc.collect()

# 3. CV and the baseline score

CV is important to know that:
- mean accuracy of the model over possible different market environment (expected to be the same level as Public/Private LB score. Larger is better).
- standard dev of the model over possible different market environment (the Public/Private LB score would vary within the width. Smaller is better).
- When the model works nicely/badly (It can give us idea to improve the model).

Thanks to https://www.kaggle.com/c/ubiquant-market-prediction/discussion/304036, we use a simple K-fold CV:
- Without shuffle.
- I modified a bit to use all the samples apart from val data for training.
- Holdout days set to 60 weekdays ( aligned to the possible forecasting period in Private LB (2022/4/18~7/18) )
- did not do CPCV (from my experience, it would not affect a lot to results. To be honest, I should have checked all the auto-correlation beforehand).

Baseline model is a simple Linear Regression.
- We do not create custom features (just used 300 features f_0~f_299).

We can observe the following things from the CV result:
- We found mean correlation = 11.5% and the stdev = 0.9% averaged over folds.
- We never know the future, but we can expect Public/Private LB score falls into [11.5-0.9%, 11.5+0.9%] (1-sigma).
- We can make the model better by digging out the fold showing lowest mean correaltion (for example).

In [None]:
class GroupTimeSeriesSplit:
    """
    From: https://www.kaggle.com/c/ubiquant-market-prediction/discussion/304036
    Custom class to create a Group Time Series Split. We ensure
    that the time id values that are in the testing data are not a part
    of the training data & the splits are temporal
    """
    def __init__(self, n_folds: int, holdout_size: int, groups: str, cv = False) -> None:
        self.n_folds = n_folds
        self.holdout_size = holdout_size
        self.groups = groups
        self.cv = cv

    def split(self, X) -> Tuple[np.array, np.array]:
        # Take the group column and get the unique values
        unique_time_ids = np.unique(self.groups.values)

        # Split the time ids into the length of the holdout size
        # and reverse so we work backwards in time. Also, makes
        # it easier to get the correct time_id values per
        # split
        array_split_time_ids = np.array_split(
            unique_time_ids, len(unique_time_ids) // self.holdout_size
        )[::-1]

        # Get the first n_folds values
        array_split_time_ids = array_split_time_ids[:self.n_folds]

        for time_ids in array_split_time_ids:
            # Get test index - time id values that are in the time_ids
            test_condition = X['time_id'].isin(time_ids)
            test_index = X.loc[test_condition].index

            # Get train index - The train index will be the time
            # id values right up until the minimum value in the test
            # data - we can also add a gap to this step by
            # time id < (min - gap)
            if self.cv:
                train_condition = ( X['time_id'] < (np.min(time_ids)) ) | ( X['time_id'] > (np.max(time_ids)) )
            else:
                train_condition = X['time_id'] < (np.min(time_ids))
            train_index = X.loc[train_condition].index

            yield train_index, test_index

In [None]:
%%time

FEATS = features + ['investment_id', 'time_id']
models = []
pearson_means = []
pearsons_ts = []
holdout_size = 60 # aligned to forecasting period (4/18~7/18)
FOLDS = int(len(np.unique(train['time_id'])) / holdout_size)-1 # all the historical data will be used as eval in the training to make the model generalize better.
gtss = GroupTimeSeriesSplit(n_folds=FOLDS, holdout_size=holdout_size, groups=train['time_id'], cv=True)
for fold, (tr, val) in enumerate(gtss.split(train)):
    print('FOLD:', fold)
    
    # Training
    X_train = train.loc[tr, FEATS]
    y_train = train.loc[tr, 'target']
    del tr
    gc.collect()
    print('Train time_id range:', X_train['time_id'].min(), '->', X_train['time_id'].max(), 'Nb time_id: ', len(X_train['time_id'].unique()))
    
    model = LinearRegression()
    model.fit(X_train.drop(['investment_id', 'time_id'], axis=1), y_train)
    models.append(model)    
    del X_train, y_train
    gc.collect()
    
    # Evaluation
    X_val = train.loc[val, FEATS]
    y_val = train.loc[val, 'target']
    del val
    gc.collect()
    print('Val time_id range:', X_val['time_id'].min(), '->', X_val['time_id'].max(), 'Nb time_id: ', len(X_val['time_id'].unique()))

    time_ids_val = X_val['time_id'].values
    X_val['y_pred'] = model.predict(X_val.drop(['investment_id', 'time_id'], axis=1))
    X_val['y_true'] = y_val.values
    X_val['time_id'] = time_ids_val
    
    pearson = X_val[['time_id', 'y_true', 'y_pred']].groupby('time_id').apply(lambda x: pearsonr(x['y_true'], x['y_pred'])[0])
    pearson_mean = pearson.mean()
    pearson_stdev = pearson.std()
    print('Pearson Mean: {:.3%}, Pearson Stdev: {:.3%}'.format(pearson_mean, pearson_stdev))
    print()
    pearson_means.append(pearson_mean)
    df_pearson = pd.DataFrame(pearson, index=X_val['time_id'])
    df_pearson.columns = ['corr']
    pearsons_ts.append(df_pearson)
    del X_val, y_val, time_ids_val
    gc.collect()
    
print("Peason Mean over Folds: {:.3%}, Stdev over Folds {:.3%}".format(np.array(pearson_means).mean(), np.array(pearson_means).std()))
print()

### Correlation in Time for Validation Dataset
- Correlation oscillates around 10% for all the folds.
- Volatility depends on fold.
- Nothing bad, but necessary to lift the average correlation level up!

In [None]:
nr = int(len(pearsons_ts)/2)+1
fig, ax = plt.subplots(nr, 2, figsize=(20, 8+nr))
for i in range(len(pearsons_ts)):
    df_pearson = pearsons_ts[i].groupby('time_id').mean()
    df_pearson.columns = ['correl']
    df_pearson.plot(grid=True, ax=ax[int(i/2), i%2])

## 3. Should we apply Target preprocessing? If so, how?
See **Part II**

## 4. Should we apply Feature preprocessing? If so, how?
See **Part II**

In [None]:
print("{}{: >25}{}{: >10}{}".format('|','Variable Name','|','Memory','|'))
print(" ------------------------------------ ")
for var_name in dir():
    if not var_name.startswith("_") and sys.getsizeof(eval(var_name)) > 1000000:
        print("{}{: >25}{}{: >10}{}".format('|',var_name,'|',sys.getsizeof(eval(var_name))/1000/1000/1000,'|'))