## Reference:

Kaggle \
[G-Research Crypto Forecasting - baseline & FE](https://www.kaggle.com/vbmokin/g-research-crypto-forecasting-baseline-fe)\
[G-Research- Starter LGBM Pipeline(copied)](https://www.kaggle.com/yliu27/g-research-starter-lgbm-pipeline-copied)\
[[GResearch] Simple LGB Starter](https://www.kaggle.com/code1110/gresearch-simple-lgb-starter)\
[LightGBM with Sklearn pipelines](https://www.kaggle.com/paweljankiewicz/lightgbm-with-sklearn-pipelines)\
[Parameter grid search LGBM with scikit-learn](https://www.kaggle.com/bitit1994/parameter-grid-search-lgbm-with-scikit-learn)

External \
[You Are Missing Out on LightGBM. It Crushes XGBoost in Every Aspect](https://towardsdatascience.com/how-to-beat-the-heck-out-of-xgboost-with-lightgbm-comprehensive-tutorial-5eba52195997)
[Machine Learning Tutorial Python - 16: Hyper parameter Tuning (GridSearchCV)](https://www.youtube.com/watch?v=HdlDYng8g9s&t=16s)

# Environment Setup

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/g-research-crypto-forecasting/example_sample_submission.csv
/kaggle/input/g-research-crypto-forecasting/asset_details.csv
/kaggle/input/g-research-crypto-forecasting/example_test.csv
/kaggle/input/g-research-crypto-forecasting/train.csv
/kaggle/input/g-research-crypto-forecasting/supplemental_train.csv
/kaggle/input/g-research-crypto-forecasting/gresearch_crypto/competition.cpython-37m-x86_64-linux-gnu.so
/kaggle/input/g-research-crypto-forecasting/gresearch_crypto/__init__.py


In [2]:
import sys
sys.path.insert(0, '/kaggle/input/g-research-crypto-forecasting')
# somehow need to run this before importing competition API

import gresearch_crypto
import time
from datetime import datetime

import warnings
warnings.simplefilter('ignore')

dir_in = '/kaggle/input/g-research-crypto-forecasting/'
file_train = 'train.csv'
file_asset_details = 'asset_details.csv'

df_train = pd.read_csv(os.path.join(dir_in, file_train))
df_asset_details = pd.read_csv(os.path.join(dir_in, file_asset_details))

In [3]:
import random

def fix_all_seeds(seed):
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

fix_all_seeds(2021)

# Feature Engineering

In [4]:
def get_features(df):
    
    df = df.set_index('timestamp')
    
    df['upper_shadow'] = df['High'] / df[['Close', 'Open']].max(axis=1)
    df['lower_shadow'] = df[['Close', 'Open']].min(axis=1) / df['Low']
    df['open2close'] = df['Close'] / df['Open']
    df['high2low'] = df['High'] / df['Low']
    
    mean_price = df[['Open', 'High', 'Low', 'Close']].mean(axis=1)
    median_price = df[['Open', 'High', 'Low', 'Close']].median(axis=1)
    
    df['high2mean'] = df['High'] / mean_price
    df['low2mean'] = df['Low'] / mean_price
    df['high2median'] = df['High'] / median_price
    df['low2median'] = df['Low'] / median_price
    df['volume2count'] = df['Volume'] / (df['Count'] + 1)
    
    return df    

In [5]:
def get_asset_data(df_train, asset_id):
    
    df = df_train[df_train["Asset_ID"] == asset_id].copy()
    df = df.replace([np.inf, -np.inf], np.nan)
    y = df['Target'].copy()
    y = y.fillna(0)
    X = df.drop('Target', axis=1)
    
    return X, y

In [6]:
def get_corr(y_pred, y):
    corr = np.corrcoef(y_pred, y)[0,1]
    return corr

# Pipeline

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict
from lightgbm import LGBMRegressor
from category_encoders import OneHotEncoder

import time     # timer

# customize class for feature transformation
from sklearn.base import BaseEstimator, TransformerMixin

class GetFeatureTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_ = get_features(X)
        return X_

In [8]:
cat_cols = ['Asset_ID']
num_cols = ['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP',
            'upper_shadow', 'lower_shadow', 'open2close', 'high2low', 'high2mean', 'low2mean', 'high2median', 'low2median', 'volume2count']

[hyper parameter optimization - suggested parameter grid](https://github.com/Microsoft/LightGBM/issues/695)

>For heavily unbalanced datasets such as 1:10000:
>
>max_bin: keep it only for memory pressure, not to tune (otherwise overfitting)\
>learning rate: keep it only for training speed, not to tune (otherwise overfitting)\
>n_estimators: must be infinite (like 9999999) and use early stopping to auto-tune (otherwise overfitting)\
>num_leaves: [7, 4095]\
>max_depth: [2, 63] and infinite (I personally saw metric performance increases with such 63 depth with small number of leaves on sparse unbalanced datasets)\
>scale_pos_weight: [1, 10000] (if over 10000, something might be wrong because I never saw it that good after 5000)\
>min_child_weight: [0.01, (sample size / 1000)] if you are using logloss (think about the hessian possible value range before putting "sample size / 1000", it is dataset-dependent and loss-dependent)\
>subsample: [0.4, 1]\
>bagging_freq: only 1, keep as is (otherwise overfitting)\
>colsample_bytree: [0.4, 1]\
>is_unbalance: false (make your own weighting with scale_pos_weight)\
>USE A CUSTOM METRIC (to reflect reality without weighting, otherwise you have weights inside your metric with premade metrics like xgboost)\
>Never tune these parameters unless you have an explicit requirement to tune them:
>
>Learning rate (lower means longer to train but more accurate, higher means smaller to train but less accurate)\
>Number of boosting iterations (automatically tuned with early stopping and learning rate)\
>Maximum number of bins (RAM dependent)


In [9]:
params = {
    'n_estimators': 1000,
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'max_depth': -1,
    'learning_rate': 0.01,
    'subsample': 0.72,
    'subsample_freq': 4,
    'feature_fraction': 0.4,
    'lambda_l1': 1,
    'lambda_l2': 1   
}

In [10]:
pipe_lgbm = Pipeline(steps=[
    ('get_feature', GetFeatureTransformer()),
    ('transform_columns', ColumnTransformer([
        ('tf_num', StandardScaler(), num_cols),
        ('tf_cat', OneHotEncoder(), cat_cols)
    ])),
    ('model', LGBMRegressor(**params))
])

# Training

In [11]:
X_train = {}
y_train = {}
model_lgbm = {}
y_insmpl_pred = {}
score_insmpl = {}

# for asset_id, asset_name in zip([10], ['Maker']):
for asset_id, asset_name in zip(df_asset_details['Asset_ID'], df_asset_details['Asset_Name']):
    start_ts = time.time()
    print(f"Training model for {asset_name:<16} (ID={asset_id:<2})...")
    
    X, y = get_asset_data(df_train, asset_id)
    model = pipe_lgbm.fit(X, y)
#     y_pred = model.predict(X)
    y_pred = cross_val_predict(pipe_lgbm, X, y, cv = 5)

    score = get_corr(y_pred, y)
    
#     print(f"In-sample test score for {asset_name:<16} {score:.4f}")
    print(f"Cross validation test score for {asset_name:<16} {score:.4f}")
    
    X_train[asset_id] = X
    y_train[asset_id] = y
    model_lgbm[asset_id] = model
    y_insmpl_pred[asset_id] = y_pred
    score_insmpl[asset_id] = score
    
    end_ts = time.time()
    print(f"Time consumption: {(end_ts-start_ts)/60:.2f}min")

Training model for Bitcoin Cash     (ID=2 )...
Cross validation test score for Bitcoin Cash     0.0047
Time consumption: 6.56min
Training model for Binance Coin     (ID=0 )...
Cross validation test score for Binance Coin     0.0046
Time consumption: 7.50min
Training model for Bitcoin          (ID=1 )...
Cross validation test score for Bitcoin          0.0215
Time consumption: 9.07min
Training model for EOS.IO           (ID=5 )...
Cross validation test score for EOS.IO           0.0053
Time consumption: 8.62min
Training model for Ethereum Classic (ID=7 )...
Cross validation test score for Ethereum Classic 0.0036
Time consumption: 6.48min
Training model for Ethereum         (ID=6 )...
Cross validation test score for Ethereum         0.0243
Time consumption: 9.21min
Training model for Litecoin         (ID=9 )...
Cross validation test score for Litecoin         0.0065
Time consumption: 8.79min
Training model for Monero           (ID=11)...
Cross validation test score for Monero           0

In [12]:
import traceback

df_test_all = {}
df_pred_all = {}

env = gresearch_crypto.make_env()
iter_test = env.iter_test()

In [13]:
for i, (df_test, df_pred) in enumerate(iter_test):
    
    # make predictions
    for j, row in df_test.iterrows():
        asset_id = row['Asset_ID']
        try:
            y_pred = model_lgbm[asset_id].predict(row.to_frame().T)[0]
        except:
            y_pred = 0.0
            traceback.print_exc()
        df_pred.loc[df_pred['row_id']==row['row_id'], 'Target'] = y_pred
        
    # store test dataframes
    df_test_all[i] = df_test
    df_pred_all[i] = df_pred
    
    # submit predictions
    env.predict(df_pred)

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


In [14]:
file_smpl_subm = 'example_sample_submission.csv'
df_smpl_subm = pd.read_csv(os.path.join(dir_in, file_smpl_subm))

In [15]:
df_smpl_subm.head()

Unnamed: 0,group_num,row_id,Target
0,0,0,0
1,0,1,0
2,0,2,0
3,0,3,0
4,0,4,0


In [16]:
df_subm_wgid = pd.DataFrame(columns = df_smpl_subm.columns)
df_subm = pd.DataFrame(columns = ['row_id', 'Target'])

In [17]:
for group_num, df_pred in df_pred_all.items():
    df = df_pred.copy()
    
    # without group_num
    df_subm = df_subm.append(df)
    
    # with group_num
    df['group_num'] = group_num
    df_subm_wgid = df_subm_wgid.append(df)

In [18]:
df_subm_wgid.head()

Unnamed: 0,group_num,row_id,Target
0,0,0,-0.001065
1,0,1,-0.001376
2,0,2,-0.000756
3,0,3,-0.000719
4,0,4,-3.8e-05


In [19]:
df_subm.to_csv('submission.csv', index=False)
df_subm_wgid.to_csv('submission_with_group_num.csv', index=False)