## Introduction
This notebook is for intorducing how our team achieved 18th on public leaderboard with our **LGBM and CatBoost ensemble model with holdout CV** and demonstrating the process of **online learning to collect public leaderboard data continuosly update the model.** (~40 days after date_id 480). 

These methods are designed for public leaderboard data and should not be extrapolated for private leaderboard predictions. (This is not the notebook we submit for private.)

Special thanks to the authors of the following public notebooks for their invaluable contributions:

- **@lblhandsome**, ["⚡Optiver 🚀Robust Best ⚡ Single Model"](https://www.kaggle.com/code/lblhandsome/optiver-robust-best-single-model)

- **@zulqarnainali**, ["LGB Fine-tuned 🚀(Explained)"](https://www.kaggle.com/code/zulqarnainali/lgb-fine-tuned-explained)

- **@verracodeguacas**, ["🖐️-Fold CV 🚀"](https://www.kaggle.com/code/verracodeguacas/fold-cv)

- **@yekenot**, ["Feature Elimination by CatBoost"](https://www.kaggle.com/code/yekenot/feature-elimination-by-catboost)

- **@jirkaborovec**, ["📈Optiver📉: Feature Eng & Optuna🌪️LightGBM@GPU"](https://www.kaggle.com/code/jirkaborovec/optiver-feature-eng-optuna-lightgbm-gpu)

- **@yunchonggan**,["Weights of the Synthetic Index"](https://www.kaggle.com/competitions/optiver-trading-at-the-close/discussion/442851)


I would also like to extend my great appreciation to my teammates for their tremendous effort in this competition:

- **@chellyfan**
- **@unknownkid**



Please note that you should run with **GPU T4x2**. P100 GPU will cause memory issue when training Catboost at Kaggle platform. 
Check this notebook for detail https://www.kaggle.com/code/yekenot/feature-elimination-by-catboost

# 🧹 Importing necessary libraries

In [None]:
import gc  # Garbage collection for memory management
import os  # Operating system-related functions
import time  # Time-related functions
import warnings  # Handling warnings
from itertools import combinations  # For creating combinations of elements
from warnings import simplefilter  # Simplifying warning handling

# 📦 Importing machine learning libraries
import joblib  # For saving and loading models
import xgboost as xgb
import lightgbm as lgb  # LightGBM gradient boosting framework
import catboost as ctb
import numpy as np  # Numerical operations
import pandas as pd  # Data manipulation and analysis
# from pandarallel import pandarallel
from sklearn.metrics import mean_absolute_error  # Metric for evaluation
from sklearn.model_selection import KFold, TimeSeriesSplit  # Cross-validation techniques

# 🤐 Disable warnings to keep the code clean
warnings.filterwarnings("ignore")
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
# pandarallel.initialize(nb_workers=4)

# 📊 Define flags and variables
# is_offline = True  # Flag for online/offline mode
# is_train = False  # Flag for training mode
# is_infer = False  # Flag for inference mode
max_lookback = np.nan  # Maximum lookback (not specified)
split_day = 435  # Split day for time series data
N_STOCKS = 200
MAX_N_NEIGHBOURS = 10
NEIGHBOUR_CORR_THRESHOLD = 0.3

# 📊 Data Loading and Preprocessing 📊

In [None]:
# 📂 Read the dataset from a CSV file using Pandas
# df = pd.read_csv("/kaggle/input/optiver-trading-at-the-close/train.csv")
df = pd.read_csv("/kaggle/input/optiver-trading-at-the-close/train.csv")

# 🧹 Remove rows with missing values in the "target" column
df = df.dropna(subset=["target"])
df = df.dropna(subset=["wap"])

# 🔁 Reset the index of the DataFrame and apply the changes in place
df.reset_index(drop=True, inplace=True)

# 📏 Get the shape of the DataFrame (number of rows and columns)
df_shape = df.shape

df_train = df

# 🚀 Memory Optimization Function with Data Type Conversion 🧹¶

In [None]:
# 🧹 Function to reduce memory usage of a Pandas DataFrame
def reduce_mem_usage(df, verbose=0):
    """
    Iterate through all numeric columns of a dataframe and modify the data type
    to reduce memory usage.
    """
    
    # 📏 Calculate the initial memory usage of the DataFrame
    start_mem = df.memory_usage().sum() / 1024**2

    # 🔄 Iterate through each column in the DataFrame
    for col in df.columns:
        col_type = df[col].dtype

        # Check if the column's data type is not 'object' (i.e., numeric)
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            # Check if the column's data type is an integer
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                # Check if the column's data type is a float
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float32)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float32)

    # ℹ️ Provide memory optimization information if 'verbose' is True
    if verbose:
        logger.info(f"Memory usage of dataframe is {start_mem:.2f} MB")
        end_mem = df.memory_usage().sum() / 1024**2
        logger.info(f"Memory usage after optimization is: {end_mem:.2f} MB")
        decrease = 100 * (start_mem - end_mem) / start_mem
        logger.info(f"Decreased by {decrease:.2f}%")

    # 🔄 Return the DataFrame with optimized memory usage
    return df


# 🏎️Parallel Triplet Imbalance Calculation with Numba

In [None]:
# 🏎️ Import Numba for just-in-time (JIT) compilation and parallel processing
from numba import njit, prange

# 📊 Function to compute triplet imbalance in parallel using Numba
@njit(parallel=True)
def compute_triplet_imbalance(df_values, comb_indices):
    num_rows = df_values.shape[0]
    num_combinations = len(comb_indices)
    imbalance_features = np.empty((num_rows, num_combinations))

    # 🔁 Loop through all combinations of triplets
    for i in prange(num_combinations):
        a, b, c = comb_indices[i]
        
        # 🔁 Loop through rows of the DataFrame
        for j in range(num_rows):
            max_val = max(df_values[j, a], df_values[j, b], df_values[j, c])
            min_val = min(df_values[j, a], df_values[j, b], df_values[j, c])
            mid_val = df_values[j, a] + df_values[j, b] + df_values[j, c] - min_val - max_val
            
            # 🚫 Prevent division by zero
            if mid_val == min_val:
                imbalance_features[j, i] = np.nan
            else:
                imbalance_features[j, i] = (max_val - mid_val) / (mid_val - min_val)

    return imbalance_features

# 📊 Function to compute triplet imbalance in parallel using Numba
@njit(parallel=True)
def compute_triplet_imbalance(df_values, comb_indices):
    num_rows = df_values.shape[0]
    num_combinations = len(comb_indices)
    imbalance_features = np.empty((num_rows, num_combinations))

    # 🔁 Loop through all combinations of triplets
    for i in prange(num_combinations):
        a, b, c = comb_indices[i]
        
        # 🔁 Loop through rows of the DataFrame
        for j in range(num_rows):

            if df_values[j, a] < df_values[j, b]:
                min_val = df_values[j, a]
                max_val = df_values[j, b]
            else:
                max_val = df_values[j, a]
                min_val = df_values[j, b]

            if min_val < df_values[j, c]:
                if df_values[j, c] < max_val:
                    mid_val = df_values[j, c]
                else:
                    mid_val = max_val
                    max_val = df_values[j, c]
            else:
                mid_val = min_val
                min_val = df_values[j, c]
            
            # 🚫 Prevent division by zero
            if max_val == min_val:
                imbalance_features[j, i] = np.nan
            elif mid_val == min_val:
                imbalance_features[j, i] = np.nan
            else:
                imbalance_features[j, i] = (max_val - mid_val) / (mid_val - min_val)
    
    return imbalance_features

# 📈 Function to calculate triplet imbalance for given price data and a DataFrame
def calculate_triplet_imbalance_numba(price, df):
    # Convert DataFrame to numpy array for Numba compatibility
    df_values = df[price].values
    comb_indices = [(price.index(a), price.index(b), price.index(c)) for a, b, c in combinations(price, 3)]

    # Calculate the triplet imbalance using the Numba-optimized function
    features_array = compute_triplet_imbalance(df_values, comb_indices)

    # Create a DataFrame from the results
    columns = [f"{a}_{b}_{c}_imb2" for a, b, c in combinations(price, 3)]
    features = pd.DataFrame(features_array, columns=columns)

    return features


# 📊 Feature Generation Functions 📊

In [None]:
# 📊 Function to generate imbalance features
def imbalance_features(df):

    stock_groups = df.groupby(["date_id", "seconds_in_bucket"])
    # Index WAP
    df["wwap"] = df.stock_id.map(weights) * df.wap
    df["iwap"] = stock_groups["wwap"].transform(lambda x: x.sum())
    del df["wwap"]

    # Define lists of price and size-related column names
    prices = ["reference_price", "far_price", "near_price", "ask_price", "bid_price", "wap"]
    sizes = ["matched_size", "bid_size", "ask_size", "imbalance_size"]

    # V1 features
    # Calculate various features using Pandas eval function
    df["volume"] = df.eval("ask_size + bid_size")
    df["mid_price"] = df.eval("(ask_price + bid_price) / 2")
    df["liquidity_imbalance"] = df.eval("(bid_size-ask_size)/(bid_size+ask_size)")
    df["matched_imbalance"] = df.eval("(imbalance_size-matched_size)/(matched_size+imbalance_size)")
    df["all_size"] = df.eval("matched_size + imbalance_size") # add
    df["imbalance_size_for_buy_sell"] = df.eval("imbalance_size * imbalance_buy_sell_flag")  # add
    # df["size_imbalance"] = df.eval("bid_size / ask_size")
    
    cols = ['wap', 'imbalance_size_for_buy_sell', "bid_size", "ask_size"]
    for q in [0.25, 0.5, 0.75]:  # Try more/different q
        df[[f'{col}_quantile_{q}' for col in cols]] = stock_groups[cols].transform(lambda x: x.quantile(q)).astype(np.float32)
    
    # Create features for pairwise price imbalances
    for c in combinations(prices, 2):
        df[f"{c[0]}_{c[1]}_imb"] = df.eval(f"({c[0]} - {c[1]})/({c[0]} + {c[1]})").astype(np.float32)

    for c in combinations(sizes, 2):
        df[f"{c[0]}/{c[1]}"] = df.eval(f"({c[0]})/({c[1]})").astype(np.float32)

    # Calculate triplet imbalance features using the Numba-optimized function
    for c in [['ask_price', 'bid_price', 'wap', 'reference_price'], sizes]:
        triplet_feature = calculate_triplet_imbalance_numba(c, df)
        df[triplet_feature.columns] = triplet_feature.values.astype(np.float32)
        
    # V2 features
    # Calculate additional features
    stock_groups = df.groupby(['stock_id', 'date_id'])
    df["imbalance_momentum"] = stock_groups['imbalance_size'].diff(periods=1) / df['matched_size']
    df["price_spread"] = df["ask_price"] - df["bid_price"]
    df["spread_intensity"] = stock_groups['price_spread'].diff()
    df['price_pressure'] = df['imbalance_size'] * (df['ask_price'] - df['bid_price'])
    df['market_urgency'] = df['price_spread'] * df['liquidity_imbalance']
    df['depth_pressure'] = (df['ask_size'] - df['bid_size']) * (df['far_price'] - df['near_price'])
    df['wap_advantage'] = df.wap - df.iwap  # add

    # Calculate various statistical aggregation features
    df_prices = df[prices]
    df_sizes = df[sizes]
    for func in ["mean", "std", "skew", "kurt"]:
        df[f"all_prices_{func}"] = df_prices.agg(func, axis=1)
        df[f"all_sizes_{func}"] = df_sizes.agg(func, axis=1)
        
    # V3 features
    # Calculate shifted and return features for specific columns
    cols = ['matched_size', 'imbalance_size', 'reference_price', 'imbalance_buy_sell_flag', "wap", "iwap"]
    stock_groups_cols = stock_groups[cols]
    for window in [1, 2, 3, 6, 10]:
        df[[f"{col}_shift_{window}" for col in cols]] = stock_groups_cols.shift(window)

    cols = ['matched_size', 'imbalance_size', 'reference_price', "iwap"] #wap
    stock_groups_cols = stock_groups[cols]
    for window in [1, 2, 3, 6, 10]:
        df[[f"{col}_ret_{window}" for col in cols]] = stock_groups_cols.pct_change(window).astype(np.float32)

    # Calculate diff features for specific columns
    cols = ['ask_price', 'bid_price', 'ask_size', 'bid_size', 'wap', 'near_price', 'far_price', 'imbalance_size_for_buy_sell']
    stock_groups_cols = stock_groups[cols]
    for window in [1, 2, 3, 6, 10]:
        df[[f"{col}_diff_{window}" for col in cols]] = stock_groups_cols.diff(window).astype(np.float32)

    # V4 features
    # Construct `time_since_last_imbalance_change`
    # 当`imbalance_buy_sell_flag`改变时，'flag_change'列会为该行赋值为1
#     df['flag_change'] = stock_groups['imbalance_buy_sell_flag'].diff().ne(0).astype(int)
#     # 使用cumsum创建一个组标识符，每当flag改变时，组标识符增加
#     df['group'] = stock_groups['flag_change'].cumsum()
#     # 对每个组内的'seconds_in_bucket'计算时间差，以得到自上次flag改变以来的时间
#     df['time_since_last_imbalance_change'] = df.groupby(['stock_id', 'date_id', 'group'])['seconds_in_bucket'].transform(lambda x: x - x.min())
#     # `flag_change`为1的地方设为0
#     df.loc[df['flag_change'] == 1, 'time_since_last_imbalance_change'] = 0
#     df.drop(columns=['flag_change', 'group'], inplace=True)
    
    df['flag_change'] = stock_groups['imbalance_buy_sell_flag'].diff().ne(0).astype(int)
    # 使用cumsum创建一个组标识符，每当flag改变时，组标识符增加
    df['group'] = df.groupby(['stock_id', 'date_id'])['flag_change'].cumsum()
    # 对每个组内的'seconds_in_bucket'计算时间差，以得到自上次flag改变以来的时间
    group_min = df.groupby(['stock_id', 'date_id', 'group'])['seconds_in_bucket'].transform('min')
    df['time_since_last_imbalance_change'] = df['seconds_in_bucket'] - group_min
    # `flag_change`为1的地方设为0
    df['time_since_last_imbalance_change'] *= (1 - df['flag_change'])
    df.drop(columns=['flag_change', 'group'], inplace=True)
    
    cols = ['imbalance_size_for_buy_sell']
    stock_groups_cols = stock_groups[cols]
    for window in [5, 10]:
        mean_col = stock_groups_cols.transform(lambda x: x.rolling(window=window).mean())
        std_col = stock_groups_cols.transform(lambda x: x.rolling(window=window).std())
        df[[f'z_score_{col}_{window}' for col in cols]] = (df[cols] - mean_col) / std_col
    
    # Replace infinite values with 0
    return df.replace([np.inf, -np.inf], 0)

# 📅 Function to generate time and stock-related features
def other_features(df):
    df["seconds"] = df["seconds_in_bucket"] % 60  # Seconds
    df["minute"] = df["seconds_in_bucket"] // 60  # Minutes

    # Map global features to the DataFrame
    for key, value in global_stock_id_feats.items():
        df[f"global_{key}"] = df["stock_id"].map(value.to_dict())
        
    for key, value in global_seconds_feats.items():
        df[f"global_seconds_{key}"] = df["seconds_in_bucket"].map(value.to_dict())

    return df

def last_days_features(df: pd.DataFrame, feat_last=None, target_last=None):
    size = None
    
    if feat_last is not None and len(feat_last) > 0:
        cols = [col for col in df.columns if col in set(feat_last.columns)]
        if target_last is not None:
            cols.append("target")
            feat_last["target"] = target_last
            df["target"] = 0
        paddings = []
        second_start = df.seconds_in_bucket.max()
        padding_src = df[df.seconds_in_bucket == second_start]
        size = len(df)
        size_pad = len(padding_src) * 6
        for second in range(second_start + 10, second_start + 70, 10):
            padding = padding_src.copy()
            padding["seconds_in_bucket"] = second
            paddings.append(padding)
        df = pd.concat([feat_last[cols], df] + paddings)

    # Add Last days features
    # TODO: Try more features
    cols = ['near_price', 'far_price', 'depth_pressure']
    if 'target' in df.columns:
        cols.append('target')
    stock_groups = df.groupby(['stock_id', 'seconds_in_bucket'])
    stock_groups_cols = stock_groups[cols]
    for window in [1]:
        df[[f"{col}_last_{window}day" for col in cols]] = stock_groups_cols.shift(window)
    if cols[-1] == "target":
        cols.pop()
    
    cols = [f"{col}_last_{window}day" for col in cols]
    stock_groups = df.groupby(['stock_id', 'date_id'])
    stock_groups_cols = stock_groups[cols]
    for window in [1, 2, 3, 6]:
        df[[f"{col}_future_{window}" for col in cols]] = stock_groups_cols.shift(-window)
        
    if size:
        return df[-(size + size_pad):-size_pad]
    return df

# 🚀 Function to generate all features by combining imbalance and other features
def generate_all_features(df, feat_last=None, target_last=None):
    # Select relevant columns for feature generation
    cols = [c for c in df.columns if c not in {"row_id", "time_id", "currently_scored"}]
    df = df[cols]
    
    # Generate imbalance features
    df = imbalance_features(df)
    
    # Generate last days features
    df = last_days_features(df, feat_last, target_last)
    
    # Generate time and stock-related features
    df = other_features(df)

    gc.collect()  # Perform garbage collection to free up memory
    
    # Select and return the generated features
    feature_name = [i for i in df.columns if i not in {"row_id", "target", "time_id"}]
    
    return df[feature_name]

In [None]:
weights = [
    0.004, 0.001, 0.002, 0.006, 0.004, 0.004, 0.002, 0.006, 0.006, 0.002, 0.002, 0.008,
    0.006, 0.002, 0.008, 0.006, 0.002, 0.006, 0.004, 0.002, 0.004, 0.001, 0.006, 0.004,
    0.002, 0.002, 0.004, 0.002, 0.004, 0.004, 0.001, 0.001, 0.002, 0.002, 0.006, 0.004,
    0.004, 0.004, 0.006, 0.002, 0.002, 0.04 , 0.002, 0.002, 0.004, 0.04 , 0.002, 0.001,
    0.006, 0.004, 0.004, 0.006, 0.001, 0.004, 0.004, 0.002, 0.006, 0.004, 0.006, 0.004,
    0.006, 0.004, 0.002, 0.001, 0.002, 0.004, 0.002, 0.008, 0.004, 0.004, 0.002, 0.004,
    0.006, 0.002, 0.004, 0.004, 0.002, 0.004, 0.004, 0.004, 0.001, 0.002, 0.002, 0.008,
    0.02 , 0.004, 0.006, 0.002, 0.02 , 0.002, 0.002, 0.006, 0.004, 0.002, 0.001, 0.02,
    0.006, 0.001, 0.002, 0.004, 0.001, 0.002, 0.006, 0.006, 0.004, 0.006, 0.001, 0.002,
    0.004, 0.006, 0.006, 0.001, 0.04 , 0.006, 0.002, 0.004, 0.002, 0.002, 0.006, 0.002,
    0.002, 0.004, 0.006, 0.006, 0.002, 0.002, 0.008, 0.006, 0.004, 0.002, 0.006, 0.002,
    0.004, 0.006, 0.002, 0.004, 0.001, 0.004, 0.002, 0.004, 0.008, 0.006, 0.008, 0.002,
    0.004, 0.002, 0.001, 0.004, 0.004, 0.004, 0.006, 0.008, 0.004, 0.001, 0.001, 0.002,
    0.006, 0.004, 0.001, 0.002, 0.006, 0.004, 0.006, 0.008, 0.002, 0.002, 0.004, 0.002,
    0.04 , 0.002, 0.002, 0.004, 0.002, 0.002, 0.006, 0.02 , 0.004, 0.002, 0.006, 0.02,
    0.001, 0.002, 0.006, 0.004, 0.006, 0.004, 0.004, 0.004, 0.004, 0.002, 0.004, 0.04,
    0.002, 0.008, 0.002, 0.004, 0.001, 0.004, 0.006, 0.004,
]

weights = {int(k):v for k,v in enumerate(weights)}

stock_group = df_train.groupby("stock_id")
df_train["imbalance_size_for_buy_sell"] = df_train.eval("imbalance_size * imbalance_buy_sell_flag")  # add
global_stock_id_feats = {
        "median_size": stock_group["bid_size"].median() + stock_group["ask_size"].median(),
        "std_size": stock_group["bid_size"].std() + stock_group["ask_size"].std(),
        "ptp_size": stock_group["bid_size"].max() - stock_group["bid_size"].min(),
        "median_price": stock_group["bid_price"].median() + stock_group["ask_price"].median(),
        "std_price": stock_group["bid_price"].std() + stock_group["ask_price"].std(),
        "ptp_price": stock_group["bid_price"].max() - stock_group["ask_price"].min(),
        "median_far_price": stock_group["far_price"].median(),
        "median_near_price": stock_group["near_price"].median(),
        "median_imbalance_size_for_buy_sell": stock_group["imbalance_size_for_buy_sell"].median(),
        "matched_size":stock_group["matched_size"].median(),
    }

global_seconds_feats = {
    "median_target": df_train.groupby(["date_id", "seconds_in_bucket"])["target"].apply(lambda x: x.abs().mean()).reset_index(0, drop=True).groupby("seconds_in_bucket").median(),
}
# stock2nei = get_stock_neighbour(df_train)

# LGBM training function

LGBM parameter taken from **@jirkaborovec**, ["📈Optiver📉: Feature Eng & Optuna🌪️LightGBM@GPU"](https://www.kaggle.com/code/jirkaborovec/optiver-feature-eng-optuna-lightgbm-gpu) except for learning rate and n_estimator

In [None]:
def train_lgb(df_train_feats, targets, split_day):

    feature_name = list(df_train_feats.columns)
    feature_name.remove("date_id")
    offline_split = df_train_feats['date_id'] > (split_day - 45)
    
#     pre = df_train_feats[offline_split][["date_id", "seconds_in_bucket"]]
#     pre["target"] = targets[offline_split].values
#     df_train_feats["imbalance_size_for_buy_sell"] = df_train.eval("imbalance_size * imbalance_buy_sell_flag")
    # Update global features
    stock_group = df_train_feats.groupby("stock_id")
    global global_stock_id_feats, global_seconds_feats
    global_stock_id_feats = {
        "median_size": stock_group["bid_size"].median() + stock_group["ask_size"].median(),
        "std_size": stock_group["bid_size"].std() + stock_group["ask_size"].std(),
        "ptp_size": stock_group["bid_size"].max() - stock_group["bid_size"].min(),
        "median_price": stock_group["bid_price"].median() + stock_group["ask_price"].median(),
        "std_price": stock_group["bid_price"].std() + stock_group["ask_price"].std(),
        "ptp_price": stock_group["bid_price"].max() - stock_group["ask_price"].min(),
        "median_far_price": stock_group["far_price"].median(),
        "median_near_price": stock_group["near_price"].median(),
        "median_imbalance_size_for_buy_sell": stock_group["imbalance_size_for_buy_sell"].median(),
        "matched_size": stock_group["matched_size"].median(),
    }
    df_train_feats["target"] = targets.values
    global_seconds_feats = {
        "median_target": df_train_feats.groupby(["date_id", "seconds_in_bucket"])["target"].apply(lambda x: x.abs().mean()).reset_index(0, drop=True).groupby("seconds_in_bucket").median(),
    }
    del df_train_feats["target"]
    
    # update global features
    for key, value in global_stock_id_feats.items():
        df_train_feats[f"global_{key}"] = df_train_feats["stock_id"].map(value.to_dict())
        
    for key, value in global_seconds_feats.items():
        df_train_feats[f"global_seconds_{key}"] = df_train_feats["seconds_in_bucket"].map(value.to_dict())
        
    lgb_params = {
        "objective": "mae",
        "n_estimators": 5500,
        "num_leaves": 465,
        "subsample": 0.65791,
        "colsample_bytree": 0.7,
        "learning_rate": 0.00877,  # 0.00877
        "n_jobs": 4,
        "device": "gpu",
        "verbosity": -1,
        "importance_type": "gain",
        "max_depth": 14,  # Maximum depth of the tree
        "min_child_samples": 132,  # Minimum number of data points in a leaf
        "reg_alpha": 6,  # L1 regularization term
        "reg_lambda": 0.08,  # L2 regularization term
    }

    print(f"Feature length = {len(feature_name)}")

    # Split data for offline training based on a specific date
    df_offline_train = df_train_feats[~offline_split]
    df_offline_valid = df_train_feats[offline_split]
    df_offline_train_target = targets[~offline_split]
    df_offline_valid_target = targets[offline_split]
    
    
    train_dataset = lgb.Dataset(df_offline_train[feature_name].values.astype(np.float32), 
                                label=df_offline_train_target.values.astype(np.float32))
    
    valid_dataset = lgb.Dataset(df_offline_valid[feature_name].values.astype(np.float32), 
                                label=df_offline_valid_target.values.astype(np.float32))
    
    del df_offline_train, df_offline_valid, df_offline_train_target, df_offline_valid_target, offline_split

    print("Valid Model Training.")
    
    # Train a LightGBM model on the offline data
    lgb_model = lgb.LGBMRegressor(**lgb_params)
  
    lgb_model.fit(
        train_dataset.data,
        train_dataset.label,
        eval_set=[(valid_dataset.data, valid_dataset.label)],
        callbacks=[
            lgb.callback.early_stopping(stopping_rounds=200),
            lgb.callback.log_evaluation(period=100),
        ],
        feature_name=feature_name,
    )
    
    best_iteration_ = lgb_model.best_iteration_
    
    # Free up memory by deleting variables
    del lgb_model
    gc.collect()

    # Inference
    df_train_target = targets
    full_train_dataset = lgb.Dataset(df_train_feats[feature_name].values.astype(np.float32), 
                                     label=df_train_target.values.astype(np.float32))
    print("Infer Model Training.")

    # Adjust the number of estimators for the inference model
    infer_params = lgb_params.copy()
    infer_params["n_estimators"] = int(1.2 * best_iteration_)
    infer_lgb_model = lgb.LGBMRegressor(**infer_params)
    
    infer_lgb_model.fit(
        full_train_dataset.data,
        full_train_dataset.label,
        feature_name=feature_name,
        )
    
#     infer_lgb_model.fit(
#         df_train_feats[feature_name].values.astype(np.float32), 
#         df_train_target.values.astype(np.float32), 
#         feature_name = feature_name
#         )

    return infer_lgb_model

# Catboost training & feature elinmation 

Based on **@yekenot**, ["Feature Elimination by CatBoost"](https://www.kaggle.com/code/yekenot/feature-elimination-by-catboost)


In [None]:
eliminated_features_names = set(['near_price_ask_price_imb',
                             'iwap',
                             'all_imbalance_size_for_buy_sell_quantile_0.5',
                             'iwap_shift_10',
                             'all_imbalance_size_for_buy_sell_quantile_0.75',
                             'matched_size_ret_10',
                             'all_wap_quantile_0.25',
                             'iwap_shift_6',
                             'near_price_wap_imb',
                             'iwap_shift_1',
                             'iwap_ret_10',
                             'iwap_ret_6',
                             'all_imbalance_size_for_buy_sell_quantile_0.25',
                             'global_ptp_size',
                             'reference_price_shift_10',
                             'all_ask_size_quantile_0.75',
                             'iwap_shift_3',
                             'iwap_shift_2',
                             'reference_price_near_price_imb',
                             'all_bid_size_quantile_0.25',
                             'global_std_size',
                             'global_median_price',
                             'matched_size_ret_6',
                             'wap_shift_10'])

In [None]:
def train_ctb(df_train_feats, targets, split_day):

    feature_name = list(df_train_feats.columns)
    feature_name.remove("date_id")
    feature_name = [feat for feat in feature_name if feat not in eliminated_features_names]
    offline_split = df_train_feats['date_id'] > (split_day - 45)
    
#     target_mean = targets.values.mean()
#     pre = df_train_feats[offline_split][["date_id", "seconds_in_bucket"]]
#     pre["target"] = targets[offline_split].values
    
    ctb_params = dict(iterations=2000,
                      learning_rate=1.0,
                      depth=9,
                      l2_leaf_reg=30,
                      bootstrap_type='Bernoulli',
                      subsample=0.66,
                      loss_function='MAE',
                      eval_metric = 'MAE',
                      metric_period=100,
                      od_type='Iter',
                      od_wait=200,
                      task_type='GPU',
                      allow_writing_files=False,
                      random_strength=4.428571428571429
                      )

    print(f"Feature length = {len(feature_name)}")

    # Split data for offline training based on a specific date
    df_offline_train = df_train_feats[~offline_split]
    df_offline_valid = df_train_feats[offline_split]
    df_offline_train_target = targets[~offline_split]
    df_offline_valid_target = targets[offline_split]
    
#     train_dataset = lgb.Dataset(df_offline_train[feature_name].values, label = df_offline_train_target, feature_name = feature_name)
#     valid_dataset = lgb.Dataset(df_offline_valid[feature_name].values, label = df_offline_valid_target, feature_name = feature_name)

    print("Valid Model Training.")

    # Train a LightGBM model on the offline data
    ctb_model = ctb.CatBoostRegressor(**ctb_params)
    
#   feature elinmation 

#     summary = ctb_model.select_features(
#         df_offline_train[feature_name], df_offline_train_target,
#         eval_set=[(df_offline_valid[feature_name], df_offline_valid_target)],
#         features_for_select=feature_name,
#         num_features_to_select=len(feature_name)-24,    # Dropping from 124 to 100
#         steps=3,
#         algorithm=ctb.EFeaturesSelectionAlgorithm.RecursiveByShapValues,
#         shap_calc_type=ctb.EShapCalcType.Regular,
#         train_final_model=False,
#         plot=True,
#     )
    ctb_model.fit(
        df_offline_train[feature_name].astype(np.float32), df_offline_train_target.astype(np.float32),
        eval_set=[(df_offline_valid[feature_name].astype(np.float32), df_offline_valid_target.astype(np.float32))],
        use_best_model=True,
#         early_stopping_rounds=200
    )
    
    best_iteration_ = ctb_model.best_iteration_
    
    # Free up memory by deleting variables
    del df_offline_train, df_offline_valid, df_offline_train_target, df_offline_valid_target, offline_split, ctb_model
    gc.collect()

    # Inference
    df_train_target = targets
    print("Infer Model Training.")

    # Adjust the number of estimators for the inference model
    infer_params = ctb_params.copy()
    infer_params["iterations"] = int(1.2 * best_iteration_)
    infer_ctb_model = ctb.CatBoostRegressor(**infer_params)
    infer_ctb_model.fit(df_train_feats[feature_name].astype(np.float32), df_train_target.astype(np.float32))
    return infer_ctb_model

# Inference and online learning

Training the model at date_id 481, 481+15, 481+30 during inferenceing

In [None]:
def zero_sum(prices, volumes):
    std_error = np.sqrt(volumes)
    step = np.sum(prices)/np.sum(std_error)
    out = prices-std_error*step
    
    return out

import optiver2023
env = optiver2023.make_env()
iter_test = env.iter_test()
counter = 0
y_min, y_max = -64, 64
qps, predictions = [], []
cache = pd.DataFrame()
# train_cache = pd.DataFrame()
print("Start initialization...")
targets = df_train["target"].astype(np.float32)
data = reduce_mem_usage(generate_all_features(df_train))
del df_train
gc.collect()
print("Initialization done.")
start_date_id = 481  # 481
start_pro_date_id = 480
# online traiing at 
train_date_id = set([start_date_id, start_date_id + 15, start_date_id + 30])  # set([start_date_id + 20])
feature_name = list(data.columns)
feature_name.remove("date_id")
feature_name_ctb = [feat for feat in feature_name if feat not in eliminated_features_names]
feat_last = None
target_last = None
target_mean = targets.values.mean()
print("Start prediction...")
for (test, revealed_targets, sample_prediction) in iter_test:
    currently_scored = test.iloc[0]["currently_scored"]
    date_id = test.iloc[0]["date_id"]
#     currently_scored = date_id >= start_date_id
    if not currently_scored and date_id != start_pro_date_id:
        sample_prediction['target'] = 0
        env.predict(sample_prediction)
        continue
    del test['currently_scored']
    seconds_in_bucket = test.iloc[0]["seconds_in_bucket"]
    # Online train
    if seconds_in_bucket == 0:
        cache = pd.DataFrame()
        target_last = revealed_targets["revealed_target"].values.astype(np.float32)
        #  concat target
        if len(targets) < len(data):
            targets = pd.concat([targets, revealed_targets["revealed_target"].astype(np.float32)], ignore_index=True, axis=0)
        # Online learning
        if date_id in train_date_id:
            infer_lgb_model = train_lgb(data, targets, date_id - 1)
            infer_ctb_model = train_ctb(data, targets, date_id - 1)
            target_mean = targets.values.mean()

#     now_time = time.time()
    cache = pd.concat([cache, test], ignore_index=True, axis=0)
    feat = generate_all_features(cache, feat_last, target_last)
    if seconds_in_bucket == 540:
        feat_last = feat.copy()
        if currently_scored:
            # concat date for online learning
            data = pd.concat([data, reduce_mem_usage(feat)], ignore_index=True, axis=0)
    if not currently_scored:
        sample_prediction['target'] = 0
        env.predict(sample_prediction)
        continue
    feat = feat[-len(test):]
    lgb_prediction = infer_lgb_model.predict(feat[feature_name])
    ctb_prediction = infer_ctb_model.predict(feat[feature_name_ctb])
    prediction = lgb_prediction * 0.67 + ctb_prediction * 0.33
    prediction = prediction - prediction.mean() + target_mean
    clipped_predictions = np.clip(prediction, y_min, y_max)
    sample_prediction['target'] = clipped_predictions
    env.predict(sample_prediction)
print("OK")