# 사용 설명서

## 0. requirements
#### 0.1. 로컬 환경에서 package 설치법
- ```!pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages <package_name>```

#### 0.2. kaggle api 설치법
- https://www.kaggle.com/docs/api#getting-started-installation-&-authentication 참고
- ```!pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages kaggle```
- kaggle api token 다운로드 후 upload (kaggle.json)
- ```!mkdir ~/.kaggle``` 
- ```!mv ~/kaggle.json ~/.kaggle/``` (kaggle.json 파일을 ~/.kaggle/ 로 이동)
- ```!chmod 600 ~/.kaggle/kaggle.json``` (권한 설정)

## 1. config 설정

#### 1.1. init config
- MODE: train, inference 중 선택 (train : 로컬 환경, inference : 캐글 환경)
- KAGGLE_DATASET_NAME: 캐글 환경에서 inference 시 사용할 데이터셋 이름 
  - 이 이름으로 캐글 데이터셋이 생성됩니다. (중복 불가)

#### 1.2. train / inference config
- model_directory: 모델 저장 경로
- data_directory: 데이터 경로
- train_mode: train 모드 여부
- infer_mode: inference 모드 여부

#### 1.3. model config
- model_name: 사용할 모델 이름
    - 실제 아래 models_config에 있는 모델 이름과 동일해야 합니다 (아래중에서 선택하는것임).(:list)
- target: target column 이름
- split_method: 데이터 분리 방식
  - time_series: 시계열 데이터 분리
  - rolling: 롤링 윈도우 방식 데이터 분리
  - blocking: 블록 방식 데이터 분리
  - holdout: holdout 방식 데이터 분리
- n_splits: 데이터 분리 개수 (1 ~)
- correct: 데이터 분리 시 날짜 boundary를 맞출지 여부 (True / False)
- initial_fold_size_ratio: 초기 fold size 비율 (0 ~ 1)
- train_test_ratio: train, test 비율 (0 ~ 1)
- ~train_start: 학습 데이터 기간 시작~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- ~train_end: 학습 데이터 기간 끝~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- ~valid_start: 검증 데이터 기간 시작~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- ~valid_end: 검증 데이터 기간 끝~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- optuna_random_state: optuna random state
                                - 
#### 1.4. model heyperparameter config
- models_config: 모델 하이퍼파라미터 설정
    - model: 모델 클래스
    - params: 모델 하이퍼파라미터들
        - ... : 모델 하이퍼파라미터

## 2. Global Method
- reduce_mem_usage: 메모리 사용량 줄이는 함수
- compute_triplet_imbalance: triplet imbalance 계산 함수
- calculate_triplet_imbalance_numba: triplet imbalance 계산 함수
- print_log: 함수 실행 전후에 원하는 코드를 실행해주는 decorator 함수입니다.
- zero_sum: zero sum 함수

## 3. Pre Code
- DataPreprocessor: 데이터 전처리 클래스
- FeatureEngineer: 피쳐 엔지니어링 클래스
- Splitter: 데이터 분리 클래스
- Model: 모델 클래스
- Trainer: 학습 클래스

## 4. Main Code
 

---

## 0. requirements

In [1]:
# !pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages <package_name>

#### 0.2. kaggle api 설치법

In [2]:
# !pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages kaggle

In [3]:
# !mkdir ~/.kaggle

In [4]:
# !mv ~/kaggle.json ~/.kaggle/

In [5]:
# !chmod 600 ~/.kaggle/kaggle.json

## 1. config 설정

#### 1.1. init config

In [1]:
MODE = "train"  # train, inference, both
KAGGLE_DATASET_NAME = "model-nn-version-yongmin-6"

In [2]:
import gc
import os
import time
import warnings
from itertools import combinations
from warnings import simplefilter
import functools
import time
from numba import njit, prange
import pyarrow.parquet as pq

import joblib
import lightgbm as lgb
import xgboost as xgb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import optuna
from functools import partial
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold, TimeSeriesSplit

from sklearn.preprocessing import MinMaxScaler, QuantileTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow.keras.layers as layers
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import Callback, ReduceLROnPlateau, ModelCheckpoint, EarlyStopping

warnings.filterwarnings("ignore")
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

2023-12-08 11:39:19.087927: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-08 11:39:19.106815: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-08 11:39:19.106832: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-08 11:39:19.106843: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-08 11:39:19.110469: I tensorflow/core/platform/cpu_feature_g

#### 1.2. train / inference config

In [3]:
EPS = 1e-10

In [4]:
if MODE == "train":
    print("You are in train mode")
    model_directory = "./models/" + time.strftime("%Y%m%d_%H:%M:%S", time.localtime(time.time() + 9 * 60 * 60))
    data_directory = "./data"
    train_mode = True
    infer_mode = False
elif MODE == "inference":
    print("You are in inference mode")
    model_directory = f'/kaggle/input/{KAGGLE_DATASET_NAME}'
    data_directory = "/kaggle/input/optiver-trading-at-the-close"
    train_mode = False
    infer_mode = True
elif MODE == "both":
    print("You are in both mode")
    model_directory = f'/kaggle/working/'
    data_directory = "/kaggle/input/optiver-trading-at-the-close"
    train_mode = True
    infer_mode = True
else:
    raise ValueError("Invalid mode")

You are in train mode


#### 1.3. model config

In [5]:
config = {
    "data_dir": data_directory,
    "model_dir": model_directory,

    "train_mode": train_mode,  # True : train, False : not train
    "infer_mode": infer_mode,  # True : inference, False : not inference
    "model_name": ["lgb"],  # model name
    "final_mode": False,  # True : using final model, False : not using final model
    "best_iterate_ratio": 1.2,  # best iteration ratio
    'target': 'target',

    'split_method': 'rolling',  # time_series, rolling, blocking, holdout
    'n_splits': 3,  # number of splits
    'correct': True,  # correct boundary
    'gap': 0.05,  # gap between train and test (0.05 = 5% of train size)

    'initial_fold_size_ratio': 0.8,  # initial fold size ratio
    'train_test_ratio': 0.9,  # train, test ratio

    'optuna_random_state': 42,
}
config["model_mode"] = "single" if len(config["model_name"]) == 1 else "stacking"  # 모델 수에 따라서 single / stacking 판단
config["mae_mode"] = True if config["model_mode"] == "single" and not config[
    "final_mode"] else False  # single 모델이면서 final_mode가 아닌경우 폴드가 여러개일때 모델 평가기준이 없어서 mae로 평가
config["inference_n_splits"] = len(config['model_name']) if config["final_mode"] or config["mae_mode"] else config[
    "n_splits"]  # final_mode가 아닌경우 n_splits만큼 inference

#### 1.4. model heyperparameter config

In [6]:
if MODE == "train":
    if not os.path.exists(config["model_dir"]):
        os.makedirs(config["model_dir"])
    if not os.path.exists(config["data_dir"]):
        os.makedirs(config["data_dir"])
    !kaggle competitions download optiver-trading-at-the-close -p {config["data_dir"]} --force
    !unzip -o {config["data_dir"]}/optiver-trading-at-the-close.zip -d {config["data_dir"]}
    !rm {config["data_dir"]}/optiver-trading-at-the-close.zip

Downloading optiver-trading-at-the-close.zip to ./data
 98%|███████████████████████████████████████▎| 197M/201M [00:06<00:00, 30.7MB/s]
100%|████████████████████████████████████████| 201M/201M [00:06<00:00, 31.4MB/s]
Archive:  ./data/optiver-trading-at-the-close.zip
  inflating: ./data/example_test_files/revealed_targets.csv  
  inflating: ./data/example_test_files/sample_submission.csv  
  inflating: ./data/example_test_files/test.csv  
  inflating: ./data/optiver2023/__init__.py  
  inflating: ./data/optiver2023/competition.cpython-310-x86_64-linux-gnu.so  
  inflating: ./data/public_timeseries_testing_util.py  
  inflating: ./data/train.csv        


# ## Global Method

In [7]:
def reduce_mem_usage(df, verbose=0):
    """
    Iterate through all numeric columns of a dataframe and modify the data type
    to reduce memory usage.
    """

    start_mem = df.memory_usage().sum() / 1024 ** 2

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float32)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float32)

    return df

In [8]:
def print_log(message_format):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # self 확인: 첫 번째 인자가 클래스 인스턴스인지 확인합니다.
            if args and hasattr(args[0], 'infer'):
                self = args[0]

                # self.infer가 False이면 아무 것도 출력하지 않고 함수를 바로 반환합니다.
                if self.infer:
                    return func(*args, **kwargs)

            start_time = time.time()
            result = func(*args, **kwargs)
            end_time = time.time()

            elapsed_time = end_time - start_time

            if result is not None:
                data_shape = getattr(result, 'shape', 'No shape attribute')
                shape_message = f", shape({data_shape})"
            else:
                shape_message = ""

            print(f"\n{'-' * 100}")
            print(message_format.format(func_name=func.__name__, elapsed_time=elapsed_time) + shape_message)
            print(f"{'-' * 100}\n")

            return result

        return wrapper

    return decorator


In [9]:
def zero_sum(prices, volumes):
    std_error = np.sqrt(volumes)
    step = np.sum(prices) / np.sum(std_error)
    out = prices - std_error * step
    return out

#### 각 클래스의 method는 각자 필요에 따라 추가 해서 사용하면 됩니다. 이때 class의 주석에 method를 추가하고, method의 주석에는 method의 역할을 간단하게 적어주세요.

# ## Pre Code

## Data Preprocessing Class

## Feature Engineering Class

In [10]:
global_features = {}

In [11]:
def calculate_rsi(data, window_size=7):
    price_diff = data['wap'].diff()
    gain = price_diff.where(price_diff > 0, 0)
    loss = -price_diff.where(price_diff < 0, 0)

    avg_gain = gain.rolling(window=window_size).mean()
    avg_loss = loss.rolling(window=window_size).mean()

    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))

    return rsi

In [12]:
@njit(parallel=True)
def compute_triplet_imbalance(df_values, comb_indices):
    """
    Calculate the triplet imbalance for each row in the DataFrame.
    :param df_values: 
    :param comb_indices: 
    :return: 
    """
    num_rows = df_values.shape[0]
    num_combinations = len(comb_indices)
    imbalance_features = np.empty((num_rows, num_combinations))

    for i in prange(num_combinations):
        a, b, c = comb_indices[i]
        for j in range(num_rows):
            max_val = max(df_values[j, a], df_values[j, b], df_values[j, c])
            min_val = min(df_values[j, a], df_values[j, b], df_values[j, c])
            mid_val = df_values[j, a] + df_values[j, b] + df_values[j, c] - min_val - max_val
            if mid_val == min_val:  # Prevent division by zero
                imbalance_features[j, i] = np.nan
            else:
                imbalance_features[j, i] = (max_val - mid_val) / (mid_val - min_val + EPS)

    return imbalance_features


def calculate_triplet_imbalance_numba(price, df):
    """
    Calculate the triplet imbalance for each row in the DataFrame.
    :param price: 
    :param df: 
    :return: 
    """
    # Convert DataFrame to numpy array for Numba compatibility
    df_values = df[price].values
    comb_indices = [(price.index(a), price.index(b), price.index(c)) for a, b, c in combinations(price, 3)]

    # Calculate the triplet imbalance
    features_array = compute_triplet_imbalance(df_values, comb_indices)

    # Create a DataFrame from the results
    columns = [f"{a}_{b}_{c}_imb2" for a, b, c in combinations(price, 3)]
    features = pd.DataFrame(features_array, columns=columns)

    return features

In [13]:
from tqdm import tqdm
import glob

all_stock_data = {}

for s in tqdm(glob.glob("./data/alpha/*.csv") if MODE == "train" else glob.glob(
        "/kaggle/input/nasdaq-stocks-historical-data/alpha/*.csv"), desc="Processing files"):
    stock_df = pd.read_csv(s, dtype={"ticker": str})
    stock_df.query("Date >= '2021-08-05' and Date <= '2023-07-06'", inplace=True)
    if len(stock_df) > 180:
        all_stock_data[s[13:-15]] = (stock_df, len(stock_df))

reversed_stock_list = [
        'MNST', 'WING', 'AXON', 'HON', 'MAR', 'OKTA', 'POOL', 'LRCX', 'YOTA', 'PFG',
        'NDAQ', 'COIN', 'AMGN', 'TER', 'ADBE', 'ABNB', 'ZBRA', 'KLAC', 'ZI', 'ALNY',
        'ULTA', 'SSNC', 'ON', 'SWKS', 'AKAM', 'ASML', 'PPBI', 'QRVO', 'FANG', 'ORLY',
        'LNT', 'AGRX', 'NTAP', 'CROX', 'REGN', 'ROST', 'DLTR', 'ADP', 'EMCG', 'CTAS',
        'CZR', 'NVDA', 'SAIA', 'JKHY', 'FOSLL', 'MSFT', 'TECH', 'TXRH', 'WDAY', 'FITB',
        'MTCH', 'ROKU', 'CINF', 'EBAY', 'SNPS', 'FAST', 'ETSY', 'IDXX', 'INTU', 'ZG',
        'CRWD', 'LYFT', 'RGEN', 'LKQ', 'MKTX', 'EXC', 'LBRDK', 'MRNA', 'PAYX', 'SOFI',
        'BYND', 'EQIX', 'ADI', 'GEN', 'ALGN', 'CDNS', 'HAS', 'VRTX', 'HOOD', 'WBD',
        'TXG', 'SGEN', 'OPEN', 'INTC', 'GOOG', 'CAR', 'UPST', 'LSCC', 'NFLX', 'ENTG',
        'FFIV', 'DOCU', 'MSTR', 'ZION', 'PCTY', 'AMD', 'MRVL', 'NBIX', 'JBLU', 'PARA',
        'MQ', 'FCNCA', 'TEAM', 'ZS', 'WBA', 'MDLZ', 'TRMB', 'PODD', 'SEDG', 'CSX',
        'TMUS', 'SPWR', 'AAPL', 'LULU', 'LPLA', 'ILMN', 'CDW', 'GDS', 'MELI', 'MASI',
        'FOXA', 'KDP', 'AAL', 'GILD', 'ASO', 'UTHR', 'MU', 'MDB', 'WDC', 'CFLT',
        'SBUX', 'INCY', 'TSCO', 'ISRG', 'VTRS', 'DKNG', 'LITE', 'TTWO', 'SMCI', 'EXPE',
        'VRTS', 'AMAT', 'AVGO', 'TLRY', 'PCAR', 'CG', 'MIDD', 'APA', 'LNT', 'VRSK',
        'PANW', 'CSCO', 'SBAC', 'HTZ', 'DBX', 'CHKEW', 'LCID', 'ADSK', 'APLS', 'STLD',
        'PEP', 'PTON', 'ENPH', 'COST', 'CPRT', 'HST', 'KHC', 'CHRW', 'AMZN', 'ANSS',
        'HOLX', 'TROW', 'APP', 'FIVE', 'AFRM', 'GOOGL', 'FTNT', 'SWAV', 'ZM', 'META',
        'GH', 'JBHT', 'UAL', 'MCHP', 'DDOG', 'ODFL', 'CTSH', 'EA', 'RUN', 'CSGP',
        'DXCM', 'TSLA', 'PTC', 'PYPL', 'PENN', 'XEL', 'XRAY', 'SPLK', 'CMCSA', 'BKR'
]

stock_list_df = pd.read_csv('./data/nasdaq-screener/nasdaq_screener_1701158836955.csv') if MODE == "train" else pd.read_csv(
    '/kaggle/input/nasdaq-screener/nasdaq_screener_1701158836955.csv')

Processing files: 100%|██████████| 3131/3131 [00:05<00:00, 600.79it/s]


In [14]:
from sklearn.preprocessing import LabelEncoder

def get_stock_info(df, data, column_name):  # column_name = "Market Cap", "Sector", "Industry"
    le = LabelEncoder()

    if column_name != "Market Cap":
        stock_list_df[column_name] = le.fit_transform(stock_list_df[column_name])

    df[f'{column_name}'] = -1

    for idx, ticker in enumerate(reversed_stock_list):
        stock_id_indices = data[data['stock_id'] == idx].index
        if ticker in stock_list_df["Symbol"].values:
            value = stock_list_df[stock_list_df["Symbol"] == ticker][column_name].iloc[0]
            df.loc[stock_id_indices, f'{column_name}'] = value

    return df

In [15]:
weights = [
    0.004, 0.001, 0.002, 0.006, 0.004, 0.004, 0.002, 0.006, 0.006, 0.002, 0.002, 0.008,
    0.006, 0.002, 0.008, 0.006, 0.002, 0.006, 0.004, 0.002, 0.004, 0.001, 0.006, 0.004,
    0.002, 0.002, 0.004, 0.002, 0.004, 0.004, 0.001, 0.001, 0.002, 0.002, 0.006, 0.004,
    0.004, 0.004, 0.006, 0.002, 0.002, 0.04 , 0.002, 0.002, 0.004, 0.04 , 0.002, 0.001,
    0.006, 0.004, 0.004, 0.006, 0.001, 0.004, 0.004, 0.002, 0.006, 0.004, 0.006, 0.004,
    0.006, 0.004, 0.002, 0.001, 0.002, 0.004, 0.002, 0.008, 0.004, 0.004, 0.002, 0.004,
    0.006, 0.002, 0.004, 0.004, 0.002, 0.004, 0.004, 0.004, 0.001, 0.002, 0.002, 0.008,
    0.02 , 0.004, 0.006, 0.002, 0.02 , 0.002, 0.002, 0.006, 0.004, 0.002, 0.001, 0.02,
    0.006, 0.001, 0.002, 0.004, 0.001, 0.002, 0.006, 0.006, 0.004, 0.006, 0.001, 0.002,
    0.004, 0.006, 0.006, 0.001, 0.04 , 0.006, 0.002, 0.004, 0.002, 0.002, 0.006, 0.002,
    0.002, 0.004, 0.006, 0.006, 0.002, 0.002, 0.008, 0.006, 0.004, 0.002, 0.006, 0.002,
    0.004, 0.006, 0.002, 0.004, 0.001, 0.004, 0.002, 0.004, 0.008, 0.006, 0.008, 0.002,
    0.004, 0.002, 0.001, 0.004, 0.004, 0.004, 0.006, 0.008, 0.004, 0.001, 0.001, 0.002,
    0.006, 0.004, 0.001, 0.002, 0.006, 0.004, 0.006, 0.008, 0.002, 0.002, 0.004, 0.002,
    0.04 , 0.002, 0.002, 0.004, 0.002, 0.002, 0.006, 0.02 , 0.004, 0.002, 0.006, 0.02,
    0.001, 0.002, 0.006, 0.004, 0.006, 0.004, 0.004, 0.004, 0.004, 0.002, 0.004, 0.04,
    0.002, 0.008, 0.002, 0.004, 0.001, 0.004, 0.006, 0.004,
]
_weights = {int(k):v for k,v in enumerate(weights)}

In [23]:
class FeatureEngineer:

    def __init__(self, data, infer=False, feature_versions=None, dependencies=None,
                 base_directory="./data/fe_versions"):
        self.data = data
        self.infer = infer
        self.feature_versions = feature_versions or []
        self.dependencies = dependencies or {}  # 피처 버전 간 의존성을 정의하는 딕셔너리
        self.base_directory = base_directory
        if not os.path.exists(self.base_directory):
            os.makedirs(self.base_directory)

    @staticmethod
    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def generate_global_features(data):
        global_features["version_0"] = {
            "median_size": data.groupby("stock_id")["bid_size"].median() + data.groupby("stock_id")[
                "ask_size"].median(),
            "std_size": data.groupby("stock_id")["bid_size"].std() + data.groupby("stock_id")["ask_size"].std(),
            "ptp_size": data.groupby("stock_id")["bid_size"].max() - data.groupby("stock_id")["bid_size"].min(),
            "median_price": data.groupby("stock_id")["bid_price"].median() + data.groupby("stock_id")[
                "ask_price"].median(),
            "std_price": data.groupby("stock_id")["bid_price"].std() + data.groupby("stock_id")["ask_price"].std(),
            "ptp_price": data.groupby("stock_id")["bid_price"].max() - data.groupby("stock_id")["ask_price"].min(),
        }

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_selection(self, data, exclude_columns):
        # 제외할 컬럼을 뺀 나머지로 구성된 새로운 DataFrame을 생성합니다.
        selected_columns = [c for c in data.columns if c not in exclude_columns]
        data = data[selected_columns]
        return data

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_version_yongmin_0(self, *args, version_name="feature_version_yongmin_0"):
        
        df = pd.DataFrame(index=self.data.index)

        df['dow'] = self.data["date_id"] % 5
        df['seconds'] = self.data['seconds_in_bucket'] % 60
        df['minute'] = self.data['seconds_in_bucket'] // 60
    
        df["volume"] = self.data.eval("ask_size + bid_size")
        df['cum_wap'] = self.data.groupby(['stock_id'])['wap'].cumprod()
    
        for i in [1, 6]:
            df[f'pct_change_{i}'] = self.data.groupby(['stock_id', 'seconds_in_bucket'])['wap'].pct_change(i).fillna(0)

            f = lambda x: 1 if x > 0 else (0 if x == 0 else -1)
            df[f'polarize_pct_{i}'] = df[f'pct_change_{i}'].apply(f)
    
        return df

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_version_yongmin_1(self, *args, version_name="feature_version_yongmin_1"):
        
        df = pd.DataFrame(index=self.data.index)
        
        window_size = 6
        short_window = 1
        long_window = 6

        df['vol_std'] = self.data.groupby(['stock_id'])['wap'].pct_change().rolling(window=window_size).std()
        df['rolling_vol_di'] = self.data.groupby(['date_id'])['wap'].pct_change().rolling(window=window_size).std()
        df['std_st'] = self.data.groupby(['stock_id'])['wap'].rolling(window=window_size).std().values
        df['wap_pctch'] = self.data.groupby(['stock_id','date_id'])['wap'].pct_change().values*100
        df['short_ema'] = self.data.groupby(['stock_id'])['wap'].ewm(span=short_window, adjust=False).mean().values
        df['long_ema'] = self.data.groupby(['stock_id'])['wap'].ewm(span=long_window, adjust=False).mean().values
        wap_mean = self.data['wap'].mean()
        df['wap_vs_market'] = self.data['wap'] - self.data.groupby(['stock_id'])['wap'].transform('mean')
        df['macd'] = df['short_ema'] - df['long_ema']
        
        # Bollinger Bands calculation within each stock, date, and time
        df['bollinger_upper'] = self.data.groupby(['stock_id'])['wap'].rolling(window=long_window).mean().values + 2 * self.data.groupby(['stock_id'])['wap'].rolling(window=window_size).std().values
        df['bollinger_lower'] = self.data.groupby(['stock_id'])['wap'].rolling(window=long_window).mean().values - 2 * self.data.groupby(['stock_id'])['wap'].rolling(window=window_size).std().values
        # RSI calculation within each stock, date, and time
        df['rsi'] = self.data.groupby(['stock_id']).apply(calculate_rsi).values
        
        return df

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_version_yongmin_2(self, *args, version_name="feature_version_yongmin_2"):
        # feature engineering version 1
        # create empty dataframe
        df = pd.DataFrame(index=self.data.index)
        self.data["stock_weights"] = self.data["stock_id"].map(_weights)
        self.data["weighted_wap"] = self.data["stock_weights"] * self.data["wap"]
        df['wap_momentum'] = self.data.groupby('stock_id')['weighted_wap'].pct_change(periods=6)

        df["imbalance_momentum"] = self.data.groupby(['stock_id'])['imbalance_size'].diff(periods=1) / self.data['matched_size']
        self.data["price_spread"] = self.data["ask_price"] - self.data["bid_price"]
        
        self.data["mid_price"] = self.data.eval("(ask_price + bid_price) / 2")
        self.data["liquidity_imbalance"] = self.data.eval(f"(bid_size-ask_size)/(bid_size+ask_size+{EPS})")
        df["matched_imbalance"] = self.data.eval(f"(imbalance_size-matched_size)/(matched_size+imbalance_size+{EPS})")
        df["size_imbalance"] = self.data.eval(f"bid_size / ask_size+{EPS}")
        
        df["spread_intensity"] = self.data.groupby(['stock_id'])['price_spread'].diff()
        df['price_pressure'] = self.data['imbalance_size'] * (self.data['ask_price'] - self.data['bid_price'])
        df['market_urgency'] = self.data['price_spread'] * self.data['liquidity_imbalance']
        df['depth_pressure'] = (self.data['ask_size'] - self.data['bid_size']) * (self.data['far_price'] - self.data['near_price'])
        
        df['spread_depth_ratio'] = (self.data['ask_price'] - self.data['bid_price']) / (self.data['bid_size'] + self.data['ask_size'])
        df['mid_price_movement'] = self.data['mid_price'].diff(periods=5).apply(lambda x: 1 if x > 0 else (-1 if x < 0 else 0))
        
        df['micro_price'] = ((self.data['bid_price'] * self.data['ask_size']) + (self.data['ask_price'] * self.data['bid_size'])) / (self.data['bid_size'] + self.data['ask_size'])
        df['relative_spread'] = (self.data['ask_price'] - self.data['bid_price']) / self.data['wap']

        return df
    

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_version_alvin_1(self, *args, version_name="feature_version_alvin_1"):
        # feature engineering version 1
        # create empty dataframe
        df = pd.DataFrame(index=self.data.index)
        prices = ["reference_price", "far_price", "near_price", "ask_price", "bid_price", "wap"]
        sizes = ["matched_size", "bid_size", "ask_size", "imbalance_size"]

        for c in combinations(prices, 2):
            df[f"{c[0]}_{c[1]}_imb"] = self.data.eval(f"({c[0]} - {c[1]})/({c[0]} + {c[1]} + {EPS})")

        for c in [['ask_price', 'bid_price', 'wap', 'reference_price'], sizes]:
            triplet_feature = calculate_triplet_imbalance_numba(c, self.data)
            df[triplet_feature.columns] = triplet_feature.values

        return df

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_market_cap(self, *args, version_name="feature_market_cap"):
        df = pd.DataFrame(index=self.data.index)

        df = get_stock_info(df, self.data, "Market Cap")

        return df

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_sector(self, *args, version_name="feature_sector"):
        df = pd.DataFrame(index=self.data.index)

        df = get_stock_info(df, self.data, "Sector")

        return df

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_industry(self, *args, version_name="feature_industry"):
        df = pd.DataFrame(index=self.data.index)

        df = get_stock_info(df, self.data, "Industry")

        return df

    # you can add more feature engineering version like above
    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def execute_feature_versions(self, save=False, load=False):
        results = {}

        for version in self.feature_versions:
            if load:
                df = self._load_from_parquet(version)
            else:
                method = getattr(self, version, None)
                if callable(method):
                    args = []
                    for dep in self.dependencies.get(version, []):
                        dep_result = results.get(dep)
                        if isinstance(dep_result, pd.DataFrame):
                            args.append(dep_result)
                        elif dep_result is None and hasattr(self, dep):
                            dep_method = getattr(self, dep)
                            dep_result = dep_method()
                            results[dep] = dep_result
                            args.append(dep_result)
                        else:
                            args.append(None)
                    df = method(*args)
                    if save:
                        self._save_to_parquet(df, version)
            results[version] = df

        # return that was in self.feature_versions
        return {k: v for k, v in results.items() if k in self.feature_versions}

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def transform(self, save=False, load=False):
        feature_versions_results = self.execute_feature_versions(save=save, load=load)
        if not self.infer:
            self.data["date_id_copy"] = self.data["date_id"]
        concat_df = pd.concat([self.data] + list(feature_versions_results.values()), axis=1)

        exclude_columns = ["row_id", "time_id", "date_id"]
        final_data = self.feature_selection(concat_df, exclude_columns)
        final_data = concat_df
        return final_data


## Split Data Class

In [24]:
class Splitter:
    """
    데이터 분리 클래스
    
    Attributes
    ----------
    method : str
        데이터 분리 방식
    n_splits : int
        데이터 분리 개수
    correct : bool
        데이터 분리 시 boundary를 맞출지 여부
    initial_fold_size_ratio : float
        초기 fold size 비율
    train_test_ratio : float
        train, test 비율
        
    Methods
    -------
    split()
        데이터 분리 수행
    """

    def __init__(self, method, n_splits, correct, initial_fold_size_ratio=0.6, train_test_ratio=0.8, gap=0,
                 overlap=True, train_start=0,
                 train_end=390, valid_start=391, valid_end=480):
        self.method = method
        self.n_splits = n_splits
        self.correct = correct
        self.initial_fold_size_ratio = initial_fold_size_ratio
        self.train_test_ratio = train_test_ratio

        self.gap = gap
        self.overlap = overlap

        # only for holdout method
        self.train_start = train_start
        self.train_end = train_end
        self.valid_start = valid_start
        self.valid_end = valid_end

        self.target = config["target"]

        self.boundaries = []

    def split(self, data):
        self.data = data #reduce_mem_usage(data)
        self.all_dates = self.data['date_id_copy'].unique()
        if self.method == "time_series":
            if self.n_splits <= 1:
                raise ValueError("Time series split method only works with n_splits > 1")
            return self._time_series_split(data)
        elif self.method == "rolling":
            if self.n_splits <= 1:
                raise ValueError("Rolling split method only works with n_splits > 1")
            return self._rolling_split(data)
        elif self.method == "blocking":
            if self.n_splits <= 1:
                raise ValueError("Blocking split method only works with n_splits > 1")
            self.initial_fold_size_ratio = 1.0 / self.n_splits
            return self._rolling_split(data)
        elif self.method == "holdout":
            if self.n_splits != 1:
                raise ValueError("Holdout method only works with n_splits=1")
            return self._holdout_split(data)
        else:
            raise ValueError("Invalid method")

    def _correct_boundary(self, data, idx, direction="forward"):
        # Correct the boundary based on date_id_copy
        original_idx = idx
        if idx == 0 or idx == len(data) - 1:
            return idx
        if direction == "forward":
            while idx < len(data) and data.iloc[idx]['date_id_copy'] == data.iloc[original_idx]['date_id_copy']:
                idx += 1
        elif direction == "backward":
            while idx > 0 and data.iloc[idx]['date_id_copy'] == data.iloc[original_idx]['date_id_copy']:
                idx -= 1
            idx += 1  # adjust to include the boundary
        return idx

    def _time_series_split(self, data):
        n = len(data)
        initial_fold_size = int(n * self.initial_fold_size_ratio)
        initial_test_size = int(initial_fold_size * (1 - self.train_test_ratio))
        increment = (1.0 - self.initial_fold_size_ratio) / (self.n_splits - 1)

        for i in range(self.n_splits):
            fold_size = int(n * (self.initial_fold_size_ratio + i * increment))
            train_size = fold_size - initial_test_size

            if self.correct:
                train_size = self._correct_boundary(data, train_size, "forward")
                end_of_test = self._correct_boundary(data, train_size + initial_test_size, "forward")
            else:
                end_of_test = train_size + initial_test_size

            train_slice = data.iloc[:train_size]
            test_slice = data.iloc[train_size:end_of_test]
            if test_slice.shape[0] == 0:
                raise ValueError("Try setting correct=False or Try reducing the train_test_ratio")

            X_train = train_slice.drop(columns=[self.target, 'date_id_copy'])
            y_train = train_slice[self.target]
            X_test = test_slice.drop(columns=[self.target, 'date_id_copy'])
            y_test = test_slice[self.target]

            self.boundaries.append((
                train_slice['date_id_copy'].iloc[0],
                train_slice['date_id_copy'].iloc[-1],
                test_slice['date_id_copy'].iloc[-1]
            ))
            yield X_train, y_train, X_test, y_test

    def _rolling_split(self, data):
        n = len(data)
        total_fold_size = int(n * self.initial_fold_size_ratio)
        test_size = int(total_fold_size * (1 - self.train_test_ratio))
        gap_size = int(total_fold_size * self.gap)
        train_size = total_fold_size - test_size
        rolling_increment = (n - total_fold_size) // (self.n_splits - 1)

        end_of_test = n - 1
        start_of_test = end_of_test - test_size
        end_of_train = start_of_test - gap_size
        start_of_train = end_of_train - train_size

        for _ in range(self.n_splits):
            if self.correct:
                start_of_train = self._correct_boundary(data, start_of_train, direction="forward")
                end_of_train = self._correct_boundary(data, end_of_train, direction="backward")
                start_of_test = self._correct_boundary(data, start_of_test, direction="forward")
                end_of_test = self._correct_boundary(data, end_of_test, direction="forward")

            train_slice = data[start_of_train:end_of_train]
            test_slice = data[start_of_test:end_of_test]
            if test_slice.shape[0] == 0:
                raise ValueError("Try setting correct=False or Try reducing the train_test_ratio")

            X_train = train_slice.drop(columns=[self.target, 'date_id_copy'])
            y_train = train_slice[self.target]
            X_test = test_slice.drop(columns=[self.target, 'date_id_copy'])
            y_test = test_slice[self.target]

            self.boundaries.append((
                train_slice['date_id_copy'].iloc[0],
                train_slice['date_id_copy'].iloc[-1],
                test_slice['date_id_copy'].iloc[0],
                test_slice['date_id_copy'].iloc[-1]
            ))
            yield X_train, y_train, X_test, y_test
            start_of_train = max(start_of_train - rolling_increment, 0)
            end_of_train -= rolling_increment
            start_of_test -= rolling_increment
            end_of_test -= rolling_increment

    def _holdout_split(self, data):
        # train_start ~ train_end : 학습 데이터 기간
        # valid_start ~ valid_end : 검증 데이터 기간
        # 학습 및 검증 데이터 분리
        threshold = int(data['date_id_copy'].nunique() * self.train_test_ratio)
        self.train_start, self.train_end = 0, data['date_id_copy'].unique()[threshold]
        self.valid_start, self.valid_end = self.train_end + self.gap, data['date_id_copy'].unique()[-1]
        
        train_mask = (data['date_id_copy'] >= self.train_start) & (data['date_id_copy'] <= self.train_end)
        valid_mask = (data['date_id_copy'] >= self.valid_start) & (data['date_id_copy'] <= self.valid_end)

        train_slice = data[train_mask]
        valid_slice = data[valid_mask]

        X_train = train_slice.drop(columns=[self.target, 'date_id_copy'])
        y_train = train_slice[self.target]
        X_valid = valid_slice.drop(columns=[self.target, 'date_id_copy'])
        y_valid = valid_slice[self.target]

        self.boundaries.append((
            train_slice['date_id_copy'].iloc[0],
            train_slice['date_id_copy'].iloc[-1],
            valid_slice['date_id_copy'].iloc[0],
            valid_slice['date_id_copy'].iloc[-1]
        ))
        yield X_train, y_train, X_valid, y_valid

    def visualize_splits(self):
        print("Visualizing Train/Test Split Boundaries")

        plt.figure(figsize=(15, 6))

        for idx, (train_start, train_end, test_start, test_end) in enumerate(self.boundaries):
            train_width = train_end - train_start + 1
            plt.barh(y=idx, width=train_width, left=train_start, color='blue', edgecolor='black')
            plt.text(train_start + train_width / 2, idx - 0.15, f'{train_start}-{train_end}', ha='center', va='center',
                     color='black', fontsize=8)

            test_width = test_end - test_start + 1
            plt.barh(y=idx, width=test_width, left=test_start, color='red', edgecolor='black')
            if test_width > 0:
                plt.text(test_start + test_width / 2, idx + 0.15, f'{test_start}-{test_end}', ha='center', va='center',
                         color='black', fontsize=8)

        plt.yticks(range(len(self.boundaries)), [f"split {i + 1}" for i in range(len(self.boundaries))])
        plt.xticks(self.all_dates[::int(len(self.all_dates) / 10)])
        plt.xlabel("date_id_copy")
        plt.title("Train/Test Split Boundaries")
        plt.grid(axis='x')

        plt.tight_layout()
        plt.show()

## Model Class

In [25]:
def create_mlp(num_continuous_features, num_categorical_features, embedding_dims, num_labels, hidden_units, dropout_rates, learning_rate,l2_strength=0.01):
    
    # Numerical variables input
    input_continuous = tf.keras.layers.Input(shape=(num_continuous_features,))
    
    # Categorical variables input
    input_categorical = [tf.keras.layers.Input(shape=(1,)) 
                         for _ in range(len(num_categorical_features))]
    
    # Embedding layer for categorical variables
    embeddings = [tf.keras.layers.Embedding(input_dim=num_categorical_features[i] + 1, 
                                            output_dim=embedding_dims[i], 
                                            embeddings_initializer='he_normal')(input_cat) 
                  for i, input_cat in enumerate(input_categorical)]
    flat_embeddings = [tf.keras.layers.Flatten()(embed) for embed in embeddings]
    
    # concat numerical and categorical
    concat_input = tf.keras.layers.concatenate([input_continuous] + flat_embeddings)
    
    # MLP
    x = tf.keras.layers.BatchNormalization()(concat_input)
    x = tf.keras.layers.Dropout(dropout_rates[0])(x)
    
    for i in range(len(hidden_units)): 
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Dropout(dropout_rates[i+1])(x)
        x = tf.keras.layers.Dense(hidden_units[i], kernel_initializer='he_normal')(x)
        # x = tf.keras.layers.LeakyReLU()(x)
        x = tf.keras.layers.Activation(tf.keras.activations.swish)(x)
        
    #No activation
    out = tf.keras.layers.Dense(num_labels, kernel_initializer='he_normal')(x) 
    
    model = tf.keras.models.Model(inputs=[input_continuous] + input_categorical, outputs=out)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss='mae',  
                  metrics=['mae'])
    
    gc.collect()
    return model

# ## Main
## import data

quantile transformer works well

In [26]:
# 피쳐 엔지니어링 할 함수에 args가 들어간다면 dependencies에 추가
dependencies = {
     # "feature_version_yongmin_1": ["feature_version_yongmin_0"],
}

In [27]:
class QuantileTF:
    def __init__(self):
        self.scaler = QuantileTransformer(output_distribution='normal')
        self.pipe = None

    def initialize_pipeline(self, df, usecols, passcols):
        columnTF = ColumnTransformer(
                                        transformers = [
                                            ("scaler" , self.scaler, usecols),
                                            ("pass", 'passthrough', passcols)
                                        ]
                                    )

        self.pipe = Pipeline([
                                ("scaler", columnTF)
                            ])

    def fit(self, df, usecols, passcols):
        self.initialize_pipeline(df, usecols, passcols)
        self.pipe.fit(df)
        
    def transform(self, df):
        return self.pipe.transform(df)

    def fit_transform(self, df, usecols, passcols):
        self.initialize_pipeline(df, usecols, passcols)
        return self.pipe.fit_transform(df)

In [28]:
if config["train_mode"]:
    
    df = pd.read_csv(f"{config['data_dir']}/train.csv")

    # 데이터 전처리
    df = reduce_mem_usage(df)

    # 사용할 피쳐 엔지니어링 함수 선택
    feature_engineer = FeatureEngineer(df, feature_versions=['feature_version_yongmin_0', 'feature_version_yongmin_1', 'feature_version_yongmin_2',
                                                             'feature_version_alvin_1', 'feature_market_cap', 'feature_sector', 'feature_industry'],
                                       dependencies=dependencies)
    
    feature_engineer.generate_global_features(df)
    
    df = feature_engineer.transform()  # 맨 처음에는 save=True 돌렸으면, 다음부턴 transform(load=True)로 바꾸면된 

    # fillna
    df = (df.replace([np.inf, -np.inf], np.nan)
          .fillna(method='ffill')
          .fillna(0)
         )

    df = df.drop(['row_id', 'time_id'], axis=1)
    df = df.loc[:,~df.columns.duplicated()].copy()

    batch_size = 200
    epochs = 15
    hidden_units = [128, 128, 32] # 3 layer
    dropout_rates = [0, 0.1, 0.1, 0.1]
    learning_rate = 1e-3
    embedding_dims = [50, 10, 10]
    MODEL_NAME = "my_nn_model_15epochV6_0.h5"
    
    # set scaler
    ckp_path = os.path.join(config['model_dir'], MODEL_NAME)
    if not os.path.exists(config['model_dir']):
        os.mkdir(config['model_dir'])
    
    rlr = ReduceLROnPlateau(monitor='val_mae', factor=0.5, patience=3, verbose=0, min_delta=1e-5, mode='min')
    ckp = ModelCheckpoint(ckp_path, monitor='val_mae', verbose=0, save_best_only=True, save_weights_only=True, mode='min')
    es = EarlyStopping(monitor='val_mae', min_delta=1e-4, patience=7, mode='min', restore_best_weights=True, verbose=0)

    model_checkpoint = [rlr, ckp, es]

    splitter = Splitter(method='holdout', n_splits=1, correct=True, train_test_ratio=0.98, gap=5)
    for idx, (X_train, y_train, X_test, y_test) in enumerate(splitter.split(df)):
        if 'date_id_copy' in X_train.columns:
            X_train = X_train.drop(['date_id_copy'], axis=1)
            X_test  = X_test.drop(['date_id_copy'], axis=1)
            
        print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

        scaler = QuantileTF()

        categorical = ["stock_id", "Sector", "Industry"]
        numerical = list(set(X_train.columns) - set(categorical) - set(['target', 'date_id_copy']))
        num_categorical = [len(X_train[col].unique()) for col in categorical]
    
        unselect_columns = ['dow', 'seconds', 'minute', 'polarize_pct_1', 'polarize_pct_6', 'imbalance_buy_sell_flag']
        select_columns = list(set(numerical) - set(unselect_columns))
        
        X_tr_continuous = scaler.fit_transform(X_train[numerical], select_columns, unselect_columns)
        X_tr_categorical = X_train[categorical].values

        X_ts_continuous = scaler.transform(X_test[numerical])
        X_ts_categorical = X_test[categorical].values

        print("X_train_numerical shape:",X_tr_continuous.shape)
        print("X_train_categorical shape:",X_tr_categorical.shape)
        print("Y_train shape:",y_train.shape)

        print("\n")
        print("X_test_numerical shape:",X_ts_continuous.shape)
        print("X_test_categorical shape:",X_ts_categorical.shape)
        print("Y_test shape:",y_test.shape)
        
        # create model
        model = create_mlp(len(numerical), num_categorical, embedding_dims, 1, hidden_units, dropout_rates, learning_rate)

        print(f"\nFitting Model - Holdout")
        model.fit((X_tr_continuous, X_tr_categorical[:, 0:1], X_tr_categorical[:, 1:2], 
                   X_tr_categorical[:, 2:3]), y_train,
                   epochs=epochs, batch_size=batch_size, 
                   validation_data=((X_ts_continuous, X_ts_categorical[:, 0:1], X_ts_categorical[:, 1:2],
                                     X_ts_categorical[:, 2:3]), y_test),
                   callbacks=model_checkpoint)
    
        model.save_weights(ckp_path) # SAVE MODEL WEIGHTS
    
        pred = model.predict((X_ts_continuous, X_ts_categorical[:, 0:1], X_ts_categorical[:, 1:2],
                              X_ts_categorical[:, 2:3]),
                              batch_size=batch_size)
    
        print("Train NN Score:", mean_absolute_error(y_test, pred))

    K.clear_session()
    del model
    rubbish = gc.collect()


----------------------------------------------------------------------------------------------------
Executed generate_global_features, Elapsed time: 0.80 seconds
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Executed feature_version_yongmin_0, Elapsed time: 2.54 seconds, shape((5237980, 9))
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Executed feature_version_yongmin_1, Elapsed time: 11.50 seconds, shape((5237980, 11))
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Executed feature_version_yongmin_2, Elapsed time: 1.29

### upload kaggle dataset

#### dataset init
! /home/username/.local/bin/kaggle datasets init -p {config['model_dir']}
#### dataset create 
! /home/username/.local/bin/kaggle datasets create -p {config['model_dir']}

In [29]:
if MODE == "train":
    ! /usr/local/bin/kaggle datasets init -p {config['model_dir']}
    import json

    with open(f"{config['model_dir']}/dataset-metadata.json", "r") as file:
        data = json.load(file)

    data["title"] = data["title"].replace("INSERT_TITLE_HERE", f"{KAGGLE_DATASET_NAME}")
    data["id"] = data["id"].replace("INSERT_SLUG_HERE", f"{KAGGLE_DATASET_NAME}")

    with open(f"{config['model_dir']}/dataset-metadata.json", "w") as file:
        json.dump(data, file, indent=2)

    ! /usr/local/bin/kaggle datasets create -p {config['model_dir']}

    # !/usr/local/bin/kaggle datasets version -p {config['model_dir']} -m 'Updated data'

Data package template written to: ./models/20231208_20:39:19/dataset-metadata.json
Starting upload for file my_nn_model_15epochV6_0.h5
100%|█████████████████████████████████████████| 258k/258k [00:02<00:00, 122kB/s]
Upload successful: my_nn_model_15epochV6_0.h5 (258KB)
Your private Dataset is being created. Please check progress at https://www.kaggle.com/datasets/jhk3211/model-nn-version-yongmin-6


In [54]:
dependencies = {
    # "feature_version_alvin_2_1": ["feature_version_alvin_1", "feature_version_alvin_2_0"],
}

In [26]:
if config["infer_mode"]:
    import optiver2023
    from sklearn.preprocessing import QuantileTransformer
    
#     env = optiver2023.make_env()
#     iter_test = env.iter_test()

    y_min, y_max = -64, 64
    qps = []
    counter = 0
    cache = pd.DataFrame()

    # set scaler and features
    scaler = QuantileTransformer()

    df = pd.read_csv(f"{config['data_dir']}/train.csv")
    
    feature_engineer = FeatureEngineer(df, feature_versions=['feature_version_yongmin_0', 'feature_version_yongmin_1', 'feature_version_yongmin_2',
                                                             'feature_version_alvin_1',
                                                             'feature_market_cap', 'feature_sector', 'feature_industry'],
                                       dependencies=dependencies, infer=True)
    feature_engineer.generate_global_features(df)

    df = feature_engineer.transform()

    df = df.drop(['time_id', 'row_id', 'target'], axis = 1)
    
    # fillna
    df = (df.replace([np.inf, -np.inf], np.nan)
          .fillna(method='ffill')
          .fillna(0)
         )
    
    # cat - num
    categorical = ["stock_id"]
    numerical = list(set(df.columns) - set(categorical) - set(['target', 'date_id_copy']))
    num_categorical = [len(df[col].unique()) for col in categorical]

    unselect_columns = ['dow', 'seconds', 'minute', 'polarize_pct_1', 'polarize_pct_6', 'imbalance_buy_sell_flag']
    select_columns = list(set(numerical) - set(unselect_columns))
    
    # build scaler pipeline
    scaler.fit(df[select_columns])
    
    batch_size = 200
    hidden_units = [1024, 256, 128, 64, 32] # 5 layer
    dropout_rates = [0, 0.3, 0.3, 0.3, 0.3, 0.3]
    learning_rate = 1e-3
    embedding_dims = [50, 50, 50, 50]
    
    # Load Model
    final_model = create_mlp(len(numerical), num_categorical, embedding_dims, 1, hidden_units, dropout_rates, learning_rate)
    model_path = "/kaggle/input/model-nn-version-yongmin-6/my_nn_model_15epochV6_0.h5"
    final_model.load_weights(model_path)
    
    for (test, revealed_targets, sample_prediction) in iter_test:
        
        now_time = time.time()
        cache = pd.concat([cache, test], ignore_index=True, axis=0)
    
        if counter > 0:
            cache = cache.groupby(['stock_id']).tail(21).sort_values(
                                  by=['date_id', 'seconds_in_bucket', 'stock_id']).reset_index(drop=True)
        
        # feature engineering
        feature_engineer = FeatureEngineer(cache, feature_versions=['feature_version_yongmin_0', 'feature_version_yongmin_1', 'feature_version_yongmin_2',
                                                                    'feature_version_alvin_1', 'feature_market_cap', 'feature_sector', 'feature_industry'],
                                           dependencies=dependencies, infer=True)
        
        cache_df = feature_engineer.transform()
        
        cache_df = (cache_df.replace([np.inf, -np.inf], np.nan)
          .fillna(method='ffill')
          .fillna(0)
         )
        
        feat = cache_df[-len(test):]
        
        if 'currently_scored' in feat.columns:
            feat = feat.drop(['currently_scored'], axis=1)
        
        X_num = feat[numerical]
        X_num[select_columns] = scaler.transform(X_num[select_columns])
        X_cat = feat[categorical].values
        
        print("X_train_numerical shape:",X_num.shape)
        print("X_train_categorical shape:",X_cat.shape)
        
        # feat = generate_all_features(cache)[-len(test):]
        test_predss = np.zeros(feat.shape[0])
        
        # prediction
        inference_prediction = final_model.predict((X_num, X_cat[:,0:1], X_cat[:,1:2], X_cat[:,2:3], X_cat[:,3:4])).ravel()
            
        test_predss = zero_sum(inference_prediction, test['bid_size'] + test['ask_size'])
        
        clipped_predictions = np.clip(test_predss, y_min, y_max)
        sample_prediction['target'] = clipped_predictions
        
        env.predict(sample_prediction)
        counter += 1
        qps.append(time.time() - now_time)
        
        if counter % 10 == 0:
            print(counter, 'qps:', np.mean(qps))

    time_cost = 1.146 * np.mean(qps)
    print(f"The code will take approximately {np.round(time_cost, 4)} hours to reason about")

In [27]:
# single 1fold final / fianl no
# single 1fold final / fianl
# single 5fold final / fianl no
# single 5fold final / fianl
# stacking 1fold final / fianl no
# stacking 1fold final / fianl
# stacking 5fold final / fianl no
# stacking 5fold final / fianl