# 사용 설명서

## 0. requirements
#### 0.1. 로컬 환경에서 package 설치법
- ```!pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages <package_name>```

#### 0.2. kaggle api 설치법
- https://www.kaggle.com/docs/api#getting-started-installation-&-authentication 참고
- ```!pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages kaggle```
- kaggle api token 다운로드 후 upload (kaggle.json)
- ```!mkdir ~/.kaggle``` 
- ```!mv ~/kaggle.json ~/.kaggle/``` (kaggle.json 파일을 ~/.kaggle/ 로 이동)
- ```!chmod 600 ~/.kaggle/kaggle.json``` (권한 설정)

## 1. config 설정

#### 1.1. init config
- MODE: train, inference 중 선택 (train : 로컬 환경, inference : 캐글 환경)
- KAGGLE_DATASET_NAME: 캐글 환경에서 inference 시 사용할 데이터셋 이름 
  - 이 이름으로 캐글 데이터셋이 생성됩니다. (중복 불가)

#### 1.2. train / inference config
- model_directory: 모델 저장 경로
- data_directory: 데이터 경로
- train_mode: train 모드 여부
- infer_mode: inference 모드 여부

#### 1.3. model config
- model_name: 사용할 모델 이름
    - 실제 아래 models_config에 있는 모델 이름과 동일해야 합니다 (아래중에서 선택하는것임).(:list)
- target: target column 이름
- split_method: 데이터 분리 방식
  - time_series: 시계열 데이터 분리
  - rolling: 롤링 윈도우 방식 데이터 분리
  - blocking: 블록 방식 데이터 분리
  - holdout: holdout 방식 데이터 분리
- n_splits: 데이터 분리 개수 (1 ~)
- correct: 데이터 분리 시 날짜 boundary를 맞출지 여부 (True / False)
- initial_fold_size_ratio: 초기 fold size 비율 (0 ~ 1)
- train_test_ratio: train, test 비율 (0 ~ 1)
- ~train_start: 학습 데이터 기간 시작~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- ~train_end: 학습 데이터 기간 끝~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- ~valid_start: 검증 데이터 기간 시작~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- ~valid_end: 검증 데이터 기간 끝~ (holdout 방식에서만 사용 -> split_method가 holdout이면 직접 설정)
- optuna_random_state: optuna random state
                                - 
#### 1.4. model heyperparameter config
- models_config: 모델 하이퍼파라미터 설정
    - model: 모델 클래스
    - params: 모델 하이퍼파라미터들
        - ... : 모델 하이퍼파라미터

## 2. Global Method
- reduce_mem_usage: 메모리 사용량 줄이는 함수
- compute_triplet_imbalance: triplet imbalance 계산 함수
- calculate_triplet_imbalance_numba: triplet imbalance 계산 함수
- print_log: 함수 실행 전후에 원하는 코드를 실행해주는 decorator 함수입니다.
- zero_sum: zero sum 함수

## 3. Pre Code
- DataPreprocessor: 데이터 전처리 클래스
- FeatureEngineer: 피쳐 엔지니어링 클래스
- Splitter: 데이터 분리 클래스
- Model: 모델 클래스
- Trainer: 학습 클래스

## 4. Main Code
 

---

## 0. requirements

In [3]:
# !pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages <package_name>

#### 0.2. kaggle api 설치법

In [4]:
# !pip install --target=/home/<user_name>/<venv_name>/lib/python3.10/site-packages kaggle

In [5]:
# !mkdir ~/.kaggle

In [6]:
# !mv ~/kaggle.json ~/.kaggle/

In [7]:
# !chmod 600 ~/.kaggle/kaggle.json

## 1. config 설정

#### 1.1. init config

In [2]:
MODE = "train"  # train, inference, both
KAGGLE_DATASET_NAME = "model-nn-version-yongmin-0"

In [3]:
import gc
import os
import time
import warnings
from itertools import combinations
from warnings import simplefilter
import functools
import time
from numba import njit, prange
import pyarrow.parquet as pq

import joblib
import lightgbm as lgb
import xgboost as xgb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import optuna
from functools import partial
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold, TimeSeriesSplit

from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow.keras.layers as layers
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import Callback, ReduceLROnPlateau, ModelCheckpoint, EarlyStopping

warnings.filterwarnings("ignore")
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

2023-11-27 10:09:23.943804: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-27 10:09:23.963294: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-27 10:09:23.963314: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-27 10:09:23.963327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-27 10:09:23.967282: I tensorflow/core/platform/cpu_feature_g

#### 1.2. train / inference config

In [4]:
lgb.__version__, xgb.__version__

('3.3.2', '2.0.1')

In [5]:
EPS = 1e-10

In [6]:
if MODE == "train":
    print("You are in train mode")
    model_directory = "./models/" + time.strftime("%Y%m%d_%H:%M:%S", time.localtime(time.time() + 9 * 60 * 60))
    data_directory = "./data"
    train_mode = True
    infer_mode = False
elif MODE == "inference":
    print("You are in inference mode")
    model_directory = f'/kaggle/input/{KAGGLE_DATASET_NAME}'
    data_directory = "/kaggle/input/optiver-trading-at-the-close"
    train_mode = False
    infer_mode = True
elif MODE == "both":
    print("You are in both mode")
    model_directory = f'/kaggle/working/'
    data_directory = "/kaggle/input/optiver-trading-at-the-close"
    train_mode = True
    infer_mode = True
else:
    raise ValueError("Invalid mode")

You are in train mode


#### 1.3. model config

In [7]:
config = {
    "data_dir": data_directory,
    "model_dir": model_directory,

    "train_mode": train_mode,  # True : train, False : not train
    "infer_mode": infer_mode,  # True : inference, False : not inference
    "model_name": ["lgb"],  # model name
    "final_mode": False,  # True : using final model, False : not using final model
    "best_iterate_ratio": 1.2,  # best iteration ratio
    'target': 'target',

    'split_method': 'rolling',  # time_series, rolling, blocking, holdout
    'n_splits': 3,  # number of splits
    'correct': True,  # correct boundary
    'gap': 0.05,  # gap between train and test (0.05 = 5% of train size)

    'initial_fold_size_ratio': 0.8,  # initial fold size ratio
    'train_test_ratio': 0.9,  # train, test ratio

    'optuna_random_state': 42,
}
config["model_mode"] = "single" if len(config["model_name"]) == 1 else "stacking"  # 모델 수에 따라서 single / stacking 판단
config["mae_mode"] = True if config["model_mode"] == "single" and not config[
    "final_mode"] else False  # single 모델이면서 final_mode가 아닌경우 폴드가 여러개일때 모델 평가기준이 없어서 mae로 평가
config["inference_n_splits"] = len(config['model_name']) if config["final_mode"] or config["mae_mode"] else config[
    "n_splits"]  # final_mode가 아닌경우 n_splits만큼 inference

#### 1.4. model heyperparameter config

In [8]:
if MODE == "train":
    if not os.path.exists(config["model_dir"]):
        os.makedirs(config["model_dir"])
    if not os.path.exists(config["data_dir"]):
        os.makedirs(config["data_dir"])
    !kaggle competitions download optiver-trading-at-the-close -p {config["data_dir"]} --force
    !unzip -o {config["data_dir"]}/optiver-trading-at-the-close.zip -d {config["data_dir"]}
    !rm {config["data_dir"]}/optiver-trading-at-the-close.zip

Downloading optiver-trading-at-the-close.zip to ./data
100%|███████████████████████████████████████▉| 200M/201M [00:07<00:00, 33.1MB/s]
100%|████████████████████████████████████████| 201M/201M [00:07<00:00, 29.3MB/s]
Archive:  ./data/optiver-trading-at-the-close.zip
  inflating: ./data/example_test_files/revealed_targets.csv  
  inflating: ./data/example_test_files/sample_submission.csv  
  inflating: ./data/example_test_files/test.csv  
  inflating: ./data/optiver2023/__init__.py  
  inflating: ./data/optiver2023/competition.cpython-310-x86_64-linux-gnu.so  
  inflating: ./data/public_timeseries_testing_util.py  
  inflating: ./data/train.csv        


# ## Global Method

In [9]:
def reduce_mem_usage(df, verbose=0):
    """
    Iterate through all numeric columns of a dataframe and modify the data type
    to reduce memory usage.
    """

    start_mem = df.memory_usage().sum() / 1024 ** 2

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float32)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float32)

    return df

In [10]:
def print_log(message_format):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # self 확인: 첫 번째 인자가 클래스 인스턴스인지 확인합니다.
            if args and hasattr(args[0], 'infer'):
                self = args[0]

                # self.infer가 False이면 아무 것도 출력하지 않고 함수를 바로 반환합니다.
                if self.infer:
                    return func(*args, **kwargs)

            start_time = time.time()
            result = func(*args, **kwargs)
            end_time = time.time()

            elapsed_time = end_time - start_time

            if result is not None:
                data_shape = getattr(result, 'shape', 'No shape attribute')
                shape_message = f", shape({data_shape})"
            else:
                shape_message = ""

            print(f"\n{'-' * 100}")
            print(message_format.format(func_name=func.__name__, elapsed_time=elapsed_time) + shape_message)
            print(f"{'-' * 100}\n")

            return result

        return wrapper

    return decorator


In [11]:
def zero_sum(prices, volumes):
    std_error = np.sqrt(volumes)
    step = np.sum(prices) / np.sum(std_error)
    out = prices - std_error * step
    return out

#### 각 클래스의 method는 각자 필요에 따라 추가 해서 사용하면 됩니다. 이때 class의 주석에 method를 추가하고, method의 주석에는 method의 역할을 간단하게 적어주세요.

# ## Pre Code

## Data Preprocessing Class

## Feature Engineering Class

In [12]:
global_features = {}

In [13]:
class FeatureEngineer:
    """
    이 클래스는 데이터 세트에 대한 피처 엔지니어링을 수행합니다.
    클래스의 주요 목적은 데이터 세트에 대한 다양한 변환 및 가공을 통해 머신 러닝 모델에 적합한 형태의 피처를 생성하는 것입니다.

    클래스에는 다음과 같은 메서드들이 포함됩니다:
    1. feature_version_n: 피처 엔지니어링의 다양한 버전을 구현합니다. 
       이 메서드들은 데이터에 대한 고유한 변환을 적용하며, 다른 피처 엔지니어링 버전의 결과를 결합할 수도 있습니다.
    2. transform: 모든 피처 엔지니어링 버전의 결과를 결합하여 최종적으로 통합된 데이터 세트를 생성하고 반환합니다.

    feature_version_n 메서드의 'args' 매개변수에 대한 설명:
    - 'args'는 가변 인자로, 다른 피처 엔지니어링 버전의 결과를 전달하는 데 사용됩니다.
    - 예를 들어, feature_version_2 메서드가 feature_version_0의 결과를 필요로 하는 경우, 
      feature_version_2(feature_version_0()) 형태로 호출할 수 있습니다.
    - 이런 방식으로 'args'를 사용하면, 하나의 피처 엔지니어링 버전이 다른 버전의 결과를 참조하고 활용할 수 있습니다.

    주의: 이 클래스는 원본 데이터를 직접 수정하지 않습니다. 모든 변환은 새로운 데이터 프레임에 적용되며, 
    transform 메서드는 최종적으로 통합된 데이터 세트를 반환합니다.
    """

    def __init__(self, data, infer=False, feature_versions=None, dependencies=None,
                 base_directory="./data/fe_versions"):
        self.data = data
        self.infer = infer
        self.feature_versions = feature_versions or []
        self.dependencies = dependencies or {}  # 피처 버전 간 의존성을 정의하는 딕셔너리
        self.base_directory = base_directory
        if not os.path.exists(self.base_directory):
            os.makedirs(self.base_directory)

    def _save_to_parquet(self, df, version_name):
        file_path = os.path.join(self.base_directory, f"{version_name}.parquet")
        df.to_parquet(file_path)
        print(f"Saved {version_name} to {file_path}")

    def _load_from_parquet(self, version_name):
        file_path = os.path.join(self.base_directory, f"{version_name}.parquet")
        if os.path.exists(file_path):
            return pq.read_table(file_path).to_pandas()
        else:
            raise FileNotFoundError(f"File {file_path} not found.")


    @staticmethod
    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def generate_global_features(data):
        global_features["version_0"] = {
            "median_size": data.groupby("stock_id")["bid_size"].median() + data.groupby("stock_id")[
                "ask_size"].median(),
            "std_size": data.groupby("stock_id")["bid_size"].std() + data.groupby("stock_id")["ask_size"].std(),
            "ptp_size": data.groupby("stock_id")["bid_size"].max() - data.groupby("stock_id")["bid_size"].min(),
            "median_price": data.groupby("stock_id")["bid_price"].median() + data.groupby("stock_id")[
                "ask_price"].median(),
            "std_price": data.groupby("stock_id")["bid_price"].std() + data.groupby("stock_id")["ask_price"].std(),
            "ptp_price": data.groupby("stock_id")["bid_price"].max() - data.groupby("stock_id")["ask_price"].min(),
        }

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_selection(self, data, exclude_columns):
        # 제외할 컬럼을 뺀 나머지로 구성된 새로운 DataFrame을 생성합니다.
        selected_columns = [c for c in data.columns if c not in exclude_columns]
        data = data[selected_columns]
        return data

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def feature_version_yongmin_0(self, *args, version_name="feature_version_yongmin_0"):
        
        df = pd.DataFrame(index=self.data.index)

        df['dow'] = self.data["date_id"] % 5
        df['seconds'] = self.data['seconds_in_bucket'] % 60
        df['minute'] = self.data['seconds_in_bucket'] // 60
    
        df["volume"] = self.data.eval("ask_size + bid_size")
    
        for i in [1, 5, 10]:
            df[f'pct_change_{i}'] = self.data.groupby(['stock_id', 'seconds_in_bucket'])['wap'].pct_change(i).fillna(0)
    
        return df
        
    # you can add more feature engineering version like above
    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def execute_feature_versions(self, save=False, load=False):
        results = {}

        for version in self.feature_versions:
            if load:
                df = self._load_from_parquet(version)
            else:
                method = getattr(self, version, None)
                if callable(method):
                    args = []
                    for dep in self.dependencies.get(version, []):
                        dep_result = results.get(dep)
                        if isinstance(dep_result, pd.DataFrame):
                            args.append(dep_result)
                        elif dep_result is None and hasattr(self, dep):
                            dep_method = getattr(self, dep)
                            dep_result = dep_method()
                            results[dep] = dep_result
                            args.append(dep_result)
                        else:
                            args.append(None)
                    df = method(*args)
                    if save:
                        self._save_to_parquet(df, version)
            results[version] = df

        # return that was in self.feature_versions
        return {k: v for k, v in results.items() if k in self.feature_versions}

    @print_log("Executed {func_name}, Elapsed time: {elapsed_time:.2f} seconds")
    def transform(self, save=False, load=False):
        feature_versions_results = self.execute_feature_versions(save=save, load=load)
        if not self.infer:
            self.data["date_id_copy"] = self.data["date_id"]
        concat_df = pd.concat([self.data] + list(feature_versions_results.values()), axis=1)

        exclude_columns = ["row_id", "time_id", "date_id"]
        final_data = self.feature_selection(concat_df, exclude_columns)
        return final_data


## Split Data Class

In [14]:
class Splitter:
    """
    데이터 분리 클래스
    
    Attributes
    ----------
    method : str
        데이터 분리 방식
    n_splits : int
        데이터 분리 개수
    correct : bool
        데이터 분리 시 boundary를 맞출지 여부
    initial_fold_size_ratio : float
        초기 fold size 비율
    train_test_ratio : float
        train, test 비율
        
    Methods
    -------
    split()
        데이터 분리 수행
    """

    def __init__(self, method, n_splits, correct, initial_fold_size_ratio=0.6, train_test_ratio=0.8, gap=0,
                 overlap=True, train_start=0,
                 train_end=390, valid_start=391, valid_end=480):
        self.method = method
        self.n_splits = n_splits
        self.correct = correct
        self.initial_fold_size_ratio = initial_fold_size_ratio
        self.train_test_ratio = train_test_ratio

        self.gap = gap
        self.overlap = overlap

        # only for holdout method
        self.train_start = train_start
        self.train_end = train_end
        self.valid_start = valid_start
        self.valid_end = valid_end

        self.target = config["target"]

        self.boundaries = []

    def split(self, data):
        self.data = reduce_mem_usage(data)
        self.all_dates = self.data['date_id_copy'].unique()
        if self.method == "time_series":
            if self.n_splits <= 1:
                raise ValueError("Time series split method only works with n_splits > 1")
            return self._time_series_split(data)
        elif self.method == "rolling":
            if self.n_splits <= 1:
                raise ValueError("Rolling split method only works with n_splits > 1")
            return self._rolling_split(data)
        elif self.method == "blocking":
            if self.n_splits <= 1:
                raise ValueError("Blocking split method only works with n_splits > 1")
            self.initial_fold_size_ratio = 1.0 / self.n_splits
            return self._rolling_split(data)
        elif self.method == "holdout":
            if self.n_splits != 1:
                raise ValueError("Holdout method only works with n_splits=1")
            return self._holdout_split(data)
        else:
            raise ValueError("Invalid method")

    def _correct_boundary(self, data, idx, direction="forward"):
        # Correct the boundary based on date_id_copy
        original_idx = idx
        if idx == 0 or idx == len(data) - 1:
            return idx
        if direction == "forward":
            while idx < len(data) and data.iloc[idx]['date_id_copy'] == data.iloc[original_idx]['date_id_copy']:
                idx += 1
        elif direction == "backward":
            while idx > 0 and data.iloc[idx]['date_id_copy'] == data.iloc[original_idx]['date_id_copy']:
                idx -= 1
            idx += 1  # adjust to include the boundary
        return idx

    def _time_series_split(self, data):
        n = len(data)
        initial_fold_size = int(n * self.initial_fold_size_ratio)
        initial_test_size = int(initial_fold_size * (1 - self.train_test_ratio))
        increment = (1.0 - self.initial_fold_size_ratio) / (self.n_splits - 1)

        for i in range(self.n_splits):
            fold_size = int(n * (self.initial_fold_size_ratio + i * increment))
            train_size = fold_size - initial_test_size

            if self.correct:
                train_size = self._correct_boundary(data, train_size, "forward")
                end_of_test = self._correct_boundary(data, train_size + initial_test_size, "forward")
            else:
                end_of_test = train_size + initial_test_size

            train_slice = data.iloc[:train_size]
            test_slice = data.iloc[train_size:end_of_test]
            if test_slice.shape[0] == 0:
                raise ValueError("Try setting correct=False or Try reducing the train_test_ratio")

            X_train = train_slice.drop(columns=[self.target, 'date_id_copy'])
            y_train = train_slice[self.target]
            X_test = test_slice.drop(columns=[self.target, 'date_id_copy'])
            y_test = test_slice[self.target]

            self.boundaries.append((
                train_slice['date_id_copy'].iloc[0],
                train_slice['date_id_copy'].iloc[-1],
                test_slice['date_id_copy'].iloc[-1]
            ))
            yield X_train, y_train, X_test, y_test

    def _rolling_split(self, data):
        n = len(data)
        total_fold_size = int(n * self.initial_fold_size_ratio)
        test_size = int(total_fold_size * (1 - self.train_test_ratio))
        gap_size = int(total_fold_size * self.gap)
        train_size = total_fold_size - test_size
        rolling_increment = (n - total_fold_size) // (self.n_splits - 1)

        end_of_test = n - 1
        start_of_test = end_of_test - test_size
        end_of_train = start_of_test - gap_size
        start_of_train = end_of_train - train_size

        for _ in range(self.n_splits):
            if self.correct:
                start_of_train = self._correct_boundary(data, start_of_train, direction="forward")
                end_of_train = self._correct_boundary(data, end_of_train, direction="backward")
                start_of_test = self._correct_boundary(data, start_of_test, direction="forward")
                end_of_test = self._correct_boundary(data, end_of_test, direction="forward")

            train_slice = data[start_of_train:end_of_train]
            test_slice = data[start_of_test:end_of_test]
            if test_slice.shape[0] == 0:
                raise ValueError("Try setting correct=False or Try reducing the train_test_ratio")

            X_train = train_slice.drop(columns=[self.target, 'date_id_copy'])
            y_train = train_slice[self.target]
            X_test = test_slice.drop(columns=[self.target, 'date_id_copy'])
            y_test = test_slice[self.target]

            self.boundaries.append((
                train_slice['date_id_copy'].iloc[0],
                train_slice['date_id_copy'].iloc[-1],
                test_slice['date_id_copy'].iloc[0],
                test_slice['date_id_copy'].iloc[-1]
            ))
            yield X_train, y_train, X_test, y_test
            start_of_train = max(start_of_train - rolling_increment, 0)
            end_of_train -= rolling_increment
            start_of_test -= rolling_increment
            end_of_test -= rolling_increment

    def _holdout_split(self, data):
        # train_start ~ train_end : 학습 데이터 기간
        # valid_start ~ valid_end : 검증 데이터 기간
        # 학습 및 검증 데이터 분리
        train_mask = (data['date_id_copy'] >= self.train_start) & (data['date_id_copy'] <= self.train_end)
        valid_mask = (data['date_id_copy'] >= self.valid_start) & (data['date_id_copy'] <= self.valid_end)

        train_slice = data[train_mask]
        valid_slice = data[valid_mask]

        X_train = train_slice.drop(columns=[self.target, 'date_id_copy'])
        y_train = train_slice[self.target]
        X_valid = valid_slice.drop(columns=[self.target, 'date_id_copy'])
        y_valid = valid_slice[self.target]

        self.boundaries.append((
            train_slice['date_id_copy'].iloc[0],
            train_slice['date_id_copy'].iloc[-1],
            valid_slice['date_id_copy'].iloc[0],
            valid_slice['date_id_copy'].iloc[-1]
        ))
        yield X_train, y_train, X_valid, y_valid

    def visualize_splits(self):
        print("Visualizing Train/Test Split Boundaries")

        plt.figure(figsize=(15, 6))

        for idx, (train_start, train_end, test_start, test_end) in enumerate(self.boundaries):
            train_width = train_end - train_start + 1
            plt.barh(y=idx, width=train_width, left=train_start, color='blue', edgecolor='black')
            plt.text(train_start + train_width / 2, idx - 0.15, f'{train_start}-{train_end}', ha='center', va='center',
                     color='black', fontsize=8)

            test_width = test_end - test_start + 1
            plt.barh(y=idx, width=test_width, left=test_start, color='red', edgecolor='black')
            if test_width > 0:
                plt.text(test_start + test_width / 2, idx + 0.15, f'{test_start}-{test_end}', ha='center', va='center',
                         color='black', fontsize=8)

        plt.yticks(range(len(self.boundaries)), [f"split {i + 1}" for i in range(len(self.boundaries))])
        plt.xticks(self.all_dates[::int(len(self.all_dates) / 10)])
        plt.xlabel("date_id_copy")
        plt.title("Train/Test Split Boundaries")
        plt.grid(axis='x')

        plt.tight_layout()
        plt.show()

## Model Class

In [15]:
def create_mlp(num_continuous_features, num_categorical_features, embedding_dims, num_labels, hidden_units, dropout_rates, learning_rate,l2_strength=0.01):
    
    # Numerical variables input
    input_continuous = tf.keras.layers.Input(shape=(num_continuous_features,))
    
    # Categorical variables input
    input_categorical = [tf.keras.layers.Input(shape=(1,)) 
                         for _ in range(len(num_categorical_features))]
    
    # Embedding layer for categorical variables
    embeddings = [tf.keras.layers.Embedding(input_dim=num_categorical_features[i], 
                                            output_dim=embedding_dims[i], 
                                            embeddings_initializer='he_normal')(input_cat) 
                  for i, input_cat in enumerate(input_categorical)]
    flat_embeddings = [tf.keras.layers.Flatten()(embed) for embed in embeddings]
    
    # concat numerical and categorical
    concat_input = tf.keras.layers.concatenate([input_continuous] + flat_embeddings)
    
    # MLP
    x = tf.keras.layers.BatchNormalization()(concat_input)
    x = tf.keras.layers.Dropout(dropout_rates[0])(x)
    
    for i in range(len(hidden_units)): 
        x = tf.keras.layers.Dense(hidden_units[i],kernel_initializer='he_normal')(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.LeakyReLU()(x)
        #x = tf.keras.layers.Activation(tf.keras.activations.swish)(x)
        x = tf.keras.layers.Dropout(dropout_rates[i+1])(x)    
        
    #No activation
    out = tf.keras.layers.Dense(num_labels, kernel_initializer='he_normal')(x) 
    
    model = tf.keras.models.Model(inputs=[input_continuous] + input_categorical, outputs=out)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss='mean_absolute_error',  
                  metrics=['mean_absolute_error'])  
    gc.collect()
    return model

# ## Main
## import data

In [16]:
# 피쳐 엔지니어링 할 함수에 args가 들어간다면 dependencies에 추가
dependencies = {
    # "feature_version_alvin_2_1": ["feature_version_alvin_1", "feature_version_alvin_2_0"],
}

In [21]:
if config["train_mode"]:
    # 데이터 불러오기
    df = pd.read_csv(f"{config['data_dir']}/train.csv")

    # 데이터 전처리
    df = reduce_mem_usage(df)

    # 사용할 피쳐 엔지니어링 함수 선택
    feature_engineer = FeatureEngineer(df, feature_versions=['feature_version_yongmin_0'],
                                       dependencies=dependencies)
    
    feature_engineer.generate_global_features(df)
    
    df = feature_engineer.transform()  # 맨 처음에는 save=True 돌렸으면, 다음부턴 transform(load=True)로 바꾸면된 

    df = df.drop(['date_id_copy'], axis=1)

    # fillna
    df = (df.replace([np.inf, -np.inf], np.nan)
          .fillna(method='ffill')
          .fillna(0)
         )

    batch_size = 2048
    epochs = 50
    hidden_units = [512, 1024, 1024, 512]
    dropout_rates = [0.2, 0.2, 0.2, 0.2, 0.2]
    learning_rate = 1e-2
    embedding_dims = [25]
    
    # set scaler
    categorical = ["stock_id"]
    numerical = list(set(df.columns) - set(categorical) - set(['target']))
    num_categorical = [len(df[col].unique()) for col in categorical]

    scaler = MinMaxScaler()

    select_columns = list(set(numerical) - set(['dow', 'seconds', 'minute']))
    unselect_columns = ['dow', 'seconds', 'minute']
    
    preprocessor = ColumnTransformer(
        transformers = [
            ("scaler" , scaler, select_columns),
            ("pass", 'passthrough', unselect_columns)
        ]
    )
    
    pipe = Pipeline([
        ("scaler", preprocessor)
    ])

    
    ckp_path = os.path.join(config['model_dir'], 'my_nn_model_10epoch.h5')
    if not os.path.exists(config['model_dir']):
        os.mkdir(config['model_dir'])

    X_train = df.drop(['target'], axis=1)
    Y = df['target']
        
        
    X_tr_continuous = pipe.fit_transform(X_train[numerical])
    X_tr_categorical = X_train[categorical].values

    print("X_train_numerical shape:",X_tr_continuous.shape)
    print("X_train_categorical shape:",X_tr_categorical.shape)
    print("Y_train shape:",Y.shape)
    
    # create model
    model = create_mlp(len(numerical), num_categorical, embedding_dims, 1, hidden_units, dropout_rates, learning_rate)
    
    rlr = ReduceLROnPlateau(monitor='mean_absolute_error', factor=0.1, patience=3, verbose=0, min_delta=1e-4, mode='min')
    ckp = ModelCheckpoint(ckp_path, monitor='mean_absolute_error', verbose=0, save_best_only=True, save_weights_only=True, mode='min')
    es = EarlyStopping(monitor='mean_absolute_error', min_delta=1e-4, patience=7, mode='min', restore_best_weights=True, verbose=0)

    print(f"Fitting Model - No CV")
    model.fit((X_tr_continuous,X_tr_categorical[:, 0:1]), Y,
          epochs=epochs, batch_size=batch_size)

    model.save_weights(ckp_path)

    pred = model.predict((X_tr_continuous,X_tr_categorical[:,0:1]), batch_size=batch_size).ravel()
    print("Train NN Score:", mean_absolute_error(Y, pred))

    K.clear_session()
    del model
    rubbish = gc.collect()


----------------------------------------------------------------------------------------------------
Executed generate_global_features, Elapsed time: 0.51 seconds
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Executed feature_version_yongmin_0, Elapsed time: 1.45 seconds, shape((5237980, 7))
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Executed execute_feature_versions, Elapsed time: 1.45 seconds, shape(No shape attribute)
----------------------------------------------------------------------------------------------------


----------------------------------------------------------------------------------------------------
Executed feature_selection, Elapsed time: 0.06 seco

### upload kaggle dataset

#### dataset init
! /home/username/.local/bin/kaggle datasets init -p {config['model_dir']}
#### dataset create 
! /home/username/.local/bin/kaggle datasets create -p {config['model_dir']}

In [25]:
if MODE == "train":
    ! /usr/local/bin/kaggle datasets init -p {config['model_dir']}
    import json

    with open(f"{config['model_dir']}/dataset-metadata.json", "r") as file:
        data = json.load(file)

    data["title"] = data["title"].replace("INSERT_TITLE_HERE", f"{KAGGLE_DATASET_NAME}")
    data["id"] = data["id"].replace("INSERT_SLUG_HERE", f"{KAGGLE_DATASET_NAME}")

    with open(f"{config['model_dir']}/dataset-metadata.json", "w") as file:
        json.dump(data, file, indent=2)

    ! /usr/local/bin/kaggle datasets create -p {config['model_dir']}

    # !/usr/local/bin/kaggle datasets version -p {config['model_dir']} -m 'Updated data'

Data package template written to: ./models/20231127_10:36:02/dataset-metadata.json
Starting upload for file my_nn_model_50epoch.h5
100%|██████████████████████████████████████| 8.22M/8.22M [00:02<00:00, 3.14MB/s]
Upload successful: my_nn_model_50epoch.h5 (8MB)
Skipping folder: .ipynb_checkpoints; use '--dir-mode' to upload folders
Your private Dataset is being created. Please check progress at https://www.kaggle.com/datasets/jhk3211/model-nn-version-yongmin-0


In [18]:
dependencies = {
    # "feature_version_alvin_2_1": ["feature_version_alvin_1", "feature_version_alvin_2_0"],
}

In [None]:
if config["infer_mode"]:

    model_list = []
    print("Loading Models...")
    final_model = create_mlp(len(numerical), num_categorical, embedding_dims, 1, hidden_units, dropout_rates, learning_rate)

    model_path = '/kaggle/input/optiv-try-an-simple-neural-network-with-keras/NN_Models/my_nn_model_10epoch.h5'
    final_model.load_weights(model_path)
    model_list.append(final_model)

In [None]:
if config["infer_mode"]:
    import optiver2023

    env = optiver2023.make_env()
    iter_test = env.iter_test()

    y_min, y_max = -64, 64
    qps = []
    counter = 0
    cache = pd.DataFrame()

    # set scaler
    scaler = MinMaxScaler()

    select_columns = list(set(numerical) - set(['dow', 'seconds', 'minute']))
    unselect_columns = ['dow', 'seconds', 'minute']
    
    preprocessor = ColumnTransformer(
        transformers = [
            ("scaler" , scaler, select_columns),
            ("pass", 'passthrough', unselect_columns)
        ]
    )
    
    pipe = Pipeline([
        ("scaler", preprocessor)
    ])

    df = pd.read_csv(f"{config['data_dir']}/train.csv")
    
    feature_engineer = FeatureEngineer(df)
    feature_engineer.generate_global_features(df)

    df = feature_engineer.transform()

    # fillna
    df = (df.replace([np.inf, -np.inf], np.nan)
          .fillna(method='ffill')
          .fillna(0)
         )

    categorical = ["stock_id"]
    numerical = list(set(df.columns) - set(categorical) - set(['target']))
    num_categorical = [len(df[col].unique()) for col in categorical]

    pipe.fit(df[numeric])
    
    for (test, revealed_targets, sample_prediction) in iter_test:
        
        now_time = time.time()
        cache = pd.concat([cache, test], ignore_index=True, axis=0)
    
        if counter > 0:
            cache_df = cache.groupby(['stock_id']).tail(21).sort_values(
                                     by=['date_id', 'seconds_in_bucket', 'stock_id']).reset_index(drop=True)
        
        # feature engineering
        feature_engineer = FeatureEngineer(cache_df, infer=True, feature_versions=['feature_version_yongmin_0'],
                                           dependencies=dependencies)
        
        cache_df = feature_engineer.transform()
        
        feat = cache_df[-len(test):]

        X_num = pipe.transform(feat[numerical])
        X_cat = feat[categorical].values
        
        print("X_train_numerical shape:",X_num.shape)
        print("X_train_categorical shape:",X_cat.shape)
        
        # feat = generate_all_features(cache)[-len(test):]
        test_predss = np.zeros(feat.shape[0])

        
        # prediction
        for i in range(config["inference_n_splits"]):
            inference_prediction = final_model.predict((X_num, X_cat[:,0:1], X_cat[:,1:2]))
            
            test_predss += inference_prediction / config["inference_n_splits"]
            
        test_predss = zero_sum(test_predss, test['bid_size'] + test['ask_size'])
        
        clipped_predictions = np.clip(test_predss, y_min, y_max)
        sample_prediction['target'] = clipped_predictions
        
        env.predict(sample_prediction)
        counter += 1
        qps.append(time.time() - now_time)
        
        if counter % 10 == 0:
            print(counter, 'qps:', np.mean(qps))

    time_cost = 1.146 * np.mean(qps)
    print(f"The code will take approximately {np.round(time_cost, 4)} hours to reason about")

NameError: name 'df' is not defined

In [None]:
# single 1fold final / fianl no
# single 1fold final / fianl
# single 5fold final / fianl no
# single 5fold final / fianl
# stacking 1fold final / fianl no
# stacking 1fold final / fianl
# stacking 5fold final / fianl no
# stacking 5fold final / fianl