# Trading at the Close - Inference
-----------------------
-----------------------


This notebook is intended to be run after the train notebook. It takes the artifacts generated by the hyperparameter search and produces the final predictions on the public leaderboard.

## Installs
------------

In [1]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


## Imports

In [3]:
import sys
import numpy as np
from pathlib import Path
from loguru import logger # for nice colored logging
from pprint import pformat
import pandas as pd
import json
from sklearn.metrics import mean_absolute_error
from lightgbm import LGBMRegressor
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
from timeit import default_timer as timer
from IPython.display import clear_output

sns.set_style("ticks")

In [4]:
class CFG:
    JOBS_PATH = Path(".", "job_artifacts")
    JOB_NAME = "optiver-inference_lgbmregressor"
    TEST_PATH = Path(".", "example_test_files", "test.csv")
    FEATURES_PATH = JOBS_PATH.joinpath("optiver-feature_selection-0002")
    FEATURES_NAME = "feature_names.json"
    MODEL_PATH = JOBS_PATH.joinpath("optiver-tuning_lgbmregressor-0006")

Create the artifacts folder tree incrementally. Each run will correspond to a different folder.

In [5]:
CFG.JOBS_PATH.mkdir(exist_ok=True, parents=True)

for i in range(1, 10000):
    CFG.JOB_PATH = CFG.JOBS_PATH.joinpath(CFG.JOB_NAME + "-" +  str(i).zfill(4))
    try:
        CFG.JOB_PATH.mkdir()
        break
    except:
        continue

# Data Loading

In the next cell, I define code to load the data and to generate features with optional memory reduction. I write these two functions in a file called `preprocess.py` with the cell magi `%%writefile`, so that I can import it with `from preprocess import load_data`. With this, my other notebooks can do the same thing without having to redefine these functions. In case I make changes, all notebooks automatically use the same updated version.

In [8]:
from utils.featurizers import featurize
from utils.files import read_json

df = pd.read_csv(CFG.TEST_PATH)
selected_features = read_json(CFG.FEATURES_PATH.joinpath(CFG.FEATURES_NAME))["selected_features"]

# Get features
featurize(df, selected_features)

X = df.copy()

del df

[32m2023-10-12 18:35:59.273[0m | [1mINFO    [0m | [36mutils.featurizers[0m:[36mfeaturize[0m:[36m16[0m - [1mCreating additional features...[0m
[32m2023-10-12 18:35:59.791[0m | [1mINFO    [0m | [36mutils.featurizers[0m:[36mfeaturize[0m:[36m64[0m - [1mDropping unnecesary features...[0m
[32m2023-10-12 18:35:59.796[0m | [1mINFO    [0m | [36mutils.featurizers[0m:[36mfeaturize[0m:[36m68[0m - [1mReducing data memory footprint...[0m
[32m2023-10-12 18:35:59.799[0m | [1mINFO    [0m | [36mutils.compression[0m:[36mdowncast[0m:[36m11[0m - [1mMemory usage of dataframe is 3.78 MB[0m
[32m2023-10-12 18:35:59.828[0m | [1mINFO    [0m | [36mutils.compression[0m:[36mdowncast[0m:[36m20[0m - [1mMemory usage after optimization is: 2.36 MB[0m
[32m2023-10-12 18:35:59.829[0m | [1mINFO    [0m | [36mutils.compression[0m:[36mdowncast[0m:[36m22[0m - [1mDecreased by 37.50%[0m


## Inference
--------------

## Setup the evaluation process

We have two options:
1. Use a simple train-test split for evaluation
2. Use a Cross-Validation with TimeSeriesSplit for more robust evaluation

The second option is much more expensive to tune, but it yields more robust estimations of mean absolute error.
We define functions for both and then try them out!

### Evaluation via Cross Validation with TimeSeriesSplit

In [9]:
def zero_sum(prices, volumes):
    std_error = np.sqrt(volumes)
    step = np.sum(prices)/np.sum(std_error)
    out = prices - std_error * step

    return out

In [10]:
import optiver2023
from utils.files import load_model

env = optiver2023.make_env()
iter_test = env.iter_test()
counter = 0
predictions = []

models_path = CFG.MODEL_PATH.glob("**/*.pkl")
models = [load_model(path) for path in models_path]

for (test, revealed_targets, sample_prediction) in iter_test:
    feat = featurize(test)

    # Mean ensemble
    fold_prediction = 0
    for model in models:
        model_prediction += model.predict(feat)   
    fold_prediction /= N_Folds

    fold_prediction = zero_sum(fold_prediction, test.loc[:, "bid_size"] + test.loc[:, "ask_size"])
    clipped_predictions = np.clip(fold_prediction, y_min, y_max)
    sample_prediction["target"] = clipped_predictions
    env.predict(sample_prediction)
    counter += 1

SyntaxError: invalid syntax (2361610516.py, line 10)

## Optimization specifics

We define a convenience function `run_optimization`, that starts the optimization process with sane defaults, given a objective function.

The objective function is returned by the `get_objective_function` method, which configures the logging and evaluation process ("simple" or "cross_validate").
Inside this method, the parameter space is defined using the optuna `trial` object.