<div style="padding:20px;color:#76B900;margin:0;font-size:240%;text-align:center;display:fill;border-radius:5px;background-color:white;overflow:hidden;font-weight:600">JPX Tokyo Stock Exchange Prediction with NVIDIA-TSPP</div>

This notebook shows how to use NVIDIA Time Series Prediction Platform. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">1 | About NVIDIA-TSPP</div>
<br>
NVIDIA's Time Series Prediction Platform (NVIDIA-TSPP) enables users to mix and match time-series datasets and models. In this case, the user has complete control over the following settings, and can compare side-by-side results obtained from various solutions. 


![TSPP Features](https://i.imgur.com/YlwY9Td.png)


1. NVIDIA's Time Series Prediction Platform ([NVIDIA-TSPP](https://github.com/NVIDIA/DeepLearningExamples/tree/master/Tools/PyTorch/TimeSeriesPredictionPlatform)) attempts to unify all time-series models under one evaluation framework <br>
2. Enables users to mix and match time-series datasets and models <br>
3. Pytorch Framework <br>
4. Modularized components (dataset, model, loss, evaluation metrics, etc) <br>
5. Allows for common point(s) of comparison <br>
6. Comparisons should be low-effort <br>
7. Easily swap in your own dataset to compare portfolio models’ accuracy <br>
8. Easily swap in your model to compare its accuracy across datasets <br>
9. Supports Triton based deployment <br>
10. Enables quick experimentation with Multi-GPU and AMP <br>
11. Optimized for Command-Line Interface with [hydra](https://hydra.cc/) <br>
<br>
The NVIDIA-TSPP uses hydra to support experiment configuration. Hydra makes selecting models, datasets, training and evaluation parameters very easy<br>
**NOTE: NVIDIA-TSPP code in this notebook is a development version. We will soon have an official update available on the [Nvidia Deep Learning Examples github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/Tools/PyTorch/TimeSeriesPredictionPlatform)**

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">2 | Setup and Installations</div>


This section covers the installation, setup, and imports. Given this is a code competition we uploaded the NVIDIA-TSPP code in a dataset. We also uploaded all required packages in a dataset to allow an offline pip install. Code for installation is hidden for the sake of clarity but you certainly can have a look if you are interested.

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">Installing NVIDIA-TSPP's libraries - Offline</div>

Install libraries required by NVIDIA-TSPP

In [None]:
!ls ../input/nvidia-tspp-libraries/offline_whls/

In [None]:
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/hydra_core-1.1.1-py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/patool-1.12-py2.py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/optuna_dashboard-0.6.4-py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/hydra_optuna_sweeper-1.1.2-py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/hydra_joblib_launcher-1.1.5-py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/nvidia-pyindex-1.0.9.tar.gz --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/dgl-0.6.1-cp37-cp37m-manylinux1_x86_64.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/dask_cuda-22.2.0-py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/DLLogger-1.0.0.zip --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/dtw_python-1.1.12-cp37-cp37m-manylinux2010_x86_64.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/mlflow-1.26.1-py3-none-any.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/
!pip install --no-index ../input/nvidia-tspp-libraries/offline_whls/pmdarima-1.8.0-cp37-cp37m-manylinux1_x86_64.whl --find-links=../input/nvidia-tspp-libraries/offline_whls/

Adding NVIDIA-TSPP code to the search path

In [None]:
import sys
tspp_ws = "/kaggle/input/nvidia-tspp-code/nvidia_tspp_code"
sys.path.insert(1, tspp_ws)
sys.path

Imports

In [None]:
import os
from decimal import ROUND_HALF_UP, Decimal
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import hydra
import warnings
import torch
from omegaconf import OmegaConf
import os
import pickle
from hydra.utils import get_original_cwd
import conf.conf_utils
from hydra_utils import get_config
from data.data_utils import Preprocessor
from training.utils import set_seed

curr_workdir = globals()['_dh'][0]
config_path = "./conf"
warnings.filterwarnings("ignore")

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">3 | Feature Engineering</div>

In this section, we will review the dataset and create features for training. Our features are various aggregates on lag variables. We provide these as examples, and these could certainly be modified and improved.

Load stock_prices training dataset

In [None]:
dataset_path = "/kaggle/input/jpx-tokyo-stock-exchange-prediction"
train_sp_file = "train_files/stock_prices.csv"
train_sp_file = os.path.join(dataset_path, train_sp_file)

In [None]:
df = pd.read_csv(train_sp_file)
print(df.head())

Cell below checks if the all the 2000 stocks are available on all dates

In [None]:
%%script false --no-raise-error
idcount = df.groupby("Date")["SecuritiesCode"].count().reset_index()
plt.plot(idcount["Date"],idcount["SecuritiesCode"])
idcount.loc[idcount["SecuritiesCode"]==2000,:]

We will update the prices using AdjustmentFactor value. This should reduce historical price gap caused by split/reverse-split. <br>


In [None]:
def adjust_price(price):
    """
    Ref: https://www.kaggle.com/code/smeitoma/train-demo#Generating-AdjustedClose-price
    Args:
        price (pd.DataFrame)  : pd.DataFrame include stock_price
    Returns:
        price DataFrame (pd.DataFrame): stock_price with generated Adjusted Prices
    """

    def generate_adjusted_price(df):
        """
        Args:
            df (pd.DataFrame)  : stock_price for a single SecuritiesCode
        Returns:
            df (pd.DataFrame): stock_price with Adjusted Price for a single SecuritiesCode
        """
        # sort data to generate CumulativeAdjustmentFactor
        df = df.sort_values("Date", ascending=False)
        # generate CumulativeAdjustmentFactor
        df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
        # generate AdjustedClose
        df.loc[:, "AdjustedClose"] = (
            df["CumulativeAdjustmentFactor"] * df["Close"]
        ).map(lambda x: float(
            Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
        ))
        df.loc[:, "AdjustedOpen"] = (
            df["CumulativeAdjustmentFactor"] * df["Open"]
        ).map(lambda x: float(
            Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
        ))
        df.loc[:, "AdjustedLow"] = (
            df["CumulativeAdjustmentFactor"] * df["Low"]
        ).map(lambda x: float(
            Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
        ))
        df.loc[:, "AdjustedHigh"] = (
            df["CumulativeAdjustmentFactor"] * df["High"]
        ).map(lambda x: float(
            Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
        ))
        # reverse order
        df = df.sort_values("Date")
        # to fill AdjustedClose, replace 0 into np.nan
        df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
        # forward fill AdjustedClose
        df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()

        df.loc[df["AdjustedOpen"] == 0, "AdjustedOpen"] = np.nan
        df.loc[:, "AdjustedOpen"] = df.loc[:, "AdjustedOpen"].ffill()

        df.loc[df["AdjustedLow"] == 0, "AdjustedLow"] = np.nan
        df.loc[:, "AdjustedLow"] = df.loc[:, "AdjustedLow"].ffill()
        
        df.loc[df["AdjustedHigh"] == 0, "AdjustedHigh"] = np.nan
        df.loc[:, "AdjustedHigh"] = df.loc[:, "AdjustedHigh"].ffill()
        
        return df

    # generate AdjustedClose
    price = price.sort_values(["SecuritiesCode", "Date"])
    price = price.groupby("SecuritiesCode").apply(generate_adjusted_price).reset_index(drop=True)

    return price

We will create a new set of features with Adjusted Close Price: mean, std, min, and max each over a period of 5 and 20 days respectively.

In [None]:
def create_features(price):
    """
    Args:
        price (pd.DataFrame)  : pd.DataFrame include stock_price
    Returns:
        price DataFrame (pd.DataFrame): stock_price with new generated features
    """

    def generate_features_single_stock(df):
        """
        Args:
            df (pd.DataFrame)  : stock_price for a single SecuritiesCode
        Returns:
            df (pd.DataFrame): stock_price with new features for a single SecuritiesCode
        """
        
        df['Close_1week_mean'] = df['AdjustedClose'].rolling(window = 5).mean().fillna(0)
        df['Close_4weeks_mean'] = df['AdjustedClose'].rolling(window = 20).mean().fillna(0)
        df['Close_1week_std'] = df['AdjustedClose'].rolling(window = 5).std().fillna(0)
        df['Close_4weeks_std'] = df['AdjustedClose'].rolling(window = 20).std().fillna(0)
        df['Close_1week_min'] = df['AdjustedClose'].rolling(window = 5).min().fillna(0)
        df['Close_4weeks_min'] = df['AdjustedClose'].rolling(window = 20).min().fillna(0)
        df['Close_1week_max'] = df['AdjustedClose'].rolling(window = 5).max().fillna(0)
        df['Close_4weeks_max'] = df['AdjustedClose'].rolling(window = 20).max().fillna(0)
        return df

    price = price.sort_values(["SecuritiesCode", "Date"])
    price = price.groupby("SecuritiesCode").apply(generate_features_single_stock).reset_index(drop=True)

    return price

The NVIDIA-TSPP provides download functions for a number of datasets. For some datasets, it supports an end-2-end flow starting from automatic download all the way to creating train, valid and test binaries (for example: [electricity dataset](http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014)). For other datasets, it expects users to download the dataset and from there, the NVIDIA-TSPP will do the remaining steps <br>
For downloading and cleaning up the datasets, the NVIDIA-TSPP provides the script: <code>{tspp_ws}/data/script_download_data.py --dataset {dataset_name} --output_dir {output_dir_path} </code>
This script generates the output at the directory: <code>{output_dir}/{dataset}</code> <br>
Here, since we already have the dataset, we define a function for creating dataset features for NVIDIA-TSPP's preprocessing


In [None]:
def process_raw_data(df, min_timestamp=None):
    df["Date"] = pd.to_datetime(df["Date"])
    df['DayOfWeek'] = df['Date'].apply(lambda x: x.dayofweek)
    df['Month'] = df['Date'].apply(lambda x: x.month)
    df['Day'] = df['Date'].apply(lambda x: x.day)
    df['Hour'] = df['Date'].apply(lambda x: x.hour)
    df['Minute'] = df['Date'].apply(lambda x: x.minute)
    df['Timestamp'] = df['Date']
    df['id'] = df["SecuritiesCode"]
    df['Weight'] = 1
    df = adjust_price(df)
    df = create_features(df)
    df['DayOfYear'] = df['Date'].apply(lambda x: x.dayofyear)
    df['Year'] = df['Date'].apply(lambda x: x.year)
    df['IsMonday'] = (df['DayOfWeek'] == 0).astype(int)
    # Add continuous timeline
    if min_timestamp:
        min_timestamp = pd.to_datetime(min_timestamp)
    else:
        min_timestamp = min(df['Timestamp'])
    df['Timestep'] = df['Timestamp'].apply(lambda x: (x-min_timestamp)/datetime.timedelta(days=1))
    df['Timestep_id'] = df['Timestep']
    df = df.set_index('Date')
    return df

Create output directory for dataset

In [None]:
out_dataset_dir = "/kaggle/working/dataset"
os.makedirs(out_dataset_dir, exist_ok=True)

Note that we can model missing stocks too, because NVIDIA-TSPP maintains a separate scaler per stock id. We went for simplicity here.

Select stock prices for dates from 2021-01-01 onwards and save the updated features dataframe

In [None]:
train_df = df.loc[df["Date"]>= "2021-01-01"].reset_index(drop=True)
train_df = process_raw_data(train_df)
train_df.to_csv(os.path.join(out_dataset_dir, 'filtered_stock_prices.csv'))

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">4 | Dataset Preprocessing</div>


In this section, we will preprocess the dataset for NVIDIA-TSPP 

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">NVIDIA-TSPP dataset parameters</div>

![TSPP Features](https://i.imgur.com/3jRx1Ad.png)

As NVIDIA-TSPP works with yaml-based configuration files, we need to create a configuration file for the competition dataset <br>
Cell below will print the configuration file for the electricity dataset

In [None]:
with open(os.path.join(tspp_ws, "conf/dataset/electricity.yaml"), "rb") as f:
    elec_config = OmegaConf.load(f)
    print("\nDataset Config\n", OmegaConf.to_yaml(elec_config))

Some of the relevant fields are described below:


1. <code>config.source_path</code>: path to the source CSV file that contains the dataset
2. <code>config.dest_path</code>: output directory path for NVIDIA-TSPP's preprocessing
3. <code>config.time_ids</code>: The name of the column within the source CSV that is used to split training, validation, and test datasets. In the configuration, we can specify the range as shown above using: <code>config.train_range</code> for the train dataset, the example above has the range \[0, 1315). Similarly we can use <code>config.valid_range</code> & <code>config.test_range</code> for the validation and test sets respectively. Remember that there can be overlap between subsets since predicting the first ‘unseen element’ requires the input of the seen elements before it.
4. <code>config.dataset_stride</code>: This is used to define how far apart in time different examples are in the dataset. When set to 1, all nearest examples will differ only by one time step
5. <code>config.scale_per_id</code>: If true, preprocessing uses different scaling factors and biases for each of the time-series in the dataset
6. <code>config.encoder_length</code>: The length of data known up until the ‘present’ for a single sample
7. <code>config.example_length</code>: The length of all data, including data known into the future for a single sample. The target you are predicting lies on the difference between the example_length and encoder_length.
8. <code>config.input_length</code>: same as encoder_length
9. <code>config.features</code>: We will go over features and their properties in a bit more detail in the next cell.
10. <code>config.train_samples</code>, <code>config.valid_samples</code>: Randomly subsample train and valid splits to reduce amount of correlated data fed to the model during a training epoch
11. <code>config.binarized</code>: Store train, validation and test splits in binarized versions to improve the data-loading speed (by default csv files are created)
12. <code>config.time_series_count</code>: The number of unique time-series contained in the dataset
13. <code>config.graph</code>: option is relevant for models that can use graph related information. For example, time-series forecasting using gnns
14. <code>config.MultiID</code>: option is relevant for models that can use multiple time-series as inputs or predict multiple time-series in a single step 

Feature Specification in Dataset configuration files: <br>


The NVIDIA-TSPP requires a feature list for each dataset that the models take as input.  Each feature should be represented by an object containing descriptive attributes based on the supported types. These are used for mapping input dataset columns to types. Different types of features are treated differently during preprocessing and model runtime. <br><br>Each feature should have atleast: <br> 
<code>config.features.feature_type</code>: OPTIONS(**ID, KNOWN, STATIC, OBSERVED, TARGET, WEIGHT, TIME**) <br><br>
**ID**: Unique Identifier representing a time-series node/channel/sensor. Example: Securities Code <br>
**KNOWN**: Features known in advance for both history and future. Example: Hour, day of week, etc <br>
**STATIC**: Constant feature throughout a single time series. Ex: physical location of a sensor <br>
**OBSERVED**: Features known only for historical data <br>
**TARGET**: Target value to predict. Use history target value as model's input. Use Future targets for calculating loss. Example: Power Usage, Stock Closing Price <br>
**WEIGHT**: Some of the target values could be missing from the dataset for some timesteps. This can be used to mask those entries during loss calculation. <br>
**TIME**: Unique timestamp to create the time-series. Example: Hours from Start, Time Steps <br>
<br>and <br>
<code>config.features.feature_embed_type</code>: OPTIONS(**CATEGORICAL, CONTINUOUS**) <br><br>
**CATEGORICAL**: Features that usually have a fixed number possible values. For some models, they can be passed through categorical embedding tables that are trainable. Categorical columns should have a cardinality attribute that represents the number of unique values that the feature takes using <code>config.features.cardinality</code>. Cardinality is 1+number of categories where extra category is for supporting NaNs. <br>
**CONTINUOUS**: Represents real-valued features.  Continuous features may have a scaler attribute that represents the type of scaler used in preprocessing using <code>config.features.scaler</code> <br>
<br>
<b>Required features are one TIME feature, at least one ID feature, one TARGET feature, and at least one KNOWN, OBSERVED, or STATIC feature. </b><br>
<br>
A single example is a dictionary of features grouped into the categories specified above. The Dataset class provides a utility to iterate over a monolithic csv (or it's binarized version) creating examples on the fly. The NVIDIA-TSPP assumes that a model doesn't need any more information than the type of a group and the underlying feature tensor.

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">Create a new dataset configuration file</div>

In the cell below, we will start with the electricity dataset configuration that is already available within NVIDIA-TSPP's code and make changes to it for this competition's dataset <br>

In [None]:
# CREATE DATASET CONFIG HERE
config = None
out_conf_dir = "/kaggle/working/custom_conf/dataset"
os.makedirs(out_conf_dir, exist_ok=True)
with open(os.path.join(tspp_ws, "conf/dataset/electricity.yaml"), "rb") as f:
    config = OmegaConf.load(f)
#print(config)
config.config.source_path = os.path.join(out_dataset_dir, 'filtered_stock_prices.csv')
config.config.dest_path = out_dataset_dir
config.config.time_ids = 'Timestep_id'

config.config.encoder_length = 10
config.config.example_length = 10+2
config.config.input_length = 10

# This feature is using by xgboost model only: Its similar to setting encoder_length of 10 xgboost
config.config.lag_features=[{'name': 'AdjustedClose', 'min_value': 1, 'max_value': 10}]

# Train-Valid-Test Split
num_samples = len(train_df["Timestep_id"].unique())
num_test = round(num_samples * 0.1)
num_train = round(num_samples * 0.8)
num_val = num_samples - num_test - num_train
timesteps = train_df["Timestep_id"].unique()
train_start = int(timesteps[0])
train_end = int(timesteps[num_train])
valid_start = int(timesteps[num_train - config.config.encoder_length])
valid_end = int(timesteps[num_train + num_val])
test_start = int(timesteps[num_train + num_val - config.config.encoder_length])
test_end = int(timesteps[-1]+1)
print(F"Train Range: [{train_start}, {train_end})")
print(F"Valid Range: [{valid_start}, {valid_end})")
print(F"Test Range: [{test_start}, {test_end})")
config.config.train_range = [train_start, train_end]
config.config.valid_range = [valid_start, valid_end]
config.config.test_range = [test_start, test_end]

config.config.scale_per_id = True
#Generate Feature Spec

time_series_count = len(train_df["SecuritiesCode"].unique())

features = [
    {'name': 'id', 'feature_type': 'ID', 'feature_embed_type': 'CATEGORICAL', 'cardinality': time_series_count+1},
    {'name': 'Timestep', 'feature_type': 'TIME', 'feature_embed_type': 'CONTINUOUS'},
    {'name': 'Weight', 'feature_type': 'WEIGHT', 'feature_embed_type': 'CONTINUOUS'},
    {'name': 'AdjustedClose', 'feature_type': 'TARGET', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'DayOfWeek', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 6},
    {'name': 'Timestep', 'feature_type': 'KNOWN', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Month', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 13},
    {'name': 'Day', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 32},
    {'name': 'AdjustedOpen', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'AdjustedHigh', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'AdjustedLow', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Volume', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'id', 'feature_type': 'STATIC', 'feature_embed_type': 'CATEGORICAL', 'cardinality': time_series_count+1},
    {'name': 'IsMonday', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 3},
    {'name': 'Close_1week_mean', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_1week_std', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_1week_min', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_1week_max', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_4weeks_mean', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_4weeks_std', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_4weeks_min', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Close_4weeks_max', 'feature_type': 'OBSERVED', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'DayOfYear', 'feature_type': 'KNOWN', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
    {'name': 'Year', 'feature_type': 'KNOWN', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}},
]
    
config.config.features = features
config.config.time_series_count = time_series_count
config.config.train_samples = -1
config.config.valid_samples = -1

with open(os.path.join(out_conf_dir, "jpx.yaml"), "wb") as f:
    OmegaConf.save(config=config, f=f.name)

print(config)

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">Run Dataset Preprocessing</div>

Cell below will generate train.csv, valid.csv and test.csv (also .bin files if <code>config.binarized</code> is set to True) at the output directory set with option: <code>config.dest_path</code> <br>
<code>hydra.searchpath</code> is required as the dataset config is created in the new directory and is not present in the *{tspp_ws}/conf* directory which is the default path set for hydra in the NVIDIA-TSPP. 

In [None]:
dataset = "jpx"
skip_preprocessing = False

# Dataset Proprocessing
if not skip_preprocessing:
    # Hydra Setup: this is just needed for notebook example
    preproc_cfg = get_config("preproc_config", config_path, override_list=[F"hydra.searchpath=[pkg://custom_conf]", F"dataset={dataset}"])
    preprocessor = hydra.utils.instantiate(preproc_cfg, _recursive_=False)
    train, valid, test = preprocessor.preprocess()
    preprocessor.fit_scalers(train)
    train = preprocessor.apply_scalers(train)
    valid = preprocessor.apply_scalers(valid)
    test = preprocessor.apply_scalers(test)
    train = preprocessor.impute(train)
    valid = preprocessor.impute(valid)
    test = preprocessor.impute(test)
    preprocessor.save_state()
    preprocessor.save_datasets(train, valid, test)

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">5 | Training</div>

The platform has the following architecture <br>
![TSPP Architecture](https://i.imgur.com/BsFpL6D.png)
<br>
In the above figure, the command line feeds input to the NVIDIA-TSPP launcher, which uses said input to configure the components (green blocks) required to train and test the model.

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">Training Setup</div>


After setting the dataset configuration, we can select the training experiment configuration. The NVIDIA-TSPP offers flexibility while setting up the training for time-series models. <br>
In the cell below, we have the training setup for 3 models: **[TFT](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Forecasting/TFT), [N-BEATS](https://arxiv.org/abs/1905.10437), [XGBoost](https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf)**. Model implementations can be found at: <i>{tspp_ws}/models</i> directory <br>
All the model related parameter are represented as <code>model.config.parameter_name</code>. This will depend on the model implementation. <br>
In the later cells, we will go over the available features in the NVIDIA-TSPP. <br>
<code>hydra.run.dir</code>: explicitly sets the output directory for training to prevent the autogenerated output directory by hydra. <br>
<code>hydra.searchpath</code> is required as the dataset config is created in the new directory and is not present in the *{tspp_ws}/conf* directory which is the default path set for hydra in the NVIDIA-TSPP

In [None]:
# This is training setup part on NVIDIA-TSPP
dataset = "jpx"
def get_training_config(model_name):

    # Hydra Setup: this is just needed for notebook example
    model_params_list = []
    hydra_override_list = []
    if model_name == "tft":
        model_params_list = [
            "trainer/optimizer=TorchAdam",
            "trainer.config.num_epochs=20",
            "trainer.config.batch_size=1024",
            "model.config.hidden_size=168",
            "model.config.n_head=4",
            "model.config.dropout=0.1",
            "trainer.optimizer.lr=0.0009171768778020634",
            "++trainer.config.ema=False"
        ]
    elif model_name == "nbeats":
        model_params_list = [
            "model.config.stacks=[{type:generic,num_blocks:2,theta_dim:0,share_weights:False,hidden_size:256},{type:generic,num_blocks:2,theta_dim:0,share_weights:True,hidden_size:256}]",
            "trainer/optimizer=TorchAdam",
            "trainer.optimizer.lr=0.008472935555938357",
            "trainer.config.num_epochs=20",
            "trainer.config.batch_size=1024",
            "++trainer.config.ema=False"
        ]
    elif model_name == "xgboost":
        model_params_list = [
            "++model.config.max_depth=6",
            "++model.config.colsample_bylevel=1.0",
            "++model.config.colsample_bytree=0.6",
            "++model.config.subsample=1.0",
            "++model.config.gamma=0.0001",
            "++model.config.learning_rate=0.010213742143683735",
            "++model.config.n_rounds=1000",
            "++trainer.callbacks.early_stopping.patience=50"
        ]
    else:
        raise ValueError(F"model: {model_name} not supported")

    output_workdir = os.path.join(curr_workdir, F'outputs/kaggle_{model_name}_{dataset}')
    os.makedirs(output_workdir, exist_ok = True)
    
    hydra_override_list = [
        F"model={model_name}",
        F"hydra.searchpath=[pkg://custom_conf]",
        F"dataset={dataset}",
        F"hydra.run.dir={output_workdir}",
        "+trainer.config.force_rerun=True",
        "++trainer.config.log_interval=200"
    ] + model_params_list
        

    cfg = get_config("train_config", config_path, override_list=hydra_override_list, return_hydra_config=True)
    output_hydra_path = os.path.join(output_workdir, ".hydra")
    os.makedirs(output_hydra_path, exist_ok=True)
    with open(os.path.join(output_hydra_path, "config.yaml"), "w") as f:
        OmegaConf.save(cfg, f=f)
    return cfg, output_workdir

Cell below creates the config for a given model

In [None]:
tft_cfg, tft_outdir = get_training_config("tft")
nbeats_cfg, nbeats_outdir = get_training_config("nbeats")
xgb_cfg, xgb_outdir = get_training_config("xgboost")

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">Training Parameters</div>

In the cell below we will go over the training features available in the NVIDIA-TSPP

In [None]:
print("\nTrainer Config\n", OmegaConf.to_yaml(tft_cfg.trainer))

Training parameters are described below

The NVIDIA-TSPP supports three types of trainers:
<code>trainer</code>=OPTIONS(ctltrainer, xgbtrainer, stattrainer). ctltrainer is used for all DL model. xgbtrainer and stattrainer are for xgboost and autoarima respectively. trainer is selected based on the model config automatically <br>
<code>trainer.config.force_rerun</code> should be set to True for this notebook. This feature is used to resume training from a saved checkpoint and is not supported on the notebook at this point. <br>
<code>trainer.config.log_interval</code>: Setting iteration interval for printing the training status <br>
<code>trainer.config.logfile_name</code>: output log file name <br>
<code>trainer.config.world_size</code>:  the number of GPUs the launcher is running on <br>
<code>trainer.config.device</code>: Target device for running training. default: 'cuda' (can set to 'cpu' for cpu only training)<br>
Features like early stopping, logging, saving checkpoints, throughput measurement are available through the callback mechanism (<code>trainer.callbacks</code>). By default, all these callbacks are enabled. config files are available at *{tspp_ws}/trainer/callbacks* <br>
<br>Parameters specific to DL models<br>
<code>trainer.config.batch_size</code>: batch size per training iteration <br>
<code>trainer.config.num_workers</code>: the number of workers to use for dataloading <br>
<code>trainer.config.num_epochs</code>:  the number of epochs to train the model for <br>
<code>trainer.config.amp</code>:  whether to enable AMP for accelerated training (This requires APEX library in the NVIDIA-TSPP currently and is not applicable for this notebook) <br>
<code>trainer.config.ema</code>: Enables Exponential moving average feature when set to True. The model weights are integrated into a weighted moving average, and the weighted moving average is used in lieu of the directly trained model weights at test time. Our experiments have found this technique improves the convergence properties of most models and datasets we work with. [EMA Paper](https://arxiv.org/pdf/1803.05407.pdf)<br>
<br>
<code>trainer/optimizer</code>: selecting training optimizer. (default: 'Adam'. This is APEX optimized Adam optimizer which will not work with this notebook). Available optimizers: TorchAdam, ASGD, Adadelta, Adagrad, AdamW, Adamax, LBFGS, RMSProp, Rprop, SGD, SparseAdam. Optimizer configs are available at *{tspp_ws}/conf/trainer/optimizer* directory. Optimizer specific parameters are set using: <code>trainer.optimizer.parameter_name=..</code><br>
<code>trainer/criterion</code>: selecting training criterion. (default: 'MSE'). Available criterion: GLL, L1, MSE, quantile. Criterion configs are available at <i>{tspp_ws}/conf/trainer/criterion</i> directory <br><br>
For evalualtion, the NVIDIA-TSPP provides several metrics: MSE, MAE, RMSE, SMAPE, TDI (Temporal Distortion Index), ND (Normalized Deviation). Evaluation configs are available at: *{tspp_ws}/conf/evaluator* directory. These metrics can be used for hyper-parameter search using Optuna (example in the later cell)



In [None]:
%%script false --no-raise-error
# Uncomment lines below to look at all the parameters for this experiment

#print("\nModel Config\n", OmegaConf.to_yaml(tft_cfg.model))
#print("\nDataset Config\n", OmegaConf.to_yaml(tft_cfg.dataset))
#print("\nEvaluater Config\n", OmegaConf.to_yaml(tft_cfg.evaluator))

# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">Launch Training</div>

Method to launch training on the NVIDIA-TSPP

In [None]:
def launch_training(model_name, cfg, output_workdir):
    os.chdir(curr_workdir)
    os.chdir(output_workdir)
    # Training
    set_seed(cfg.get("seed", None))
    train, valid, test = hydra.utils.call(cfg.dataset)
    model = hydra.utils.instantiate(cfg.model)
    if model_name == "xgboost":
        if cfg.trainer.get("criterion", None):
            del cfg.trainer.criterion
        trainer = hydra.utils.instantiate(
            cfg.trainer,
            model=model,
            train_dataset=train,
            valid_dataset=valid,
            patience=cfg.trainer.callbacks.early_stopping.patience,
            log_interval=cfg.trainer.config.get('log_interval', 25)
        )
        trainer.train()
    else:
        model = model.to(device=cfg.model.config.device)
        trainer = hydra.utils.instantiate(
                cfg.trainer,
                optimizer={'params': model.parameters()},
                model=model,
                train_dataset=train,
                valid_dataset=valid)
        trainer.train()
    os.chdir(curr_workdir)
    print(output_workdir)

Select model for training

In [None]:
print("==========TFT===========")
launch_training("tft", tft_cfg, tft_outdir)
base_cfg = tft_cfg
base_outdir = tft_outdir
base_name = "tft"
# print("==========NBEATS===========")
# launch_training("nbeats", nbeats_cfg, nbeats_outdir)
# base_cfg = nbeats_cfg
# base_outdir = nbeats_outdir
# base_name = "nbeats"
# print("==========XGB===========")
# launch_training("xgboost", xgb_cfg, xgb_outdir)
# base_cfg = xgb_cfg
# base_outdir = xgb_outdir
# base_name = "xgboost"


# <div style="padding:10px;color:#76B900;margin:0;font-size:60%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">HP Search with Optuna</div>


Hyperparameter search can be used to find semi-optimal hyperparameter configurations for a given model or dataset. In the NVIDIA-TSPP, hyperparameter search is driven by Optuna. <br>
Cell shows examples for running hp search with NVIDIA-TSPP

In [None]:
%%script false --no-raise-error
#Optuna Support
num_trials = 2
model_name = "tft"
#model_name = "xgboost"
dataset = "jpx"
output_workdir_optuna = os.path.join(curr_workdir, F'outputs/kaggle_{model_name}_{dataset}_optuna')
os.makedirs(output_workdir_optuna, exist_ok = True)
if model_name == "tft":
    !python {tspp_ws}/launch_training.py --config-dir custom_conf -m 'model.config.n_head=choice(1,2,4)' 'trainer.optimizer.lr=tag(log, interval(1e-5, 1e-2))' model={model_name} dataset={dataset} \
    trainer.config.batch_size=1024 trainer.config.num_epochs=1   ++trainer.config.log_interval=-1  ++trainer.config.force_rerun=True trainer/optimizer=TorchAdam +optuna_objectives=[MAE] hydra/sweeper=optuna \
    hydra.sweeper.n_trials={num_trials} hydra.sweeper.n_jobs=1 hydra.sweeper.storage=sqlite:///{output_workdir_optuna}/hp_search_multiobjective.db \
    hydra.sweep.dir={output_workdir_optuna}

if model_name == "xgboost":
    !python {tspp_ws}/launch_training.py --config-dir custom_conf -m '++model.config.max_depth=choice(2, 3, 4, 5, 6)' model={model_name} dataset={dataset} \
    ++trainer.config.log_interval=200 ++trainer.config.force_rerun=True ++trainer.callbacks.early_stopping.patience=20 \
    ++model.config.learning_rate=0.017 ++model.config.subsample=0.8 \
    ++model.config.colsample_bytree=1.0 ++model.config.colsample_bylevel=0.4 ++model.config.gamma=0.3 ++model.config.n_rounds=1000 \
    +optuna_objectives=[MSE] hydra/sweeper=optuna \
    hydra.sweeper.n_trials={num_trials} hydra.sweeper.n_jobs=1 hydra.sweeper.storage=sqlite:///{output_workdir_optuna}/hp_search_multiobjective.db \
    hydra.sweep.dir={output_workdir_optuna}

# Find the directory and model with the best results    
n_trials = num_trials
print(F"Optuna Results: {output_workdir_optuna}")
best_model_dir = None
best_params = None
with open(os.path.join(output_workdir_optuna, "optimization_results.yaml"), "r") as f:
    best_params = OmegaConf.load(f)
    best_params = best_params.best_params
    best_params = set([str(k)+"="+str(v) for k,v in best_params.items()])
print(best_params)
for trial_id in range(n_trials):
    trial_path = os.path.join(output_workdir_optuna, str(trial_id))
    overrides_file_path = os.path.join(trial_path, ".hydra/overrides.yaml")
    if os.path.exists(overrides_file_path):
        with open(overrides_file_path, "r") as f:
            overrides_list = set(OmegaConf.load(f))
            #print(overrides_list)
            if best_params.issubset(overrides_list):
                #print(F"Best Model in the directory: {trial_path}")
                best_model_dir = trial_path
                break
    else:
        print(F"Couldn't find directory or config files at: {trail_path}")

print(F"best model dir: {best_model_dir}")

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">6 | Evaluation</div>

Method to check Inference Performance <br>

In [None]:
def evaluate_model(model_name, cfg, output_workdir):
    print(F"Dataset: {dataset}")
    print(F"Model:{model_name}")
    print(F"Curr Directory: {curr_workdir}")
    print(F"Checkpoint Directory: {output_workdir}")
    os.chdir(curr_workdir)
    train, valid, test = hydra.utils.call(cfg.dataset)
    del train, valid
    evaluator = hydra.utils.instantiate(cfg.evaluator, test_data=test)
    model = hydra.utils.instantiate(cfg.model)

    if model_name == "xgboost":
        model.load(output_workdir)
    else:
        state_dict = torch.load(os.path.join(output_workdir, "best_checkpoint.zip"))['model_state_dict']
        model.load_state_dict(state_dict)
        device = torch.device(cfg.model.config.device)  # maybe change depending on evaluator
        model.to(device=device)

    preds_full, labels_full, ids_full, weights_full = evaluator.predict(model)
    eval_metrics = evaluator.evaluate(preds_full, labels_full, ids_full, weights_full)
    print(eval_metrics)

    #Command Line
    #! python launch_inference.py dataset={dataset} model={model} evaluator.config.checkpoint={output_workdir}

Select model for evaluation

In [None]:
evaluate_model(base_name, base_cfg, base_outdir)
#print("====TFT Perf===")
#evaluate_model("tft", tft_cfg, tft_outdir)
#print("====Nbeats Perf===")
#evaluate_model("nbeats", nbeats_cfg, nbeats_outdir)
#print("====Xgboost Perf===")
#evaluate_model("xgboost", xgb_cfg, xgb_outdir)

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">7 | Inference Pipeline</div>
Below we describe the steps for inference for this competition

1. For each target date (T), first select historical entries from training set (based on encoder_length)
2. Append target date entry to the historical entry dataframe
3. Also append prediction date entries to this dataframe (based on example_length - encoder_length). For this example, it's set to 2 to predict Closing price of T+1 and T+2 (This is primarily needed for known future features: Day of Week, Day, Month for Target predictions)
4. After this, first extract Time features from this dataframe, then preprocess this dataframe to apply scalers and categorical mappings
5. Run inference
6. Unscale predictions to get closing prices for T+1 and T+2
7. Calculate Target
8. Calculate Sharpe Ratio
9. Compare it with the actual target value

Method to calculate Sharpe Ratio <br>

In [None]:
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, toprank_weight_ratio: float = 2) -> float:
    """
    Ref: https://www.kaggle.com/code/smeitoma/jpx-competition-metric-definition
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): sharpe ratio
    """
    def _calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
        """
        Args:
            df (pd.DataFrame): predicted results
            portfolio_size (int): # of equities to buy/sell
            toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
        Returns:
            (float): spread return
        """
        assert df['Rank'].min() == 0
        assert df['Rank'].max() == len(df['Rank']) - 1
        weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
        purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
        short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
        return purchase - short

    buf = df.groupby('Date').apply(_calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio, buf

Loading supplemental files for Inference Performance

In [None]:
supp_file = "supplemental_files/stock_prices.csv"
supp_file = os.path.join(dataset_path, supp_file)
supp_df = pd.read_csv(supp_file)
min_date = pd.to_datetime(supp_df["Date"]).min()
max_date = pd.to_datetime(supp_df["Date"]).max()
print(F"min_date={min_date}, max_date={max_date}, num_days = {max_date - min_date}")

Loading Dataset Mappings for inference

In [None]:
dataset_cfg = base_cfg.dataset
encoder_length = dataset_cfg.config.encoder_length
f = open(os.path.join(dataset_cfg.config.dest_path, "tspp_preprocess.bin"), "rb")
preprocess_map = pickle.load(f)
scalers = preprocess_map["scalers"]
node_mappings_df = preprocess_map["id_mappings"]
id_mappings_dict = pd.Series(node_mappings_df['id'].values,index=node_mappings_df["_id_"]).to_dict()

Method to run inference

In [None]:
def get_inference(model_name, cfg, output_workdir, pred_df):
    model = hydra.utils.instantiate(cfg.model)
    if model_name != "xgboost":
        state_dict = torch.load(os.path.join(output_workdir, "best_checkpoint.zip"))['model_state_dict']
        model.load_state_dict(state_dict)
        device = torch.device(cfg.model.config.device)  # maybe change depending on evaluator
        model.to(device=device)
    else:
        model.load(output_workdir)
    
    _, _, test_dataobj = hydra.utils.call(cfg.dataset, input_df=pred_df)
    evaluator = hydra.utils.instantiate(cfg.evaluator, test_data=test_dataobj)
    preds_full, labels_full, ids_full, weights_full = evaluator.predict(model)
    return preds_full, labels_full, ids_full, weights_full

Select Model

In [None]:
#models_list = [("xgboost", xgb_cfg, xgb_outdir), ("tft", tft_cfg, tft_outdir), ("nbeats", nbeats_cfg, nbeats_outdir)]
models_list = [(base_name, base_cfg, base_outdir)]

Inference Test

In [None]:
#%%time
# For inference, we need historical data for encoder part
df = pd.read_csv(train_sp_file)
df["Date"] = pd.to_datetime(df["Date"])

history_date_start = np.sort(df["Date"].unique())[-60]
df = df[df["Date"] >= history_date_start].reset_index()

supp_df["Date"] = pd.to_datetime(supp_df["Date"])
supp_df = supp_df.sort_values(["Date"])

if df.Date.iloc[-1] < supp_df.Date.iloc[0]:
    print(F"Last date on train file: {df.Date.iloc[-1]}, First date on supplemental file: {supp_df.Date.iloc[0]}")
    df = df.append(supp_df, ignore_index=True)
else:
    print("Overlap in dates of supplemental files and train files")

hdays = encoder_length
prediction_length = dataset_cfg.config.example_length - dataset_cfg.config.encoder_length
num_codes = len(supp_df["SecuritiesCode"].unique())
test_dates = supp_df["Date"].unique()
final_df = pd.DataFrame()

preprocessor = Preprocessor(dataset_cfg.config)
preprocessor.load_state(os.path.join(dataset_cfg.config.dest_path, "tspp_preprocess.bin"))

# Predict first two dates in a loop
for test_date in test_dates[10:12]:
    print(test_date)
    test_entry_df = supp_df[supp_df["Date"] == test_date]
    print("History Date Before", df.Date.iloc[-1])
    df = df[df["Date"] < test_date]
    print("History Date After", df.Date.iloc[-1])
    df = pd.concat([df, test_entry_df], ignore_index=True)
    pred_df = pd.DataFrame({'Date': pd.date_range(start=test_entry_df.Date.iloc[-1], periods=prediction_length+1, freq='B', closed='right')})
    codes = test_entry_df.SecuritiesCode.unique()
    pred_df = pd.concat([pred_df]*len(codes), ignore_index=True)
    expanded_codes = [element for element in codes for i in range(prediction_length)]
    pred_df["SecuritiesCode"] = expanded_codes
    pred_df = pd.concat([df, pred_df], ignore_index=True)
    pred_df = pred_df.sort_values(["SecuritiesCode", "Date"])
    pred_df.ffill(inplace=True)
    pred_df = process_raw_data(pred_df)
    pred_df = pred_df.groupby("SecuritiesCode").tail(hdays+prediction_length).copy()
    pred_df = pred_df.reset_index()
    pred_df = preprocessor.preprocess_test(dataset=pred_df)
    pred_df = preprocessor.apply_scalers(pred_df)
    pred_df = preprocessor.impute(pred_df)
    
    model_results_dict = {}
    
    for model_name, cfg, model_dir in models_list:
        print("Running Inference on model: ", model_name)
        preds_full, labels_full, ids_full, weights_full = get_inference(model_name, cfg, model_dir, pred_df)
        upreds = np.stack([scalers.inverse_transform_targets(preds_full[...,i], ids_full) for i in range(preds_full.shape[-1])], axis=-1)
        targets = upreds[:, 1]/upreds[:, 0] - 1
        targets = list(targets.squeeze())
        securities = [id_mappings_dict[i] for i in ids_full]
        target_dict = dict(zip(securities, targets))
        model_results_dict[model_name] = target_dict
    
    out_df = pd.DataFrame(codes, columns=["SecuritiesCode"])
    out_df["Date"] = test_entry_df.Date.iloc[-1]
    out_df["Target"] = 0
    for model_name, _, _ in models_list:
        out_df["Target"] += out_df["SecuritiesCode"].map(model_results_dict[model_name])
    out_df["Target"] /= len(models_list) #, "xgboost"
    out_df['Rank'] = out_df.groupby('Date')['Target'].rank(ascending = False, method = 'first') - 1 
    out_df['Rank'] = out_df['Rank'].astype("int")
    
    _, predicted_spread_return = calc_spread_return_sharpe(out_df, 200, 2)
    final_df = final_df.append(out_df, ignore_index=True)
    #Valid Part:
    valid_df = test_entry_df.copy()
    valid_df['Rank'] = valid_df.groupby('Date')['Target'].rank(ascending = False, method = 'first') - 1 
    valid_df['Rank'] = valid_df['Rank'].astype("int")
    _, actual_spread_return = calc_spread_return_sharpe(valid_df, 200, 2)
    print(F"Spread Return on date: {pd.to_datetime(test_date).strftime('%Y-%m-%d')}, Predicted: {predicted_spread_return[0]}, Actual: {actual_spread_return[0]}")
    
calc_spread_return_sharpe(final_df, 200, 2)

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">8 | Submission</div>

Spread Return Per Day

In [None]:
def calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): spread return
    """
    assert df['Rank'].min() == 0
    assert df['Rank'].max() == len(df['Rank']) - 1
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
    return purchase - short

Inference Run for Submission

In [None]:
%%time

import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

# For inference, we need historical data for encoder part
df = pd.read_csv(train_sp_file)
df["Date"] = pd.to_datetime(df["Date"])

history_date_start = np.sort(df["Date"].unique())[-60]
df = df[df["Date"] >= history_date_start].reset_index()

supp_file = "supplemental_files/stock_prices.csv"
supp_file = os.path.join(dataset_path, supp_file)
supp_df = pd.read_csv(supp_file)
supp_df["Date"] = pd.to_datetime(supp_df["Date"])
supp_df = supp_df.sort_values(["Date"])

if df.Date.iloc[-1] < supp_df.Date.iloc[0]:
    print(F"Last date on train file: {df.Date.iloc[-1]}, First date on supplemental file: {supp_df.Date.iloc[0]}")
    df = df.append(supp_df, ignore_index=True)
else:
    print("Overlap in dates of supplemental files and train files")

hdays = encoder_length
prediction_length = dataset_cfg.config.example_length - dataset_cfg.config.encoder_length


preprocessor = Preprocessor(dataset_cfg.config)
preprocessor.load_state(os.path.join(dataset_cfg.config.dest_path, "tspp_preprocess.bin"))


for prices, options, financials, trades, secondary_prices, sample_prediction in iter_test:
    print(F"Current date: {prices.Date.iloc[-1]}")
    print("History Date Before", df.Date.iloc[-1])
    df = df[df["Date"] < prices.Date.iloc[-1]]
    print("History Date After", df.Date.iloc[-1])
    df = pd.concat([df, prices], ignore_index=True)
    df["Date"] = pd.to_datetime(df["Date"])
    pred_df = pd.DataFrame({'Date': pd.date_range(start=prices.Date.iloc[-1], periods=prediction_length+1, freq='B', closed='right')})
    codes = prices.SecuritiesCode.unique()
    pred_df = pd.concat([pred_df]*len(codes), ignore_index=True)
    expanded_codes = [element for element in codes for i in range(prediction_length)]
    pred_df["SecuritiesCode"] = expanded_codes
    pred_df = pd.concat([df, pred_df], ignore_index=True)
    pred_df = pred_df.sort_values(["SecuritiesCode", "Date"])
    pred_df.ffill(inplace=True)
    pred_df = process_raw_data(pred_df)
    pred_df = pred_df.groupby("SecuritiesCode").tail(hdays+prediction_length).copy()
    pred_df = pred_df.reset_index()
    pred_df = preprocessor.preprocess_test(dataset=pred_df)
    pred_df = preprocessor.apply_scalers(pred_df)
    pred_df = preprocessor.impute(pred_df)

    model_results_dict = {}
    
    for model_name, cfg, model_dir in models_list:
        print("Running Inference on model: ", model_name)
        preds_full, labels_full, ids_full, weights_full = get_inference(model_name, cfg, model_dir, pred_df)
        upreds = np.stack([scalers.inverse_transform_targets(preds_full[...,i], ids_full) for i in range(preds_full.shape[-1])], axis=-1)
        targets = upreds[:, 1]/upreds[:, 0] - 1
        targets = list(targets.squeeze())
        securities = [id_mappings_dict[i] for i in ids_full]
        target_dict = dict(zip(securities, targets))
        model_results_dict[model_name] = target_dict

    
    out_df = pd.DataFrame(codes, columns=["SecuritiesCode"])
    out_df["Date"] = prices.Date.iloc[-1]
    out_df["Target"] = 0
    for model_name, _, _ in models_list:
        out_df["Target"] += out_df["SecuritiesCode"].map(model_results_dict[model_name])
    out_df["Target"] /= len(models_list)
    out_df['Rank'] = out_df.groupby('Date')['Target'].rank(ascending = False, method = 'first') - 1 
    out_df['Rank'] = out_df['Rank'].astype("int")
    
    score = calc_spread_return_per_day(out_df, 200, 2)
    print(F"Score: {score}")
    subm_preds = out_df.set_index("SecuritiesCode")["Rank"]
    sample_prediction['Rank'] = sample_prediction["SecuritiesCode"].map(subm_preds)
    env.predict(sample_prediction)
    

# <div style="padding:20px;color:#76B900;margin:0;font-size:100%;text-align:left;display:fill;border-radius:5px;background-color:#5E5E5E;overflow:hidden">9 | References</div>

Here are the references to notebooks we started from as well as pointers to additional material if you want to know more about TSPP.  We hope you enjoyed this notebook and that it will be useful to you!

1. Kaggle Notebooks
* https://www.kaggle.com/code/smeitoma/train-demo#Generating-AdjustedClose-price
* https://www.kaggle.com/code/smeitoma/jpx-competition-metric-definition
* https://www.kaggle.com/code/datahobbit/jpx-network-models-and-feature-generation
* https://www.kaggle.com/code/kellibelcher/jpx-stock-market-analysis-prediction-with-lgbm
* https://www.kaggle.com/code/chumajin/easy-to-understand-the-competition
2. NVIDIA-TSPP: https://github.com/NVIDIA/DeepLearningExamples/tree/master/Tools/PyTorch/TimeSeriesPredictionPlatform
3. NVIDIA-TFT code: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Forecasting/TFT
4. NVIDIA-TSPP Blog: https://developer.nvidia.com/blog/time-series-forecasting-with-the-nvidia-time-series-prediction-platform-and-triton-inference-server/
5. XGBoost: https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
6. TFT: https://arxiv.org/abs/1912.09363
7. N-Beats: https://arxiv.org/abs/1905.10437
8. N-Beats code: https://github.com/philipperemy/n-beats
