<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>ABOUT</b></p>
</div>

This is yet another LGB notebook in the JPX competition

Its main difference is the modular structure:

- An utility script [jpx_feat_util](https://www.kaggle.com/code/notabene/jpx-feat-util) deals with the pre-processing & feature engineering
- This script is used by this notebook which deals with the CV. It pre-computes the CV fold files which can then be used by simply importing the output of this notebook
- The fold files are imported by another notebook [jpx-lgb-train-no-leak](https://www.kaggle.com/code/notabene/jpx-lgb-train-no-leak) which trains a separate LGB model on each fold
- Yet another notebook [jpx-lgb-test-no-leak](https://www.kaggle.com/code/notabene/jpx-lgb-test-no-leak) then imports the models and the utility script to produce a submission file

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>CROSS-VALIDATION STRATEGY</b></p>
</div>

The CV strategy pursued in this notebook can be summarized simply as:

- Use every record in a test *once* by rolling a test window across the time series.
- To prevent a leak there are gaps between the test window and the rest of the data.
- The width of this gap is defined by the lag in the features used. The longest lag used is 66 days - the gap is set at 70 days.

In [None]:
import numpy as np 
import pandas as pd
from jpx_feat_util import get_stock_features, create_features

prices = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')

In [None]:
prices = prices.drop(prices[prices.Date == '2020-10-01'].index)
prices = prices[prices.Open.notnull() & prices.Target.notnull()]
prices.isna().sum()

In [None]:
numUniqDates = prices['Date'].unique()
len(numUniqDates)

In [None]:
folds,testPeriod = [],130
conseqDates, embargoPeriod, curr = sorted(numUniqDates),70,0
while curr < len(conseqDates):
    folds.append((conseqDates[curr:min([curr+testPeriod,len(conseqDates)])],conseqDates[:max([0,curr - embargoPeriod])] + conseqDates[min(curr+testPeriod+embargoPeriod, len(conseqDates)):]))
    print('test interval: '+str(curr) + ':' + str(min([curr+testPeriod,len(conseqDates)])) + ', train interval A: [:' + str(max([0,curr - embargoPeriod])) + '], train interval B: ['  + str(min(curr+testPeriod+embargoPeriod, len(conseqDates))) +':]')
    curr += testPeriod

In [None]:
%%time
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
feats = create_features(prices)
feats.head()

In [None]:
feats.isna().sum()

In [None]:
i = 0
for (test, train) in folds:
    print(f'Creating fold {i}..')
    feats[feats['Date'].isin(frozenset(train))].to_pickle(f"train{i}.pkl")
    feats[feats['Date'].isin(frozenset(test))].to_pickle(f"test{i}.pkl")
    i += 1

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>CREDITS</b></p>
</div>

This modular structure used here is influenced by [this notebook](https://www.kaggle.com/code/slawekbiel/short-fast-nn-lgb) of [Slawek Biel](https://www.kaggle.com/slawekbiel).

The layout used in the markup cells of this notebook is influenced by [this notebook](https://www.kaggle.com/code/shtrausslearning/building-an-asset-trading-strategy) of [Andrey Shtrauss](https://www.kaggle.com/shtrausslearning).

The CV strategy has been influenced by the "Combinatorial Purged Cross-Validation" (CPCV) technique for time series described in Marcos Lopez de Prado's "Advances in Financial Machine Learning" book (p. 163).