# JPX TSE Minimal CatBoost Baseline
Treat the challenge as a regression problem in order to implement a minimal training & inference baseline with CatBoost

Thanks to [Lonnie](https://www.kaggle.com/lonnieqin) and [swimmy](https://www.kaggle.com/swimmy) for the following notebooks which were very helpful:
* [https://www.kaggle.com/code/lonnieqin/tokyo-stock-market-prediction-with-catboost-v2](https://www.kaggle.com/code/lonnieqin/tokyo-stock-market-prediction-with-catboost-v2)
* [https://www.kaggle.com/code/swimmy/lgbm-opt-model-jpx/notebook](https://www.kaggle.com/code/swimmy/lgbm-opt-model-jpx/notebook)


In [None]:
import catboost as cb
import jpx_tokyo_market_prediction
import numpy as np
import os
import pandas as pd

In [None]:
class CFG:
    TRAIN_PERCENTAGE = 0.8

Load & process the data

To simplify matters,
- drop all rows where 'AdjustmentFactor' is not equal to 1.0
- only use 'SecuritiesCode', 'Open', 'High', 'Low', 'Close' and 'Volume' from *stock_prices.csv* as features
- drop all NA rows

In [None]:
path = '../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv'
cols = ['RowId', 'Date', 'SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag', 'Target']
feat_cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume']
all_prices_df = pd.read_csv(path)
all_prices_df['Date'] = pd.to_datetime(all_prices_df['Date'])
all_prices_df.drop(all_prices_df[all_prices_df['AdjustmentFactor'] != 1.0].index, inplace = True)
all_prices_df = all_prices_df[['Date', 'SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'Target', ]]
all_prices_df.dropna(axis=0, inplace=True)

Split the training and test data by date to avoid information leakage

In [None]:
train_rows = int(len(all_prices_df.index)*CFG.TRAIN_PERCENTAGE)
test_start_date = all_prices_df['Date'].iloc[train_rows]

train_df = all_prices_df[all_prices_df['Date'] < test_start_date]
test_df = all_prices_df[all_prices_df['Date'] >= test_start_date]

X_train, y_train = train_df[feat_cols], train_df['Target']
X_test, y_test = test_df[feat_cols], test_df['Target']

train_dataset = cb.Pool(X_train, y_train, cat_features=['SecuritiesCode'])
test_dataset = cb.Pool(X_test, y_test, cat_features=['SecuritiesCode'])

CatBoostRegressor on GPU with RMSE loss function, random seed 42 and default parameters

In [None]:
model = cb.CatBoostRegressor(loss_function='RMSE',  random_seed=42, task_type='GPU',)
model.fit(train_dataset, verbose=False)
print(model.get_all_params())
print(model.get_best_iteration())
print(model.get_best_score())

Use **argsort** from numpy to implement the ranking (reference: [https://stackoverflow.com/a/6266510](https://stackoverflow.com/a/6266510))

In [None]:
env = jpx_tokyo_market_prediction.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test files
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    y_pred = model.predict(prices[feat_cols]).reshape(-1)
    ranks = (-1*y_pred).argsort().argsort()
    sample_prediction['Rank'] = ranks
    env.predict(sample_prediction)