# Overview

**My Goal with this notebook was to create baseline models without overfitting to the leaderboard**

In this notebook, I wanted to demonstrate how to make a baseline Deep Neural Network and LGBM model. The models are not hyperparameter tuned or optimized, so they have much room for improvement.

### Features

I used the open, high, low, and close features about the price of the stock. I removed columns with null values for these price features. I also used the SecuritiesCode feature to have information about if one stock is all around better or worse than others. I applied an ordinal encoder to the securities code. 

I did not use the date feature in any way because the final leaderboard will be predicted on new days.

### Cross Validation

I binned the target value according to Sturges Rule and applied StratifiedKFold to the binned column. This is a common technique for regression style problems. I used 5 splits for this notebook.

### Models

For the LGBM, I simply used the default parameters.

DNN Structure
- input open, high, low, and close
- Batch Normalize and Dense layer
- input SecuritiesCode
- Embedding Layer then Dense layer
- Concat all features together
- Dense Blocks (include dropout, batch norm, and dense) (x3)
- Output

The DNN was trained for 15 epochs using an Adam optimizer. It optimizes mse with a batch size of 128.

### References
https://www.kaggle.com/code/realneuralnetwork/jpx-lgbm-model-overfitting-high-score by Kabir Ivan - notebook demonstrating how to use a LGBM for this competition, includes SecuritiesCode and date features, produces high public leaderboard score by overfitting

https://www.kaggle.com/code/lonnieqin/ubiquant-market-prediction-with-dnn/notebook by Lonnie - good DNN starter reference made for a different market prediction challenge

# Preprocessing
### Imports

In [None]:
import numpy as np
import pandas as pd
import jpx_tokyo_market_prediction

from lightgbm import LGBMRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder

import tensorflow as tf
import tensorflow.keras.layers as L
import tensorflow.keras.models as M
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping

import warnings
warnings.filterwarnings("ignore")

In [None]:
prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv")

### Handle Nulls

In [None]:
prices = prices.drop("ExpectedDividend", axis=1)
prices = prices.dropna()
prices.isnull().sum()

### Cross Validation Split

In [None]:
def setup_cv(df, splits=5):
    df['fold'] = -1
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    num_bins = int(np.floor(1 + np.log2(len(df))))
    df.loc[:, "bins"] = pd.cut(
        df["Target"], bins=num_bins, labels=False
    )

    kf = StratifiedKFold(n_splits=splits)
    for f, (t_, v_) in enumerate(kf.split(X=df, y=df.bins.values)):
            df.loc[v_, 'fold'] = f

    df = df.drop("bins", axis=1)
    return df

In [None]:
prices = setup_cv(prices)

### Ordinal Encode Securities Code

In [None]:
enc = OrdinalEncoder()
prices["SecuritiesCode"] = enc.fit_transform(prices[["SecuritiesCode"]])

# Train Models

### LGBM

In [None]:
def train_lgbm(prices, folds):
    models = list()
    
    for f in range(folds):
        X_train = prices[prices.fold != f][["SecuritiesCode", "Open", "High", "Low", "Close"]]
        y_train = prices[prices.fold != f][["Target"]]
        X_valid = prices[prices.fold == f][["SecuritiesCode", "Open", "High", "Low", "Close"]]
        y_valid = prices[prices.fold == f][["Target"]]
        
        model = LGBMRegressor()
        model.fit(X_train, y_train)
        oof_preds = model.predict(X_valid)
        oof_score = np.sqrt(mean_squared_error(y_valid, oof_preds))
        print(oof_score)
        models.append(model)
        
    return models

### Deep Neural Network

In [None]:
codes = list(prices.SecuritiesCode.unique())
codes_size = len(codes)

def dense_block(x, units, act='swish', dr=0.2):
    x = L.Dropout(dr)(x)
    x = L.BatchNormalization()(x)
    x = L.Dense(units, activation=act)(x)
    return x

def get_dnn(dense_blocks):
    prices_in = L.Input(shape=(4,), name='input_prices')
    x_prices = L.BatchNormalization()(prices_in)
    x_prices = L.Dense(64, activation='swish')(x_prices)
    
    security_code_input = L.Input(shape=(1,), name='input_security_code')
    x_id = L.Embedding(codes_size, 32, input_length=1)(security_code_input)
    x_id = L.Reshape((-1, ))(x_id)
    x_id = L.Dense(32, activation='swish')(x_id)

    x = L.Concatenate(axis=1)([x_id, x_prices])
    
    for units in dense_blocks:
        x = dense_block(x, units)
    
    output = L.Dense(1)(x)
    
    model = M.Model([prices_in, security_code_input], 
                    [output])

    model.compile(optimizer=tf.optimizers.Adam(lr=0.001),
                  loss='mse', metrics=['mse'])
    
    return model
    
def train_dnn(prices, folds):
    models = list()
    
    for f in range(folds):
        X_train_prices = prices[prices.fold != f][["Open", "High", "Low", "Close"]]
        X_train_id = prices[prices.fold != f][["SecuritiesCode"]]
        y_train = prices[prices.fold != f][["Target"]]
        X_valid_prices = prices[prices.fold == f][["Open", "High", "Low", "Close"]]
        X_valid_id = prices[prices.fold == f][["SecuritiesCode"]]
        y_valid = prices[prices.fold == f][["Target"]]

        model = get_dnn([128, 64, 32])
        model.fit([X_train_prices, X_train_id], y_train,
                   validation_data=([X_valid_prices, X_valid_id], y_valid),
                   batch_size=128, epochs=15, verbose=0)

        oof_preds = model.predict([X_valid_prices, X_valid_id])
        oof_score = np.sqrt(mean_squared_error(y_valid, oof_preds))
        print(oof_score)
        models.append(model)
        break
        # break for speed of training, feel free to train all folds
    
    return models

### Run - prints rmse for each fold

In [None]:
lgbm_models = train_lgbm(prices, 5)

In [None]:
dnn_models = train_dnn(prices, 5)

# Make Predictions & Submit

In [None]:
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    prices["SecuritiesCode"] = enc.fit_transform(prices[["SecuritiesCode"]])
    
    X_test = prices[["SecuritiesCode", "Open", "High", "Low", "Close"]]
    lgbm_preds = list()
    for model in lgbm_models:
        lgbm_preds.append( model.predict(X_test) )
    lgbm_preds = np.mean(lgbm_preds, axis=0)
    
    X_test_prices = prices[["Open", "High", "Low", "Close"]]
    X_test_id = prices[["SecuritiesCode"]]
    dnn_preds = list()
    for model in dnn_models:
        dnn_preds.append( model.predict([X_test_prices, X_test_id]) )
    dnn_preds = np.mean(dnn_preds, axis=0)[0]
    
    sample_prediction["Prediction"] = lgbm_preds*0.8 + dnn_preds*0.2
    
    sample_prediction = sample_prediction.sort_values(by = "Prediction", ascending=False)
    sample_prediction.Rank = np.arange(0,2000)
    sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending=True)
    sample_prediction.drop(["Prediction"],axis=1)
    submission = sample_prediction[["Date","SecuritiesCode","Rank"]]
    env.predict(submission)

In [None]:
pd.read_csv("./submission.csv")