In this starter notebook, I am trying to build a basic neural network using Keras for the JPX challenge.  
This notebook is not aiming for high score, my purpose is setting a start point for deeper analysis and model experiment. By doing this I can also get familiarBatchNormalizationth the competition dataset and practice what I have learned about deep learning and neural network.  
Hope everyone enjoy this competition ðŸ˜„

# Setup

In [None]:
import os
import numpy as np
import pandas as pd
import random
import jpx_tokyo_market_prediction
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
# Set random seed
seed = 30
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
tf.random.set_seed(seed)

# Data Prep

## Dataset choice

To keep things simple, I will only use the data in *stock_prices.csv* to train the model.

In [None]:
df = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv")
df

## Train & valid split

In [None]:
df.Date.describe()

We have 2332531 rows in the raw data table. The *Date* starts from *2017-01-04* and ends at *2021-12-03*.

In [None]:
print("Start date: {}, end date: {}".format(df.Date.unique().min(), df.Date.unique().max()))

I want to avoid some complicated time series split or CV strategy, so let's use the 2021's data for validation and the rest of data for training.

In [None]:
df_train = df[df['Date'] < '2021-01-01'].copy()
df_train.shape

In [None]:
df_valid = df[df['Date'] >= '2021-01-01'].copy()
df_valid.shape

There will be approximately 20% data for validation, which is reasonable.

In [None]:
df_valid.shape[0] / df.shape[0] * 100

## Select Feature 

For this toy model, we will choose **Open**, **High**, **Low**, **Close** and **Volume** as 5 numerical predicting features, and ignore other information like Date, SecuritiesCode, etc.  

In [None]:
num_features = ['Open', 'High', 'Low', 'Close', 'Volume']
target = ['Target']
df_train = df_train[num_features + target].reset_index(drop=True).copy()
df_valid = df_valid[num_features + target].reset_index(drop=True).copy()
df_valid.head()

There are some missing values in the corresponding columns and I just drop them this time.

In [None]:
df.isnull().sum()

In [None]:
df_train.dropna(subset=num_features + target, axis=0, inplace=True)
df_valid.dropna(subset=num_features + target, axis=0, inplace=True)

In [None]:
df_train.isnull().sum() + df_valid.isnull().sum()

Looks good.

## Preprocessing

The data preprocessing part mainly includes two operations:
* feature normalization
* create tensorflow dataset

Keras official document provides great examples:
https://keras.io/examples/structured_data/structured_data_classification_from_scratch/#preparing-the-data  
For each of the continuous numerical features, we will use Keras Normalization layer to make sure the mean of each feature is 0 and its standard deviation is 1.

In [None]:
# Define encoding function for numerical features
def encode_numerical_feature(feature, name, dataset):
    # Create a Normalization layer for our feature
    normalizer = layers.Normalization()

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the statistics of the data
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    encoded_feature = normalizer(feature)
    return encoded_feature

In [None]:
# Generate tensorflow dataset
def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("Target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds

train_ds = dataframe_to_dataset(df_train)
valid_ds = dataframe_to_dataset(df_valid)

Each Dataset yields a tuple (input, target) where input is a dictionary of features

In [None]:
for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

In [None]:
# Batch the dataset
train_ds = train_ds.batch(1024)
valid_ds = valid_ds.batch(1024)

# Build Model

First we define the input layers of our NN model, then encoding them.

In [None]:
%%time
# Raw numerical features
Open = keras.Input(shape=(1,), name="Open")
High = keras.Input(shape=(1,), name="High")
Low = keras.Input(shape=(1,), name="Low")
Close = keras.Input(shape=(1,), name="Close")
Volume = keras.Input(shape=(1,), name="Volume")

all_inputs = [Open, High, Low, Close, Volume]

# Encode numerical features
open_encoded = encode_numerical_feature(Open, "Open", train_ds)
high_encoded = encode_numerical_feature(High, "High", train_ds)
low_encoded = encode_numerical_feature(Low, "Low", train_ds)
close_encoded = encode_numerical_feature(Close, "Close", train_ds)
volume_encoded = encode_numerical_feature(Volume, "Volume", train_ds)

The code block above runs for a while. After that, we could concat all input layers and connect them to multiple hidden Dense layers.

In [None]:
# Concat all features of input layer
all_features = layers.concatenate(
    [
        open_encoded,
        high_encoded,
        low_encoded,
        close_encoded,
        volume_encoded,
    ]
)

# Add several hidden layers with batch_norm and dropout
x = layers.Dense(128, activation="relu")(all_features)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(64, activation="relu")(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(32, activation="relu")(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.2)(x)

# Output layer for regression task
output = layers.Dense(1, activation="linear")(x)

# Create our NN model
model = keras.Model(all_inputs, output)
model.compile(
    optimizer='adam', 
    loss="mse", 
    metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

In [None]:
# NN model structure
model.summary()

In [None]:
# Model visualization
keras.utils.plot_model(model, show_shapes=True, expand_nested=True)

# Train Model

Before start training our model, we could set an early-stopping callback. If validation loss does not improve for some number of epochs, stop training and restore best model weights.

In [None]:
# Set early_stopping callbacks, if val_loss does not improve for 10 epochs, stop training and restore best model weights
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    min_delta=1e-3,
    restore_best_weights=True,
)

Now let's train the model

In [None]:
# Model training 
model.fit(
    train_ds, 
    validation_data=valid_ds, 
    epochs=50, 
    callbacks=[early_stopping])

In [None]:
# Save model
model.save("spx_toy_model.h5")

# Submission

In [None]:
# Load trained model
best_model = keras.models.load_model("spx_toy_model.h5")

In [None]:
# Generate tensorflow dataset for test data
def dataframe_to_dataset_test(dataframe):
    dataframe = dataframe.copy()
    ds = tf.data.Dataset.from_tensor_slices(dict(dataframe))
    return ds

In [None]:
# Make predictions and submission
env = jpx_tokyo_market_prediction.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test files
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    test_ds = dataframe_to_dataset_test(prices)
    sample_prediction['target_pred'] = best_model.predict(test_ds)
    sample_prediction = sample_prediction.sort_values(by="target_pred", ascending=False)
    sample_prediction['Rank'] = np.arange(2000)
    sample_prediction = sample_prediction.sort_values(by="SecuritiesCode", ascending=True)
    sample_prediction.drop(['target_pred'], axis=1, inplace=True)
    display(sample_prediction)
    env.predict(sample_prediction)   # register your predictions

# Reference
There are other good starter notebooks about NN, such as:  
* Ravi trained both NN and LGBM baseline models with cross validation in https://www.kaggle.com/code/ravishah1/jpx-dnn-lgbm-with-cross-validation. His NN model design is a little different, using SecuritiesCode as an input featrue but without using Volume column.