# Jane Street Market Prediction - A Multi-layer Perceptron
*This notebook is a response to the problem posed by the "Jane Street Market Prediction" Kaggle Competition (Nov 2020 - Feb 2021).*

The applications of Deep Learning in financial markets has always been one of the hot topics of the field. The Jane Street Market Prediction competition challenges us to create a quantitative trading model, one that utilizes real-time market data to help make trading decisions and maximise returns.

### Framing the Problem

The goal of the model is to **predict whether it is better to make a trade or pass on it** at a certain point in time, given an anonymized set of features representing stock market data at that point.

This is a **Multi-layer Perceptron (MLP)** model. With 131 features in the dataset, a basic MLP should have reasonable performance despite its simplicity and inability to take time into account. After the poor performance of the LSTM model, I decided it will be best to avoid looking back through the data and returning to the basics.

Below, I go through the preparation of data, model creation and finally prediction.

## 1. Cleaning the Dataset

We first have to import the dataset from Kaggle.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datatable

# datatable reads large csv files faster than pandas
train_df = datatable.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()

Let's inspect the data:

In [None]:
print(train_df.info())
train_df.head()

The `date` is the day on which the trading opportunity occurs. This goes from Day 0-499.

The `weight` and `resp` together represent the value of each trade. `resp_1` to `resp_4` are 'resp' values over different time horizons. **The five 'resp' values will be the dependent variables, and hence the targets of prediction.**

`feature_0` to `feature_129` represent stock market data.

The `ts_id` is the index of each row. It is the number representing the time of the trading opportunity.

### Dealing with NaN entries

Right away we see that we will have to deal with numerous NaN entries, as seen in feature_121. Let's dig a little deeper:

In [None]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

In [None]:
print("Max:", isna_df.max())
print(isna_df[isna_df == isna_df.max()])
isna_df.max()/train_df.size

We can see that there are 88 columns with NaN entries, with the a maximum of 395535 NaN entries in a single column. However, this is 0.1% of the whole dataset, so it should be okay to fill in the NaN entries.

An analysis by Tom Warrens strongly suggests that most NaN values occur at the start of the day and during midday, which corresponds to the market opening and lunch breaks. With this information, it makes sense to fill in the NaN values with the last valid observation.

However, this only holds true if data is at least generally continuous. Carl McBride's Day 0 Exploratory Data Analysis workbook shows that this is not always the case. `feature_41` to `feature_45` comprise of discrete value. For these features, it makes more sense to fill in NaN values with the mean.

*Tom Warrens' analysis can be found here: https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day*

*Carl McBride's Day 0 EDA can be found here: https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance*

In [None]:
discrete_features = ['feature_41', 'feature_42', 'feature_43', 'feature_44', 'feature_45']

isna_df = train_df[discrete_features].isnull().sum()
isna_df[isna_df > 0]

Since there are discrete features with NaN entries, we need to take two different approaches to filling in the data: forward-filling the continuous data and filling with mean for the discrete data.

We deal with the discrete data first.

In [None]:
train_df[discrete_features] = train_df[discrete_features].fillna(value=train_df[discrete_features].mean())

isna_df = train_df[discrete_features].isnull().sum()
isna_df[isna_df > 0]

Next, we can use forward-filling to fill the rest of the data.

In [None]:
train_df.fillna(method="ffill", inplace=True)
train_df.head()

In [None]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

We can see that the number of NaN entries has been drastically reduced, but there are still many entries with NaN values. This is likely because many NaN values start at index 0 (as can be seen from feature_121 above) and hence do not have a last valid observation to fill from.

Although this is not ideal since in actual use we will not have future data on hand, for training purposes we can fill in the last few NaN entries with the next valid observation instead.

In [None]:
train_df.fillna(method="bfill", inplace=True)
train_df.head()

In [None]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0].size

### Reducing Memory Usage

Before we continue, we should return to the memory usage of the dataset, as seen above. At 2.4GB, the training dataset takes up quite a lot of memory. Let's try to reduce the memory usage by optimizing the data types.

(Note: if done before we fill the NaN entries, the pandas.fillna method will not work)

In [None]:
def reduce_memory_usage(df):
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            cmin = df[col].min()
            cmax = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if cmin > np.iinfo(np.int8).min and cmax < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif cmin > np.iinfo(np.int16).min and cmax < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif cmin > np.iinfo(np.int32).min and cmax < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif cmin > np.iinfo(np.int64).min and cmax < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if cmin > np.finfo(np.float16).min and cmax < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif cmin > np.finfo(np.float32).min and cmax < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                    
        else:
            df[col] = df[col].astype('category')
            
    return df

train_df = reduce_memory_usage(train_df)
train_df.info()

### Re-indexing the Data

Lastly, we should set the index of train_df to "ts_id".

In [None]:
train_df.set_index("ts_id", drop=True)

## 2. Transforming the Dataset

Now that the data is clean, we can start to prepare the data for the model. We first separate the features and our dependent variables, which are "resp" and the other "resp" over the various time frames.

In [None]:
Y = (train_df[["resp", "resp_1", "resp_2", "resp_3", "resp_4"]] > 0).astype(int)
X = train_df.drop(["resp", "resp_1", "resp_2", "resp_3", "resp_4", "date", "ts_id"], axis=1)

print(X.head())
print(Y.head())

We split the data into training and validation data (10% of the data will be taken as validation).

In [None]:
from sklearn.model_selection import train_test_split

valid_ratio = 0.1 # 90% training data 10% validation data

train_X, valid_X, train_Y, valid_Y = train_test_split(X, Y, test_size=valid_ratio, random_state=42)

print(len(train_X.index))
print(len(valid_X.index))

## 3. Building and Training the Model

We will then start building the model. I use Keras to build a LSTM model, using Adam as the optimizer, Binary-Crossentropy as the loss, and AUC-ROC and accuracy as the metrics.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_mlp(num_columns, num_labels, dense_units, dropout_rate, learning_rate, label_smoothing):
    inp = layers.Input(shape=(num_columns, ))
    x = layers.BatchNormalization()(inp)
    x = layers.Dropout(dropout_rate)(x)
        
    for j in range(len(dense_units)):
        x = layers.Dense(dense_units[j])(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation(tf.keras.activations.swish)(x)
        x = layers.Dropout(dropout_rate)(x)
        
    x = layers.Dense(num_labels)(x)
    out = layers.Activation("sigmoid")(x)
    
    model = keras.Model(inp, out)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing),
                  metrics=['AUC', 'accuracy'])
    print(model.summary())
    return model

In [None]:
# Tuning attempt 3
num_epochs = 30

num_columns = len(train_X.columns)
num_labels = len(train_Y.columns)
dense_units = [128, 256, 128]
dropout_rate = 0.2
learning_rate = 0.001
label_smoothing = 0.01

# early stopping
callback = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=5, verbose=1)

mlp_model = build_mlp(num_columns, num_labels, dense_units, dropout_rate, learning_rate, label_smoothing)
mlp_model.fit(train_X, train_Y, validation_data=(valid_X, valid_Y), epochs=num_epochs, callbacks=callback)

## 4. Submission

Using the Jane Street Time-series API, we set up our notebook for submission to the competition.

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

threshold = 0.500

for (test_df, sample_prediction_df) in iter_test:
    test_df.drop('date', axis=1, inplace=True)
    
    if test_df['weight'].values > 0:
        prediction = mlp_model.predict(test_df)
        avg = np.sum(prediction) / prediction.size
        sample_prediction_df.action = 1 if avg > threshold else 0
    else:
        sample_prediction_df.action = 0
    env.predict(sample_prediction_df)

## 5. Notes and Observations

Compared to the previous LSTM model, this MLP model had a much better performance. While the accuracy of the model is comparable at 15-30%, the AUC-ROC is consistently more than 0.56, indicating that the model has significantly more distinguishing power than the LSTM model. Despite the inability to look back into past data, it seems that the 131 features provide enough data to produce a good prediction of returns.

Sometimes the basic approach is best.

### References:

https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day

https://www.kaggle.com/manavtrivedi/lstm-rnn-classifier/output

https://www.kaggle.com/rajkumarl/jane-tf-keras-lstm

https://www.kaggle.com/tarlannazarov/own-jane-street-with-keras-nn