# Jane Street Market Prediction - An LSTM Approach
*This notebook is a response to the problem posed by the "Jane Street Market Prediction" Kaggle Competition (Nov 2020 - Feb 2021).*

The applications of Deep Learning in financial markets has always been one of the hot topics of the field. The Jane Street Market Prediction competition challenges us to create a quantitative trading model, one that utilizes real-time market data to help make trading decisions and maximise returns.

### Framing the Problem

The goal of the model is to **predict whether it is better to make a trade or pass on it** at a certain point in time, given an anonymized set of features representing stock market data at that point.

I opted to use a **Long Short-Term Memory (LSTM)** model because market data is a Time Series. Analysing past patterns to predict future performance is already established in Fundamental market analysis, so I decided to have the model take into account past data in addition to current data.

Below, I go through the preparation of data, model creation and finally prediction.

## 1. Cleaning the Dataset

We first have to import the dataset from Kaggle.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datatable

# datatable reads large csv files faster than pandas
train_df = datatable.fread('/kaggle/input/jane-street-market-prediction/train.csv').to_pandas()

Let's inspect the data:

In [None]:
print(train_df.info())
train_df.head()

The `date` is the day on which the trading opportunity occurs. This goes from Day 0-499.

The `weight` and `resp` together represent the value of each trade. `resp_1` to `resp_4` are 'resp' values over different time horizons. **The five 'resp' values will be the dependent variables, and hence the targets of prediction.**

`feature_0` to `feature_129` represent stock market data.

The `ts_id` is the index of each row. It is the number representing the time of the trading opportunity.

In [None]:
f_mean = train_df.mean().drop(['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'date', 'ts_id']) # Will be used later

### Dealing with NaN entries

Right away we see that we will have to deal with numerous NaN entries, as seen in feature_121. Let's dig a little deeper:

In [None]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

In [None]:
print("Max:", isna_df.max())
print(isna_df[isna_df == isna_df.max()])
isna_df.max()/train_df.size

We can see that there are 88 columns with NaN entries, with the a maximum of 395535 NaN entries in a single column. However, this is 0.1% of the whole dataset, so it should be okay to fill in the NaN entries.

An analysis by Tom Warrens strongly suggests that most NaN values occur at the start of the day and during midday, which corresponds to the market opening and lunch breaks. With this information, it makes sense to fill in the NaN values with the last valid observation.

However, this only holds true if data is at least generally continuous. Carl McBride's Day 0 Exploratory Data Analysis workbook shows that this is not always the case. `feature_41` to `feature_45` comprise of discrete value. For these features, it makes more sense to fill in NaN values with the mean.

*Tom Warrens' analysis can be found here: https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day*

*Carl McBride's Day 0 EDA can be found here: https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance*

In [None]:
discrete_features = ['feature_41', 'feature_42', 'feature_43', 'feature_44', 'feature_45']

isna_df = train_df[discrete_features].isnull().sum()
isna_df[isna_df > 0]

Since there are discrete features with NaN entries, we need to take two different approaches to filling in the data: forward-filling the continuous data and filling with mean for the discrete data.

We deal with the discrete data first.

*Note: For the sake of simplicity in this notebook, I split the data into training and validation datasets and fill with the mean before concatenating again. This is to prevent **data leakage** that occurs when the mean used to fill in the values is the mean of the whole dataset, rather than just the training set.*

In [None]:
# Filling with mean for discrete data
def fill_na_mean_discrete(df, discrete_features):
    df[discrete_features] = df[discrete_features].fillna(value=df[discrete_features].mean())
    return df

# Splitting into validation and training datasets to prevent data leakage
valid_ratio = 0.1 # 90% training data, 10% validation data
valid_index = int(len(train_df.index) * (1 - valid_ratio))

valid_df = fill_na_mean_discrete(train_df[valid_index:], discrete_features)
train_df = fill_na_mean_discrete(train_df[0:valid_index], discrete_features)

# Re-concatenating both datasets
train_df = pd.concat([train_df, valid_df], axis=0)
train_df.head()

In [None]:
isna_df = train_df[discrete_features].isnull().sum()
isna_df[isna_df > 0]

Next, we can use forward-filling to fill the rest of the data.

In [None]:
# Forward-filling
train_df.fillna(method="ffill", inplace=True)
train_df.head()

In [None]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

We can see that the number of NaN entries has been drastically reduced, but there are still many entries with NaN values. This is likely because many NaN values start at index 0 (as can be seen from `feature_121` above) and hence do not have a last valid observation to fill from.

Although this is not ideal since in actual use we will not have future data on hand, for training purposes we can fill in the last few NaN entries with the next valid observation instead.

In [None]:
train_df.fillna(method="bfill", inplace=True)
train_df.head()

In [None]:
isna_df = train_df.isnull().sum()
isna_df[isna_df > 0]

### Reducing Memory Usage

Before we continue, we should return to the memory usage of the dataset, as seen above. At 2.4GB, the training dataset takes up quite a lot of memory. Let's try to reduce the memory usage by optimizing the data types.

(Note: if done before we fill the NaN entries, the pandas.fillna method will not work)

In [None]:
def reduce_memory_usage(df):
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            cmin = df[col].min()
            cmax = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if cmin > np.iinfo(np.int8).min and cmax < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif cmin > np.iinfo(np.int16).min and cmax < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif cmin > np.iinfo(np.int32).min and cmax < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif cmin > np.iinfo(np.int64).min and cmax < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if cmin > np.finfo(np.float16).min and cmax < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif cmin > np.finfo(np.float32).min and cmax < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                    
        else:
            df[col] = df[col].astype('category')
            
    return df

train_df = reduce_memory_usage(train_df)
train_df.info()

### Re-indexing the Data

Lastly, we should set the index of train_df to "ts_id".

In [None]:
train_df.set_index("ts_id", drop=True)

## 2. Transforming the Dataset

Now that the data is clean, we can start to prepare the data for the model. We first split the data into training and validation data (because this is Time Series, the last 10% of data will be taken as validation).

In [None]:
# valid_index established above
valid_df = train_df[valid_index:]
train_df = train_df[0:valid_index]

print(len(train_df.index))
print(len(valid_df.index))

We separate the features and our dependent variables, which are "resp" and the other "resp" over the various time frames.

In [None]:
train_Y = (train_df[["resp", "resp_1", "resp_2", "resp_3", "resp_4"]] > 0).astype(int) # The model just has to predict whether the 'resp' value is positive or negative
train_X = train_df.drop(["resp", "resp_1", "resp_2", "resp_3", "resp_4", "date", "ts_id"], axis=1)

valid_Y = (valid_df[["resp", "resp_1", "resp_2", "resp_3", "resp_4"]] > 0).astype(int)
valid_X = valid_df.drop(["resp", "resp_1", "resp_2", "resp_3", "resp_4", "date", "ts_id"], axis=1)

print(train_X.head())
print(train_Y.head())

### Data Windowing

Next, we have to reshape our data for our model. Our model expects us to **window** our data for Time Series analysis. The final shape should be 3D, of the format **(batch_size, time_steps, feature_count)**.

In [None]:
import tensorflow as tf

# returns a tf.data.Dataset object
def get_windowed_dataset(x_data, y_data, window_size, batch_size=4096, mode='train'):
    x_ds = tf.data.Dataset.from_tensor_slices(x_data) # converting pandas Dataframe into tf.data.Dataset object
    
    x_ds = x_ds.window(window_size, shift=1)
    x_ds = x_ds.flat_map(lambda window : window.batch(window_size, drop_remainder=True))
    
    if mode == 'train':
        y_ds = tf.data.Dataset.from_tensor_slices(y_data[window_size:])
        ds = tf.data.Dataset.zip((x_ds, y_ds))
        ds = ds.shuffle(10000).batch(batch_size)
    elif mode == 'predict':
        ds = x_ds
        ds = ds.batch(batch_size)
        
    ds = ds.prefetch(tf.data.AUTOTUNE)    
    return ds

In [None]:
lookback = 15 # The window_size is the lookback of the model

train_ds = get_windowed_dataset(train_X, train_Y, lookback)
valid_ds = get_windowed_dataset(valid_X, valid_Y, lookback)

In [None]:
for line in train_ds.take(5):
    print(line)

## 3. Building and Training the Model

We will then start building the model. I use Keras to build a LSTM model, using Adam as the optimizer, Binary-Crossentropy as the loss, and AUC-ROC and accuracy as the metrics.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_lstm(lookback, num_columns, num_labels, lstm_units, dense_units, dropout_rate, learning_rate, label_smoothing):
    inp = layers.Input(shape=(lookback, num_columns, ))
    x = layers.BatchNormalization()(inp)
    x = layers.Dropout(dropout_rate)(x)
    
    for i in range(len(lstm_units)):
        x = layers.LSTM(lstm_units[i], return_state=False, return_sequences=(False if i==len(lstm_units)-1 else True))(x)
        x = layers.Dropout(dropout_rate)(x)
        
    for j in range(len(dense_units)):
        x = layers.Dense(dense_units[j])(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation(tf.keras.activations.swish)(x)
        x = layers.Dropout(dropout_rate)(x)
        
    x = layers.Dense(num_labels)(x)
    out = layers.Activation("sigmoid")(x)
    
    model = keras.Model(inp, out)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing),
                  metrics=['AUC', 'accuracy'])
    print(model.summary())
    return model

In [None]:
# This model has been tuned
num_epochs = 20

num_columns = len(train_X.columns)
num_labels = len(train_Y.columns)
lstm_units = [64, 64]
dense_units = [512, 256]
dropout_rate = 0.2
learning_rate = 0.001
label_smoothing = 0.01

# Early stopping
callback = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=5, verbose=1)

lstm_model = build_lstm(lookback, num_columns, num_labels, lstm_units, dense_units, dropout_rate, learning_rate, label_smoothing)
lstm_model.fit(train_ds, validation_data=(valid_ds), epochs=num_epochs, callbacks=callback)

## 4. Submission

Using the Jane Street Time-series API, we set up our notebook for submission to the competition.

In [None]:
# Object for keeping track of the windowed data
class DataWindower:
    def __init__(self, lookback, discrete_features):
        self.data = pd.DataFrame()
        self.cols = None
        self.lookback = lookback
        self.discrete_features = discrete_features
        
    def add_data(self, data):
        if self.data.empty:
            data = np.nan_to_num(data) + np.isnan(data) * f_mean # Dealing with NaN entries
            self.data = pd.concat([data for _ in range(self.lookback)], axis=0) # Filling all rows with copies of the first data entry
            self.cols = self.data.columns
            self.data.reset_index(drop=True, inplace=True)
        else:
            data = self.__fill_na_mean_discrete(data) # Dealing with discrete NaN entries
            data = np.nan_to_num(data) + np.isnan(data) * self.data.loc[len(self.data)-1] # Dealing with continuous NaN entries
            self.data = pd.concat([self.data, data], axis=0)
            self.data.drop(0, axis=0, inplace=True) # Ensuring that the data window is always of lookback length
            self.data.reset_index(drop=True, inplace=True)
            
    def __fill_na_mean_discrete(self, data):
        data[self.discrete_features] = data[self.discrete_features].fillna(value=f_mean[self.discrete_features])
        return data
    
    def get_data(self):
        return self.data.values.reshape(1, self.data.shape[0], self.data.shape[1])

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

data_w = DataWindower(lookback, discrete_features)
threshold = 0.500

for (test_df, sample_prediction_df) in iter_test:
    data_w.add_data(test_df.drop('date', axis=1))
    
    if test_df['weight'].values > 0:
        prediction = lstm_model.predict(data_w.get_data())
        avg = np.sum(prediction) / prediction.size
        sample_prediction_df.action = 1 if avg > threshold else 0
    else:
        sample_prediction_df.action = 0
    env.predict(sample_prediction_df)

## 5. Notes and Observations

Despite my initial optimism that an LSTM will be an improvement over simply using a standard multi-layer perceptron (MLP), the model did not perform well. The AUC-ROC was very close to 0.5, indicating that the model had little to no distinguishing power, even on the training data. The accuracy was also low, hovering around 15-20%.

The poor performance might be due to the very short time between each data point. A lookback of 10, 50 or even 100 will only retain data from a short period of time into the past. In contrast, Fundamental Analysis tends to look at data going back hours, days or weeks. With such a short lookback, the data is also likely very noisy.

This LSTM approach was also much more resource-intensive than simpler approaches, due to the windowing of the data increasing the size of the data processed by a factor of the lookback value and the complexity of an LSTM model relative to a MLP. This limited the amount of tuning and epochs I could run due to Kaggle Notebooks' computing limitations.
 
Ultimately, I conclude that the model, as it is, is ill-suited for this problem.

### References:

https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance

https://www.kaggle.com/tomwarrens/nan-values-depending-on-time-of-day

https://www.kaggle.com/manavtrivedi/lstm-rnn-classifier

https://www.kaggle.com/rajkumarl/jane-tf-keras-lstm

https://www.kaggle.com/tarlannazarov/own-jane-street-with-keras-nn