# Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units
### Authors: Zihao Zhang and Stefan Zohren
### Oxford-Man Institute of Quantitative Finance, Department of Engineering Science, University of Oxford

This jupyter notebook is used to demonstrate our recent paper [2] published in <...>. We use FI-2010 [1] dataset and present how model architecture is constructed here. The FI-2010 is publicly avilable and interested readers can check out their paper [1]. 

### Data:
The FI-2010 is publicly avilable and interested readers can check out their paper [1]. The dataset can be downloaded from: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649 

Otherwise, the notebook will download the data automatically or it can be obtained from: 

https://drive.google.com/drive/folders/1Xen3aRid9ZZhFqJRgEMyETNazk02cNmv?usp=sharing.

### References:

[1] Ntakaris A, Magris M, Kanniainen J, Gabbouj M, Iosifidis A. Benchmark dataset for mid‐price forecasting of limit order book data with machine learning methods. Journal of Forecasting. 2018 Dec;37(8):852-66. https://arxiv.org/abs/1705.03233

[2] Zhang Z, Zohren S. Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units. 

#### This notebook demonstrates how to train DeepLOB-Seq2Seq by using tensorflow 2 on GPUs.

In [1]:
# obtain data
import os 
if not os.path.isfile('data.zip'):
    !wget https://raw.githubusercontent.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books/master/data/data.zip
    !unzip -n data.zip
    print('data downloaded.')
else:
    print('data already existed.')

--2021-07-14 23:19:53--  https://raw.githubusercontent.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books/master/data/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56278154 (54M) [application/zip]
Saving to: ‘data.zip’


2021-07-14 23:19:57 (70.8 MB/s) - ‘data.zip’ saved [56278154/56278154]

Archive:  data.zip
  inflating: Test_Dst_NoAuction_DecPre_CF_7.txt  
  inflating: Test_Dst_NoAuction_DecPre_CF_9.txt  
  inflating: Test_Dst_NoAuction_DecPre_CF_8.txt  
  inflating: Train_Dst_NoAuction_DecPre_CF_7.txt  
data downloaded.


In [1]:
import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
    # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
        print(e)

# %%
import os
import logging
import glob
import argparse
import sys
import time
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

# set random seeds
np.random.seed(1)
tf.random.set_seed(2)

1 Physical GPUs, 1 Logical GPUs


In [2]:
from preprocess import *
from model_gpu import get_model_seq

In [3]:
# please change the data_path to your local path
# data_path = '/nfs/home/zihaoz/limit_order_book/data'
T = 50
epochs = 50
batch_size = 32
n_hidden = 64
checkpoint_filepath = './model_deeplob_seq/weights'

In [4]:
# %%
dec_train = np.loadtxt('Train_Dst_NoAuction_DecPre_CF_7.txt')
dec_test1 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_7.txt')
dec_test2 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_8.txt')
dec_test3 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_9.txt')
dec_test = np.hstack((dec_test1, dec_test2, dec_test3))

# extract limit order book data from the FI-2010 dataset
train_lob = prepare_x(dec_train)
test_lob = prepare_x(dec_test)

# extract label from the FI-2010 dataset
train_label = get_label(dec_train)
test_label = get_label(dec_test)

# prepare training data. We feed past 100 observations into our algorithms.
train_encoder_input, train_decoder_target = data_classification(train_lob, train_label, T)
train_decoder_input = prepare_decoder_input(train_encoder_input, teacher_forcing=False)

test_encoder_input, test_decoder_target = data_classification(test_lob, test_label, T)
test_decoder_input = prepare_decoder_input(test_encoder_input, teacher_forcing=False)

print(f'train_encoder_input.shape = {train_encoder_input.shape},'
      f'train_decoder_target.shape = {train_decoder_target.shape}')
print(f'test_encoder_input.shape = {test_encoder_input.shape},'
      f'test_decoder_target.shape = {test_decoder_target.shape}')

train_encoder_input.shape = (254701, 50, 40, 1),train_decoder_target.shape = (254701, 5, 3)
test_encoder_input.shape = (139538, 50, 40, 1),test_decoder_target.shape = (139538, 5, 3)


In [5]:
model = get_model_seq(n_hidden)
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

In [None]:
split_train_val = int(np.floor(len(train_encoder_input) * 0.8))

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

model.fit([train_encoder_input[:split_train_val], train_decoder_input[:split_train_val]], 
          train_decoder_target[:split_train_val],
          validation_data=([train_encoder_input[split_train_val:], train_decoder_input[split_train_val:]], 
          train_decoder_target[split_train_val:]),
          epochs=epochs, batch_size=batch_size, verbose=2, callbacks=[model_checkpoint_callback])

Epoch 1/50
6368/6368 - 181s - loss: 0.9576 - accuracy: 0.5171 - val_loss: 0.8496 - val_accuracy: 0.6138
Epoch 2/50
6368/6368 - 179s - loss: 0.6975 - accuracy: 0.7035 - val_loss: 0.7852 - val_accuracy: 0.6524
Epoch 3/50
6368/6368 - 180s - loss: 0.6446 - accuracy: 0.7311 - val_loss: 0.7602 - val_accuracy: 0.6728
Epoch 4/50
6368/6368 - 179s - loss: 0.6162 - accuracy: 0.7455 - val_loss: 0.7625 - val_accuracy: 0.6723
Epoch 5/50
6368/6368 - 180s - loss: 0.5984 - accuracy: 0.7540 - val_loss: 0.7443 - val_accuracy: 0.6825
Epoch 6/50
6368/6368 - 181s - loss: 0.5848 - accuracy: 0.7603 - val_loss: 0.7305 - val_accuracy: 0.6892
Epoch 7/50
6368/6368 - 180s - loss: 0.5740 - accuracy: 0.7649 - val_loss: 0.7427 - val_accuracy: 0.6822
Epoch 8/50
6368/6368 - 180s - loss: 0.5650 - accuracy: 0.7689 - val_loss: 0.7615 - val_accuracy: 0.6819
Epoch 9/50
6368/6368 - 179s - loss: 0.5576 - accuracy: 0.7719 - val_loss: 0.7527 - val_accuracy: 0.6846
Epoch 10/50
6368/6368 - 179s - loss: 0.5518 - accuracy: 0.7745 -

In [9]:
model.load_weights(checkpoint_filepath)
pred = model.predict([test_encoder_input, test_decoder_input])

In [10]:
evaluation_metrics(test_decoder_target, pred)


Prediction horizon = 0
accuracy_score = 0.8222921354756411
classification_report =               precision    recall  f1-score   support

           0     0.6910    0.5771    0.6289     21147
           1     0.8593    0.9309    0.8937     98624
           2     0.7136    0.5427    0.6165     19767

    accuracy                         0.8223    139538
   macro avg     0.7546    0.6835    0.7130    139538
weighted avg     0.8131    0.8223    0.8143    139538

-------------------------------
Prediction horizon = 1
accuracy_score = 0.7423497541888231
classification_report =               precision    recall  f1-score   support

           0     0.6292    0.5130    0.5652     27448
           1     0.7868    0.8924    0.8363     86605
           2     0.6454    0.4793    0.5501     25485

    accuracy                         0.7423    139538
   macro avg     0.6871    0.6283    0.6505    139538
weighted avg     0.7300    0.7423    0.7307    139538

-------------------------------
Predicti