# Data Preparation

To fully utilize the power of HftBacktest, it requires to input Tick-by-Tick full order book and trade feed data. Unfortunately, free Tick-by-Tick full order book and trade feed data for HFT is not available unlike daily bar data provided by platforms like Yahoo Finance. However, in the case of cryptocurrency, you can collect the full raw feed yourself.

## Getting started from Binance Futures' raw feed data

You can collect Binance Futures feed yourself using https://github.com/nkaz001/collect-binancefutures  

In [1]:
import gzip

with gzip.open('usdm/btcusdt_20230404.dat.gz', 'r') as f:
    for i in range(20):
        line = f.readline()
        print(line)

b'1680652700423575 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":2710246762461,"s":"BTCUSDT","b":"28145.10","B":"3.868","a":"28145.20","A":"6.887","T":1680652700430,"E":1680652700435}}\n'
b'1680652700441533 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186032,"p":"28145.10","q":"0.002","X":"MARKET","m":true}}\n'
b'1680652700441685 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186033,"p":"28145.10","q":"0.020","X":"MARKET","m":true}}\n'
b'1680652700441725 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186034,"p":"28145.10","q":"0.020","X":"MARKET","m":true}}\n'
b'1680652700442528 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186035,"p":"28145.10","q":"0.008","X":"MARKET","m":true}}\n'
b'1680652700442569 {"stream":"btcusdt@bookTicker","data":{"e":

The first token of the line is timestamp received by local.

The data needs to be converted to normalized data that can be fed into HftBacktest.  
`convert` method also attempts to correct timestamps by reordering the rows.

In [2]:
import numpy as np

from hftbacktest.data.utils import binancefutures

data = binancefutures.convert('usdm/btcusdt_20230404.dat.gz')
np.savez('btcusdt_20230404', data=data)

local_timestamp is ahead of exch_timestamp by 18836.0
found 542 rows that exch_timestamp is ahead of the previous exch_timestamp
Correction is done.


You can save the data directly to a file by providing `output_filename`.

In [3]:
binancefutures.convert('usdm/btcusdt_20230405.dat.gz', output_filename='btcusdt_20230405')

local_timestamp is ahead of exch_timestamp by 26932.0
found 6555 rows that exch_timestamp is ahead of the previous exch_timestamp
Correction is done.
Saving to btcusdt_20230405


array([[ 1.00000000e+00,  1.68065280e+15,  1.68065280e+15,
         1.00000000e+00,  2.23000000e+04,  2.78800000e+00],
       [ 1.00000000e+00,  1.68065280e+15,  1.68065280e+15,
         1.00000000e+00,  2.75774000e+04,  0.00000000e+00],
       [ 1.00000000e+00,  1.68065280e+15,  1.68065280e+15,
         1.00000000e+00,  2.80238000e+04,  1.63800000e+00],
       ...,
       [ 1.00000000e+00,  1.68065321e+15,  1.68065321e+15,
        -1.00000000e+00,  2.81499000e+04,  1.53200000e+00],
       [ 1.00000000e+00,  1.68065321e+15,  1.68065321e+15,
        -1.00000000e+00,  2.85725000e+04,  1.83000000e-01],
       [ 1.00000000e+00,  1.68065321e+15,  1.68065321e+15,
        -1.00000000e+00,  2.89844000e+04,  1.00000000e-03]])

Normalized data as follows. You can find more details on [Data](https://github.com/nkaz001/hftbacktest/wiki/Data).

In [4]:
import pandas as pd

df = pd.DataFrame(data, columns=['event', 'exch_timestamp', 'local_timestamp', 'side', 'price', 'qty'])
df['event'] = df['event'].astype(int)
df['exch_timestamp'] = df['exch_timestamp'].astype(int)
df['local_timestamp'] = df['local_timestamp'].astype(int)
df['side'] = df['side'].astype(int)
df

Unnamed: 0,event,exch_timestamp,local_timestamp,side,price,qty
0,2,1680652700452000,1680652700460369,-1,28145.1,0.002
1,2,1680652700452000,1680652700460521,-1,28145.1,0.020
2,2,1680652700452000,1680652700460561,-1,28145.1,0.020
3,2,1680652700452000,1680652700461364,-1,28145.1,0.008
4,2,1680652700462000,1680652700473746,1,28145.2,0.002
...,...,...,...,...,...,...
71014,1,1680652799975000,1680652799977784,-1,28182.7,0.441
71015,1,1680652799975000,1680652799977784,-1,28186.9,0.054
71016,1,1680652799975000,1680652799977784,-1,28225.5,3.213
71017,1,1680652799975000,1680652799977784,-1,28231.7,0.356


## Creating a market depth snapshot

As Binance Futures exchange runs 24/7, you need the initial snapshot to get the complete(almost) market depth.  
`collect-binancefutures` fetches the snapshot only when it makes the connection, so you need build the initial snapshot from the start of the collected feed data.

In [5]:
from hftbacktest.data.utils import create_last_snapshot

# Build 20230404 End of Day snapshot. It will be used for the initial snapshot for 20230405.
data = create_last_snapshot('btcusdt_20230404.npz', tick_size=0.01, lot_size=0.001)
np.savez('btcusdt_20230404_eod.npz', data=data)

# Build 20230405 End of Day snapshot.
# Due to the file size limitation, btcusdt_20230405.npz does not contain data for the entire day.
create_last_snapshot(
    'btcusdt_20230405.npz',
    tick_size=0.01,
    lot_size=0.001,
    initial_snapshot='btcusdt_20230404_eod.npz',
    output_snapshot_filename='btcusdt_20230405_eod'
)

Load btcusdt_20230404.npz
Load btcusdt_20230405.npz


array([[ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
         1.00000000e+00,  2.81401000e+04,  8.25100000e+00],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
         1.00000000e+00,  2.81400000e+04,  1.62000000e-01],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
         1.00000000e+00,  2.81399000e+04,  4.00000000e-03],
       ...,
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
        -1.00000000e+00,  3.09404800e+05,  2.00000000e-03],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
        -1.00000000e+00,  3.09425600e+05,  7.00000000e-03],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
        -1.00000000e+00,  3.09443200e+05,  5.00000000e-03]])

In [6]:
import pandas as pd

df = pd.DataFrame(data, columns=['event', 'exch_timestamp', 'local_timestamp', 'side', 'price', 'qty'])
df['event'] = df['event'].astype(int)
df['exch_timestamp'] = df['exch_timestamp'].astype(int)
df['local_timestamp'] = df['local_timestamp'].astype(int)
df['side'] = df['side'].astype(int)
df

Unnamed: 0,event,exch_timestamp,local_timestamp,side,price,qty
0,4,1680652799977784,-1,1,28155.1,0.060
1,4,1680652799977784,-1,1,28155.0,0.004
2,4,1680652799977784,-1,1,28154.9,0.001
3,4,1680652799977784,-1,1,28154.8,0.001
4,4,1680652799977784,-1,1,28154.7,0.002
...,...,...,...,...,...,...
4092,4,1680652799977784,-1,-1,30827.5,1.620
4093,4,1680652799977784,-1,-1,31500.0,33.077
4094,4,1680652799977784,-1,-1,33500.0,11.648
4095,4,1680652799977784,-1,-1,33752.3,0.001


## Getting started from Tardis.dev data

Few vendors offer tick-by-tick full market depth data along with snapshot and trade data, and Tardis.dev is among them.

In [None]:
# https://docs.tardis.dev/historical-data-details/binance-futures

# Download sample Binance futures BTCUSDT trades
!wget https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_trades.csv.gz
    
# Download sample Binance futures BTCUSDT book
!wget https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_book.csv.gz

In [8]:
from hftbacktest.data.utils import tardis

data = tardis.convert(['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'])
np.savez('btcusdt_20200201.npz', data=data)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Merging
found 20948 rows that exch_timestamp is ahead of the previous exch_timestamp
Correction is done.


You can save the data directly to a file by providing `output_filename`. If there are too many rows, you need to increase `buffer_size`.  

In [9]:
tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'],
    output_filename='btcusdt_20200201.npz',
    buffer_size=200_000_000
)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Merging
found 20948 rows that exch_timestamp is ahead of the previous exch_timestamp
Correction is done.
Saving to btcusdt_20200201.npz


array([[ 2.0000000e+00,  1.5805152e+15,  1.5805152e+15,  1.0000000e+00,
         9.3645100e+03,  1.1970000e+00],
       [ 2.0000000e+00,  1.5805152e+15,  1.5805152e+15,  1.0000000e+00,
         9.3656700e+03,  2.0000000e-02],
       [ 2.0000000e+00,  1.5805152e+15,  1.5805152e+15,  1.0000000e+00,
         9.3658600e+03,  1.0000000e-02],
       ...,
       [ 1.0000000e+00,  1.5806016e+15,  1.5806016e+15,  1.0000000e+00,
         9.3514700e+03,  3.9140000e+00],
       [ 1.0000000e+00,  1.5806016e+15,  1.5806016e+15, -1.0000000e+00,
         9.3977800e+03,  1.0000000e-01],
       [ 1.0000000e+00,  1.5806016e+15,  1.5806016e+15,  1.0000000e+00,
         9.3481400e+03,  3.9800000e+00]])

You can also build the snapshot in the same way as described above.