# Binance Data Get Overview
This notebook is designed to download and preprocess historical cryptocurrency data from Binance using the Tardis.dev dataset for the ETH/USDT trading pair. It automates the process of retrieving trade and order book data, applying filtering criteria, and computing order latencies. The data processed in this notebook will serve as the foundation for further analysis and backtesting in future examples.

By running the scripts in this notebook, you will generate .npz files containing the relevant data, as well as calculated latency files. These outputs will be stored in the data/ and latency/ directories, respectively, for easy access in subsequent workflows.

In [1]:
import numpy as np
from numba import njit
import polars as pl
from hftbacktest import LOCAL_EVENT, EXCH_EVENT

@njit
def generate_order_latency_nb(data, order_latency, mul_entry, offset_entry, mul_resp, offset_resp):
    for i in range(len(data)):
        exch_ts = data[i].exch_ts
        local_ts = data[i].local_ts
        feed_latency = local_ts - exch_ts
        order_entry_latency = mul_entry * feed_latency + offset_entry
        order_resp_latency = mul_resp * feed_latency + offset_resp

        req_ts = local_ts
        order_exch_ts = req_ts + order_entry_latency
        resp_ts = order_exch_ts + order_resp_latency

        order_latency[i].req_ts = req_ts
        order_latency[i].exch_ts = order_exch_ts
        order_latency[i].resp_ts = resp_ts

def generate_order_latency(feed_file, output_file = None, mul_entry = 1, offset_entry = 0, mul_resp = 1, offset_resp = 0):
    data = np.load(feed_file)['data']
    df = pl.DataFrame(data)
    
    df = df.filter(
        (pl.col('ev') & EXCH_EVENT == EXCH_EVENT) & (pl.col('ev') & LOCAL_EVENT == LOCAL_EVENT)
    ).with_columns(
        pl.col('local_ts').alias('ts')
    ).group_by_dynamic(
        'ts', every='1000000000i'
    ).agg(
        pl.col('exch_ts').last(),
        pl.col('local_ts').last()
    ).drop('ts')
    
    data = df.to_numpy(structured=True)

    order_latency = np.zeros(len(data), dtype=[('req_ts', 'i8'), ('exch_ts', 'i8'), ('resp_ts', 'i8'), ('_padding', 'i8')])
    generate_order_latency_nb(data, order_latency, mul_entry, offset_entry, mul_resp, offset_resp)

    if output_file is not None:
        np.savez_compressed(output_file, data=order_latency)

    return order_latency

## Getting started from Tardis.dev data

Few vendors offer tick-by-tick full market depth data along with snapshot and trade data, and Tardis.dev is among them.

<div class="alert alert-info">
    
**Note:** Some data may have an issue with the exchange timestamp. Ideally, the exchange timestamp should reflect the moment the event occurs at the matching engine. However, some data uses the server's data sent timestamp instead of the matching engine timestamp.

</div>

In [2]:
from hftbacktest.data.utils import tardis
import os
import polars as pl
from hftbacktest import EXCH_EVENT, LOCAL_EVENT
import numpy as np

# Define the year and months to process
year = 2020
months = range(1, 6)  

# Base URLs for trades and book data
trade_base_url = "https://datasets.tardis.dev/v1/binance-futures/trades/{year}/{month:02d}/01/ETHUSDT.csv.gz"
book_base_url = "https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/{year}/{month:02d}/01/ETHUSDT.csv.gz"

# Create directories if not present
os.makedirs('data', exist_ok=True)
os.makedirs('latency', exist_ok=True)

# Loop through each month and process data
for month in months:
    trade_url = trade_base_url.format(year=year, month=month)
    book_url = book_base_url.format(year=year, month=month)

    # Define filenames
    trade_filename = f"ETHUSDT_trades_{year}{month:02d}01.csv.gz"
    book_filename = f"ETHUSDT_book_{year}{month:02d}01.csv.gz"

    # Download trade and book data
    os.system(f"wget {trade_url} -O {trade_filename}")
    os.system(f"wget {book_url} -O {book_filename}")

    # Convert downloaded files using tardis
    output_filename = f"ethusdt_{year}{month:02d}01.npz"
    _ = tardis.convert(
        [trade_filename, book_filename],
        output_filename=output_filename,
        buffer_size=200_000_000,
        ss_buffer_size=500_000
    )

    # Load the converted data
    data = np.load(output_filename)['data']

    # Convert to Polars DataFrame and filter for relevant events
    df = pl.DataFrame(data)
    df = df.filter((pl.col('ev') & EXCH_EVENT == EXCH_EVENT) & (pl.col('ev') & LOCAL_EVENT == LOCAL_EVENT))

    # Group data by 1-second intervals
    df = df.with_columns(pl.col('local_ts').alias('ts')).group_by_dynamic('ts', every='1000000000i').agg(
        pl.col('exch_ts').last(),
        pl.col('local_ts').last()
    ).drop('ts')

    # Convert filtered data back to numpy structured array
    data = df.to_numpy(structured=True)

    # Parameters for order latency calculation
    mul_entry, offset_entry, mul_resp, offset_resp = 4, 0, 3, 0

    # Initialize order latency array
    order_latency = np.zeros(len(data), dtype=[('req_ts', 'i8'), ('exch_ts', 'i8'), ('resp_ts', 'i8'), ('_padding', 'i8')])

    # Calculate order latencies
    for i, (exch_ts, local_ts) in enumerate(data):
        feed_latency = local_ts - exch_ts
        order_entry_latency = mul_entry * feed_latency + offset_entry
        order_resp_latency = mul_resp * feed_latency + offset_resp

        req_ts = local_ts
        order_exch_ts = req_ts + order_entry_latency
        resp_ts = order_exch_ts + order_resp_latency

        order_latency[i] = (req_ts, order_exch_ts, resp_ts, 0)

    # Convert order latency array to DataFrame and validate
    df_order_latency = pl.DataFrame(order_latency)
    order_entry_latency = df_order_latency['exch_ts'] - df_order_latency['req_ts']
    order_resp_latency = df_order_latency['resp_ts'] - df_order_latency['exch_ts']

    # Generate feed latency file
    feed_file = f'feed_latency_{year}{month:02d}01.npz'
    order_latency = generate_order_latency(output_filename, output_file=feed_file, mul_entry=4, mul_resp=3)

    # Move files to respective directories
    os.rename(output_filename, f'data/{output_filename}')
    os.rename(feed_file, f'latency/{feed_file}')

--2024-09-25 13:02:48--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/01/01/ETHUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 2606:4700:4400::6812:28cd, 2606:4700:4400::ac40:9333, 104.18.40.205, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|2606:4700:4400::6812:28cd|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 612298 (598K) [text/csv]
Saving to: ‘ETHUSDT_trades_20200101.csv.gz’

     0K .......... .......... .......... .......... ..........  8% 7.48M 0s
    50K .......... .......... .......... .......... .......... 16%  895K 0s
   100K .......... .......... .......... .......... .......... 25% 4.06M 0s
   150K .......... .......... .......... .......... .......... 33% 2.95M 0s
   200K .......... .......... .......... .......... .......... 41% 7.61M 0s
   250K .......... .......... .......... .......... .......... 50% 7.27M 0s
   300K .......... .......... .......... .......... .......... 58% 7.50M 0s
   350K .

Reading ETHUSDT_trades_20200101.csv.gz
Reading ETHUSDT_book_20200101.csv.gz
Correcting the latency
Correcting the event order
Saving to ethusdt_20200101.npz


--2024-09-25 13:02:55--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/ETHUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 2606:4700:4400::6812:28cd, 2606:4700:4400::ac40:9333, 172.64.147.51, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|2606:4700:4400::6812:28cd|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1090855 (1.0M) [text/csv]
Saving to: ‘ETHUSDT_trades_20200201.csv.gz’

     0K .......... .......... .......... .......... ..........  4% 67.0M 0s
    50K .......... .......... .......... .......... ..........  9%  932K 1s
   100K .......... .......... .......... .......... .......... 14% 7.04M 0s
   150K .......... .......... .......... .......... .......... 18% 2.66M 0s
   200K .......... .......... .......... .......... .......... 23% 5.45M 0s
   250K .......... .......... .......... .......... .......... 28% 10.7M 0s
   300K .......... .......... .......... .......... .......... 32% 7.35M 0s
   350K 

Reading ETHUSDT_trades_20200201.csv.gz
Reading ETHUSDT_book_20200201.csv.gz
Correcting the latency
Correcting the event order
Saving to ethusdt_20200201.npz


--2024-09-25 13:03:28--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/03/01/ETHUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 2606:4700:4400::6812:28cd, 2606:4700:4400::ac40:9333, 172.64.147.51, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|2606:4700:4400::6812:28cd|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4339605 (4.1M) [text/csv]
Saving to: ‘ETHUSDT_trades_20200301.csv.gz’

     0K .......... .......... .......... .......... ..........  1% 23.7M 0s
    50K .......... .......... .......... .......... ..........  2%  893K 2s
   100K .......... .......... .......... .......... ..........  3% 2.29M 2s
   150K .......... .......... .......... .......... ..........  4% 6.09M 2s
   200K .......... .......... .......... .......... ..........  5% 6.20M 2s
   250K .......... .......... .......... .......... ..........  7% 7.49M 1s
   300K .......... .......... .......... .......... ..........  8% 13.1M 1s
   350K 

Reading ETHUSDT_trades_20200301.csv.gz
Reading ETHUSDT_book_20200301.csv.gz
Correcting the latency
Correcting the event order
Saving to ethusdt_20200301.npz


--2024-09-25 13:04:33--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/04/01/ETHUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 2606:4700:4400::ac40:9333, 2606:4700:4400::6812:28cd, 104.18.40.205, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|2606:4700:4400::ac40:9333|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2752273 (2.6M) [text/csv]
Saving to: ‘ETHUSDT_trades_20200401.csv.gz’

     0K .......... .......... .......... .......... ..........  1% 1.05M 2s
    50K .......... .......... .......... .......... ..........  3%  891K 3s
   100K .......... .......... .......... .......... ..........  5% 1.85M 2s
   150K .......... .......... .......... .......... ..........  7% 1.90M 2s
   200K .......... .......... .......... .......... ..........  9% 5.26M 2s
   250K .......... .......... .......... .......... .......... 11% 5.25M 1s
   300K .......... .......... .......... .......... .......... 13% 5.73M 1s
   350K 

Reading ETHUSDT_trades_20200401.csv.gz
Reading ETHUSDT_book_20200401.csv.gz
Correcting the latency
Correcting the event order
Saving to ethusdt_20200401.npz


It is recommended to input trade files before depth files. This is because if a depth event occurs due to a trade event, having the trade event before the depth event could provide a more realistic fill during backtesting. However, the sorting process will prioritize events from the first input file when both events have the same timestamp.