# Hyperliquid Data Overview

This notebook provides a complete overview of Hyperliquid's publicly available data infrastructure.

**Goal**: Build a trading engine that deeply understands Hyperliquid through its traders.

---

## Setup

Create `.env` from `.env.example`:
```
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-2
```

In [59]:
import os
import json
import boto3
import lz4.frame
import msgpack
from collections import Counter
from dotenv import load_dotenv
from datetime import datetime

load_dotenv()

s3 = boto3.client(
    's3',
    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
    region_name=os.getenv('AWS_REGION', 'us-east-2')
)

REQUEST_PAYER = {'RequestPayer': 'requester'}

def list_prefixes(bucket, prefix=''):
    """List folder prefixes in S3"""
    r = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter='/', **REQUEST_PAYER)
    return [p['Prefix'] for p in r.get('CommonPrefixes', [])]

def list_files(bucket, prefix, limit=100):
    """List files in S3"""
    r = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=limit, **REQUEST_PAYER)
    return [obj['Key'] for obj in r.get('Contents', [])]

def download(bucket, key):
    """Download file from S3"""
    return s3.get_object(Bucket=bucket, Key=key, **REQUEST_PAYER)['Body'].read()

def parse_jsonl_lz4(data):
    """Parse LZ4-compressed JSON lines"""
    for line in lz4.frame.decompress(data).decode().strip().split('\n'):
        if line:
            yield json.loads(line)

def parse_msgpack_lz4(data):
    """Parse LZ4-compressed MessagePack"""
    return msgpack.unpackb(lz4.frame.decompress(data), raw=False)

print("Ready")

Ready


---

## The Two S3 Buckets

From [Hyperliquid docs](https://hyperliquid.gitbook.io/hyperliquid-docs/historical-data):

| Bucket | Purpose | Data Types |
|--------|---------|------------|
| `hyperliquid-archive` | Market data archives | Market data |
| `hl-mainnet-node-data` | Node-streamed data | Explorer blocks, trades, fills |


In [49]:
print("=== hyperliquid-archive ===")
for p in list_prefixes('hyperliquid-archive'):
    print(f"  {p}")

print("\n=== hl-mainnet-node-data ===")
for p in list_prefixes('hl-mainnet-node-data'):
    print(f"  {p}")

=== hyperliquid-archive ===
  Testnet/
  asset_ctxs/
  market_data/

=== hl-mainnet-node-data ===
  explorer_blocks/
  misc_events_by_block/
  node_fills/
  node_fills_by_block/
  node_trades/
  replica_cmds/


---

## Bucket 1: `hyperliquid-archive`

Market data archives. Not super important for our purposes.

```
s3://hyperliquid-archive/
├── market_data/{YYYYMMDD}/{HH}/l2Book/{COIN}.lz4
├── asset_ctxs/{YYYYMMDD}.csv.lz4
└── Testnet/
```

In [None]:
# Discover actual date range from S3
dates = list_prefixes('hyperliquid-archive', 'market_data/')
first = dates[0].split('/')[-2]
last = dates[-1].split('/')[-2]
print(f"market_data available: {first} to {last} ({len(dates)} days)")

market_data available: 20230415 to 20251102 (925 days)


---

## Bucket 2: `hl-mainnet-node-data`

Node-streamed data. This is what we need for trader analysis.

```
s3://hl-mainnet-node-data/
├── explorer_blocks/   
├── node_trades/hourly/
├── node_fills/hourly/
├── node_fills_by_block/
├── replica_cmds/
└── misc_events_by_block/
```

| Dataset | First Date | Content | Format |
|---------|------------|---------|--------|
| `explorer_blocks` | Feb 2023 | Raw blocks: Orders, cancels (no fills) | MessagePack+LZ4 |
| `node_trades` | Mar 2025 | Trades with buyer/seller | JSON+LZ4 |
| `node_fills` | May 2025 | Fills with PnL, fees | JSON+LZ4 |
| `node_fills_by_block` | Jul 2025 | Fills organized by block | JSON+LZ4 |
| `replica_cmds` | Jul 2025 | L1 transactions | JSON+LZ4 |
| `misc_events_by_block` | Jul 2025 | Liquidations, etc. | JSON+LZ4 |

In [62]:
# Discover actual date ranges from S3
def get_date_range(bucket, prefix):
    dates = list_prefixes(bucket, prefix)
    if not dates:
        return None, None, 0
    first = dates[0].rstrip('/').split('/')[-1]
    last = dates[-1].rstrip('/').split('/')[-1]
    return first, last, len(dates)

datasets = [
    ('node_fills_by_block', 'node_fills_by_block/hourly/'),
    ('node_fills', 'node_fills/hourly/'),
    ('node_trades', 'node_trades/hourly/'),
]

print("Date ranges discovered from S3:\n")
for name, prefix in datasets:
    first, last, count = get_date_range('hl-mainnet-node-data', prefix)
    if first:
        print(f"{name:25} {datetime.strptime(first, '%Y%m%d').strftime('%B %d, %Y')} to {datetime.strptime(last, '%Y%m%d').strftime('%B %d, %Y')} ({count} days)")

Date ranges discovered from S3:

node_fills_by_block       July 27, 2025 to November 28, 2025 (125 days)
node_fills                May 25, 2025 to July 27, 2025 (64 days)
node_trades               March 22, 2025 to June 21, 2025 (66 days)


In [63]:
# Check explorer_blocks - different structure
prefixes = list_prefixes('hl-mainnet-node-data', 'explorer_blocks/')
print(f"explorer_blocks prefixes: {[p.split('/')[1] for p in prefixes]}")
print(f"\nBlock ranges: 0 to {prefixes[-1].split('/')[1]}+ (100M blocks each)")

explorer_blocks prefixes: ['0', '100000000', '200000000', '300000000', '400000000', '500000000', '600000000', '700000000', '800000000']

Block ranges: 0 to 800000000+ (100M blocks each)


---

## Exploring `node_fills_by_block` (The Best Data)

Every fill since July 2025 with complete information.

In [68]:
# Find a date with data
dates = list_prefixes('hl-mainnet-node-data', 'node_fills_by_block/hourly/')
sample_date = dates[-2].rstrip('/').split('/')[-1]  # Second to last (likely complete)
print(f"Using date: {sample_date}")

# List files for hour 12
files = list_files('hl-mainnet-node-data', f'node_fills_by_block/hourly/{sample_date}/12/')
print(f"Files in hour 12: {len(files)}")
if files:
    print(f"First file: {files[0]}")

Using date: 20251127
Files in hour 12: 0


In [54]:
# Download and parse a sample
if files:
    data = download('hl-mainnet-node-data', files[0])
    fills = list(parse_jsonl_lz4(data))
    print(f"Loaded {len(fills)} fills\n")
    print("Sample fill:")
    print(json.dumps(fills[0], indent=2))

### Fill Schema

| Field | Description |
|-------|-------------|
| `user` | Wallet address |
| `coin` | Asset traded |
| `px`, `sz` | Price and size |
| `dir` | Direction: Open Long, Open Short, Close Long, Close Short |
| `closedPnl` | Realized PnL (only on closes) |
| `fee` | Fee paid |
| `crossed` | true = taker, false = maker |
| `startPosition` | Position before this fill |

In [55]:
# Quick stats
if files:
    print(f"Unique users: {len(set(f['user'] for f in fills))}")
    print(f"Unique coins: {len(set(f['coin'] for f in fills))}")
    print(f"\nDirections:")
    for d, c in Counter(f['dir'] for f in fills).most_common():
        print(f"  {d}: {c}")

---

## Exploring `explorer_blocks` (Raw Historical Data)

For Feb 2023 - Jul 2025, only raw blocks exist. Must reconstruct fills.

In [56]:
# Get earliest block file
block_files = list_files('hl-mainnet-node-data', 'explorer_blocks/0/0/', limit=5)
print("Early block files:")
for f in block_files:
    print(f"  {f}")

Early block files:
  explorer_blocks/0/0/1000.rmp.lz4
  explorer_blocks/0/0/10000.rmp.lz4
  explorer_blocks/0/0/100000.rmp.lz4
  explorer_blocks/0/0/10100.rmp.lz4
  explorer_blocks/0/0/10200.rmp.lz4


In [57]:
# Download and parse
if block_files:
    data = download('hl-mainnet-node-data', block_files[0])
    blocks = parse_msgpack_lz4(data)
    print(f"Loaded {len(blocks)} blocks")
    print(f"Block range: {blocks[0]['header']['height']} - {blocks[-1]['header']['height']}")
    print(f"Time: {blocks[0]['header']['block_time']}")

Loaded 100 blocks
Block range: 901 - 1000
Time: 2023-02-26T17:41:39.942659


In [58]:
# Action types in blocks
if block_files:
    actions = Counter()
    for block in blocks:
        for tx in block.get('txs', []):
            for action in tx.get('actions', []):
                actions[action.get('type', 'unknown')] += 1
    
    print("Action types:")
    for t, c in actions.most_common():
        print(f"  {t}: {c}")

Action types:
  order: 100
  cancel: 25
  SetGlobalAction: 5
  CreditBridgeDepositAction: 1
  connect: 1


### What's in Raw Blocks

- Orders (asset, side, price, size)
- Cancels
- User addresses

### What's Missing

- Fills (must reconstruct via matching engine)
- PnL (must calculate)
- Fees

---

## Reconstruction: Theoretical

Fills can be reconstructed because matching is deterministic:

```
Same orders + Same sequence = Same fills
```

**Challenges:**
- ~3-4 TB of data
- Must track order book state
- Edge cases: liquidations, funding, trigger orders

**Effort estimate:**

| Approach | Coverage | Time |
|----------|----------|------|
| `node_fills_by_block` only | Jul 2025+ (100%) | 1 day |
| + API backfill (10k limit) | ~95% of fills | 3 days |
| + Full reconstruction | 100% | 3-4 weeks |

---

## Our Approach

**Use `node_fills_by_block`** (Jul 2025 - Present)

- ~5 months of complete fill data
- Every trader, every fill, with PnL and fees
- No reconstruction needed

**What we can compute:**

| Metric | Source |
|--------|--------|
| Realized PnL | `SUM(closedPnl)` |
| Volume | `SUM(px * sz)` |
| Trade count | `COUNT(*)` |
| Maker % | `SUM(crossed=false) / COUNT(*)` |
| Win rate | Closes with positive PnL / total closes |

---

## Cost Estimates

S3 transfer at $0.09/GB:

| Dataset | Size (est.) | Cost |
|---------|-------------|------|
| `node_fills_by_block` | ~200-400 GB | ~$30 |
| `explorer_blocks` | ~3-4 TB | ~$315 |

**Start with `node_fills_by_block`.**