# Hyperliquid Data Overview

This notebook provides a complete overview of Hyperliquid's publicly available data infrastructure.

**Goal**: Build a trading engine that deeply understands Hyperliquid through its traders.

---

## Setup

Create `.env` from `.env.example`:
```
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-2
```

In [23]:
import os
import json
import boto3
import lz4.frame
import msgpack
from collections import Counter
from dotenv import load_dotenv
from datetime import datetime

load_dotenv()

s3 = boto3.client(
    's3',
    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
    region_name=os.getenv('AWS_REGION', 'us-east-2')
)

REQUEST_PAYER = {'RequestPayer': 'requester'}

def list_prefixes(bucket, prefix=''):
    """List folder prefixes in S3"""
    r = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter='/', **REQUEST_PAYER)
    return [p['Prefix'] for p in r.get('CommonPrefixes', [])]

def list_files(bucket, prefix, limit=100):
    """List files in S3"""
    r = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=limit, **REQUEST_PAYER)
    return [obj['Key'] for obj in r.get('Contents', [])]

def download(bucket, key):
    """Download file from S3"""
    return s3.get_object(Bucket=bucket, Key=key, **REQUEST_PAYER)['Body'].read()

def parse_jsonl_lz4(data):
    """Parse LZ4-compressed JSON lines"""
    for line in lz4.frame.decompress(data).decode().strip().split('\n'):
        if line:
            yield json.loads(line)

def parse_msgpack_lz4(data):
    """Parse LZ4-compressed MessagePack"""
    return msgpack.unpackb(lz4.frame.decompress(data), raw=False)

print("Ready")

Ready


---

## The Two S3 Buckets

From [Hyperliquid docs](https://hyperliquid.gitbook.io/hyperliquid-docs/historical-data):

| Bucket | Purpose | Data Types |
|--------|---------|------------|
| `hyperliquid-archive` | Market data archives | Market data |
| `hl-mainnet-node-data` | Node-streamed data | Explorer blocks, trades, fills |


In [24]:
print("=== hyperliquid-archive ===")
for p in list_prefixes('hyperliquid-archive'):
    print(f"  {p}")

print("\n=== hl-mainnet-node-data ===")
for p in list_prefixes('hl-mainnet-node-data'):
    print(f"  {p}")

=== hyperliquid-archive ===
  Testnet/
  asset_ctxs/
  market_data/

=== hl-mainnet-node-data ===
  explorer_blocks/
  misc_events_by_block/
  node_fills/
  node_fills_by_block/
  node_trades/
  replica_cmds/


---

## Bucket 1: `hyperliquid-archive`

Market data archives. Not the focus for trader analysis, but useful for orderbook studies.

| Dataset | Date Range | Content | Format |
|---------|------------|---------|--------|
| `market_data` | Apr 2023 - Present | L2 orderbook snapshots per coin/hour | JSON+LZ4 |
| `asset_ctxs` | May 2023 - Present | Daily asset context (funding, OI, etc.) | CSV+LZ4 |

In [25]:
# Discover date ranges for hyperliquid-archive
def fmt_date(d):
    return datetime.strptime(d, '%Y%m%d').strftime('%b %d, %Y')

print("=" * 70)
print(f"{'Dataset':<20} {'First':<15} {'Last':<15} {'Days':>6}")
print("=" * 70)

# market_data (organized by date folders)
dates = list_prefixes('hyperliquid-archive', 'market_data/')
if dates:
    first = dates[0].split('/')[-2]
    last = dates[-1].split('/')[-2]
    print(f"{'market_data':<20} {fmt_date(first):<15} {fmt_date(last):<15} {len(dates):>6}")

# asset_ctxs (files named by date)
files = list_files('hyperliquid-archive', 'asset_ctxs/', limit=1000)
if files:
    # Extract dates from filenames like "asset_ctxs/20230520.csv.lz4"
    file_dates = [f.split('/')[-1].split('.')[0] for f in files]
    first = file_dates[0]
    last = file_dates[-1]
    print(f"{'asset_ctxs':<20} {fmt_date(first):<15} {fmt_date(last):<15} {len(files):>6}")

print("=" * 70)

Dataset              First           Last              Days
market_data          Apr 15, 2023    Nov 02, 2025       925
asset_ctxs           May 20, 2023    Nov 02, 2025       898


---

## Bucket 2: `hl-mainnet-node-data`

Node-streamed data. This is what we need for trader analysis.

| Dataset | Date Range | Content | Format |
|---------|------------|---------|--------|
| `node_fills_by_block` | **Jul 2025 - Present** | Fills with PnL, fees, maker/taker | JSON+LZ4 |
| `node_fills` | May 2025 - Jul 2025 | Fills (legacy format) | JSON+LZ4 |
| `node_trades` | Mar 2025 - Jun 2025 | Trades with buyer/seller | JSON+LZ4 |
| `replica_cmds` | Jul 2025 - Present | L1 transactions | JSON+LZ4 |
| `misc_events_by_block` | Jul 2025 - Present | Liquidations, funding, etc. | JSON+LZ4 |
| `explorer_blocks` | Feb 2023 - Present | Raw blocks (orders, cancels) | MessagePack+LZ4 |

**Best dataset**: `node_fills_by_block` â€” complete fill data, no reconstruction needed.

In [26]:
# Discover all date ranges from S3
def get_date_range(prefix):
    """Get first/last date for an hourly prefix"""
    dates = list_prefixes('hl-mainnet-node-data', f'{prefix}hourly/')
    if not dates:
        return None, None, 0
    first = dates[0].rstrip('/').split('/')[-1]
    last = dates[-1].rstrip('/').split('/')[-1]
    return first, last, len(dates)

def fmt_date(d):
    return datetime.strptime(d, '%Y%m%d').strftime('%b %d, %Y')

datasets = [
    ('node_fills_by_block/', 'Fills by block (BEST)'),
    ('node_fills/', 'Node fills (legacy)'),
    ('node_trades/', 'Node trades (legacy)'),
    ('replica_cmds/', 'Replica commands'),
    ('misc_events_by_block/', 'Misc events'),
]

print("=" * 70)
print(f"{'Dataset':<25} {'First':<15} {'Last':<15} {'Days':>6}")
print("=" * 70)

for prefix, label in datasets:
    first, last, count = get_date_range(prefix)
    if first:
        print(f"{label:<25} {fmt_date(first):<15} {fmt_date(last):<15} {count:>6}")
    else:
        print(f"{label:<25} {'N/A':<15} {'N/A':<15} {0:>6}")

# Explorer blocks - different structure (by block number, not date)
prefixes = list_prefixes('hl-mainnet-node-data', 'explorer_blocks/')
if prefixes:
    last_range = prefixes[-1].split('/')[1]
    print(f"{'Explorer blocks':<25} {'Block 0':<15} {f'Block {last_range}+':<15} {'N/A':>6}")

print("=" * 70)

Dataset                   First           Last              Days
Fills by block (BEST)     Jul 27, 2025    Nov 29, 2025       126
Node fills (legacy)       May 25, 2025    Jul 27, 2025        64
Node trades (legacy)      Mar 22, 2025    Jun 21, 2025        66
Replica commands          N/A             N/A                  0
Misc events               Sep 27, 2025    Nov 29, 2025        64
Explorer blocks           Block 0         Block 800000000+    N/A


---

## Exploring `node_fills_by_block` (The Best Data)

Every fill since July 2025 with complete information.

In [27]:
# Find a date with data
dates = list_prefixes('hl-mainnet-node-data', 'node_fills_by_block/hourly/')
sample_date = dates[-2].rstrip('/').split('/')[-1]  # Second to last (likely complete)
print(f"Using date: {sample_date}")

# List files for this date (files are named {hour}.lz4, not in hour subfolders)
files = list_files('hl-mainnet-node-data', f'node_fills_by_block/hourly/{sample_date}/')
print(f"Files for this date: {len(files)}")
print("Sample files:", [f.split('/')[-1] for f in files[:5]])

Using date: 20251128
Files for this date: 24
Sample files: ['0.lz4', '1.lz4', '10.lz4', '11.lz4', '12.lz4']


In [28]:
# Download and parse hour 12 (midday, typically active)
sample_key = f'node_fills_by_block/hourly/{sample_date}/12.lz4'
data = download('hl-mainnet-node-data', sample_key)

# Each line is a block with multiple fill events
blocks = list(parse_jsonl_lz4(data))
print(f"Loaded {len(blocks):,} blocks from hour 12")

# Flatten to individual fills: each event is [user_address, fill_data]
fills = []
for block in blocks:
    for user, fill_data in block.get('events', []):
        fill_data['user'] = user
        fill_data['block_time'] = block['block_time']
        fills.append(fill_data)

print(f"Extracted {len(fills):,} fills\n")
print("Sample fill:")
print(json.dumps(fills[0], indent=2))

Loaded 44,437 blocks from hour 12
Extracted 230,074 fills

Sample fill:
{
  "coin": "HYPE",
  "px": "35.982",
  "sz": "104.87",
  "side": "B",
  "time": 1764331199926,
  "startPosition": "-4036.68",
  "dir": "Close Short",
  "closedPnl": "-77.089937",
  "hash": "0x8535edac35dc9a1786af043059f8ba02031b0091d0dfb8e928fe98fef4d07402",
  "oid": 251510202190,
  "crossed": true,
  "fee": "0.0",
  "tid": 124684432082590,
  "feeToken": "USDC",
  "twapId": null,
  "user": "0x010461c14e146ac35fe42271bdc1134ee31c703a",
  "block_time": "2025-11-28T11:59:59.926782774"
}


### Data Structure

Each file contains **blocks** (one per line), each block contains multiple **fill events**:

```
{block_time, block_number, events: [[user_address, fill_data], ...]}
```

### Fill Schema

| Field | Description |
|-------|-------------|
| `user` | Wallet address (from event tuple) |
| `coin` | Asset traded |
| `px`, `sz` | Price and size |
| `side` | B (buy) or A (ask/sell) |
| `dir` | Direction: Open Long, Open Short, Close Long, Close Short, Long > Short, Short > Long |
| `closedPnl` | Realized PnL (only on closes) |
| `fee` | Fee paid (negative = rebate) |
| `crossed` | true = taker, false = maker |
| `startPosition` | Position before this fill |

In [29]:
# Quick stats
print(f"Unique users: {len(set(f['user'] for f in fills))}")
print(f"Unique coins: {len(set(f['coin'] for f in fills))}")
print(f"\nDirections:")
for d, c in Counter(f['dir'] for f in fills).most_common():
    print(f"  {d}: {c}")

Unique users: 5497
Unique coins: 249

Directions:
  Open Short: 51788
  Open Long: 47976
  Close Short: 47480
  Close Long: 43688
  Buy: 17541
  Sell: 17541
  Short > Long: 2040
  Long > Short: 2020


---

## Exploring `explorer_blocks` (Raw Historical Data)

For Feb 2023 - Jul 2025, only raw blocks exist. Must reconstruct fills.

In [30]:
# Get earliest block file
block_files = list_files('hl-mainnet-node-data', 'explorer_blocks/0/0/', limit=5)
print("Early block files:")
for f in block_files:
    print(f"  {f}")

Early block files:
  explorer_blocks/0/0/1000.rmp.lz4
  explorer_blocks/0/0/10000.rmp.lz4
  explorer_blocks/0/0/100000.rmp.lz4
  explorer_blocks/0/0/10100.rmp.lz4
  explorer_blocks/0/0/10200.rmp.lz4


In [31]:
# Download and parse
if block_files:
    data = download('hl-mainnet-node-data', block_files[0])
    blocks = parse_msgpack_lz4(data)
    print(f"Loaded {len(blocks)} blocks")
    print(f"Block range: {blocks[0]['header']['height']} - {blocks[-1]['header']['height']}")
    print(f"Time: {blocks[0]['header']['block_time']}")

Loaded 100 blocks
Block range: 901 - 1000
Time: 2023-02-26T17:41:39.942659


In [32]:
# Action types in blocks
if block_files:
    actions = Counter()
    for block in blocks:
        for tx in block.get('txs', []):
            for action in tx.get('actions', []):
                actions[action.get('type', 'unknown')] += 1
    
    print("Action types:")
    for t, c in actions.most_common():
        print(f"  {t}: {c}")

Action types:
  order: 100
  cancel: 25
  SetGlobalAction: 5
  CreditBridgeDepositAction: 1
  connect: 1


### What's in Raw Blocks

- Orders (asset, side, price, size)
- Cancels
- User addresses

### What's Missing

- Fills (must reconstruct via matching engine)
- PnL (must calculate)
- Fees

---

## Reconstruction: Theoretical

Fills can be reconstructed because matching is deterministic:

```
Same orders + Same sequence = Same fills
```

**Challenges:**
- ~3-4 TB of data
- Must track order book state
- Edge cases: liquidations, funding, trigger orders

**Effort estimate:**

| Approach | Coverage | Time |
|----------|----------|------|
| `node_fills_by_block` only | Jul 2025+ (100%) | 1 day |
| + API backfill (10k limit) | ~95% of fills | 3 days |
| + Full reconstruction | 100% | 3-4 weeks |

---

## Our Approach

**Use `node_fills_by_block`** (Jul 2025 - Present)

- ~5 months of complete fill data
- Every trader, every fill, with PnL and fees
- No reconstruction needed

**What we can compute:**

| Metric | Source |
|--------|--------|
| Realized PnL | `SUM(closedPnl)` |
| Volume | `SUM(px * sz)` |
| Trade count | `COUNT(*)` |
| Maker % | `SUM(crossed=false) / COUNT(*)` |
| Win rate | Closes with positive PnL / total closes |

---

## Cost Estimates

S3 transfer at $0.09/GB:

| Dataset | Size (est.) | Cost |
|---------|-------------|------|
| `node_fills_by_block` | ~200-400 GB | ~$30 |
| `explorer_blocks` | ~3-4 TB | ~$315 |

**Start with `node_fills_by_block`.**