# Hyperliquid S3 Data Sampler

This notebook downloads sample data from each Hyperliquid S3 bucket to help understand the data structure.

## Available Buckets

| Bucket | Contents | Format |
|--------|----------|--------|
| `hyperliquid-archive` | L2 book snapshots, asset contexts | LZ4 compressed |
| `hl-mainnet-node-data` | Fills, trades, explorer blocks | LZ4/JSON |
| `hl-mainnet-evm-blocks` | HyperEVM block data | MessagePack + LZ4 |

Samples are saved to `../hyperliquid_samples/` mirroring the S3 structure.

In [None]:
import subprocess
import json
import lz4.frame
from pathlib import Path
from datetime import datetime, timedelta

SAMPLES_DIR = Path("../hyperliquid_samples")
SAMPLES_DIR.mkdir(exist_ok=True)

def run_aws(cmd: str) -> str:
    """Run AWS CLI command with requester-pays."""
    full_cmd = f"aws s3 {cmd} --request-payer requester"
    result = subprocess.run(full_cmd, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
        return ""
    return result.stdout

def list_s3(bucket: str, prefix: str = "", max_items: int = 20) -> list[str]:
    """List objects in S3 bucket."""
    output = run_aws(f"ls s3://{bucket}/{prefix}")
    lines = [l.strip() for l in output.strip().split("\n") if l.strip()][:max_items]
    for line in lines:
        print(line)
    return lines

def download_sample(bucket: str, key: str, local_subdir: str = "") -> Path:
    """Download a file from S3 to samples directory."""
    local_dir = SAMPLES_DIR / bucket / local_subdir
    local_dir.mkdir(parents=True, exist_ok=True)
    local_path = local_dir / Path(key).name
    
    if local_path.exists():
        print(f"Already exists: {local_path}")
        return local_path
    
    print(f"Downloading s3://{bucket}/{key}...")
    run_aws(f"cp s3://{bucket}/{key} {local_path}")
    print(f"Saved to: {local_path}")
    return local_path

def detect_format(data: bytes) -> str:
    """Detect file format from content."""
    try:
        text = data[:1000].decode("utf-8", errors="ignore").strip()
        # Check if it's CSV (has commas and newlines, first line looks like headers)
        if "," in text and "\n" in text:
            first_line = text.split("\n")[0]
            if all(c.isalnum() or c in ",_" for c in first_line.replace(" ", "")):
                return "csv"
        # Check if it's JSON or JSONL
        if text.startswith("{") or text.startswith("["):
            return "json"
        if text and text[0] == "{":
            return "jsonl"
    except:
        pass
    return "txt"

def decompress_lz4(path: Path, force_ext: str = None) -> Path:
    """Decompress LZ4 file and rename with proper extension.
    
    Args:
        path: Path to .lz4 file
        force_ext: Force a specific extension (e.g., "json", "csv")
    
    Returns:
        Path to decompressed file with appropriate extension
    """
    if path.suffix != ".lz4":
        return path
    
    # Decompress
    with open(path, "rb") as f_in:
        data = lz4.frame.decompress(f_in.read())
    
    # Determine output extension
    base = path.with_suffix("")  # Remove .lz4
    
    if force_ext:
        ext = force_ext
    elif base.suffix in [".csv", ".json", ".jsonl", ".txt"]:
        # Already has a good extension (e.g., file.csv.lz4 -> file.csv)
        ext = None  # Keep as-is
    else:
        # Detect format from content
        ext = detect_format(data)
    
    if ext:
        output_path = base.with_suffix(f".{ext}")
    else:
        output_path = base
    
    if output_path.exists():
        print(f"Already decompressed: {output_path}")
        return output_path
    
    with open(output_path, "wb") as f_out:
        f_out.write(data)
    
    print(f"Decompressed: {output_path}")
    return output_path

def decode_msgpack_to_json(path: Path) -> Path:
    """Decode MessagePack (.rmp.lz4) file to JSON."""
    import msgpack
    
    # Read and decompress
    with open(path, "rb") as f:
        compressed = f.read()
    
    decompressed = lz4.frame.decompress(compressed)
    data = msgpack.unpackb(decompressed, raw=False)
    
    # Save as JSON
    json_path = path.with_suffix("").with_suffix(".json")  # .rmp.lz4 -> .json
    
    if json_path.exists():
        print(f"Already decoded: {json_path}")
        return json_path
    
    with open(json_path, "w") as f:
        json.dump(data, f, indent=2, default=str)
    
    print(f"Decoded to JSON: {json_path}")
    return json_path

def preview_file(path: Path, lines: int = 20) -> None:
    """Preview first N lines of a file."""
    with open(path, "r", errors="replace") as f:
        for i, line in enumerate(f):
            if i >= lines:
                print(f"... ({lines} of many lines shown)")
                break
            print(line.rstrip())

print("Setup complete. Samples will be saved to:", SAMPLES_DIR.resolve())

---
## 1. hyperliquid-archive

Contains L2 orderbook snapshots and asset context data. Updated ~monthly.

**Structure:**
```
hyperliquid-archive/
├── market_data/
│   └── [YYYYMMDD]/
│       └── [hour 0-23]/
│           └── l2Book/
│               └── [COIN].lz4
└── asset_ctxs/
    └── [YYYYMMDD].csv.lz4
```

In [2]:
# Explore hyperliquid-archive structure
print("=" * 60)
print("hyperliquid-archive - Top level")
print("=" * 60)
list_s3("hyperliquid-archive")

hyperliquid-archive - Top level
PRE Testnet/
PRE asset_ctxs/
PRE market_data/


['PRE Testnet/', 'PRE asset_ctxs/', 'PRE market_data/']

In [3]:
# List available dates in market_data
print("Available dates in market_data/:")
list_s3("hyperliquid-archive", "market_data/", max_items=10)

Available dates in market_data/:
PRE 20230415/
PRE 20230416/
PRE 20230417/
PRE 20230418/
PRE 20230419/
PRE 20230420/
PRE 20230421/
PRE 20230422/
PRE 20230423/
PRE 20230424/


['PRE 20230415/',
 'PRE 20230416/',
 'PRE 20230417/',
 'PRE 20230418/',
 'PRE 20230419/',
 'PRE 20230420/',
 'PRE 20230421/',
 'PRE 20230422/',
 'PRE 20230423/',
 'PRE 20230424/']

In [4]:
# Sample: L2 Book snapshot for BTC
# Pick a date that exists (adjust based on list above)
SAMPLE_DATE = "20240101"  # Adjust to an available date
SAMPLE_HOUR = "12"
SAMPLE_COIN = "BTC"

# Check what coins are available
print(f"Coins available for {SAMPLE_DATE} hour {SAMPLE_HOUR}:")
list_s3("hyperliquid-archive", f"market_data/{SAMPLE_DATE}/{SAMPLE_HOUR}/l2Book/")

Coins available for 20240101 hour 12:
2024-01-10 13:34:55     249607 AAVE.lz4
2024-01-10 13:34:55     503267 ACE.lz4
2024-01-10 13:34:55     304366 ADA.lz4
2024-01-10 13:34:55     210242 APE.lz4
2024-01-10 13:34:55     340025 APT.lz4
2024-01-10 13:34:55     367217 ARB.lz4
2024-01-10 13:34:55     266178 ARK.lz4
2024-01-10 13:34:55     270370 ATOM.lz4
2024-01-10 13:34:55     441728 AVAX.lz4
2024-01-10 13:34:55     237988 BADGER.lz4
2024-01-10 13:34:55     154178 BANANA.lz4
2024-01-10 13:34:55     267246 BCH.lz4
2024-01-10 13:34:55     370771 BIGTIME.lz4
2024-01-10 13:34:55     316714 BLUR.lz4
2024-01-10 13:34:55     301069 BLZ.lz4
2024-01-10 13:34:55     249844 BNB.lz4
2024-01-10 13:34:55     235714 BNT.lz4
2024-01-10 13:34:55     461182 BSV.lz4
2024-01-10 13:34:55     309355 BTC.lz4
2024-01-10 13:34:55     256782 CAKE.lz4


['2024-01-10 13:34:55     249607 AAVE.lz4',
 '2024-01-10 13:34:55     503267 ACE.lz4',
 '2024-01-10 13:34:55     304366 ADA.lz4',
 '2024-01-10 13:34:55     210242 APE.lz4',
 '2024-01-10 13:34:55     340025 APT.lz4',
 '2024-01-10 13:34:55     367217 ARB.lz4',
 '2024-01-10 13:34:55     266178 ARK.lz4',
 '2024-01-10 13:34:55     270370 ATOM.lz4',
 '2024-01-10 13:34:55     441728 AVAX.lz4',
 '2024-01-10 13:34:55     237988 BADGER.lz4',
 '2024-01-10 13:34:55     154178 BANANA.lz4',
 '2024-01-10 13:34:55     267246 BCH.lz4',
 '2024-01-10 13:34:55     370771 BIGTIME.lz4',
 '2024-01-10 13:34:55     316714 BLUR.lz4',
 '2024-01-10 13:34:55     301069 BLZ.lz4',
 '2024-01-10 13:34:55     249844 BNB.lz4',
 '2024-01-10 13:34:55     235714 BNT.lz4',
 '2024-01-10 13:34:55     461182 BSV.lz4',
 '2024-01-10 13:34:55     309355 BTC.lz4',
 '2024-01-10 13:34:55     256782 CAKE.lz4']

In [None]:
# Download and inspect L2 book sample
l2_key = f"market_data/{SAMPLE_DATE}/{SAMPLE_HOUR}/l2Book/{SAMPLE_COIN}.lz4"
l2_path = download_sample("hyperliquid-archive", l2_key, f"market_data/{SAMPLE_DATE}/{SAMPLE_HOUR}/l2Book")

if l2_path.exists():
    # L2 book files are JSON lines
    decompressed = decompress_lz4(l2_path, force_ext="jsonl")
    print("\n" + "=" * 60)
    print(f"L2 Book Data Structure ({SAMPLE_COIN}):")
    print("=" * 60)
    preview_file(decompressed, lines=30)

In [6]:
# Sample: Asset contexts
print("Available asset_ctxs files:")
list_s3("hyperliquid-archive", "asset_ctxs/", max_items=10)

Available asset_ctxs files:
2024-12-27 02:36:59    1004923 20230520.csv.lz4
2024-12-27 02:36:59    1164671 20230521.csv.lz4
2024-12-27 02:36:59    1215123 20230522.csv.lz4
2024-12-27 02:36:59    1201427 20230523.csv.lz4
2024-12-27 02:36:59    1344355 20230524.csv.lz4
2024-12-27 02:36:59    1536620 20230525.csv.lz4
2024-12-27 02:36:59    1511494 20230526.csv.lz4
2024-12-27 02:36:59    1393419 20230527.csv.lz4
2024-12-27 02:36:59    1467243 20230528.csv.lz4
2024-12-27 02:37:00    1503738 20230529.csv.lz4


['2024-12-27 02:36:59    1004923 20230520.csv.lz4',
 '2024-12-27 02:36:59    1164671 20230521.csv.lz4',
 '2024-12-27 02:36:59    1215123 20230522.csv.lz4',
 '2024-12-27 02:36:59    1201427 20230523.csv.lz4',
 '2024-12-27 02:36:59    1344355 20230524.csv.lz4',
 '2024-12-27 02:36:59    1536620 20230525.csv.lz4',
 '2024-12-27 02:36:59    1511494 20230526.csv.lz4',
 '2024-12-27 02:36:59    1393419 20230527.csv.lz4',
 '2024-12-27 02:36:59    1467243 20230528.csv.lz4',
 '2024-12-27 02:37:00    1503738 20230529.csv.lz4']

In [7]:
# Download and inspect asset context sample
ctx_key = f"asset_ctxs/{SAMPLE_DATE}.csv.lz4"  # Adjust date if needed
ctx_path = download_sample("hyperliquid-archive", ctx_key, "asset_ctxs")

if ctx_path.exists():
    decompressed = decompress_lz4(ctx_path)
    print("\n" + "=" * 60)
    print("Asset Context Data Structure:")
    print("=" * 60)
    preview_file(decompressed, lines=20)

Downloading s3://hyperliquid-archive/asset_ctxs/20240101.csv.lz4...
Saved to: ../hyperliquid_samples/hyperliquid-archive/asset_ctxs/20240101.csv.lz4
Decompressed: ../hyperliquid_samples/hyperliquid-archive/asset_ctxs/20240101.csv

Asset Context Data Structure:
time,coin,funding,open_interest,prev_day_px,day_ntl_vlm,premium,oracle_px,mark_px,mid_px,impact_bid_px,impact_ask_px
2024-01-01T00:00:00Z,AAVE,0.00002789,1075.9,111.09,1759054.2941,0.00072316,108.69,108.79,108.765,108.75,108.7872
2024-01-01T00:00:00Z,ACE,0.00004605,9238.23,10.331,1168558.484093,0.00086837,9.3278,9.331,9.3361,9.334,9.3378
2024-01-01T00:00:00Z,ADA,0.0000125,725829,0.60135,723752.77572,0.00048152,0.59395,0.59424,0.594245,0.594144,0.594328
2024-01-01T00:00:00Z,APE,0.00013221,136489.6,1.6509,453141.36398,0.00155768,1.621,1.6235,1.62365,1.62285,1.6242
2024-01-01T00:00:00Z,APT,0.00018119,13074.82,9.3941,851013.11729,0.00194953,9.3869,9.4035,9.4051,9.4031,9.4073
2024-01-01T00:00:00Z,ARB,0.00008562,1043726.6,1.4815,669656

---
## 2. hl-mainnet-node-data

Contains fill records, trades, and explorer data from Hyperliquid nodes.

**Structure:**
```
hl-mainnet-node-data/
├── node_fills_by_block/    # Modern format - fills batched by block
├── node_fills/             # Legacy format - individual fills
├── node_trades/            # Legacy trade records
├── explorer_blocks/        # Block data for explorer
└── replica_cmds/           # L1 transaction commands
```

In [8]:
# Explore hl-mainnet-node-data structure
print("=" * 60)
print("hl-mainnet-node-data - Top level")
print("=" * 60)
list_s3("hl-mainnet-node-data")

hl-mainnet-node-data - Top level
PRE explorer_blocks/
PRE misc_events_by_block/
PRE node_fills/
PRE node_fills_by_block/
PRE node_trades/
PRE replica_cmds/


['PRE explorer_blocks/',
 'PRE misc_events_by_block/',
 'PRE node_fills/',
 'PRE node_fills_by_block/',
 'PRE node_trades/',
 'PRE replica_cmds/']

In [9]:
# Explore node_fills_by_block (modern format)
print("node_fills_by_block/ structure:")
list_s3("hl-mainnet-node-data", "node_fills_by_block/", max_items=15)

node_fills_by_block/ structure:
PRE hourly/


['PRE hourly/']

In [10]:
# Download a sample fills file
# First, find an available file
output = run_aws("ls s3://hl-mainnet-node-data/node_fills_by_block/ --recursive")
files = [l.split()[-1] for l in output.strip().split("\n") if l.strip()][:5]
print("Sample files available:")
for f in files:
    print(f"  {f}")

Sample files available:
  node_fills_by_block/hourly/20250727/10.lz4
  node_fills_by_block/hourly/20250727/11.lz4
  node_fills_by_block/hourly/20250727/12.lz4
  node_fills_by_block/hourly/20250727/13.lz4
  node_fills_by_block/hourly/20250727/14.lz4


In [None]:
# Download and inspect first available fills file
if files:
    sample_key = files[0]
    fills_path = download_sample("hl-mainnet-node-data", sample_key, "node_fills_by_block")
    
    if fills_path.exists():
        # Node fills are JSON lines
        if fills_path.suffix == ".lz4":
            fills_path = decompress_lz4(fills_path, force_ext="jsonl")
        
        print("\n" + "=" * 60)
        print("Node Fills by Block Data Structure:")
        print("=" * 60)
        preview_file(fills_path, lines=30)

In [12]:
# Explore legacy node_fills format
print("node_fills/ (legacy format):")
list_s3("hl-mainnet-node-data", "node_fills/", max_items=10)

node_fills/ (legacy format):
PRE hourly/


['PRE hourly/']

In [13]:
# Explore node_trades
print("node_trades/:")
list_s3("hl-mainnet-node-data", "node_trades/", max_items=10)

node_trades/:
PRE hourly/


['PRE hourly/']

In [14]:
# Explore explorer_blocks
print("explorer_blocks/:")
list_s3("hl-mainnet-node-data", "explorer_blocks/", max_items=10)

explorer_blocks/:
PRE 0/
PRE 100000000/
PRE 200000000/
PRE 300000000/
PRE 400000000/
PRE 500000000/
PRE 600000000/
PRE 700000000/
PRE 800000000/


['PRE 0/',
 'PRE 100000000/',
 'PRE 200000000/',
 'PRE 300000000/',
 'PRE 400000000/',
 'PRE 500000000/',
 'PRE 600000000/',
 'PRE 700000000/',
 'PRE 800000000/']

---
## 3. hl-mainnet-evm-blocks

Contains HyperEVM block data for indexing without running a node.

**Structure:**
```
hl-mainnet-evm-blocks/
└── [prefix]/
    └── [range]/
        └── [block_number].rmp.lz4
```

Files are MessagePack format, LZ4 compressed. Block numbers are predictably indexed.

In [15]:
# Explore hl-mainnet-evm-blocks structure
print("=" * 60)
print("hl-mainnet-evm-blocks - Top level")
print("=" * 60)
list_s3("hl-mainnet-evm-blocks")

hl-mainnet-evm-blocks - Top level
PRE 0/
PRE 1000000/
PRE 10000000/
PRE 11000000/
PRE 12000000/
PRE 13000000/
PRE 14000000/
PRE 15000000/
PRE 16000000/
PRE 17000000/
PRE 18000000/
PRE 19000000/
PRE 2000000/
PRE 20000000/
PRE 3000000/
PRE 4000000/
PRE 5000000/
PRE 6000000/
PRE 7000000/
PRE 8000000/


['PRE 0/',
 'PRE 1000000/',
 'PRE 10000000/',
 'PRE 11000000/',
 'PRE 12000000/',
 'PRE 13000000/',
 'PRE 14000000/',
 'PRE 15000000/',
 'PRE 16000000/',
 'PRE 17000000/',
 'PRE 18000000/',
 'PRE 19000000/',
 'PRE 2000000/',
 'PRE 20000000/',
 'PRE 3000000/',
 'PRE 4000000/',
 'PRE 5000000/',
 'PRE 6000000/',
 'PRE 7000000/',
 'PRE 8000000/']

In [16]:
# Dive into structure
print("Exploring nested structure:")
list_s3("hl-mainnet-evm-blocks", "0/", max_items=10)

Exploring nested structure:
PRE 0/
PRE 1000/
PRE 10000/
PRE 100000/
PRE 101000/
PRE 102000/
PRE 103000/
PRE 104000/
PRE 105000/
PRE 106000/


['PRE 0/',
 'PRE 1000/',
 'PRE 10000/',
 'PRE 100000/',
 'PRE 101000/',
 'PRE 102000/',
 'PRE 103000/',
 'PRE 104000/',
 'PRE 105000/',
 'PRE 106000/']

In [17]:
# Download a sample EVM block
import msgpack

# Find a sample block file
output = run_aws("ls s3://hl-mainnet-evm-blocks/0/0/ --recursive")
evm_files = [l.split()[-1] for l in output.strip().split("\n") if l.strip() and ".rmp.lz4" in l][:3]
print("Sample EVM block files:")
for f in evm_files:
    print(f"  {f}")

Sample EVM block files:
  0/0/1.rmp.lz4
  0/0/10.rmp.lz4
  0/0/100.rmp.lz4


In [None]:
# Download and decode EVM block to JSON
if evm_files:
    evm_key = evm_files[0]
    evm_path = download_sample("hl-mainnet-evm-blocks", evm_key, "blocks")
    
    if evm_path.exists():
        # Decode MessagePack to JSON
        json_path = decode_msgpack_to_json(evm_path)
        
        print("\n" + "=" * 60)
        print("EVM Block Data Structure:")
        print("=" * 60)
        preview_file(json_path, lines=50)

---
## Summary: What's in Each Bucket

### hyperliquid-archive
- **market_data/**: L2 orderbook snapshots per coin per hour
  - Format: JSON lines → saved as `.jsonl`
  - Use case: Historical orderbook analysis, liquidity studies
- **asset_ctxs/**: Daily asset context snapshots
  - Format: CSV → saved as `.csv`
  - Use case: Understanding asset parameters over time

### hl-mainnet-node-data  
- **node_fills_by_block/**: Trade fills grouped by block (modern)
  - Format: JSON lines → saved as `.jsonl`
  - Use case: Trade analysis, volume studies
- **node_fills/**: Legacy individual fill records
- **node_trades/**: Legacy trade format
- **explorer_blocks/**: Block data for explorer indexing
- **replica_cmds/**: L1 transaction commands

### hl-mainnet-evm-blocks
- EVM block data in MessagePack format → decoded to `.json`
- Use case: Indexing HyperEVM without running a node

---
### File Extensions After Download
| Original | Decompressed |
|----------|--------------|
| `BTC.lz4` | `BTC.jsonl` |
| `20240101.csv.lz4` | `20240101.csv` |
| `10.lz4` (fills) | `10.jsonl` |
| `1.rmp.lz4` (EVM) | `1.json` |

Check `../hyperliquid_samples/` for downloaded files organized by bucket.

In [19]:
# Show what we've downloaded
print("Downloaded samples:")
print("=" * 60)
for path in sorted(SAMPLES_DIR.rglob("*")):
    if path.is_file():
        size_kb = path.stat().st_size / 1024
        rel_path = path.relative_to(SAMPLES_DIR)
        print(f"  {rel_path}  ({size_kb:.1f} KB)")

Downloaded samples:
  README.md  (1.2 KB)
  hl-mainnet-evm-blocks/blocks/1.rmp.lz4  (0.5 KB)
  hl-mainnet-node-data/node_fills_by_block/10  (102161.9 KB)
  hl-mainnet-node-data/node_fills_by_block/10.lz4  (20848.9 KB)
  hyperliquid-archive/asset_ctxs/20240101.csv  (16072.7 KB)
  hyperliquid-archive/asset_ctxs/20240101.csv.lz4  (5879.0 KB)
  hyperliquid-archive/market_data/20240101/12/l2Book/BTC  (8963.4 KB)
  hyperliquid-archive/market_data/20240101/12/l2Book/BTC.lz4  (302.1 KB)
