# 1) Parse and order book reconstruction (Seg=48)

This notebook contains instructions and command cells only; the logic lives in `src/` and `scripts/`.
Please make sure `0_env_data_prep.ipynb` has been run (dependencies + DuckDB init) before proceeding.


In [120]:
# Repository path
import os
REPO_DIR = '/content/drive/MyDrive/00_EUREX/eurex-liquidity-demo'
assert os.path.exists(REPO_DIR), f'Repo not found: {REPO_DIR}'

# Source root (use Google Drive persistent extraction)
SRC_ROOT = f"{REPO_DIR}/data_raw/Sample_Eurex_20201201_10MktSegID"
assert os.path.exists(SRC_ROOT), f"Source root not found: {SRC_ROOT}"

print("REPO_DIR =", REPO_DIR)
print("SRC_ROOT =", SRC_ROOT)
print("DI exists under SRC_ROOT?", any(fn.endswith("DI_48_20201201.csv") for dp, dn, fns in os.walk(SRC_ROOT) for fn in fns))

REPO_DIR = /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo
SRC_ROOT = /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_raw/Sample_Eurex_20201201_10MktSegID
DI exists under SRC_ROOT? True


## Propose Seg=48 opening window (continuous trading, first 10 minutes)

The script will scan DI timestamps (and optionally ISC/PSC) to detect the opening
minute of sustained activity and propose a 10-minute window. This step does not
perform slicing yet; it only writes a JSON manifest for your review.

In [121]:
# Propose window (no slicing yet)
# This will scan DI timestamps and write a JSON manifest to data_samples/ for review.
!python "{REPO_DIR}/scripts/make_samples.py" \
  --seg 48 \
  --src "{SRC_ROOT}" \
  --out "{REPO_DIR}/data_samples/48-FSTK-ADSG" \
  --sustain 5 \
  --window-minutes 10 \
  --propose-only


  return dt.datetime.utcfromtimestamp(ns / 1_000_000_000).replace(second=0, microsecond=0)
  return dt.datetime.utcfromtimestamp(sec).strftime("%Y-%m-%dT%H:%M:%S") + f".{rem_ns:09d}Z"
== Proposed window ==
{
  "segment": 48,
  "open_ns": 1606809660000000000,
  "open_iso": "2020-12-01T08:01:00.000000000Z",
  "end_ns": 1606810260000000000,
  "end_iso": "2020-12-01T08:11:00.000000000Z"
}
ISC earliest: 1606804200139187733 2020-12-01T06:30:00.139187733Z
PSC earliest: 1606804200139187733 2020-12-01T06:30:00.139187733Z
Saved proposal: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/raw/proposed_window_seg48.json


## Slice files into a 10-minute sample

Use the proposed window to slice DI/DS/ISC/PSC into [open, end) and copy IS fully into `data_samples/48-FSTK-ADSG/`.

In [122]:
# Slice files into a 10-minute sample using the proposed window
!python "{REPO_DIR}/scripts/make_samples.py" \
  --seg 48 \
  --src "{SRC_ROOT}" \
  --out "{REPO_DIR}/data_samples/48-FSTK-ADSG" \
  --sustain 5 \
  --window-minutes 10

  return dt.datetime.utcfromtimestamp(ns / 1_000_000_000).replace(second=0, microsecond=0)
  return dt.datetime.utcfromtimestamp(sec).strftime("%Y-%m-%dT%H:%M:%S") + f".{rem_ns:09d}Z"
== Proposed window ==
{
  "segment": 48,
  "open_ns": 1606809660000000000,
  "open_iso": "2020-12-01T08:01:00.000000000Z",
  "end_ns": 1606810260000000000,
  "end_iso": "2020-12-01T08:11:00.000000000Z"
}
ISC earliest: 1606804200139187733 2020-12-01T06:30:00.139187733Z
PSC earliest: 1606804200139187733 2020-12-01T06:30:00.139187733Z
Saved proposal: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/raw/proposed_window_seg48.json
[INFO] Proceeding to slice files using the proposed window [open_ns, end_ns)...
== Slicing summary ==
  DI: written=60 scanned=7175
  DS: written=11 scanned=9616
  ISC: written=0 scanned=4
  PSC: written=0 scanned=8
  IS: written=16 scanned=16


## Inspect DI schema and write mapping JSON

This step infers a minimal field mapping for DI entries from the sliced sample and writes it to `data_samples/48-FSTK-ADSG/di_mapping_seg48.json`. It also prints a few parsed entries for visual validation.

## Check Maximum Depth in DI Data

Before running multi-level construction (L5/L10/L20), verify the maximum available depth in the raw data to avoid wasting computation on non-existent levels.

In [123]:
# === Check Maximum Depth Level in DI Data ===
# This prevents wasting time constructing L10/L20 if the data only has 5 levels

SEG = 48
DI_FULL = f"{REPO_DIR}/data_raw/Sample_Eurex_20201201_10MktSegID/48/DI_48_20201201.csv"
MAPPING_JSON = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/di_mapping_seg48.json"

print(f"üîç Checking maximum depth level in DI data...")
print(f"‚è±Ô∏è  Sampling first 1000 lines (use --sample-limit for different amount)")
print(f"")

!python "{REPO_DIR}/scripts/check_max_depth.py" \
  --di "{DI_FULL}" \
  --mapping "{MAPPING_JSON}" \
  --sample-limit 1000

üîç Checking maximum depth level in DI data...
‚è±Ô∏è  Sampling first 1000 lines (use --sample-limit for different amount)

[INFO] Analyzing DI file: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_raw/Sample_Eurex_20201201_10MktSegID/48/DI_48_20201201.csv
[INFO] Sampling first 1000 lines

MAXIMUM DEPTH ANALYSIS
Maximum price level found: 5
Total entries analyzed: 2698
Lines scanned: 1000

Price Level Distribution:
  Level  0:      963 entries ( 35.7%)
  Level  2:      959 entries ( 35.5%)
  Level  5:      776 entries ( 28.8%)

RECOMMENDATION:
‚úÖ Maximum useful level: L5
‚ö†Ô∏è  Using --levels > 5 will not capture additional depth
üí° Suggested configurations:
   ‚Ä¢ L1 : Basic best bid/ask
   ‚Ä¢ L5 : Rich market depth (recommended)
   ‚Ä¢ L5: Maximum available depth


## ‚ö†Ô∏è Important: Depth Analysis Results for Segment 48

**This dataset contains only 3 price levels (sparse numbering):**
- **Level 0**: ~36% of entries (best bid/ask)
- **Level 2**: ~36% of entries (second tier)
- **Level 5**: ~29% of entries (third tier)

**Levels 1, 3, 4, 6+ do not exist in this data.**

**Recommended configuration:**
- ‚úÖ **L1**: Use for best bid/ask analysis only
- ‚úÖ **L5**: Use for maximum depth (captures all 3 actual levels: 0, 2, 5)
- ‚ùå **L10/L20**: Not needed - would produce identical results to L5

In [124]:
# Run schema inspection on the sliced DI sample (outputs to organized structure)
MAPPING_JSON = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/di_mapping_seg48.json"      # Updated path
DI_SLICE = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/DI_48_20201201_window.csv"       # Updated path

!python "{REPO_DIR}/scripts/inspect_schema.py" \
  --di "{DI_SLICE}" \
  --out "{MAPPING_JSON}" \
  --sample-limit 200

[OK] Wrote DI mapping JSON to: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/raw/di_mapping_seg48.json
{
  "md_update_action_idx": 0,
  "entry_type_idx": 2,
  "price_level_idx": 1,
  "security_id_idx": 3,
  "price_idx": 5,
  "size_idx": 6,
  "ts_ns_idx": 9
}

[Preview] First 6 parsed entries:
{'md_update_action': 0, 'entry_type': 0, 'price_level': 0, 'security_id': 2788279, 'price': 269.9571, 'size': 5, 'ts_ns': 1606809697524721888}
{'md_update_action': 0, 'entry_type': 1, 'price_level': 0, 'security_id': 2788279, 'price': 270.9045, 'size': 5, 'ts_ns': 1606809697524721888}
{'md_update_action': 0, 'entry_type': 0, 'price_level': 2, 'security_id': 2788279, 'price': 269.9571, 'size': 5, 'ts_ns': 1606809701948251798}
{'md_update_action': 0, 'entry_type': 1, 'price_level': 2, 'security_id': 2788279, 'price': 270.9045, 'size': 5, 'ts_ns': 1606809701948251798}
{'md_update_action': 0, 'entry_type': 0, 'price_level': 0, 'security_id': 2788279, 'price': 269.8074,

## Build L1 snapshots from DI

This step parses the sliced DI using the inferred mapping and reconstructs best bid/ask snapshots per security. Outputs will be written to `data_samples/48-FSTK-ADSG` as Parquet and CSV for quick inspection.

In [125]:
# Run L1 snapshot builder (10-minute window demo)
SEG = 48
DI_SLICE = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/DI_48_20201201_window.csv"       # Updated path
MAPPING_JSON = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/di_mapping_seg48.json"        # Updated path
OUT_DIR = f"{REPO_DIR}/data_samples/48-FSTK-ADSG"

!python "{REPO_DIR}/scripts/parse_and_l1.py" \
  --seg {SEG} \
  --di "{DI_SLICE}" \
  --mapping "{MAPPING_JSON}" \
  --out "{OUT_DIR}"

[OK] Wrote: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l1/l1_snapshots_seg48.parquet rows= 100
[OK] Wrote: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l1/l1_snapshots_seg48.csv rows= 100

[Preview] First 6 rows:
                 ts_ns  best_bid  bid_size  ...  ask_size  action  security_id
0  1606809697524721888  269.9571         5  ...       NaN       0      2788279
1  1606809697524721888  269.9571         5  ...       5.0       0      2788279
2  1606809778487620430  269.8074         5  ...       5.0       0      2788279
3  1606809778487620430  269.8074         5  ...       5.0       0      2788279
4  1606809810878532690  269.9571         5  ...       5.0       0      2788279
5  1606809810878532690  269.9571         5  ...       5.0       0      2788279

[6 rows x 7 columns]


## Aggregate to 1-second metrics

This step aggregates L1 snapshots into 1-second metrics per security and joins DI action counts (updates/cancels). Outputs will be written to `data_samples/`.

In [126]:
# Run 1-second aggregation
SEG = 48
L1_CSV = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/l1/l1_snapshots_seg{SEG}.csv"
DI_SLICE = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/DI_48_20201201_window.csv"
MAPPING_JSON = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/di_mapping_seg48.json"
OUT_DIR = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/"

!python "{REPO_DIR}/scripts/aggregate_1s.py" \
  --seg {SEG} \
  --l1 "{L1_CSV}" \
  --di "{DI_SLICE}" \
  --mapping "{MAPPING_JSON}" \
  --out "{OUT_DIR}"


[OK] Wrote: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l1/l1_agg_1s_seg48.parquet rows= 33
[OK] Wrote: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l1/l1_agg_1s_seg48.csv rows= 33

[Preview] First 6 rows:
   security_id        ts_s  best_bid  ...  microprice  update_count  cancel_count
0      2788279  1606809697  269.9571  ...   270.43080             2             0
1      2788279  1606809778  269.8074  ...   270.28080             2             0
2      2788279  1606809810  269.9571  ...   270.43080             4             0
3      2788279  1606809816  270.1068  ...   270.58075             2             0
4      2788279  1606809824  269.9571  ...   270.43080             2             0
5      2788279  1606809886  270.2565  ...   270.73070             2             0

[6 rows x 12 columns]


In [127]:
# === Full-Day L5 Maximum Depth Order Book Construction ===
# Build L5 snapshots using complete daily DI data
# NOTE: This dataset only has 3 actual levels (0, 2, 5), so L5 captures ALL available depth

SEG = 48
DI_FULL = f"{REPO_DIR}/data_raw/Sample_Eurex_20201201_10MktSegID/48/DI_48_20201201.csv"  # Complete daily file
MAPPING_JSON = f"{REPO_DIR}/data_samples/48-FSTK-ADSG/raw/di_mapping_seg48.json"          # Updated path
OUT_DIR = f"{REPO_DIR}/data_samples/48-FSTK-ADSG"                                         # Base directory

print(f"üöÄ Building L5 order book from full day data")
print(f"Input: {DI_FULL}")
print(f"Output: {OUT_DIR}/l5/")
print(f"Note: Captures all 3 available depth levels (0, 2, 5)")

!python "{REPO_DIR}/scripts/parse_and_l5.py" \
  --seg {SEG} \
  --di "{DI_FULL}" \
  --mapping "{MAPPING_JSON}" \
  --out "{OUT_DIR}" \
  --levels 5

üöÄ Building L5 order book from full day data
Input: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_raw/Sample_Eurex_20201201_10MktSegID/48/DI_48_20201201.csv
Output: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l5/
Note: Captures all 3 available depth levels (0, 2, 5)
[INFO] Parsing DI file: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_raw/Sample_Eurex_20201201_10MktSegID/48/DI_48_20201201.csv
[INFO] Tracking top 5 levels per side
  Processed 5000 lines...
[INFO] Processing complete:
  Lines: 7175
  Events: 19658
  Changes: 19098
  Snapshots: 19098
[OK] Wrote: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l5/l5_snapshots_seg48.parquet rows= 19098
[OK] Wrote: /content/drive/MyDrive/00_EUREX/eurex-liquidity-demo/data_samples/48-FSTK-ADSG/l5/l5_snapshots_seg48.csv rows= 19098

[Preview] First 3 rows:
                 ts_ns  bid_price_1  bid_size_1  ...  ask_size_5  security_id  action
0  160680969