Skip to content

[EPIC] Push pandas to the seams: views-frames migration for sample-bearing predictions #85

Description

@Polichinel

Problem

The FAO delivery is moving from point estimates to predictions-with-uncertainty (S samples per cell). As long as pandas sits on the data path, samples cannot propagate: a distribution becomes an object-dtype list-in-cell (the inherited PGMDataset representation) — the memory / encode-decode wall captured in register C-40 (and pipeline-core's own ~18 GB OOM, their #181). views-frames stores a distribution natively as a contiguous (N, S) float32 PredictionFrame.

This epic maps every pandas site in views-postprocessing and sequences the swap to native views-frames, one site at a time, each with a parity proof, disturbing cross-repo interconnectivity as little as possible.

Key enabling fact: the samples already arrive inside the forecast PGMDataset's object-dtype cells — they're collapsed to point estimates today only because unfao/frames.py hardcodes S=1 and the wires carry object-dtype parquet. So vpp can make its interior sample-aware unilaterally, while the two wire boundaries stay frozen until cross-repo coordination.

Why it matters

The map — three concentric rings (grounded in file:line)

Ring 0 — already pandas-free (do not touch): views_postprocessing/delivery/{coverage,identity,observed_range,provenance}.py, unfao/gaul_schema.py, unfao/source_metadata.py.

Ring 1 — vpp-owned seams (UNILATERAL):

  • unfao/extraction.py — the pandas→primitives seam (cells_of/months_of/drop_months_above/unmapped_cell_count).
  • unfao/enrichment.pypd.read_parquet(lookup) (:48) + .merge(...) (:117-119): a keyed metadata-attach join.
  • unfao/frames.py — the unused views-frames adapter; _column_2d(...).reshape(-1,1) hardcodes S=1 (:85).
  • scripts/build_gaul_lookup.py — offline lookup build (off the delivery path).

Ring 2 — forced / contract boundaries (CROSS-REPO-GATED):

  • historical inbound read_dataframePGMDataset (unfao.py:58-64) + _append_metadata/_validate pandas ops — gated on pipeline-core DataFrame retirement (C-40; their #186/#207/#161/#159/#211).
  • forecast inbound pd.read_parquet(io.BytesIO(...)) (unfao.py:125) — producer wire format.
  • outbound df.to_parquet (unfao.py:298,309) → Appwrite — gated on the views-faoapi wire contract (FAO delivery path is point/DataFrame-based — won't carry rusty_bucket's pooled draws #45); faoapi already rebuilds views-frames at ingestion.

Desired end state

A forecast flows through vpp's interior as a PredictionFrame carrying (N, S), producing byte-identical deliveries to today on S=1 inputs (proving the interior is sample-ready) behind frozen wires. The final outbound hop that actually ships (N,S) to FAO (S6) and the historical inbound retirement (S7) are landed only when their cross-repo gates clear.

Scope

In: the 7 sequenced steps below (S1–S7), each independently shippable with a parity proof.
Out: Ring 0 (already pandas-free); pushing the enrichment metadata join into views-frames (S4 moves it off pandas, not into a frame — it is not frame algebra); changing the outbound format ahead of faoapi.

Stories (in sequence)

  1. S1frames.py: generalize to_prediction_frame to S>1 [UNILATERAL] ← recommended first
  2. S2extraction.py: frame-native sibling readers [UNILATERAL]
  3. S3 — forecast "convert at the door" (internal PredictionFrame) [UNILATERAL]
  4. S4enrichment.py: numpy/pyarrow keyed gather [UNILATERAL, low priority]
  5. S5build_gaul_lookup.py: pyarrow-native build [UNILATERAL, lowest priority]
  6. S6 — forecast outbound → arrow sample-frame wire [CROSS-REPO-GATED: faoapi #45 + C-40]
  7. S7 — historical inbound read_dataframe/PGMDataset retirement [CROSS-REPO-GATED: pipeline-core C-40] — do not move yet

Dependency sequence

S1 ──► S2 ──► S3 ──► S6 (gated: faoapi #45)
        S4 (independent, low pri)
        S5 (independent, lowest pri)
S7 (gated: pipeline-core C-40) — independent of S1–S6; never blocks samples

Epic acceptance criteria

  • S1–S3 merged: a forecast is represented internally as a (N,S) PredictionFrame; deliveries are byte-identical to today on S=1 inputs (parity).
  • Each landed step carries a parity proof (build pandas → convert → assert arrays equal, and/or head-to-head old-vs-new) and keeps ruff + the full suite green.
  • delivery/ remains pandas-free throughout (the design-contract test stays green).
  • S6/S7 remain explicitly blocked until their gates (faoapi FAO delivery path is point/DataFrame-based — won't carry rusty_bucket's pooled draws #45 / pipeline-core C-40) clear; no unilateral wire change.
  • Tracking issue complete.

Refs

Register C-40 (the pandas gate), #45 (the delivery-side draw carrier / faoapi wire), pipeline-core #186/#207/#161/#159/#211 (their DataFrame retirement). Parity pattern: tests/test_views_frames_conformance.py.


Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicA large capability spanning multiple stories

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions