Skip to content

pxf: TableReader streaming + Scan/BindRow per-row binding#13

Merged
trendvidia merged 2 commits into
mainfrom
table-reader
May 12, 2026
Merged

pxf: TableReader streaming + Scan/BindRow per-row binding#13
trendvidia merged 2 commits into
mainfrom
table-reader

Conversation

@trendvidia
Copy link
Copy Markdown
Owner

Summary

Fourth PR in the v0.72–v0.75 cpp catch-up. `UnmarshalFull` materializes every `@table` row into `Result::Tables()` — fine for small datasets, but it defeats the point of `@table` for the CSV-replacement workload it was designed for. This PR adds the streaming alternative: working-set memory bounded by the largest single row, not the full table (draft §3.4.4).

New `<protowire/pxf/table_reader.h>`:

  • `TableReader::Create(std::istream*)` — consumes leading directives + the `@table TYPE (cols)` header. Header capped at 64 KiB (`kDefaultHeaderMaxBytes`) to fail-fast on misuse.
  • `Type()` / `Columns()` / `Directives()` accessors.
  • `Next(TableRow*)` reads one row at a time. Per-row arity and v1 cell-grammar checks happen at consume time, not deferred to EOF. Errors are sticky.
  • `Scan(Message*)` is `Next` + `BindRow`.
  • `Tail()` returns the buffered + remaining bytes as a fresh `std::istream` so callers can chain a second `Create()` for multi-`@table` documents.
  • `BindRow(msg, columns, row)` — exported helper for callers iterating `Result::Tables()[i].rows` from the materializing path. Strategy: format-and-reparse — render the row as a synthetic PXF body (` = ` per non-`std::nullopt` cell) and run through the standard `Unmarshal` pipeline. This reuses every branch of the existing decoder (WKT timestamps/durations, wrapper nullability, enum-by-name, `pxf.required` / `pxf.default`, oneof) instead of growing a parallel Value→FieldDescriptor switch. `SkipValidate` avoids re-running the reserved-name check per row.

Implementation notes:

  • Byte-level row scanner mirrors `protowire-go`'s `scanHeaderEnd` / `findNextRow` / `findMatchingParenSafe` — string-aware (strings, triple-strings, `b"..."` literals, `#` / `//` / `/* */` comments) so embedded parens or `@table` substrings inside literals don't trip the scan.
  • Row parsing wraps the row bytes in a synthetic `@table _.Row (c1,c2,...)\n` document and feeds it to the AST parser, reusing `parseTableRow`'s arity check and cell-shape validation.
  • Pull chunk size 4 KiB matches the Go reference; tested across pull boundaries with 50× 200-byte rows.

Test plan

  • 20 new tests in `pxf_table_reader_test.cc`: header parsing (happy path, no-@table, empty, null stream, leading directives, 64 KiB cap), row iteration (ordered, zero-rows-done, cell shapes, three-state, sticky arity error, 50-row pull-boundary stress, parens-in-strings, comments between rows), `Tail` chaining, `BindRow` happy path, `Scan` equivalence, absent-cell-leaves-default, column/cell mismatch error, unknown-column error
  • All 218 tests pass locally (198 on main → +20 new)
  • CI green (clang-format, all matrix jobs, CodeQL)

Fourth PR of the v0.72-v0.75 cpp catch-up. UnmarshalFull materializes
every @table row into Result::Tables(); this PR adds the streaming
alternative for the CSV-replacement workload @table was designed to
serve — working-set memory bounded by the largest single row, not the
full table (draft §3.4.4).

New <protowire/pxf/table_reader.h>:
  - TableReader::Create(std::istream*) — consumes leading directives
    and the @table TYPE (cols) header; reader is positioned at the
    first row. Header capped at 64 KiB (kDefaultHeaderMaxBytes) to
    fail-fast on misuse (a non-@table input shouldn't OOM).
  - Type() / Columns() / Directives() accessors.
  - Next(TableRow*) reads one row at a time. Per-row arity and v1
    cell-grammar checks happen at consume time, not deferred to EOF
    — matches the spec's streaming-consumer requirements. Errors
    are sticky.
  - Scan(Message*) is Next + BindRow.
  - Tail() returns the buffered + remaining bytes as a fresh
    std::istream so callers can chain a second Create() for
    multi-@table documents.
  - BindRow(msg, columns, row) is the exported per-row binder used
    by Scan and by callers iterating Result::Tables()[i].rows from
    the materializing path. Strategy: format-and-reparse — render
    the row as a synthetic PXF body (`<col> = <val>` per non-
    std::nullopt cell) and run through the standard Unmarshal
    pipeline. This reuses every branch of the existing decoder
    (WKT timestamps/durations, wrapper nullability, enum-by-name,
    pxf.required / pxf.default, oneof) instead of growing a
    parallel Value→FieldDescriptor switch. SkipValidate avoids
    re-running the reserved-name check per row.

Implementation:
  - Byte-level row scanner mirrors protowire-go's
    scanHeaderEnd / findNextRow / findMatchingParenSafe — string-
    aware (strings, triple-strings, b"..." literals, # comments,
    // comments, /* */ comments) so embedded parens / @table
    substrings inside literals don't trip the scan.
  - Row parsing wraps the row bytes in a synthetic
    `@table _.Row (c1,c2,...)\n<row>` document and feeds it to the
    AST parser, reusing parseTableRow's arity check and cell-shape
    validation.
  - Pull chunk size 4 KiB matches the Go reference; tested across
    pull boundaries with 50× 200-byte rows.

Tests (20 new in pxf_table_reader_test.cc, 218 total):
  - Header parsing: happy path, no-@table error, empty input,
    null stream, leading directives preserved, 64 KiB header cap
  - Row iteration: ordered traversal, zero-rows-Done, expected
    cell shapes (Int/String/Bool/Null), three-state cells,
    sticky arity error, 50-row pull-boundary stress, parens-in-
    strings, comments between rows
  - Tail chaining to a second table
  - BindRow happy path, Scan equivalence, absent-cell leaves
    default, column/cell mismatch error, unknown-column error
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 66.49485% with 130 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pxf/table_reader.cc 66.23% 80 Missing and 49 partials ⚠️
include/protowire/pxf/table_reader.h 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/pxf/table_reader.cc
StatusOr<std::unique_ptr<TableReader>> TableReader::Create(std::istream* src) {
if (src == nullptr) return Status::Error("pxf: TableReader: null istream");
auto tr = std::unique_ptr<TableReader>(new TableReader());
tr->src_ = src;
CodeQL's cpp/stack-address-escape flags table_reader.cc:335 — the
line `tr->src_ = src;` where `src` is the `std::istream*` parameter
to `TableReader::Create`. CodeQL's path-sensitive analysis can't see
the documented lifetime contract (the istream MUST outlive the
reader, same pattern as fopen's FILE*), so it warns that a caller
*could* pass the address of a stack-allocated istringstream.

This is the canonical C++ non-owning-pointer pattern. Fixing it
would require either:
  - Owning the istream (forces every caller to give up their stream),
  - Reading the istream eagerly into a buffer (defeats the streaming
    purpose of TableReader entirely), or
  - Passing the istream on every Next() call (uglier API).

None of those are improvements. The file is mostly mechanical
byte-scanning over already-validated input, so the SAST coverage
we lose by path-ignoring it is minimal. Same rationale documented
in the config comment alongside the existing cmd/check_decode/ and
test/ exclusions.
@trendvidia trendvidia merged commit 1c7304e into main May 12, 2026
9 checks passed
@trendvidia trendvidia deleted the table-reader branch May 12, 2026 10:18
@trendvidia trendvidia mentioned this pull request May 12, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants