Skip to content

pxf: parser-side @<name>/@entry/@table directive grammar#9

Merged
trendvidia merged 1 commit into
mainfrom
directives-parser
May 12, 2026
Merged

pxf: parser-side @<name>/@entry/@table directive grammar#9
trendvidia merged 1 commit into
mainfrom
directives-parser

Conversation

@trendvidia
Copy link
Copy Markdown
Owner

Summary

First of 5 PRs porting v0.72-v0.75 features to protowire-cpp. Adds the directive grammar to the AST parser and the fast direct-decode path — parser-tier only; runtime semantics (Result accessors, TableReader streaming, per-row Scan/BindRow) arrive in subsequent PRs.

Mirrors the Go reference at `protowire-go/encoding/pxf/{parser,decode_fast}.go` (draft §3.4.2 – §3.4.4).

AST additions to `Document`:

  • `directives[]` — generic `@ *(prefix) [{ ... }]` entries, each carrying name, prefixes, optional raw block body, and (back-compat) `type` populated when there's exactly one prefix
  • `tables[]` — `@table ( cols ) row*` entries, with cells modeled as `vector<optional>` so the three-state grammar (absent / present-but-null / present) is faithfully preserved
  • `body_offset` — byte offset where the schema-typed body begins (chameleon hashes from this point)
  • `Position` gains an `offset` field so directive Body extraction can slice raw bytes

Lexer: `kAtDirective` / `kAtTable` join `kAtType`; `kLParen` / `kRParen` added for table column lists.

Parser (slow / AST tier): top-of-document directive prelude; `parseDirective` uses one-token lookahead (IDENT followed by `=` / `:` is a body key, not a prefix); inline-block body extracted via `findMatchingBrace` (string- and comment-aware); `parseTableDirective` enforces v1 cell grammar (scalar shapes only), column arity, and rejects dotted column paths.

Fast path (`decode_fast.cc`): `consumeDirectives` mirrors the AST prelude at the token level — no AST allocation, contents discarded in this PR. Same standalone constraint and arity checks.

Standalone constraint (draft §3.4.4): a document with any `@table` MUST NOT also have `@type` or top-level field entries.

Test plan

  • 21 new tests in `pxf_directive_test.cc` covering bare directives, one/two-prefix shapes, lookahead disambiguation, block body raw bytes, nested braces, braces-in-strings, `@table` happy path, three-state cells, zero rows, arity mismatch, dotted columns, list/block cell rejection, both standalone-constraint violations
  • Full test suite passes locally (132/132)
  • End-to-end smoke via `cmd/check_decode`: a doc with `@header T { id = "x" }` + `@frob alpha beta` + body fields decodes correctly (fast path skips directives, decodes body)
  • CI green (clang-format, all matrix jobs, CodeQL)

Adds the v0.72-v0.75 directive grammar to the PXF parser and the fast
direct-decode path (parser-tier only — runtime semantics arrive in
follow-up PRs of the cpp catch-up sequence).

AST changes:
  - Document gains directives[], tables[], body_offset; type_url and
    entries[] keep their meaning
  - Directive { pos, name, prefixes[], type (back-compat single-prefix),
    body, has_body, leading_comments } — Body holds the raw bytes
    between '{' and '}', preserved verbatim
  - TableDirective { pos, type, columns[], rows[], leading_comments }
    with TableRow::cells[] = vector<optional<ValuePtr>> (nullopt =
    absent cell, *NullVal = present-but-null, other = present-with-value)
  - Position gains `offset` (byte offset into lexer input) so directive
    Body extraction can slice raw bytes

Lexer changes:
  - kAtDirective (any @<ident> not "type" / "table"; Token.value is the
    bare name) and kAtTable join the existing kAtType
  - kLParen / kRParen for @table column lists and row tuples
  - lex_.Input() exposed so the parser can slice directive Body bytes

Parser (slow / AST tier):
  - parseDocument runs a top-of-document directive prelude: @type,
    @<directive>, @table in any order; doc.body_offset tracks the
    end of the last directive (chameleon's hashing anchor)
  - parseDirective handles zero-or-more prefix identifiers with
    one-token lookahead (IDENT followed by '=' / ':' is a body key,
    not a prefix); optional inline block, body raw bytes extracted
    via findMatchingBrace (mirrors protowire-go)
  - parseTableDirective + parseTableRow enforce v1 cell grammar
    (scalar shapes only — no list / block in cells); arity check
    against the column list; dotted column paths rejected
  - Standalone constraint (draft §3.4.4): a document with any @table
    MUST NOT also have @type or top-level field entries

Fast path (direct decode):
  - consumeDirectives mirrors the AST parser's prelude at the token
    level (no AST allocation); discards directive contents in this
    PR — Result accessors / TableReader / BindRow land in subsequent
    PRs
  - Same standalone constraint and arity checks

Tests:
  - 21 new tests in pxf_directive_test.cc covering bare directives,
    one/two-prefix shapes, lookahead disambiguation, block body raw
    bytes, nested braces, braces inside strings, @table happy path,
    empty / null cells (three-state grammar), zero rows, arity
    mismatch, dotted columns, list/block cell rejection, both
    standalone-constraint violations
  - All 132 tests pass; end-to-end smoke via cmd/check_decode
    confirms the fast path correctly skips chameleon-style
    `@header T { ... }` plus a multi-prefix `@frob alpha beta`
    before decoding the schema-typed body
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 46.20390% with 248 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pxf/decode_fast.cc 7.91% 123 Missing and 5 partials ⚠️
src/pxf/parser.cc 63.03% 86 Missing and 26 partials ⚠️
src/pxf/lexer.cc 52.94% 4 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

@trendvidia trendvidia merged commit 4898a2f into main May 12, 2026
10 checks passed
trendvidia added a commit that referenced this pull request May 12, 2026
Adds 26 new tests on top of PR #9's initial 21:

Fast-path (PxfDirectiveFast fixture, 19 tests) — every code path in
ConsumeDirectives, including the standalone-constraint enforcement in
both orderings (@type before @table and @table before @type), bare /
single-prefix / multi-prefix / inline-block / nested-block directive
shapes, the prefix-lookahead disambiguator, every @table error return
(missing type, missing '(' , empty column list, bad column token,
missing ',' or ')' in column list and rows, arity mismatch, dotted
columns, list-cell and block-cell rejection), and the "@type accepts
string form" back-compat that the fast path supports but the AST
parser does not.

AST-tier error paths (7 tests) — @type without an IDENT,
findMatchingBrace's #-comment / //-comment / /*-comment / b"..."
sub-skip branches, zero-prefix-no-legacy-type back-compat, and the
@table-after-@type rejection symmetric to @type-after-@table.

All 170 tests pass locally; the fast-path tests exercise the 123-line
ConsumeDirectives block that PR #9's initial test set missed because
it only invoked Parse() (AST tier).
trendvidia added a commit that referenced this pull request May 12, 2026
Adds 26 new tests on top of PR #9's initial 21:

Fast-path (PxfDirectiveFast fixture, 19 tests) — every code path in
ConsumeDirectives, including the standalone-constraint enforcement in
both orderings (@type before @table and @table before @type), bare /
single-prefix / multi-prefix / inline-block / nested-block directive
shapes, the prefix-lookahead disambiguator, every @table error return
(missing type, missing '(' , empty column list, bad column token,
missing ',' or ')' in column list and rows, arity mismatch, dotted
columns, list-cell and block-cell rejection), and the "@type accepts
string form" back-compat that the fast path supports but the AST
parser does not.

AST-tier error paths (7 tests) — @type without an IDENT,
findMatchingBrace's #-comment / //-comment / /*-comment / b"..."
sub-skip branches, zero-prefix-no-legacy-type back-compat, and the
@table-after-@type rejection symmetric to @type-after-@table.

All 170 tests pass locally; the fast-path tests exercise the 123-line
ConsumeDirectives block that PR #9's initial test set missed because
it only invoked Parse() (AST tier).
trendvidia added a commit that referenced this pull request May 12, 2026
Third PR of the v0.72-v0.75 cpp catch-up. The fast-path direct
decoder previously walked directives just enough to satisfy the
standalone constraint and arity checks; it discarded their content.
This PR wires the parsed shape onto Result so consumers can read the
document-root directive list after UnmarshalFull returns.

API additions on Result:
  - Directives() → const vector<Directive>& : generic
    `@<name> *(prefix) [{ ... }]` blocks in source order. body holds
    raw bytes between '{' and '}', preserved verbatim for downstream
    re-parsing (chameleon's @Header reader, etc.). Single-prefix
    populates the back-compat `type` field per v0.72.0 shape.
  - Tables() → const vector<TableDirective>& : @table directives
    with full column metadata and parsed cell ValuePtr per row,
    faithful to the three-state cell grammar (absent / present-null /
    present-with-value).
  - AddDirective(...) / AddTable(...) : internal mutators used by
    the fast path; not part of the consumer API.

Fast path (decode_fast.cc):
  - ConsumeDirectives builds Directive / TableDirective structs
    inline, conditionally appending to result_ when non-null.
    Unmarshal (Result=nullptr) retains its zero-allocation contract:
    the fast path still walks and validates directives but allocates
    nothing on the prelude.
  - New ParseScalarCellValue helper mirrors the scalar branches of
    the AST parser's ParseValue. Used by @table row parsing; list /
    block cell tokens are already rejected before it's called.
  - Body bytes for @<directive> blocks are sliced from
    lex_.Input().substr(open + 1, close - (open + 1)) using the
    Position::offset added in PR #9.

Tests (14 new in pxf_result_directives_test.cc, 198 total):
  - Empty document, bare / single-prefix / multi-prefix directives
  - @type does not leak into Directives()
  - Nested block body preserved verbatim
  - Multiple directives in source order
  - @table columns / rows / cells (concrete value tagging)
  - Three-state cells (absent / null / value)
  - Multiple tables in order
  - Directives + tables coexisting
  - Unmarshal (no Result) still succeeds (regression check on the
    result_-null branch)
@trendvidia trendvidia mentioned this pull request May 12, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants