Skip to content

feat: dump and classify api round-trip failures#217

Merged
webern merged 5 commits into
mainfrom
claude/new-session-xin58f
Jun 20, 2026
Merged

feat: dump and classify api round-trip failures#217
webern merged 5 commits into
mainfrom
claude/new-session-xin58f

Conversation

@webern

@webern webern commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Implements Phases 1 and 2 of the api round-trip triage plan (#208).

Phase 1 (#210) adds a --dump <dir> flag to the api round-trip harness's discovery mode. For every non-PASS, non-SKIP file it re-runs the pipeline and writes the fully-normalized expected and actual documents.

runRoundtrip() is left untouched; a new dumpDocuments() helper replays the normalization sequence. Pipeline errors produce no actual document, so a small .status sidecar records the exact error.

New target: make dump-api-roundtrip.

Phase 2 (#211) adds audit/classify.py and the python3 -m audit classify subcommand.

It diffs each dumped pair as an order-free element count because the dominant signal, deletion, would desynchronize after the first drop. It provides (distinct_missing_count = len(missing)).

It cross-references data/api.features.xml to assign each file a root-cause category and writes build/api/classified.json. It also prints a worklist ranked by files unblocked with a single-blocker (low-hanging-fruit).

Design rationale (defect analysis, layered algorithm, library survey, cited research) is in docs/ai/design/api-roundtrip-classifier.md.

Testing

  • make test-audit — 11 classifier tests pass (all categories, the multiset-completeness fix, single-blocker metric, ranking)
  • Real corpus run: make dump-api-roundtrip && make classify-api-roundtrip classified 828 non-passing files into B 1 / C 51 / F 16 / unknown 760. The large unknown bucket is the correct result on this pre-api: run discovery and pin currently-passing corpus files to the round-trip baseline #209 branch: those files drop elements that are support="full" (e.g. part-group, staff, voice) — genuine correctness signals, not feature gaps — which the design deliberately surfaces as unknown rather than mislabeling drop-only.
  • .status sidecar round-trips: a CREATEFAIL file reports "pipeline_error_kind": "CREATEFAIL".
  • git status clean after a full dump + classify run (all artifacts gitignored).
  • Regression gate unaffected: mxtest-api-roundtrip regression passes (runRoundtrip() and regression mode untouched; the change only adds a discovery-mode branch).
  • C++ clang-format clean on the changed file.

Sample stdout from the real run:

Classified 828 files from build/api/roundtrip-dump

  B  drop-only divergence           1
  C  reorder-only divergence       51
  D  enum bug                       0
  E  missing attribute/element      0
  F  pipeline error                16
  ?  unknown                      760

Top blocking features (ranked by files unblocked; B+D+E):
  source                     1 files   (1 single-blocker)

References

@webern webern added testing non-breaking fixes or implementation that do not require breaking changes area/mx::api ai Issues opened by, or through, a coding agent. labels Jun 20, 2026 — with Claude
@github-actions

Copy link
Copy Markdown

gen-quality gen/

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit fda04bdf35f07340ce89359792240ca158ce74ed.

@github-actions

Copy link
Copy Markdown

Coverage report

Core-dev coverage src/private/mx/core/

Metric Coverage Covered / Total
Lines 77.9% 28539 / 36624
Functions 74.4% 6360 / 8550
Branches 50.7% 22672 / 44725

API coverage src/private/mx/{api,impl,utility}/

Metric Coverage Covered / Total
Lines 72.1% 5328 / 7390
Functions 60.1% 1819 / 3029
Branches 43.0% 4447 / 10333

Core HTML report | API HTML report

Commit fda04bdf35f07340ce89359792240ca158ce74ed.

@github-actions

Copy link
Copy Markdown

gen-quality gen/

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit 2fdd68002c9a67c717cce0d09369e45b382892d3.

@github-actions

Copy link
Copy Markdown

Coverage report

Core-dev coverage src/private/mx/core/

Metric Coverage Covered / Total
Lines 77.9% 28539 / 36624
Functions 74.4% 6360 / 8550
Branches 50.7% 22672 / 44725

API coverage src/private/mx/{api,impl,utility}/

Metric Coverage Covered / Total
Lines 72.1% 5328 / 7390
Functions 60.1% 1819 / 3029
Branches 43.0% 4447 / 10333

Core HTML report | API HTML report

Commit 2fdd68002c9a67c717cce0d09369e45b382892d3.

@webern

webern commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

@copilot rebase

Copilot AI commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

@copilot rebase

Done — merged current main (a235185) into the branch. The PR is now up to date.

@github-actions

Copy link
Copy Markdown

gen-quality gen/

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit f604535dc246b72c57953921cb6cea25683a0f3c.

@github-actions

Copy link
Copy Markdown

Coverage report

Core-dev coverage src/private/mx/core/

Metric Coverage Covered / Total
Lines 77.9% 28539 / 36624
Functions 74.4% 6360 / 8550
Branches 50.7% 22672 / 44725

API coverage src/private/mx/{api,impl,utility}/

Metric Coverage Covered / Total
Lines 72.1% 5328 / 7390
Functions 60.1% 1819 / 3029
Branches 43.0% 4447 / 10333

Core HTML report | API HTML report

Commit f604535dc246b72c57953921cb6cea25683a0f3c.

webern added 4 commits June 20, 2026 17:00
Document the multiset-first, layered diff design for audit/classify.py,
superseding the naive "walk both trees in parallel until the first
mismatch" sketch. A positional walk cannot survive a deletion (the
dominant mx::api failure signal): one drop desynchronizes every later
sibling, so only the first divergence is trustworthy. Replace it with a
collections.Counter multiset difference that enumerates all missing
element classes in O(n), reorder-invariant, stdlib-only. Includes a
cited research appendix surveying tree edit distance, XML-specific diff
algorithms, sequence alignment, Python libraries, and yield-based
ranking.
Phase 1 (#210): add a --dump <dir> flag to the api round-trip harness'
discovery mode. For every non-PASS, non-SKIP file it re-runs the pipeline and
writes the fully-normalized expected (and, when produced, actual) documents to
<dir>/<flat>.expected.xml / .actual.xml. Pipeline errors (LOADFAIL/GETDATAFAIL/
CREATEFAIL) have no actual document, so a <flat>.status sidecar records the exact
code. runRoundtrip() is left untouched; dumpDocuments() replicates the
normalization sequence. New make target: dump-api-roundtrip.

Phase 2 (#211): add audit/classify.py and the `classify` subcommand. It diffs
each dumped pair as an order-free element multiset
(Counter(expected) - Counter(actual)), which enumerates every dropped element
class in O(n) and is reorder-invariant -- fixing the positional-walk cascade
where one drop desynchronizes all later siblings. It cross-references
data/api.features.xml and assigns each file a root-cause category (B drop-only,
C reorder-only, D enum-bug, E missing-attribute, F pipeline-error, or unknown),
emitting build/api/classified.json plus a stdout worklist ranked by files
unblocked. New make targets: classify-api-roundtrip, test-audit (wired into CI).

Tests in audit/tests/test_classify.py cover all categories, the multiset
completeness fix, the single-blocker low-hanging-fruit metric, and ranking.
Drives the failure classifier (dump -> classify) and turns
build/api/classified.json into a prioritized, plain-language explanation
of what is wrong with the mx::api round-trip and what to fix next,
grouped by failure mode (crash, supported-element drop, reorder,
by-design drop, audit blind spot).
@webern webern force-pushed the claude/new-session-xin58f branch from b83931e to 2f3675a Compare June 20, 2026 15:00
@github-actions

Copy link
Copy Markdown

gen-quality gen/

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit e85689e58f724146d7a3fe8d16aec68795754c8d.

Comment thread .claude/skills/explain-api-roundtrip/SKILL.md Outdated
@webern webern merged commit 2cd996b into main Jun 20, 2026
@webern webern deleted the claude/new-session-xin58f branch June 20, 2026 15:11
@github-actions

Copy link
Copy Markdown

gen-quality gen/

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit 2cd996bbfcd8dc03492f330b9bb0d15268346705.

@github-actions

Copy link
Copy Markdown

Coverage report

Core-dev coverage src/private/mx/core/

Metric Coverage Covered / Total
Lines 77.9% 28539 / 36624
Functions 74.4% 6360 / 8550
Branches 50.7% 22672 / 44725

API coverage src/private/mx/{api,impl,utility}/

Metric Coverage Covered / Total
Lines 72.1% 5328 / 7390
Functions 60.1% 1819 / 3029
Branches 43.0% 4447 / 10333

Core HTML report | API HTML report

Commit 2cd996bbfcd8dc03492f330b9bb0d15268346705.

webern added a commit that referenced this pull request Jun 20, 2026
…pported-element drops (#224)

## Summary

Progresses #219 in two parts: fix part-group round-trip, and make the
round-trip classifier surface the dropped-supported-element signal #219
is about (instead of burying it in `unknown`).

### part-group round-trip (mx::api / mx::impl)

#219 flagged part-group as the most-dropped `support=full` element on
api round-trip (373 files). It was two problems:

1. Misleading corpus signal. All 373 drops were synthetic files with an
unmatched `<part-group type="start">` (no stop) -- schema-valid but
semantically invalid, a start/stop pairing constraint XSD cannot
express. `mx::api` correctly drops an unmatched start (it models a
complete start..stop span), and zero real-world files drop part-group.
Fixed by making the synthetic part-groups well-formed (trailing `start`
-> `stop`, 386 files; schema-valid per version, audit surface
unchanged).

2. Overstated support. Even well-formed groups lost data on write:
`group-abbreviation` read but never written, `group-barline` fabricated
as a constant `yes`, `displayName`/`displayAbbreviation` dead. Now:
model `group-barline` (`api::GroupBarline`), write `group-abbreviation`,
wire the display names to
`group-name-display`/`group-abbreviation-display` via a shared
`NameDisplayFunctions` helper. `api.features.xml` corrected `full` ->
`partial`; `group-time`/editorial stay unmodeled by design.

### classifier (audit/classify.py)

The classifier put every file dropping a `support=full`/`partial`
element into `unknown` -- category B only fires when every dropped class
is `support=none`, so any file mixing a supported drop with unsupported
ones (nearly all) fell through. That buried #219's premise: 759 of 828
files were `unknown`. Added category G (supported-element drop): an
actionable impl-bug-or-audit-overstatement signal, evaluated after
B/C/D/E (a precise enum/attribute finding still wins) and listing the
dropped supported classes as `blocking_features`. On the corpus this
moves 575 files out of `unknown` (-> 183) and ranks the real offenders:
staff (280), lyric (117), text (115), voice (96), measure-numbering
(88).

## Testing

- [x] New `partGroupRoundTrip` test: red before the part-group fix,
green after
- [x] Full api/impl suite: 4113 assertions / 271 cases
- [x] corert over the corpus: 830 files (the 386 synthetic edits
round-trip in `mx::core`)
- [x] `make test-audit`: 13 cases incl. 2 new category-G tests
- [x] api round-trip: files dropping part-group 373 -> 0; classifier
`unknown` 759 -> 183
- [x] Changed synthetic files schema-valid (3.0/3.1/4.0); `make check`
and the api-roundtrip regression gate pass

No corpus files become green (each still drops
`footnote`/`level`/`staff`), so the pinned baseline is unchanged; the
targeted test is the demonstration.

## References

- Progresses #219
- Part of #208
- Surfaced by #211 / #217
webern added a commit that referenced this pull request Jun 20, 2026
## Summary

The api pipeline crashed outright (no output at all) on 16 corpus
metronome/tempo files — 8 GETDATAFAIL on read, 8 CREATEFAIL on write —
caused by two long-standing MX_THROW sites in the impl layer (present in
the original library too; the round-trip classifier from #217 is what
surfaced them).

Read (MetronomeReader):
- parseNoteRelationNote() threw "wtf is this" on the metronome-note
form. The api has no representation for it, so the tempo is now left
unspecified instead of crashing the whole document.
- parseMetronomeModulation() was an empty stub that left the tempo
unspecified (which then crashed the writer). Implemented it — the two
beat-units are read into api::MetricModulation.

Write (DirectionWriter):
- The tempo loop threw on any non-beatsPerMinute tempo. Replaced with a
switch that writes beatsPerMinute and metricModulation, and skips
tempoText / unspecified tempos gracefully.

Result: metric modulation now round-trips through the api. The
metronome-note form and a non-numeric per-minute (a legal xs:string,
e.g. "ca. 76") no longer crash — their tempo is dropped, producing
imperfect output instead of none.

Corpus discovery confirms GETDATAFAIL and CREATEFAIL both drop from 8 to
0. No file reaches a full PASS: all 16 still fail the strict DOM compare
on pre-existing, by-design fidelity losses unrelated to metronomes
(identification / encoding / part-list reordering), so there are no new
live-corpus baseline entries — as the issue anticipated.

## Testing

- [x] New targeted red/green tests cover both former crash sites plus
metric-modulation fidelity (`*_MetronomeApi`: 17 assertions in 4 test
cases)
- [x] Full api test suite passes (4114 assertions in 272 test cases)
- [x] Corpus discovery: 0 GETDATAFAIL, 0 CREATEFAIL (was 8 + 8)

## References

- Closes #218
- Progresses #208
- Surfaced by #211 / #217
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai Issues opened by, or through, a coding agent. area/mx::api non-breaking fixes or implementation that do not require breaking changes testing

Projects

None yet

2 participants