feat: dump and classify api round-trip failures by webern · Pull Request #217 · webern/mx

webern · 2026-06-20T09:19:11Z

Summary

Implements Phases 1 and 2 of the api round-trip triage plan (#208).

Phase 1 (#210) adds a --dump <dir> flag to the api round-trip harness's discovery mode. For every non-PASS, non-SKIP file it re-runs the pipeline and writes the fully-normalized expected and actual documents.

runRoundtrip() is left untouched; a new dumpDocuments() helper replays the normalization sequence. Pipeline errors produce no actual document, so a small .status sidecar records the exact error.

New target: make dump-api-roundtrip.

Phase 2 (#211) adds audit/classify.py and the python3 -m audit classify subcommand.

It diffs each dumped pair as an order-free element count because the dominant signal, deletion, would desynchronize after the first drop. It provides (distinct_missing_count = len(missing)).

It cross-references data/api.features.xml to assign each file a root-cause category and writes build/api/classified.json. It also prints a worklist ranked by files unblocked with a single-blocker (low-hanging-fruit).

Design rationale (defect analysis, layered algorithm, library survey, cited research) is in docs/ai/design/api-roundtrip-classifier.md.

Testing

make test-audit — 11 classifier tests pass (all categories, the multiset-completeness fix, single-blocker metric, ranking)
Real corpus run: make dump-api-roundtrip && make classify-api-roundtrip classified 828 non-passing files into B 1 / C 51 / F 16 / unknown 760. The large unknown bucket is the correct result on this pre-api: run discovery and pin currently-passing corpus files to the round-trip baseline #209 branch: those files drop elements that are support="full" (e.g. part-group, staff, voice) — genuine correctness signals, not feature gaps — which the design deliberately surfaces as unknown rather than mislabeling drop-only.
.status sidecar round-trips: a CREATEFAIL file reports "pipeline_error_kind": "CREATEFAIL".
git status clean after a full dump + classify run (all artifacts gitignored).
Regression gate unaffected: mxtest-api-roundtrip regression passes (runRoundtrip() and regression mode untouched; the change only adds a discovery-mode branch).
C++ clang-format clean on the changed file.

Sample stdout from the real run:

Classified 828 files from build/api/roundtrip-dump

  B  drop-only divergence           1
  C  reorder-only divergence       51
  D  enum bug                       0
  E  missing attribute/element      0
  F  pipeline error                16
  ?  unknown                      760

Top blocking features (ranked by files unblocked; B+D+E):
  source                     1 files   (1 single-blocker)

References

github-actions · 2026-06-20T09:21:36Z

gen-quality `gen/`

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit fda04bdf35f07340ce89359792240ca158ce74ed.

github-actions · 2026-06-20T09:35:18Z

Coverage report

Core-dev coverage `src/private/mx/core/`

Metric	Coverage	Covered / Total
Lines	77.9%	28539 / 36624
Functions	74.4%	6360 / 8550
Branches	50.7%	22672 / 44725

API coverage `src/private/mx/{api,impl,utility}/`

Metric	Coverage	Covered / Total
Lines	72.1%	5328 / 7390
Functions	60.1%	1819 / 3029
Branches	43.0%	4447 / 10333

Core HTML report | API HTML report

Commit fda04bdf35f07340ce89359792240ca158ce74ed.

github-actions · 2026-06-20T10:32:37Z

gen-quality `gen/`

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit 2fdd68002c9a67c717cce0d09369e45b382892d3.

github-actions · 2026-06-20T10:46:20Z

Coverage report

Core-dev coverage `src/private/mx/core/`

Metric	Coverage	Covered / Total
Lines	77.9%	28539 / 36624
Functions	74.4%	6360 / 8550
Branches	50.7%	22672 / 44725

API coverage `src/private/mx/{api,impl,utility}/`

Metric	Coverage	Covered / Total
Lines	72.1%	5328 / 7390
Functions	60.1%	1819 / 3029
Branches	43.0%	4447 / 10333

Core HTML report | API HTML report

Commit 2fdd68002c9a67c717cce0d09369e45b382892d3.

webern · 2026-06-20T14:27:35Z

@copilot rebase

Copilot · 2026-06-20T14:29:13Z

@copilot rebase

Done — merged current main (a235185) into the branch. The PR is now up to date.

github-actions · 2026-06-20T14:37:22Z

gen-quality `gen/`

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit f604535dc246b72c57953921cb6cea25683a0f3c.

github-actions · 2026-06-20T14:51:50Z

Coverage report

Core-dev coverage `src/private/mx/core/`

Metric	Coverage	Covered / Total
Lines	77.9%	28539 / 36624
Functions	74.4%	6360 / 8550
Branches	50.7%	22672 / 44725

API coverage `src/private/mx/{api,impl,utility}/`

Metric	Coverage	Covered / Total
Lines	72.1%	5328 / 7390
Functions	60.1%	1819 / 3029
Branches	43.0%	4447 / 10333

Core HTML report | API HTML report

Commit f604535dc246b72c57953921cb6cea25683a0f3c.

Document the multiset-first, layered diff design for audit/classify.py, superseding the naive "walk both trees in parallel until the first mismatch" sketch. A positional walk cannot survive a deletion (the dominant mx::api failure signal): one drop desynchronizes every later sibling, so only the first divergence is trustworthy. Replace it with a collections.Counter multiset difference that enumerates all missing element classes in O(n), reorder-invariant, stdlib-only. Includes a cited research appendix surveying tree edit distance, XML-specific diff algorithms, sequence alignment, Python libraries, and yield-based ranking.

Phase 1 (#210): add a --dump <dir> flag to the api round-trip harness' discovery mode. For every non-PASS, non-SKIP file it re-runs the pipeline and writes the fully-normalized expected (and, when produced, actual) documents to <dir>/<flat>.expected.xml / .actual.xml. Pipeline errors (LOADFAIL/GETDATAFAIL/ CREATEFAIL) have no actual document, so a <flat>.status sidecar records the exact code. runRoundtrip() is left untouched; dumpDocuments() replicates the normalization sequence. New make target: dump-api-roundtrip. Phase 2 (#211): add audit/classify.py and the `classify` subcommand. It diffs each dumped pair as an order-free element multiset (Counter(expected) - Counter(actual)), which enumerates every dropped element class in O(n) and is reorder-invariant -- fixing the positional-walk cascade where one drop desynchronizes all later siblings. It cross-references data/api.features.xml and assigns each file a root-cause category (B drop-only, C reorder-only, D enum-bug, E missing-attribute, F pipeline-error, or unknown), emitting build/api/classified.json plus a stdout worklist ranked by files unblocked. New make targets: classify-api-roundtrip, test-audit (wired into CI). Tests in audit/tests/test_classify.py cover all categories, the multiset completeness fix, the single-blocker low-hanging-fruit metric, and ranking.

Drives the failure classifier (dump -> classify) and turns build/api/classified.json into a prioritized, plain-language explanation of what is wrong with the mx::api round-trip and what to fix next, grouped by failure mode (crash, supported-element drop, reorder, by-design drop, audit blind spot).

github-actions · 2026-06-20T15:02:16Z

gen-quality `gen/`

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit e85689e58f724146d7a3fe8d16aec68795754c8d.

github-actions · 2026-06-20T15:14:23Z

gen-quality `gen/`

gen-quality: 84.5 / 100   (floor 84.5, +0.0)

  structure     86.5  x0.50   [fn 90.5 / file 82.6]
  cyclomatic    88.4  x0.25
  cognitive     76.6  x0.25

  409 functions across 31 files, 7702 lines (largest file 1044)
  max cc 56  max cognitive 44  max fn loc 152

Worst offenders (top 5 per axis; full lists in score.json):
  cyclomatic gen/xsd/analyze.py:311     report                             56
  cyclomatic gen/plates/build.py:956    _validate_config_against_ir        35
  cyclomatic gen/press/context.py:145   plate_context                      34
  cyclomatic gen/__main__.py:46         _ir                                23
  cyclomatic gen/tests/test_ir.py:102   _check_references                  20
  cognitive  gen/xsd/analyze.py:311     report                             44
  cognitive  gen/ir/resolve.py:119      flat_elements                      40
  cognitive  gen/tests/test_ir.py:102   _check_references                  38
  cognitive  gen/press/context.py:145   plate_context                      37
  cognitive  gen/xsd/analyze.py:207     _sccs                              37
  size       gen/xsd/analyze.py:311     report                             152
  size       gen/press/context.py:145   plate_context                      96
  size       gen/plates/build.py:533    _value_plate                       89
  size       gen/plates/build.py:956    _validate_config_against_ir        89
  size       gen/ir/resolve.py:119      flat_elements                      78

Commit 2cd996bbfcd8dc03492f330b9bb0d15268346705.

github-actions · 2026-06-20T15:28:26Z

Coverage report

Core-dev coverage `src/private/mx/core/`

Metric	Coverage	Covered / Total
Lines	77.9%	28539 / 36624
Functions	74.4%	6360 / 8550
Branches	50.7%	22672 / 44725

API coverage `src/private/mx/{api,impl,utility}/`

Metric	Coverage	Covered / Total
Lines	72.1%	5328 / 7390
Functions	60.1%	1819 / 3029
Branches	43.0%	4447 / 10333

Core HTML report | API HTML report

Commit 2cd996bbfcd8dc03492f330b9bb0d15268346705.

…pported-element drops (#224) ## Summary Progresses #219 in two parts: fix part-group round-trip, and make the round-trip classifier surface the dropped-supported-element signal #219 is about (instead of burying it in `unknown`). ### part-group round-trip (mx::api / mx::impl) #219 flagged part-group as the most-dropped `support=full` element on api round-trip (373 files). It was two problems: 1. Misleading corpus signal. All 373 drops were synthetic files with an unmatched `<part-group type="start">` (no stop) -- schema-valid but semantically invalid, a start/stop pairing constraint XSD cannot express. `mx::api` correctly drops an unmatched start (it models a complete start..stop span), and zero real-world files drop part-group. Fixed by making the synthetic part-groups well-formed (trailing `start` -> `stop`, 386 files; schema-valid per version, audit surface unchanged). 2. Overstated support. Even well-formed groups lost data on write: `group-abbreviation` read but never written, `group-barline` fabricated as a constant `yes`, `displayName`/`displayAbbreviation` dead. Now: model `group-barline` (`api::GroupBarline`), write `group-abbreviation`, wire the display names to `group-name-display`/`group-abbreviation-display` via a shared `NameDisplayFunctions` helper. `api.features.xml` corrected `full` -> `partial`; `group-time`/editorial stay unmodeled by design. ### classifier (audit/classify.py) The classifier put every file dropping a `support=full`/`partial` element into `unknown` -- category B only fires when every dropped class is `support=none`, so any file mixing a supported drop with unsupported ones (nearly all) fell through. That buried #219's premise: 759 of 828 files were `unknown`. Added category G (supported-element drop): an actionable impl-bug-or-audit-overstatement signal, evaluated after B/C/D/E (a precise enum/attribute finding still wins) and listing the dropped supported classes as `blocking_features`. On the corpus this moves 575 files out of `unknown` (-> 183) and ranks the real offenders: staff (280), lyric (117), text (115), voice (96), measure-numbering (88). ## Testing - [x] New `partGroupRoundTrip` test: red before the part-group fix, green after - [x] Full api/impl suite: 4113 assertions / 271 cases - [x] corert over the corpus: 830 files (the 386 synthetic edits round-trip in `mx::core`) - [x] `make test-audit`: 13 cases incl. 2 new category-G tests - [x] api round-trip: files dropping part-group 373 -> 0; classifier `unknown` 759 -> 183 - [x] Changed synthetic files schema-valid (3.0/3.1/4.0); `make check` and the api-roundtrip regression gate pass No corpus files become green (each still drops `footnote`/`level`/`staff`), so the pinned baseline is unchanged; the targeted test is the demonstration. ## References - Progresses #219 - Part of #208 - Surfaced by #211 / #217

## Summary The api pipeline crashed outright (no output at all) on 16 corpus metronome/tempo files — 8 GETDATAFAIL on read, 8 CREATEFAIL on write — caused by two long-standing MX_THROW sites in the impl layer (present in the original library too; the round-trip classifier from #217 is what surfaced them). Read (MetronomeReader): - parseNoteRelationNote() threw "wtf is this" on the metronome-note form. The api has no representation for it, so the tempo is now left unspecified instead of crashing the whole document. - parseMetronomeModulation() was an empty stub that left the tempo unspecified (which then crashed the writer). Implemented it — the two beat-units are read into api::MetricModulation. Write (DirectionWriter): - The tempo loop threw on any non-beatsPerMinute tempo. Replaced with a switch that writes beatsPerMinute and metricModulation, and skips tempoText / unspecified tempos gracefully. Result: metric modulation now round-trips through the api. The metronome-note form and a non-numeric per-minute (a legal xs:string, e.g. "ca. 76") no longer crash — their tempo is dropped, producing imperfect output instead of none. Corpus discovery confirms GETDATAFAIL and CREATEFAIL both drop from 8 to 0. No file reaches a full PASS: all 16 still fail the strict DOM compare on pre-existing, by-design fidelity losses unrelated to metronomes (identification / encoding / part-list reordering), so there are no new live-corpus baseline entries — as the issue anticipated. ## Testing - [x] New targeted red/green tests cover both former crash sites plus metric-modulation fidelity (`*_MetronomeApi`: 17 assertions in 4 test cases) - [x] Full api test suite passes (4114 assertions in 272 test cases) - [x] Corpus discovery: 0 GETDATAFAIL, 0 CREATEFAIL (was 8 + 8) ## References - Closes #218 - Progresses #208 - Surfaced by #211 / #217

webern added testing non-breaking fixes or implementation that do not require breaking changes area/mx::api ai Issues opened by, or through, a coding agent. labels Jun 20, 2026 — with Claude

Copilot started work on behalf of webern June 20, 2026 14:27 View session

Copilot finished work on behalf of webern June 20, 2026 14:29

webern added 4 commits June 20, 2026 17:00

chore: delete claude removing ci file that did not work

b06b39d

webern force-pushed the claude/new-session-xin58f branch from b83931e to 2f3675a Compare June 20, 2026 15:00

webern commented Jun 20, 2026

View reviewed changes

Comment thread .claude/skills/explain-api-roundtrip/SKILL.md Outdated

Update .claude/skills/explain-api-roundtrip/SKILL.md

db56b6a

webern merged commit 2cd996b into main Jun 20, 2026

webern deleted the claude/new-session-xin58f branch June 20, 2026 15:11

This was referenced Jun 20, 2026

fix: stop api round-trip crashes on metronome/tempo marks #223

Merged

fix: part-group round-trip fidelity, and a classifier category for supported-element drops #224

Merged

webern mentioned this pull request Jun 20, 2026

audit: classify api round-trip failures by measured divergence and rank a worklist #225

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dump and classify api round-trip failures#217

feat: dump and classify api round-trip failures#217
webern merged 5 commits into
mainfrom
claude/new-session-xin58f

webern commented Jun 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

webern commented Jun 20, 2026

Uh oh!

Copilot AI commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

webern commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

References

Uh oh!

github-actions Bot commented Jun 20, 2026

gen-quality gen/

Uh oh!

github-actions Bot commented Jun 20, 2026

Coverage report

Core-dev coverage src/private/mx/core/

API coverage src/private/mx/{api,impl,utility}/

Uh oh!

github-actions Bot commented Jun 20, 2026

gen-quality gen/

Uh oh!

github-actions Bot commented Jun 20, 2026

Coverage report

Core-dev coverage src/private/mx/core/

API coverage src/private/mx/{api,impl,utility}/

Uh oh!

webern commented Jun 20, 2026

Uh oh!

Copilot AI commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

gen-quality gen/

Uh oh!

github-actions Bot commented Jun 20, 2026

Coverage report

Core-dev coverage src/private/mx/core/

API coverage src/private/mx/{api,impl,utility}/

Uh oh!

github-actions Bot commented Jun 20, 2026

gen-quality gen/

Uh oh!

Uh oh!

github-actions Bot commented Jun 20, 2026

gen-quality gen/

Uh oh!

github-actions Bot commented Jun 20, 2026

Coverage report

Core-dev coverage src/private/mx/core/

API coverage src/private/mx/{api,impl,utility}/

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

webern commented Jun 20, 2026 •

edited

Loading

gen-quality `gen/`

Core-dev coverage `src/private/mx/core/`

API coverage `src/private/mx/{api,impl,utility}/`

gen-quality `gen/`

Core-dev coverage `src/private/mx/core/`

API coverage `src/private/mx/{api,impl,utility}/`

gen-quality `gen/`

Core-dev coverage `src/private/mx/core/`

API coverage `src/private/mx/{api,impl,utility}/`

gen-quality `gen/`

gen-quality `gen/`

Core-dev coverage `src/private/mx/core/`

API coverage `src/private/mx/{api,impl,utility}/`