Skip to content

Validate XML chars in name and sbmlName setters#1677

Merged
jcschaff merged 4 commits intomasterfrom
xmlchars-validation
May 5, 2026
Merged

Validate XML chars in name and sbmlName setters#1677
jcschaff merged 4 commits intomasterfrom
xmlchars-validation

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 5, 2026

Summary

  • New XmlChars helper in vcell-util enforces XML 1.0 char rules plus a project policy banning U+FFFD (the replacement char, almost always evidence of upstream charset corruption); separate "name" and "attribute-content" modes for whitespace handling.
  • TokenMangler.checkNameProperty and ReactionStep.vetoableChange("name") reject control chars and U+FFFD in entity names; existing whitespace-bearing biomodel/application/simulation names continue to work.
  • New SpeciesContext.fixAndValidateSbmlName is the single chokepoint for all 9 setSbmlName methods (SpeciesContext, Structure, ReactionStep, BioModel, Model.GlobalParameter, AssignmentRule, RateRule, BioEvent); throws PropertyVetoException on bad chars so existing UI catch sites still surface a friendly dialog.

Why

Two stored BioModels (311226221, 311875206) had cached VCML containing C0 control chars (0x13, 0x1C) and U+FFFD inside reaction Name/SbmlName attributes; SAX rejected the cached XML on load. This PR closes the write-side leak so future imports/edits cannot serialize bad chars in the first place. PR #1676 (charset hygiene) addresses the upstream cause; this PR is the defense-in-depth follow-up.

Test plan

  • mvn -pl vcell-util -Dtest=XmlCharsTest test — 17/17 pass, including real-world bad patterns observed in the corrupt biomodels (U+FFFD + 0x1C, lone 0x13).
  • mvn -pl vcell-core -Dgroups=Fast test — 418 run, 0 failures, 0 new errors (1 pre-existing unrelated VCellDataTest.test_3D failure due to missing vtkmodules Python module).
  • Validate via UI that an attempt to paste a control char into a reaction-name or sbmlName field shows the new error message rather than corrupting the model.

🤖 Generated with Claude Code

jcschaff and others added 4 commits May 5, 2026 10:03
Centralizes XML 1.0 character validation rules, plus project policy
hard-rejecting U+FFFD (almost always charset corruption). Two modes:
name (forbids whitespace) and attribute-content (allows TAB/LF/CR).

Motivated by two stored BioModels (311226221, 311875206) whose cached
VCML contained C0 control chars in reaction-name attributes and could
no longer be parsed. The helper itself is a defensive primitive; this
commit adds only the helper + tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire XmlChars.requireValidAttributeContent into TokenMangler.checkNameProperty
(used by BioModel, MathModel, Simulation, SimulationContext, Structure name
vetos) and into ReactionStep.vetoableChange("name"). Allows whitespace but
rejects C0 control chars and U+FFFD, so that bad chars from external
sources can't reach the VCML attribute payload via setName.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add SpeciesContext.fixAndValidateSbmlName as the single chokepoint for
all 9 setSbmlName methods (SpeciesContext, Structure, ReactionStep,
BioModel, Model.GlobalParameter, AssignmentRule, RateRule, BioEvent,
plus existing fixSbmlName fix-only path). The helper rejects C0 control
chars and U+FFFD via XmlChars.requireValidAttributeContent, throwing
PropertyVetoException so existing UI catch blocks render a friendly
error instead of a stack trace.

This closes the SBML import → setSbmlName → cached VCML attribute path
that produced the un-parseable cached XML observed in BioModels
311226221 and 311875206.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Simulation Name (and matching Version Name) contained two U+FFFD
replacement characters where 'µ' (micron) used to be — a charset
corruption baked into the fixture: "Figure 2 is 4� micrometer radius".
The new TokenMangler validation correctly rejects U+FFFD as invalid
attribute content, breaking SEDMLExporterSBMLTest test case 91 on this
branch.

The pre-existing test only worked because TokenMangler.fixTokenStrict
silently mangled U+FFFD into '_' before SBML export. Replacing the bad
char with the actual 'µ' unicode letter doesn't help — fixTokenStrict
preserves Unicode letters, so the resulting SBML SId becomes invalid.
Substituting plain ASCII 'u' is round-trip-safe and self-explanatory:
"Figure 2 is 4u micrometer radius".

Verified via mvn -pl vcell-core -Dtest=SEDMLExporterSBMLTest test
restricted to this single fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff force-pushed the xmlchars-validation branch from 788e115 to a0a9105 Compare May 5, 2026 14:06
@jcschaff jcschaff merged commit 4245802 into master May 5, 2026
13 checks passed
@jcschaff jcschaff deleted the xmlchars-validation branch May 5, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant