Skip to content

Preserve XML-fragment markup in Bibcollection title/author#125

Merged
andrew2net merged 3 commits into
mainfrom
fix-html-escaping-collection-title
May 13, 2026
Merged

Preserve XML-fragment markup in Bibcollection title/author#125
andrew2net merged 3 commits into
mainfrom
fix-html-escaping-collection-title

Conversation

@opoudjis
Copy link
Copy Markdown
Contributor

Summary

Fix the bug reported in metanorma/isodoc#785 where an & in the landing-page collection title is passed through unescaped and then dropped by the browser's HTML entity parser. The fix preserves XML-fragment form (markup + entities) for the collection-level title and author by reading them via inner_html instead of Nokogiri's .text, and strips inline tags only in the HTML <title> element position.

Diagnosis

Three observations that together explain the bug:

  1. Liquid 5 does not auto-escape {{ var }} output. Verified at runtime: Liquid::Template.parse("{{ x }}").render("x" => "a & b <c>") emits a & b <c> literally. Every {{ var }} in the index/document templates outputs the raw Ruby string.
  2. The collection-level title and author were being entity-decoded on the way in. Bibcollection.from_xml used find_text, which calls Nokogiri's .text — that strips child elements and decodes &amp; back to a raw & in the Ruby string. That raw & then flowed straight through Liquid into the rendered HTML, reproducing the screenshot in #785.
  3. The document-level title is a different path and works correctly. RelatonBib::BibliographicItem.to_hash returns titles as literal XML-fragment strings (e.g. Use of <strong>ActualText</strong> &amp; <strong>Reference</strong> structure elements), and that whole string lands in the Ruby hash markup + entity intact. Liquid emits it raw and it is valid HTML.

The two title pathways were thus producing inconsistent in-memory string formats: doc-titles already XML-fragment-encoded, collection-titles decoded to bare characters. The fix unifies them on the XML-fragment form.

Nokogiri behaviour on a namespaced relaton-collection title (probed at runtime):

Call Output
.text Use of ActualText & Reference structure elements (markup stripped, entity decoded — current bug)
.inner_html Use of <strong>ActualText</strong> &amp; <strong>Reference</strong> structure elements (markup + entities preserved, no namespace decoration)

Why this approach rather than CGI.escapeHTML

The initial instinct was to wrap bibcollection.title and bibcollection.author in CGI.escapeHTML before passing them to Liquid. But CGI.escapeHTML escapes < and > as well as &, which would render any legitimate inline <strong>/<sub>/<sup> markup in a title as visible &lt;strong&gt; text. The collection title in the current test data is plain text, but nothing in the data model prevents future inputs from carrying inline markup the way doc-titles already do — so a blanket escape would be brittle.

Using inner_html is the parse-side dual of the doc-title path: the in-memory string is already HTML-safe (entities encoded, markup preserved), and Liquid's no-op rendering becomes correct by construction.

This also fixes a latent invalid-XML bug in Bibcollection#to_xml — that method currently interpolates the bare-character title into <title>#{title}</title>, which produces invalid XML when the title contains a raw &. After this change the title holds &amp;, so the round-trip is XML-safe.

Changes

  • lib/relaton/element_finder.rb — add find_html(xpath, element) alongside the existing find_text, returning find(xpath, element)&.inner_html.
  • lib/relaton/bibcollection.rb — switch the two find_text calls in from_xml to find_html.
  • templates/_index.liquid — change line 4 from <title>{{ title }}</title> to <title>{{ title | strip_html }}</title>. HTML5 treats the content of <title> as text — any inline element tags inside it would render as literal characters in the browser tab bar. strip_html is a built-in Liquid filter. The visible coverpage title at line 31 keeps {{ title }} unchanged so markup is rendered there as intended.
  • spec/assets/index-with-markup.xml (new) — minimal relaton-collection fixture with Use of <strong>ActualText</strong> &amp; <strong>Reference</strong> structure elements in the collection title and Acme &amp; Co in the author org name.
  • spec/relaton/cli/xml_to_html_renderer_spec.rb — three new examples asserting:
    • <span class="title-first"> contains the markup + &amp; literally,
    • <title> in the HTML head contains &amp; and no <strong> (verifies strip_html ran),
    • the author position emits Acme &amp; Co.

All three examples fail cleanly on main (verified by stash-and-run) and pass on this branch.

Out of scope

Raw & in URL href attributes. URL fields (html, pdf, doc, xml, rxl, uri) are emitted unescaped into href attributes. A raw & in href is technically invalid HTML; browsers tolerate it; current Relaton data doesn't typically produce malformed URLs. Worth a separate ticket if it ever bites.

Doc-title markup pass-through contract. The rendering relies on RelatonBib::BibliographicItem.to_hash returning titles as XML-fragment strings (markup + entities intact). That contract is currently load-bearing for the doc-title pathway here. This PR does not modify or document it; the assumption is that the contract is stable for the foreseeable future.

Test plan

  • bundle exec rspec spec/relaton/cli/xml_to_html_renderer_spec.rb — passes (8 examples, 0 failures), with the 3 new examples verified to fail on main.
  • bundle exec rspec spec/relaton/bibcollection_spec.rb — passes (2 examples, 0 failures).
  • rubocop lib/relaton/element_finder.rb lib/relaton/bibcollection.rb spec/relaton/cli/xml_to_html_renderer_spec.rb — no new offences introduced by this change (pre-existing offences unrelated to this change remain).
  • End-to-end against metanorma site generate: set collection.name in a site's metanorma.yml to contain a literal &, point metanorma-cli at this branch via Gemfile.devel, recompile, and confirm _site/index.html shows &amp; in the coverpage-title rather than dropping the &.

Note: the unrelated lutaml-model adapter failures in the full bundle exec rspec run are from an in-flight upstream migration (lutaml-model 0.8.0) and pre-exist this branch.

🤖 Generated with Claude Code

Switch Bibcollection.from_xml to read the collection title and author
via inner_html instead of Nokogiri's .text, so the in-memory strings
keep their XML-fragment form (markup + entities intact). Apply the
strip_html Liquid filter on the HTML <title> tag position so the
browser tab title stays plain text. Adds find_html to ElementFinder
alongside find_text. Adds a regression spec with markup and &amp; in
both the collection title and the author name.

Refs metanorma/isodoc#785.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
module ElementFinder
attr_reader :document

def find_text(xpath, element = nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the find_text method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude injected that to avoid Rubocop whining. I wouldn't have. Your call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method isn't used in the relaton-cli anymore. But it can be used outside. Let's keep it.

@andrew2net andrew2net merged commit fd20c9d into main May 13, 2026
23 of 25 checks passed
@andrew2net andrew2net deleted the fix-html-escaping-collection-title branch May 13, 2026 01:51
@ronaldtse
Copy link
Copy Markdown
Contributor

@andrew2net just FYI that lutaml-model now handles XML in most ways so there is no reason to use any Nokogiri and "raw" at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants