Preserve XML-fragment markup in Bibcollection title/author by opoudjis · Pull Request #125 · relaton/relaton-cli

opoudjis · 2026-05-12T14:07:36Z

Summary

Fix the bug reported in metanorma/isodoc#785 where an & in the landing-page collection title is passed through unescaped and then dropped by the browser's HTML entity parser. The fix preserves XML-fragment form (markup + entities) for the collection-level title and author by reading them via inner_html instead of Nokogiri's .text, and strips inline tags only in the HTML <title> element position.

Diagnosis

Three observations that together explain the bug:

Liquid 5 does not auto-escape {{ var }} output. Verified at runtime: Liquid::Template.parse("{{ x }}").render("x" => "a & b <c>") emits a & b <c> literally. Every {{ var }} in the index/document templates outputs the raw Ruby string.
The collection-level title and author were being entity-decoded on the way in. Bibcollection.from_xml used find_text, which calls Nokogiri's .text — that strips child elements and decodes & back to a raw & in the Ruby string. That raw & then flowed straight through Liquid into the rendered HTML, reproducing the screenshot in #785.
The document-level title is a different path and works correctly. RelatonBib::BibliographicItem.to_hash returns titles as literal XML-fragment strings (e.g. Use of ActualText & Reference structure elements), and that whole string lands in the Ruby hash markup + entity intact. Liquid emits it raw and it is valid HTML.

The two title pathways were thus producing inconsistent in-memory string formats: doc-titles already XML-fragment-encoded, collection-titles decoded to bare characters. The fix unifies them on the XML-fragment form.

Nokogiri behaviour on a namespaced relaton-collection title (probed at runtime):

Call	Output
`.text`	`Use of ActualText & Reference structure elements` (markup stripped, entity decoded — current bug)
`.inner_html`	`Use of <strong>ActualText</strong> & <strong>Reference</strong> structure elements` (markup + entities preserved, no namespace decoration)

Why this approach rather than `CGI.escapeHTML`

The initial instinct was to wrap bibcollection.title and bibcollection.author in CGI.escapeHTML before passing them to Liquid. But CGI.escapeHTML escapes < and > as well as &, which would render any legitimate inline // markup in a title as visible  text. The collection title in the current test data is plain text, but nothing in the data model prevents future inputs from carrying inline markup the way doc-titles already do — so a blanket escape would be brittle.

Using inner_html is the parse-side dual of the doc-title path: the in-memory string is already HTML-safe (entities encoded, markup preserved), and Liquid's no-op rendering becomes correct by construction.

This also fixes a latent invalid-XML bug in Bibcollection#to_xml — that method currently interpolates the bare-character title into <title>#{title}</title>, which produces invalid XML when the title contains a raw &. After this change the title holds &, so the round-trip is XML-safe.

Changes

lib/relaton/element_finder.rb — add find_html(xpath, element) alongside the existing find_text, returning find(xpath, element)&.inner_html.
lib/relaton/bibcollection.rb — switch the two find_text calls in from_xml to find_html.
templates/_index.liquid — change line 4 from <title>{{ title }}</title> to <title>{{ title | strip_html }}</title>. HTML5 treats the content of <title> as text — any inline element tags inside it would render as literal characters in the browser tab bar. strip_html is a built-in Liquid filter. The visible coverpage title at line 31 keeps {{ title }} unchanged so markup is rendered there as intended.
spec/assets/index-with-markup.xml (new) — minimal relaton-collection fixture with Use of ActualText & Reference structure elements in the collection title and Acme & Co in the author org name.
spec/relaton/cli/xml_to_html_renderer_spec.rb — three new examples asserting:
-  contains the markup + & literally,
- <title> in the HTML head contains & and no  (verifies strip_html ran),
- the author position emits Acme & Co.

All three examples fail cleanly on main (verified by stash-and-run) and pass on this branch.

Out of scope

Raw & in URL href attributes. URL fields (html, pdf, doc, xml, rxl, uri) are emitted unescaped into href attributes. A raw & in href is technically invalid HTML; browsers tolerate it; current Relaton data doesn't typically produce malformed URLs. Worth a separate ticket if it ever bites.

Doc-title markup pass-through contract. The rendering relies on RelatonBib::BibliographicItem.to_hash returning titles as XML-fragment strings (markup + entities intact). That contract is currently load-bearing for the doc-title pathway here. This PR does not modify or document it; the assumption is that the contract is stable for the foreseeable future.

Test plan

bundle exec rspec spec/relaton/cli/xml_to_html_renderer_spec.rb — passes (8 examples, 0 failures), with the 3 new examples verified to fail on main.
bundle exec rspec spec/relaton/bibcollection_spec.rb — passes (2 examples, 0 failures).
rubocop lib/relaton/element_finder.rb lib/relaton/bibcollection.rb spec/relaton/cli/xml_to_html_renderer_spec.rb — no new offences introduced by this change (pre-existing offences unrelated to this change remain).
End-to-end against metanorma site generate: set collection.name in a site's metanorma.yml to contain a literal &, point metanorma-cli at this branch via Gemfile.devel, recompile, and confirm _site/index.html shows & in the coverpage-title rather than dropping the &.

Note: the unrelated lutaml-model adapter failures in the full bundle exec rspec run are from an in-flight upstream migration (lutaml-model 0.8.0) and pre-exist this branch.

🤖 Generated with Claude Code

Switch Bibcollection.from_xml to read the collection title and author via inner_html instead of Nokogiri's .text, so the in-memory strings keep their XML-fragment form (markup + entities intact). Apply the strip_html Liquid filter on the HTML <title> tag position so the browser tab title stays plain text. Adds find_html to ElementFinder alongside find_text. Adds a regression spec with markup and & in both the collection title and the author name. Refs metanorma/isodoc#785. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

andrew2net · 2026-05-12T17:29:39Z

  module ElementFinder
    attr_reader :document

    def find_text(xpath, element = nil)


Do we still need the find_text method?

Claude injected that to avoid Rubocop whining. I wouldn't have. Your call.

The method isn't used in the relaton-cli anymore. But it can be used outside. Let's keep it.

ronaldtse · 2026-05-13T02:02:08Z

@andrew2net just FYI that lutaml-model now handles XML in most ways so there is no reason to use any Nokogiri and "raw" at all.

opoudjis assigned andrew2net May 12, 2026

opoudjis requested a review from andrew2net May 12, 2026 14:08

opoudjis mentioned this pull request May 12, 2026

Minor bug: HTML landing page text is not being HTML-escaped metanorma/isodoc#785

Open

andrew2net approved these changes May 12, 2026

View reviewed changes

andrew2net added 2 commits May 12, 2026 17:44

Add workflow_dispatch trigger to rake.yml

373990e

Update relaton dependency to version 1.20.3

e1c9d85

andrew2net merged commit fd20c9d into main May 13, 2026
23 of 25 checks passed

andrew2net deleted the fix-html-escaping-collection-title branch May 13, 2026 01:51

andrew2net mentioned this pull request May 13, 2026

Reimplement collection model with lutaml-model #126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve XML-fragment markup in Bibcollection title/author#125

Preserve XML-fragment markup in Bibcollection title/author#125
andrew2net merged 3 commits into
mainfrom
fix-html-escaping-collection-title

opoudjis commented May 12, 2026

Uh oh!

andrew2net May 12, 2026

Uh oh!

opoudjis May 13, 2026

Uh oh!

andrew2net May 13, 2026

Uh oh!

Uh oh!

ronaldtse commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

opoudjis commented May 12, 2026

Summary

Diagnosis

Why this approach rather than CGI.escapeHTML

Changes

Out of scope

Test plan

Uh oh!

andrew2net May 12, 2026

Choose a reason for hiding this comment

Uh oh!

opoudjis May 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrew2net May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ronaldtse commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why this approach rather than `CGI.escapeHTML`