Preserve XML-fragment markup in Bibcollection title/author#125
Merged
Conversation
Switch Bibcollection.from_xml to read the collection title and author via inner_html instead of Nokogiri's .text, so the in-memory strings keep their XML-fragment form (markup + entities intact). Apply the strip_html Liquid filter on the HTML <title> tag position so the browser tab title stays plain text. Adds find_html to ElementFinder alongside find_text. Adds a regression spec with markup and & in both the collection title and the author name. Refs metanorma/isodoc#785. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
andrew2net
approved these changes
May 12, 2026
| module ElementFinder | ||
| attr_reader :document | ||
|
|
||
| def find_text(xpath, element = nil) |
Contributor
There was a problem hiding this comment.
Do we still need the find_text method?
Contributor
Author
There was a problem hiding this comment.
Claude injected that to avoid Rubocop whining. I wouldn't have. Your call.
Contributor
There was a problem hiding this comment.
The method isn't used in the relaton-cli anymore. But it can be used outside. Let's keep it.
Contributor
|
@andrew2net just FYI that lutaml-model now handles XML in most ways so there is no reason to use any Nokogiri and "raw" at all. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix the bug reported in metanorma/isodoc#785 where an
&in the landing-page collection title is passed through unescaped and then dropped by the browser's HTML entity parser. The fix preserves XML-fragment form (markup + entities) for the collection-level title and author by reading them viainner_htmlinstead of Nokogiri's.text, and strips inline tags only in the HTML<title>element position.Diagnosis
Three observations that together explain the bug:
{{ var }}output. Verified at runtime:Liquid::Template.parse("{{ x }}").render("x" => "a & b <c>")emitsa & b <c>literally. Every{{ var }}in the index/document templates outputs the raw Ruby string.titleandauthorwere being entity-decoded on the way in.Bibcollection.from_xmlusedfind_text, which calls Nokogiri's.text— that strips child elements and decodes&back to a raw&in the Ruby string. That raw&then flowed straight through Liquid into the rendered HTML, reproducing the screenshot in #785.titleis a different path and works correctly.RelatonBib::BibliographicItem.to_hashreturns titles as literal XML-fragment strings (e.g.Use of <strong>ActualText</strong> & <strong>Reference</strong> structure elements), and that whole string lands in the Ruby hash markup + entity intact. Liquid emits it raw and it is valid HTML.The two title pathways were thus producing inconsistent in-memory string formats: doc-titles already XML-fragment-encoded, collection-titles decoded to bare characters. The fix unifies them on the XML-fragment form.
Nokogiri behaviour on a namespaced relaton-collection title (probed at runtime):
.textUse of ActualText & Reference structure elements(markup stripped, entity decoded — current bug).inner_htmlUse of <strong>ActualText</strong> & <strong>Reference</strong> structure elements(markup + entities preserved, no namespace decoration)Why this approach rather than
CGI.escapeHTMLThe initial instinct was to wrap
bibcollection.titleandbibcollection.authorinCGI.escapeHTMLbefore passing them to Liquid. ButCGI.escapeHTMLescapes<and>as well as&, which would render any legitimate inline<strong>/<sub>/<sup>markup in a title as visible<strong>text. The collection title in the current test data is plain text, but nothing in the data model prevents future inputs from carrying inline markup the way doc-titles already do — so a blanket escape would be brittle.Using
inner_htmlis the parse-side dual of the doc-title path: the in-memory string is already HTML-safe (entities encoded, markup preserved), and Liquid's no-op rendering becomes correct by construction.This also fixes a latent invalid-XML bug in
Bibcollection#to_xml— that method currently interpolates the bare-character title into<title>#{title}</title>, which produces invalid XML when the title contains a raw&. After this change the title holds&, so the round-trip is XML-safe.Changes
lib/relaton/element_finder.rb— addfind_html(xpath, element)alongside the existingfind_text, returningfind(xpath, element)&.inner_html.lib/relaton/bibcollection.rb— switch the twofind_textcalls infrom_xmltofind_html.templates/_index.liquid— change line 4 from<title>{{ title }}</title>to<title>{{ title | strip_html }}</title>. HTML5 treats the content of<title>as text — any inline element tags inside it would render as literal characters in the browser tab bar.strip_htmlis a built-in Liquid filter. The visible coverpage title at line 31 keeps{{ title }}unchanged so markup is rendered there as intended.spec/assets/index-with-markup.xml(new) — minimalrelaton-collectionfixture withUse of <strong>ActualText</strong> & <strong>Reference</strong> structure elementsin the collection title andAcme & Coin the author org name.spec/relaton/cli/xml_to_html_renderer_spec.rb— three new examples asserting:<span class="title-first">contains the markup +&literally,<title>in the HTML head contains&and no<strong>(verifiesstrip_htmlran),Acme & Co.All three examples fail cleanly on
main(verified by stash-and-run) and pass on this branch.Out of scope
Raw
&in URLhrefattributes. URL fields (html,pdf,doc,xml,rxl,uri) are emitted unescaped intohrefattributes. A raw&inhrefis technically invalid HTML; browsers tolerate it; current Relaton data doesn't typically produce malformed URLs. Worth a separate ticket if it ever bites.Doc-title markup pass-through contract. The rendering relies on
RelatonBib::BibliographicItem.to_hashreturning titles as XML-fragment strings (markup + entities intact). That contract is currently load-bearing for the doc-title pathway here. This PR does not modify or document it; the assumption is that the contract is stable for the foreseeable future.Test plan
bundle exec rspec spec/relaton/cli/xml_to_html_renderer_spec.rb— passes (8 examples, 0 failures), with the 3 new examples verified to fail onmain.bundle exec rspec spec/relaton/bibcollection_spec.rb— passes (2 examples, 0 failures).rubocop lib/relaton/element_finder.rb lib/relaton/bibcollection.rb spec/relaton/cli/xml_to_html_renderer_spec.rb— no new offences introduced by this change (pre-existing offences unrelated to this change remain).metanorma site generate: setcollection.namein a site'smetanorma.ymlto contain a literal&, pointmetanorma-cliat this branch viaGemfile.devel, recompile, and confirm_site/index.htmlshows&in the coverpage-title rather than dropping the&.Note: the unrelated lutaml-model adapter failures in the full
bundle exec rspecrun are from an in-flight upstream migration (lutaml-model 0.8.0) and pre-exist this branch.🤖 Generated with Claude Code