RAWTEXT/RCDATA elements are closed inconsistently #19

12Me21 · 2022-12-20T02:30:54Z

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

7.2.3

Link to runnable example

No response

Steps to reproduce

import {raw} from 'hast-util-raw'
import {u} from 'unist-builder'

let test1 = u('root', [
	u('raw', '<xmp><b>bold</b>after</xmp>'),
])

let test2 = u('root', [
	u('raw', '<xmp>'),
	u('raw', '<b>'),
	u('text', 'bold'),
	u('raw', '</b>'),
	u('text', 'after'),
	u('raw', '</xmp>'),
])

console.log(raw(test1))
console.log(raw(test2))

Expected behavior

The output tree should contain:

root
- element: xmp
 - text: "boldafter"

(note: <xmp> is parsed in RAWTEXT mode, so tags/escapes aren't parsed inside it, and it only ends when the parser reaches </xmp>)

Actual behavior

root
- element: xmp
 - text: "bold"
- text: "after"

the <xmp> element is being closed by the tag
This affects all elements whose contents are parsed as RAWTEXT/SCRIPT_DATA (<style>, <iframe>, <noembed>, <noframes>, <script>) or RCDATA (<title>, <textarea>)

This can be seen in remark/rehype (with allowDangerousHtml enabled), in cases like:

<xmp><b>one</b><b>two</b></xmp>

which renders as:

onetwo

Affected runtime and version

node@18.0.0

Affected package manager and version

npm@8.6.0

Affected OS and version

Debian

Build and bundle tools

Rollup

The text was updated successfully, but these errors were encountered:

wooorm · 2022-12-20T09:36:12Z

So, ehh, why are you using obsolete elements anyway? It’s marked as “Non-conforming features” by the HTML spec: so humans shouldn’t use it. Of course, you’re right that it would be good to fix.
A more logical reduced test case can probably be made with style, iframe, which also use rawtext (note that xmp does not relate to rcdata as far as I am aware, that’s for textarea, title).

12Me21 · 2022-12-20T09:58:33Z

welll, <xmp> is still pretty convenient despite being obsolete, since it's the only "normal" element which is parsed in rawtext mode
anyway, I mentioned rcdata since I believe this bug applies to those elements as well

12Me21 · 2022-12-22T14:30:34Z

it seems like this happens because the text node handler calls resetTokenizer(), which always resets the tokenizer into the DATA state, even when inside an element that should be parsed in some other state.
I suppose it would be possible to fix by just not resetting the state? (or only resetting it if it's not DATA/RAWTEXT/RCDATA/SCRIPT_DATA/etc.)
Edit:
Another issue is that the element node handler passes start/end tag tokens to parser._processInputToken while in RAWTEXT/RCDATA/etc. state, which is normally impossible, so parse5 just ignores them.

The result of this is:

With the markdown: <xmp>*one* *two*</xmp>, and allowDangerousHtml enabled:

using micromark to output html directly:
- result: <xmp>one two</xmp>
 - displayed as either "one two" OR "one two" (depending on how the html is sanitized)
using mdast-util-from-markdown + mdast-util-to-hast + hast-util-raw + hast-util-to-html:
- result: <xmp>one</xmp> two
 - displayed as "one two" (the first token is ignored since we're still in RAWTEXT mode, but then the "one" text node sets the parser back to DATA mode. so closes the <xmp> since there's no open , (and the second em node is handled normally since we're outside xmp now))

(note: in both cases, the actual result (as displayed by a browser) is <xmp>...  because xmp is not allowed inside p, so it splits the paragraph, I think)

12Me21 · 2022-12-22T15:35:40Z

Here's the code i used to check the results of that:

import {micromark} from 'micromark'
import {fromMarkdown} from 'mdast-util-from-markdown'

import {toHast} from 'mdast-util-to-hast'
import {raw} from 'hast-util-raw'
import {toHtml} from 'hast-util-to-html'

import {inspect} from 'unist-util-inspect'

let text = `<xmp>*one* *two*</xmp>`

let html = micromark(text, {allowDangerousHtml: true})

let mdast = fromMarkdown(text, {allowDangerousHtml: true})
let hast = toHast(mdast, {allowDangerousHtml:true})
let hast2 = raw(hast)
let html2 = toHtml(hast2)

console.log('=== TEXT ===')
console.log(text)
console.log('=== HTML ===')
console.log(html)
console.log('='.repeat(50))
console.log('=== MDAST ===')
console.log(inspect(mdast, {showPositions:false}))
console.log('=== HAST ===')
console.log(inspect(hast, {showPositions:false}))
console.log('=== HAST (after hast-util-raw) ===')
console.log(inspect(hast2, {showPositions:false}))
console.log('=== HTML ===')
console.log(html2)

wooorm · 2022-12-23T06:31:42Z

So to repeat: you really should not use xmp. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/xmp.

If you start working on a PR, I recommend basing it off of #17. parse5 changed a lot last major, so this project was rewritten for that. I plan on releasing it in one or two months.

github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Dec 20, 2022

wooorm closed this as completed in e66705a Jan 16, 2023

This comment has been minimized.

Sign in to view

wooorm added the 💪 phase/solved Post is done label Jan 16, 2023

github-actions bot removed the 🤞 phase/open Post is being triaged manually label Jan 16, 2023

wooorm added the 🐛 type/bug This is a problem label Jan 16, 2023

wooorm mentioned this issue Jan 31, 2023

Add invoice for @wooorm unifiedjs/collective#59

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAWTEXT/RCDATA elements are closed inconsistently #19

RAWTEXT/RCDATA elements are closed inconsistently #19

12Me21 commented Dec 20, 2022

wooorm commented Dec 20, 2022 •

edited

12Me21 commented Dec 20, 2022

12Me21 commented Dec 22, 2022 •

edited

12Me21 commented Dec 22, 2022

wooorm commented Dec 23, 2022

This comment has been minimized.

RAWTEXT/RCDATA elements are closed inconsistently #19

RAWTEXT/RCDATA elements are closed inconsistently #19

Comments

12Me21 commented Dec 20, 2022

Initial checklist

Affected packages and versions

Link to runnable example

Steps to reproduce

Expected behavior

Actual behavior

Affected runtime and version

Affected package manager and version

Affected OS and version

Build and bundle tools

wooorm commented Dec 20, 2022 • edited

12Me21 commented Dec 20, 2022

12Me21 commented Dec 22, 2022 • edited

12Me21 commented Dec 22, 2022

wooorm commented Dec 23, 2022

This comment has been minimized.

wooorm commented Dec 20, 2022 •

edited

12Me21 commented Dec 22, 2022 •

edited