Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAWTEXT/RCDATA elements are closed inconsistently #19

Closed
4 tasks done
12Me21 opened this issue Dec 20, 2022 · 6 comments
Closed
4 tasks done

RAWTEXT/RCDATA elements are closed inconsistently #19

12Me21 opened this issue Dec 20, 2022 · 6 comments
Labels
💪 phase/solved Post is done 🐛 type/bug This is a problem

Comments

@12Me21
Copy link

12Me21 commented Dec 20, 2022

Initial checklist

Affected packages and versions

7.2.3

Link to runnable example

No response

Steps to reproduce

import {raw} from 'hast-util-raw'
import {u} from 'unist-builder'

let test1 = u('root', [
	u('raw', '<xmp><b>bold</b>after</xmp>'),
])

let test2 = u('root', [
	u('raw', '<xmp>'),
	u('raw', '<b>'),
	u('text', 'bold'),
	u('raw', '</b>'),
	u('text', 'after'),
	u('raw', '</xmp>'),
])

console.log(raw(test1))
console.log(raw(test2))

Expected behavior

The output tree should contain:

  • root
    • element: xmp
      • text: "<b>bold</b>after"

(note: <xmp> is parsed in RAWTEXT mode, so tags/escapes aren't parsed inside it, and it only ends when the parser reaches </xmp>)

Actual behavior

  • root
    • element: xmp
      • text: "<b>bold"
    • text: "after"

the <xmp> element is being closed by the </b> tag
This affects all elements whose contents are parsed as RAWTEXT/SCRIPT_DATA (<style>, <iframe>, <noembed>, <noframes>, <script>) or RCDATA (<title>, <textarea>)

This can be seen in remark/rehype (with allowDangerousHtml enabled), in cases like:

<xmp><b>one</b><b>two</b></xmp>

which renders as:

<b>onetwo

Affected runtime and version

node@18.0.0

Affected package manager and version

npm@8.6.0

Affected OS and version

Debian

Build and bundle tools

Rollup

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Dec 20, 2022
@wooorm
Copy link
Member

wooorm commented Dec 20, 2022

So, ehh, why are you using obsolete elements anyway? It’s marked as “Non-conforming features” by the HTML spec: so humans shouldn’t use it. Of course, you’re right that it would be good to fix.
A more logical reduced test case can probably be made with style, iframe, which also use rawtext (note that xmp does not relate to rcdata as far as I am aware, that’s for textarea, title).

@12Me21
Copy link
Author

12Me21 commented Dec 20, 2022

welll, <xmp> is still pretty convenient despite being obsolete, since it's the only "normal" element which is parsed in rawtext mode
anyway, I mentioned rcdata since I believe this bug applies to those elements as well

@12Me21
Copy link
Author

12Me21 commented Dec 22, 2022

it seems like this happens because the text node handler calls resetTokenizer(), which always resets the tokenizer into the DATA state, even when inside an element that should be parsed in some other state.
I suppose it would be possible to fix by just not resetting the state? (or only resetting it if it's not DATA/RAWTEXT/RCDATA/SCRIPT_DATA/etc.)
Edit:
Another issue is that the element node handler passes start/end tag tokens to parser._processInputToken while in RAWTEXT/RCDATA/etc. state, which is normally impossible, so parse5 just ignores them.

The result of this is:

With the markdown: <xmp>*one* *two*</xmp>, and allowDangerousHtml enabled:

  • using micromark to output html directly:
    • result: <p><xmp><em>one</em> <em>two</em></xmp></p>
      • displayed as either "<em>one</em> <em>two</em>" OR "one two" (depending on how the html is sanitized)
  • using mdast-util-from-markdown + mdast-util-to-hast + hast-util-raw + hast-util-to-html:
    • result: <p><xmp>one</xmp> <em>two</em></p>
      • displayed as "one two" (the first <em> token is ignored since we're still in RAWTEXT mode, but then the "one" text node sets the parser back to DATA mode. so </em> closes the <xmp> since there's no open <em>, (and the second em node is handled normally since we're outside xmp now))

(note: in both cases, the actual result (as displayed by a browser) is <p></p><xmp>... <p></p> because xmp is not allowed inside p, so it splits the paragraph, I think)

@12Me21
Copy link
Author

12Me21 commented Dec 22, 2022

Here's the code i used to check the results of that:

import {micromark} from 'micromark'
import {fromMarkdown} from 'mdast-util-from-markdown'

import {toHast} from 'mdast-util-to-hast'
import {raw} from 'hast-util-raw'
import {toHtml} from 'hast-util-to-html'

import {inspect} from 'unist-util-inspect'

let text = `<xmp>*one* *two*</xmp>`

let html = micromark(text, {allowDangerousHtml: true})

let mdast = fromMarkdown(text, {allowDangerousHtml: true})
let hast = toHast(mdast, {allowDangerousHtml:true})
let hast2 = raw(hast)
let html2 = toHtml(hast2)

console.log('=== TEXT ===')
console.log(text)
console.log('=== HTML ===')
console.log(html)
console.log('='.repeat(50))
console.log('=== MDAST ===')
console.log(inspect(mdast, {showPositions:false}))
console.log('=== HAST ===')
console.log(inspect(hast, {showPositions:false}))
console.log('=== HAST (after hast-util-raw) ===')
console.log(inspect(hast2, {showPositions:false}))
console.log('=== HTML ===')
console.log(html2)

@wooorm
Copy link
Member

wooorm commented Dec 23, 2022

So to repeat: you really should not use xmp. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/xmp.


If you start working on a PR, I recommend basing it off of #17. parse5 changed a lot last major, so this project was rewritten for that. I plan on releasing it in one or two months.

@wooorm wooorm closed this as completed in e66705a Jan 16, 2023
@github-actions

This comment has been minimized.

@wooorm wooorm added the 💪 phase/solved Post is done label Jan 16, 2023
@github-actions github-actions bot removed the 🤞 phase/open Post is being triaged manually label Jan 16, 2023
@wooorm wooorm added the 🐛 type/bug This is a problem label Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 phase/solved Post is done 🐛 type/bug This is a problem
Development

No branches or pull requests

2 participants