Why is rehype-raw not parsing textnodes? #8

Sewdn · 2017-10-26T13:01:05Z

When you look at the documentation, rehype-raw is parsing text-nodes that might be nested within html tags (and can contain html syntax itself):

<div class="note">

A mix of *markdown* and <em>HTML</em>.

</div>

yields:

<div class="note">
    <p>A mix of <em>markdown</em> and <em>HTML</em>.</p>
</div>

The textnode of the div.note element was interpreted as a markdown paragraph and thus parsed and wrapped in a p-element, with its inner text also being parsed, resulting in a em-node with markdown as text value.

However, when I dont use the paragraph whitespace:

<div class="note">
A mix of *markdown* and <em>HTML</em>.
</div>

or

<div class="note">A mix of *markdown* and <em>HTML</em>.</div>

this yields:

<div class="note">
    A mix of *markdown* and <em>HTML</em>.
</div>

The textnode of the div.node element was not wrapped in a paragraph (because it didnt have the required linebreaks). This is correct behaviour. But I would still expect the text value to be processed to end up with an emphasis element like this:

<div class="note">
    A mix of <em>markdown</em> and <em>HTML</em>.
</div>

Is this intended behaviour not to parse the entire textnode because it is no paragraph? If it is a bug, I will look into it to fix it.

Thanks for looking into this!

wooorm · 2017-10-28T17:07:07Z

I think rehype-raw does the opposite of what you expect it to. It deals with HTML in markdown, not markdown in HTML in markdown! You’d need a different project for this I think!

wooorm · 2017-11-06T18:53:06Z

Yeah, it definitely sounds like you’re expecting sort-of the “inverse” of what this project does! Someone will have to create that — this isn’t it!

CxRes · 2019-05-01T11:01:25Z

I find this behaviour counter-intuitive as well. I am afraid I have to agree with @Sewdn on this with the following explanation:

If you claim that we are dealing with HTML in markdown only, then <p> tags should not be added. After all that is \n\n to <p> is a markdown to HTML conversion (so is ** to em).

IMHO The line breaks are unnecessary. See pandoc's behaviour

If you want to interpret a single HTML linebreak as whitespace (strictly what HTML and markdown do), then only the second line break (in the blank line after the opening tag) should count as markdown which is interpreted as another whitespace and not another <p>.

I am afraid that you might be double counting here!

CxRes · 2019-05-02T10:53:19Z

After a lot more research, it turns out that in Commonmark and Pandoc strict, only markdown within inline tags is converted, block tags are treated as pure html. Pandoc in its regular operation provides the convenience of processing markdown in block tags. While the behaviour is imho counter-intuitive, it is in line with current rules of markdown conversion to html.

With this regard, all I have to add is that the above example where it seems the div wrap the statement is misleading. The div tags (opening and closing) and the sentence are each being processed individually. In @Sewdn's second example, however, the whole think is treated as a single html block.

wooorm · 2019-05-02T13:31:06Z

@CxRes Correct, that’s what my first comment was in regard to as well. The second example is markdown (a), with one block of HTML (b), inside which is some further markdown syntax (c).

This project focussed on HTML inside markdown (b), not on markdown in HTML (c). That’s why this issue is unrelated to this project.

Your first comment is a different question: it’s about a “non-standard” syntax. Which would be an issue in a different project (remark-parse). And which I don’t think we should support because we stick with CM / GFM. remark is pluggable, which means that it can of course be supported, but through a plugin that changes the HTML tokeniser!

CxRes · 2019-05-02T14:07:45Z

@wooorm A plugin would do nicely 😄 (I am guessing one that reruns remark-parse on the html nodes, however that is beyond my capabilities/knowledge!). Especially since there is a requirement (such as in my use case) to render documents and/or use styles + conventions meant for extended flavors such as Pandoc (which is sort of a defacto standard for non-standard/extended md syntax).

I would still suggest the examples here could be made more clear, because a novice like yours truly is going to trip up from time to time. One is an anomaly, two's a trend!

wooorm · 2019-05-02T20:13:20Z

A plugin would do nicely! [...]

That would be welcome! I’d suggest to first check out if it’s possible to change the remark HTML block tokeniser to exit on one newline instead of two. Another route to take would be to support an attribute on elements (<div markdown>...</div>) or so.

I would still suggest the examples here could be made more clear [...]

Feel free to open a PR!

CxRes · 2019-05-03T22:23:55Z

@wooorm I have tried this every which way to process example 2 with no dice:

I shut down the test condition (line 91 onward in HTML block tokenizer) for end of sequence causing the HTML block tokenizer to exit on the first new line. The next line still had a p tag surrounding the line, so it defeats the purpose.
I tried to shut down block processing for all block tags by setting the array to []. With this all tags were being processed as "other" block elements.
Next, is shut the processing for other block elements down in the hope that all elements are processed inline. This meant there was a p tag at every line break even though it did not surround the second line.
Finally, I tried making remainder of the line a child of the HTML node. I could not get this to work. I then realized that it is simply unclear to me, how to structure the AST for this purpose.

My hope was to place an option processMarkdownInHtml in the parser, which would then allow the different processing in a single pass. The other alternative is to look at each block html value and run the parser over it a second time - which is ugly.

I simply do not have the knowledge needed for this and would have to defer to your expertise...

Updated the example in Readme to clarify the processing of markdown embedded within html. In response to discussion in rehypejs#8

Sewdn mentioned this issue Oct 27, 2017

Hoes does HTML in Markdown work? syntax-tree/hast-util-raw#4

Closed

wooorm closed this as completed Nov 6, 2017

CxRes mentioned this issue May 2, 2019

Markdown content in html block tags is not converted luhmann/tufte-markdown#4

Open

CxRes mentioned this issue May 3, 2019

HTML figure tags are processed as markdown remarkjs/remark#400

Closed

CxRes added a commit to CxRes/rehype-raw that referenced this issue May 7, 2019

Updated example in Readme

54e2ab6

Updated the example in Readme to clarify the processing of markdown embedded within html. In response to discussion in rehypejs#8

CxRes mentioned this issue May 7, 2019

Updated example in Readme #12

Closed

CxRes added a commit to CxRes/rehype-raw that referenced this issue May 14, 2019

Updated example in Readme

8838b79

Updated the example in Readme to clarify the processing of markdown embedded within html. In response to discussion in rehypejs#8

CxRes mentioned this issue May 14, 2019

Updated example in Readme #13

Closed

wooorm added the 🙋 no/question This does not need any changes label Aug 13, 2019

wooorm mentioned this issue Feb 27, 2020

Markdown after self-closing html tag is not parsed #16

Closed

Specy mentioned this issue Jan 2, 2024

Unknown tags which are self closing do not close #30

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is rehype-raw not parsing textnodes? #8

Why is rehype-raw not parsing textnodes? #8

Sewdn commented Oct 26, 2017

wooorm commented Oct 28, 2017

wooorm commented Nov 6, 2017

CxRes commented May 1, 2019

CxRes commented May 2, 2019

wooorm commented May 2, 2019

CxRes commented May 2, 2019

wooorm commented May 2, 2019

CxRes commented May 3, 2019

Why is rehype-raw not parsing textnodes? #8

Why is rehype-raw not parsing textnodes? #8

Comments

Sewdn commented Oct 26, 2017

wooorm commented Oct 28, 2017

wooorm commented Nov 6, 2017

CxRes commented May 1, 2019

CxRes commented May 2, 2019

wooorm commented May 2, 2019

CxRes commented May 2, 2019

wooorm commented May 2, 2019

CxRes commented May 3, 2019