Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is rehype-raw not parsing textnodes? #8

Closed
Sewdn opened this issue Oct 26, 2017 · 8 comments
Closed

Why is rehype-raw not parsing textnodes? #8

Sewdn opened this issue Oct 26, 2017 · 8 comments
Labels
🙋 no/question This does not need any changes

Comments

@Sewdn
Copy link

Sewdn commented Oct 26, 2017

When you look at the documentation, rehype-raw is parsing text-nodes that might be nested within html tags (and can contain html syntax itself):

<div class="note">

A mix of *markdown* and <em>HTML</em>.

</div>

yields:

<div class="note">
    <p>A mix of <em>markdown</em> and <em>HTML</em>.</p>
</div>

The textnode of the div.note element was interpreted as a markdown paragraph and thus parsed and wrapped in a p-element, with its inner text also being parsed, resulting in a em-node with markdown as text value.

However, when I dont use the paragraph whitespace:

<div class="note">
A mix of *markdown* and <em>HTML</em>.
</div>

or

<div class="note">A mix of *markdown* and <em>HTML</em>.</div>

this yields:

<div class="note">
    A mix of *markdown* and <em>HTML</em>.
</div>

The textnode of the div.node element was not wrapped in a paragraph (because it didnt have the required linebreaks). This is correct behaviour. But I would still expect the text value to be processed to end up with an emphasis element like this:

<div class="note">
    A mix of <em>markdown</em> and <em>HTML</em>.
</div>

Is this intended behaviour not to parse the entire textnode because it is no paragraph? If it is a bug, I will look into it to fix it.

Thanks for looking into this!

@wooorm
Copy link
Member

wooorm commented Oct 28, 2017

I think rehype-raw does the opposite of what you expect it to. It deals with HTML in markdown, not markdown in HTML in markdown! You’d need a different project for this I think!

@wooorm
Copy link
Member

wooorm commented Nov 6, 2017

Yeah, it definitely sounds like you’re expecting sort-of the “inverse” of what this project does! Someone will have to create that — this isn’t it!

@wooorm wooorm closed this as completed Nov 6, 2017
@CxRes
Copy link
Contributor

CxRes commented May 1, 2019

I find this behaviour counter-intuitive as well. I am afraid I have to agree with @Sewdn on this with the following explanation:

If you claim that we are dealing with HTML in markdown only, then <p> tags should not be added. After all that is \n\n to <p> is a markdown to HTML conversion (so is ** to em).

IMHO The line breaks are unnecessary. See pandoc's behaviour

If you want to interpret a single HTML linebreak as whitespace (strictly what HTML and markdown do), then only the second line break (in the blank line after the opening tag) should count as markdown which is interpreted as another whitespace and not another <p>.

I am afraid that you might be double counting here!

@CxRes
Copy link
Contributor

CxRes commented May 2, 2019

After a lot more research, it turns out that in Commonmark and Pandoc strict, only markdown within inline tags is converted, block tags are treated as pure html. Pandoc in its regular operation provides the convenience of processing markdown in block tags. While the behaviour is imho counter-intuitive, it is in line with current rules of markdown conversion to html.

With this regard, all I have to add is that the above example where it seems the div wrap the statement is misleading. The div tags (opening and closing) and the sentence are each being processed individually. In @Sewdn's second example, however, the whole think is treated as a single html block.

@wooorm
Copy link
Member

wooorm commented May 2, 2019

@CxRes Correct, that’s what my first comment was in regard to as well. The second example is markdown (a), with one block of HTML (b), inside which is some further markdown syntax (c).

This project focussed on HTML inside markdown (b), not on markdown in HTML (c). That’s why this issue is unrelated to this project.

Your first comment is a different question: it’s about a “non-standard” syntax. Which would be an issue in a different project (remark-parse). And which I don’t think we should support because we stick with CM / GFM. remark is pluggable, which means that it can of course be supported, but through a plugin that changes the HTML tokeniser!

@CxRes
Copy link
Contributor

CxRes commented May 2, 2019

@wooorm A plugin would do nicely 😄 (I am guessing one that reruns remark-parse on the html nodes, however that is beyond my capabilities/knowledge!). Especially since there is a requirement (such as in my use case) to render documents and/or use styles + conventions meant for extended flavors such as Pandoc (which is sort of a defacto standard for non-standard/extended md syntax).

I would still suggest the examples here could be made more clear, because a novice like yours truly is going to trip up from time to time. One is an anomaly, two's a trend!

@wooorm
Copy link
Member

wooorm commented May 2, 2019

A plugin would do nicely! [...]

That would be welcome! I’d suggest to first check out if it’s possible to change the remark HTML block tokeniser to exit on one newline instead of two. Another route to take would be to support an attribute on elements (<div markdown>...</div>) or so.

I would still suggest the examples here could be made more clear [...]

Feel free to open a PR!

@CxRes
Copy link
Contributor

CxRes commented May 3, 2019

@wooorm I have tried this every which way to process example 2 with no dice:

  1. I shut down the test condition (line 91 onward in HTML block tokenizer) for end of sequence causing the HTML block tokenizer to exit on the first new line. The next line still had a p tag surrounding the line, so it defeats the purpose.

  2. I tried to shut down block processing for all block tags by setting the array to []. With this all tags were being processed as "other" block elements.

  3. Next, is shut the processing for other block elements down in the hope that all elements are processed inline. This meant there was a p tag at every line break even though it did not surround the second line.

  4. Finally, I tried making remainder of the line a child of the HTML node. I could not get this to work. I then realized that it is simply unclear to me, how to structure the AST for this purpose.

My hope was to place an option processMarkdownInHtml in the parser, which would then allow the different processing in a single pass. The other alternative is to look at each block html value and run the parser over it a second time - which is ugly.

I simply do not have the knowledge needed for this and would have to defer to your expertise...

CxRes added a commit to CxRes/rehype-raw that referenced this issue May 7, 2019
Updated the example in Readme to clarify the processing of markdown embedded within html. 

In response to discussion in rehypejs#8
CxRes added a commit to CxRes/rehype-raw that referenced this issue May 14, 2019
Updated the example in Readme to clarify the processing of markdown embedded within html.

In response to discussion in rehypejs#8
@wooorm wooorm added the 🙋 no/question This does not need any changes label Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋 no/question This does not need any changes
Development

No branches or pull requests

3 participants