Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc: check parsing diffs between pulldown-cmark 0.9.6 and 0.10 #121659

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

notriddle
Copy link
Contributor

This commit is not meant to be merged as-is. It's meant to run in Crater, so that we can estimate the impact of bumping to the new version of the markdown parser.

r? rustdoc

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Feb 27, 2024
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/bump-pulldown-cmark branch 2 times, most recently from c861e61 to 8ddc204 Compare February 27, 2024 04:55
@rust-log-analyzer

This comment has been minimized.

This commit is not meant to be merged as-is. It's meant to run
in Crater, so that we can estimate the impact of bumping to the
new version of the markdown parser.
LL | | /// This is a multi-paragraph footnote.
| |___________________________________________^
|
= help: new parser sees Start(Paragraph), old sees End(FootnoteDefinition(Borrowed("foot")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why it's needed, but it also displays internal stuff. Not sure how helpful it would be for users...

@GuillaumeGomez
Copy link
Member

I think it looks ready for a crater run.

@notriddle
Copy link
Contributor Author

@bors try

@bors
Copy link
Contributor

bors commented Feb 28, 2024

⌛ Trying commit 5b1ebc2 with merge 223112b...

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 28, 2024
…ark, r=<try>

rustdoc: check parsing diffs between pulldown-cmark 0.9.6 and 0.10

This commit is not meant to be merged as-is. It's meant to run in Crater, so that we can estimate the impact of bumping to the new version of the markdown parser.

r? rustdoc
@notriddle
Copy link
Contributor Author

@craterbot run name=pr-121659-bump-pulldown-cmark mode=rustdoc

@craterbot
Copy link
Collaborator

🚨 Error: missing start toolchain

🆘 If you have any trouble with Crater please ping @rust-lang/infra!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@notriddle
Copy link
Contributor Author

@craterbot run name=pr-121659-bump-pulldown-cmark mode=rustdoc start=master#ef324565d071c6d7e2477a195648549e33d6a465 end=try#223112bf1ca31911a6475910c77b36bfa127d5f8

@craterbot
Copy link
Collaborator

👌 Experiment pr-121659-bump-pulldown-cmark created and queued.
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-crater Status: Waiting on a crater run to be completed. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 28, 2024
@bors
Copy link
Contributor

bors commented Feb 28, 2024

☀️ Try build successful - checks-actions
Build commit: 223112b (223112bf1ca31911a6475910c77b36bfa127d5f8)

@craterbot
Copy link
Collaborator

🚧 Experiment pr-121659-bump-pulldown-cmark is now running

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot
Copy link
Collaborator

🎉 Experiment pr-121659-bump-pulldown-cmark is completed!
📊 267 regressed and 3 fixed (422032 total)
📰 Open the full report.

⚠️ If you notice any spurious failure please add them to the blacklist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels Mar 12, 2024
@notriddle
Copy link
Contributor Author

notriddle commented Mar 12, 2024

Roughly categorizing what I'm seeing here. The specific crates are in this Gist:

(P)roblems: docs that are broken by this change

P1: unintended strikethrough

pulldown-cmark/pulldown-cmark#648 allows abso~fricking~lutely to render as absofrickinglutely. It makes pulldown-cmark agree with cmark-gfm.

This change causes 39 docs to have strikethroughs where it looks like the author intended literal tildes.

P2: Unintended block quote

pulldown-cmark/pulldown-cmark#675 makes block quotes, introduced by >, consistent between paragraph interruption and starting after a blank line. It's pretty unambiguous that this is a bug fix.

Causes 24 crates to appear wrong.

P3: Footnote reference immediately after link or another footnote

pulldown-cmark/pulldown-cmark#773

The way that pulldown-cmark ignores the apparent footnote reference in [something][^foot] is the same way GFM does it, and it's the easiest way to parse this given everything else, but it's weird.

Causes 16 crates to appear wrong.

P4: block nested inside footnote definition without indenting

pulldown-cmark/pulldown-cmark#654

This is, basically, the expected outcome of GFM-compatible footnote parsing. The syntax now parses the same way GitHub does, which is not the way rustdoc used to.

Causes 6 crates to appear wrong.

P5: emphasis does not match what author expects

In a text like __imp_(_), commonmark rules say that the result should be _imp(_), but I'm pretty sure the author intended those to be literal underscores. In the old parser, they were.

Causes 2 crates to appear wrong.

P6: Single | does not continue table

This isn't what GitHub does, and the new parser aligns closer with GitHub, but some authors have used this as a way to put dividers in their table:

| header | two |
|--------|-----|
| item   | n   |
| item   | e   |
|
| section | l  |
| two     | o  |

Causes 6 crates to appear wrong.

P7: Table is required to have a valid header line

This is not a table according to GitHub, but the old parser would render it as one sometimes.

| first | second |
| third | fourth |
| fifth | sixth  |

The header line also needs to have at least one hyphen in each cell, so this isn't allowed either.

| first  ||
|--------||
| second ||

Causes 4 crates to appear wrong.

P8: unintended link definition

This is a link definition in the new parser, but not the old one.

[Self::method()]:
frobnicates

Causes 1 crate to appear wrong.

(F)ixes: docs that are actually render better with the new parser than the old one

F1: [^x] in the old parser rendered a broken footnote link when no footnote was intended or defined

pulldown-cmark/pulldown-cmark#654

Usually, they're trying to write regex inverted character classes, but they come out as footnotes.

This fix also makes things consistent with GitHub, and the fact that it's not breaking anyone's docs makes me happy.

Fixes 39 crates

F2: writing 1. alone on a line in the middle of a paragraph shouldn't start a list

pulldown-cmark/pulldown-cmark#681

It used to do the wrong thing with this:

Test paragraph with a count of
1.

Fixes 6 crates

F3: tables interrupt paragraphs

pulldown-cmark/pulldown-cmark#653

This makes things consistent with GitHub, and fixes NN docs. Not likely to be a problem, because you kinda have to try really hard to make something look like a table.

Fixes 37 crates

F4: ASCII art misinterpreted as list

Since asterisks on their own no longer count as lists when they interrupt paragraphs, some things that were never intended as list markers stop being seen as such.

Fixes 1 crate

F5: footnote has nested, indented children or lazy continuation

This is the flip side of P4, where the author actually wanted it to be parsed the way github parses it.

Fixes 2 crates

F6: Link definition is seen in new parser where old parser did not

The old parser didn't properly recognize link reference definitions when they were right after code blocks.

Fixes 1 crates

F7: block structure and inline structure are separated better in new parser

Consider this example:

> [my link](https://example.com "Example web site
> run by the IETF")

The > on the second line shouldn't go in there, and now it doesn't.'

Fixes 2 crates

F8: a table row followed by two spaces isn't a hard break

A very specific bug that only seems to show up when a paragraph is nested inside a list. The new parser fixes it, it only affects 1 crate, and the change makes the generated docs look better.

Fixes 2 crates

F9: footnote with <autolink> was misparsed as link def

The old parser thought this was a link definition:

[^0]: <https://example.com>

The new parser sees it as a footnote, which is also how github does it.

Fixes 1 crate

F10: indented or spaced link definition still counts

This document:

[first]: https://example.com
    [second]: https://example.com

[first]

[second]

The old parser ignored the definition of [second], so it only saw one link, but the new one doesn't, so it sees two links. This also agrees with the reference implementation.

The new parser also trims spaces, so [x ] and [x] are the same thing.

Fixes 2 crates

F11: intra-word strikethrough

This is the flip side of P1, where the author apparently intended to write intra-word strikeout.

Fixes 1 crate

(Q)uestionable cases, where the docs were broken before and are still broken now

Q1: crate author wrote ASCII art, or some language that isn't CommonMark, in their doc comments

This happened on N crates, and produces terrible results in both parsers. I'm not bothering to categorize them by which change to pulldown-cmark causes them to render differently.

There are 30 crates that do this

Q2: links with mismatched parens go from being broken links to not being links at all

pulldown-cmark/pulldown-cmark#738

There are 4 crates that do this

Q3: incorrect trim in block doc comment

Block comments are supposed to be written like this:

/**
 * first
 * second
 * third
 */
^^ these two characters are trimmed

If you get this wrong, the text that you intended to be a paragraph gets turned into a long unordered list instead, because trim is computed by checking every line for a common prefix.

It shows up here because incorrectly trimmed block doc comments often have asterisks with no text after them, and pulldown-cmark/pulldown-cmark#681 changes it from a list to plain text.

There are 8 crates that do this

Q4: numbered list that starts with zero and has two ones

0. do stuff
1.
2. do more stuff
3. do more stuff

The 0 isn't a valid list item in commonmark, so that first line is a paragraph. The divergence between the new and old parser is on the second line, but I'm not sure what the intended result by the crate author was? I think it's supposed to be a list starting at zero, but it doesn't work in the old version or the new one.

There is 1 crate that does this

Q5: incorrect footnote definition markup

The parser used to return a dangling footnote reference, but the new one doesn't. Both saw the intended footnote definition as invalid.

There are 18 crates that do this

Q6: block quotes written at start of line

Try writing block quotes like this:

>
First paragraph
>
Second paragraph

I think used to work in GitHub-flavored markdown (it works in Pandoc-flavored markdown now), and seems to be what the author intended in the 1 crate that hits this case, but it doesn't work on GitHub now, and it never worked right in rustdoc (though the exact way it fails changed, which is what brought it to my attention).

There is 1 crate that does this

(S)purious cases where the lint fires and it shouldn't

S1: loose task list

The lint I wrote fires false positives for task lists. The event stream from the parser is different, thanks to pulldown-cmark/pulldown-cmark#558, but they're detected in the same cases.

There are 7 crates that hit this

S2: minor change in HTML blocks

The indentation is handled differently, but will usually generate the same rendered result.

There are 3 crates that hit this

S3: two spaces on the last line of a paragraph

Strictly speaking, this is a different parse result. It will produce slight differences in the spacing.

There is 1 crate that hits this

S4: spans are different when emphasis wraps code

This causes the lint to claim there's a problem when there isn't.

There is 1 crate that hits this

@GuillaumeGomez
Copy link
Member

Thanks for the very detailed report! What do you plan to do from this?

@notriddle
Copy link
Contributor Author

notriddle commented Mar 13, 2024

Looking at some of the Ps and Qs with more than 10 matches (I'm just going to be happy at the Fixes with more than 10 matches).

P1: I'm making a PR against pulldown-cmark to limit single-tilde strikethroughs to the same rules as underscores. GitHub might render first~second third~fourth as firstsecond thirdfourth, but markdown-it, the parser used in Discourse, does not, and neither does commonmark-hs, the parser used in Pandoc "GFM" mode. pulldown-cmark/pulldown-cmark#864

P2: This should be a lint. The alternative would be willfully violating the spec, which I'd rather not do. The heuristic would be: if the > is not followed by a space, then suggest that the user either add a space or escape the > with a backslash.

P3: Not sure if this should be a lint or if should just be fixed.

Q1: We can't help you if you aren't even trying to write useful Markdown.

Q5: We should detect and warn about [^WHATEVER]-like footnotes with no corresponding definition that aren't backslash escaped. The trigger is the same as P3, but the suggestion would differ based on sniffing around for valid footnote defs or link refs right next to the footnote.

The solution for the spurious problems is to write lints for these specific problems instead of linting on every spot where the parsers disagree. It's also valuable to avoid setting off The Everything's Okay Alarm when their docs are fixed.

@Dylan-DPC Dylan-DPC added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants