Text events in code block start with newlines #507

BenjaminRi · 2020-12-03T02:50:19Z

As can be seen in issue #457 , if you parse the following code block

```
test

test
```

you get

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test"))
Text(Borrowed("\n"))
Text(Borrowed("\ntest"))
Text(Borrowed("\n"))
End(CodeBlock(Fenced(Borrowed(""))))

The strange behaviour here is that the lines start with \n, but don't end with \n. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test\n"))
Text(Borrowed("\n"))
Text(Borrowed("test\n"))
End(CodeBlock(Fenced(Borrowed(""))))

This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like syntect which usually parse code on a line-by-line basis, where a line is defined as a string terminated by \n.

I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.

The text was updated successfully, but these errors were encountered:

marcusklaas · 2020-12-04T13:38:41Z

That's interesting. When I parse

```
test

test
```

on master, I just get three events:

0..18: Start(CodeBlock(Fenced(Borrowed(""))))
4..15: Text(Borrowed("test\n\ntest\n"))
0..18: End(CodeBlock(Fenced(Borrowed(""))))

Are you on a windows machine and using \r\n line breaks, by chance?

Edit: Ahh no, that wouldn't make sense since we couldn't borrow test\n in that scenario. Is the whole block indented perhaps, as part of list?

BenjaminRi · 2020-12-04T21:14:42Z

You are spot on. I am on Windows 10 and it has to do with \r\n newlines.

Raw input with \n (hex-bytes):

60 60 60 0a 74 65 73 74 0a 0a 74 65 73 74 0a 60 60 60

Text event from Parser:

Borrowed("test\n\ntest\n")

However,

Raw input with \r\n (hex-bytes):

60 60 60 0d 0a 74 65 73 74 0d 0a 0d 0a 74 65 73 74 0d 0a 60 60 60

Text events from Parser:

Borrowed("test")
Borrowed("\n")
Borrowed("\ntest")
Borrowed("\n")

Also, I've been thinking... Are there any guarantees provided by pulldown-cmark regarding how text is emitted? Could I technically get an event for every single character, in other words, does my code have to handle arbitrarily emitted text when I interface with the Parser? I think that's an important part of the interface. I looked over the documentation but couldn't find anything, it just says it's "a text node".

BenjaminRi · 2020-12-05T00:14:02Z

Okay, so I played around on master (e97974b) and this is highly interesting.

The issue is hard to reproduce because Rust does some string magic. Check this out:

fn main() {
	let markdown_input: &str = "```
test

test
```";

	println!("{:?}",  markdown_input.as_bytes());
}

Output:

[96, 96, 96, 10, 116, 101, 115, 116, 10, 10, 116, 101, 115, 116, 10, 96, 96, 96]

This little program will always print a byte sequence with \n, rather than \r\n, even if the original source file has \r\n line endings. In other words, the Rust compiler automatically strips \r from your string even on Windows when you use such line endings.

However, you can trick Rust into using Windows line endings by manually giving it the bytes and converting those into a string, here is a repro of the problem:

use pulldown_cmark::{html, Event, Options, Parser};

fn main() {
	let binary_input = &[0x60, 0x60, 0x60, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x60, 0x60, 0x60];
	let markdown_input = std::str::from_utf8(binary_input).unwrap();
	let parser = Parser::new_ext(markdown_input, Options::empty())
		.map(|event| match event {
			Event::Text(text) => {println!("Text: {:?}", &text); Event::Text(text)},
			_ => event,
		});

	let mut html_output = String::new();
	html::push_html(&mut html_output, parser);
}

Output:

Text: Borrowed("test")
Text: Borrowed("\n")
Text: Borrowed("\ntest")
Text: Borrowed("\n")

The reason why I stumbled into this problem is because I'm using an SQLite database which contains such Windows line endings.

marcusklaas · 2020-12-05T16:13:23Z

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

It's actually pulldown doing this. We are normalizing line endings ourselves.

BenjaminRi · 2020-12-05T19:47:53Z

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

I beg to differ:

This is the reason why I initially couldn't reproduce the problem in a minimal example, because rustc gobbles up the \r part of the newlines. The repro I posted in the comment above works though, because I give the input in raw bytes.

It's actually pulldown doing this. We are normalizing line endings ourselves.

Yes, pulldown-cmark does the thing where it emits more text events when \r is in the markdown string.

BenjaminRi · 2020-12-06T00:53:58Z

Okay, I now understand that if you normalize line endings to \n and strip away the \r, and you always want to use Borrowed and not do any additional copies of the strings, you are forced to emit a new Text event whenever \r comes up because you have to jump over the \r.

edward-shen · 2020-12-06T01:09:00Z

I think @BenjaminRi brings up a good point -- there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.

Considering how we don't emit multiple Text events for non-normalized strings, I suggest we stay consistent and don't attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I'm not convinced that this library should try and do so on the user's behalf. Since I'm advocating for this change as well, I don't mind making a PR for this if you'd like, @marcusklaas.

edward-shen · 2020-12-06T01:49:10Z

It also looks like we're currently also inconsistent with HTML events on master, where \n are emitted separately and where the library also does normalization:

#[test]
fn html_no_extraneous_events_with_crlf() {
    let markdown_input = "<p>\r\nhello\r\nworld\r\n</p>";
    let parser_events: Vec<_> = Parser::new_ext(markdown_input, Options::empty()).collect();

    // might not be correct
    let expected_events = vec![
        Event::Html(CowStr::Borrowed("<p>\r\n")),
        Event::Html(CowStr::Borrowed("test\r\ntest\r\n")),
        Event::Html(CowStr::Borrowed("</p>")),
    ];

    assert_eq!(parser_events, expected_events);
}

---- html_no_extraneous_events_with_crlf stdout ----
thread 'html_no_extraneous_events_with_crlf' panicked at 'assertion failed: `(left == right)`
  left: `[Html(Borrowed("<p>")), Html(Borrowed("\n")), Html(Borrowed("hello")), Html(Borrowed("\n")), Html(Borrowed("world")), Html(Borrowed("\n")), Html(Borrowed("</p>"))]`,
 right: `[Html(Borrowed("<p>")), Html(Borrowed("test\r\n\r\ntest\r\n")), Html(Borrowed("</p>"))]`', tests/events.rs:27:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

marcusklaas · 2020-12-06T13:19:12Z

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

I beg to differ: <images>

Thanks for that, I honestly did not know the compiler normalized line-endings.

Okay, I now understand that if you normalize line endings to \n and strip away the \r, and you always want to use Borrowed and not do any additional copies of the strings, you are forced to emit a new Text event whenever \r comes up because you have to jump over the \r.

Correct!

I think @BenjaminRi brings up a good point -- there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.

If by continuous text events you mean consecutive text events, then yes: pulldown does not guarantee that all text is combined into a single event, even when it can be. We try to do it sometimes as an optimization, but it cannot be relied on. It would be good to clarify this indeed.

Considering how we don't emit multiple Text events for non-normalized strings, I suggest we stay consistent and don't attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I'm not convinced that this library should try and do so on the user's behalf.

These are solid arguments. It would benefit throughput as well, as reducing the number of events we emit is one of the most effective ways to speed up the parser (and renderer!). We need to exercise some caution here though. This would be a breaking change. Clients may implicitly or even unknowingly rely on this behavior. I am not against the change, but we would need consensus that this would indeed benefit the majority of users.

edit: While it would be nice to be able guarantee that continuous blocks of text are indeed emitted as a single event, it seems to not be optimal from a performance standpoint. By the nature of the pull based parsing, we are only emitting events as we walk the tree. Consider the following markdown:

This is text. <notquitehtml This is more text.

The first pass sees the opening bracket and marks it MaybeHtml. When we do the second pass, we can only emit Text("This is text. ") since the next bit could be html. But it is not since it lacks a closing bracket, and the bit <notquitehtml ends up being more text. So even though This is text. <notquitehtml This is more text. could have been emitted as a single text event, we didn't know that a priori. There are ways to work around this, but in my experiments they ended up being slower and more complex than just emitting separate events.

edward-shen · 2020-12-06T18:05:12Z

It sounds like there are two actionable items then:

Clarify the docs to specify that text events (perhaps other events as well -- HTML for example) aren't currently guaranteed to be continuous blocks of text, and that clients may need to handle multiple events of the same type in a row to get all the text for an object. Since this is a doc change, it should be non-breaking.
Make a breaking change that would remove the normalization code, which we would prefer to hear some feedback to make sure is a worthwhile change. This change does not guarantee a single event for all continuous text.

Are there any issues with the first item? Can we (I, if no one else) make a PR for this?

Alternative suggestion: Leave the parsing as-is, but have some iteration logic that lets the user decide whether or not to merge events. Perhaps part of the Options (or a method on the iterator), we could do some logic to see if the subsequent items are of the same type, and do merging then? While it wouldn't entirely solve the use case, it would greatly simplify it.

The benefits of this approach is that the cost of merging events would only be invoked if you actually set the flag. It could also be effectively implemented as a 3rd optional "ergonomics" pass, where the user could opt into some nicer but expensive things -- stuff like line-by-line text events, full text events, or normalization on those full text events. We wouldn't guarantee any performance metrics on this but could offer an ergonomic API for the clients that would rather have an easier API to use over a performant one.

BenjaminRi · 2020-12-06T20:11:14Z

Thank you both for your help. This means, as a user of this library, I will adapt my code to handle any number of continuous Text Events, and then it works fine.

Clarify the docs to specify that text events (perhaps other events as well -- HTML for example) aren't currently guaranteed to be continuous blocks of text, and that clients may need to handle multiple events of the same type in a row to get all the text for an object. Since this is a doc change, it should be non-breaking.

From my point of view, that would be really helpful, yes. Knowing this detail means that library users know what events to expect, so they can write robust code interfacing with pulldown-cmark. The good thing is, people who search for this on google will find this issue on GitHub, so the situation is already improved.

Make a breaking change that would remove the normalization code, which we would prefer to hear some feedback to make sure is a worthwhile change. This change does not guarantee a single event for all continuous text.

It's probably not necessary to introduce a breaking change. As long as the users know that multiple events can be emitted, the parser interface can be used fine. To normalize lines or not to, that's a difficult choice.

Martin1887 · 2023-05-21T11:00:25Z

What do you think about this, @raphlinus? This does not seem a bug to me now and the breaking change does not seem necessary. Maybe a docs update explaining the consecutive text events is the best option.

BenjaminRi · 2023-05-21T11:30:06Z

On my side, I ended up writing a text merge stream that merges all these consecutive text events into one large text event, because in almost all use cases, this is what you actually want and need for further processing.

https://github.com/BenjaminRi/Redwood-Wiki/blob/31ae4c43ae34a8bbaf66f9c99e8e5476e4e17502/src/markdown_utils.rs#L9

I just apply the TextMergeStream over the parser and it automatically gives me what I want without changing any of the other semantics of the parser:

TextMergeStream::new(Parser::new_with_broken_link_callback(
	&article.text,
	options,
	Some(&mut broken_link_callback),
))

Other people facing this issue can copy this TextMergeStream, perhaps it would even make sense to add it as a utility to pulldown-cmark directly.

From my point of view, the issue is resolved - with the strong recommendation to describe this behavior in the documentation because it is unexpected and may confuse library users.

Martin1887 · 2023-05-21T14:43:11Z

The TextMergeStream seems a very useful utility that could be included in the crate, and it also serves as an opportunity to emphasize in docs that usually text events are fired separately due to performance and algorithmic reasons.

If Raph approves it, I will include it in the next release.

Thanks!

raphlinus · 2023-06-17T14:08:42Z

The TextMergeStream sounds like a good idea to me, and the opt-in nature means it won't break existing uses or cause performance issues with the extra string allocations, unless the new behavior is specifically requested.

BenjaminRi · 2023-06-18T15:58:45Z

I am happy to hear that you find my code useful. A note on the license: The original source code is GPL 3.0, but you may use, modify and/or redistribute the code in markdown_utils.rs under any license you want.

Martin1887 · 2023-06-19T15:48:03Z

Great, I hope to include it this week.

Thanks.

Martin1887 · 2023-07-01T16:45:15Z

The consecutive text events issue has been fixed in #686. The normalization issue is open and it could be handled in the future but it is not guaranteed.

Martin1887 · 2023-11-04T11:05:05Z

Check if fixed by #776.

Martin1887 · 2024-02-10T18:23:54Z

Already fixed.

max-sixty mentioned this issue Apr 15, 2022

Allow examples to be extracted on Windows PRQL/prql#356

Closed

agrandville mentioned this issue Jun 23, 2022

fix windows EOL in markdown files getzola/zola#1911

Merged

CptPotato mentioned this issue Oct 23, 2022

CRLF line endings cause issues with code blocks EdJoPaTo/markdown-to-standalone-html#59

Open

markbahnman mentioned this issue Dec 11, 2022

Issues with passing CRLF to pulldown_cmark grego/blades#21

Closed

Martin1887 added the bug label May 7, 2023

EpicEricEE mentioned this issue May 14, 2023

Tests failing on Windows typst/typst#1206

Closed

1 task

Martin1887 added enhancement and removed bug labels Jul 1, 2023

Martin1887 mentioned this issue Jul 1, 2023

feat: TextMergeStream utility to merge consecutive text events #686

Merged

Martin1887 closed this as completed Feb 10, 2024

BenjaminRi mentioned this issue Apr 28, 2024

HTML blocks are split up in events #803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text events in code block start with newlines #507

Text events in code block start with newlines #507

BenjaminRi commented Dec 3, 2020 •

edited

Loading

marcusklaas commented Dec 4, 2020 •

edited

Loading

BenjaminRi commented Dec 4, 2020 •

edited

Loading

BenjaminRi commented Dec 5, 2020 •

edited

Loading

marcusklaas commented Dec 5, 2020

BenjaminRi commented Dec 5, 2020

BenjaminRi commented Dec 6, 2020

edward-shen commented Dec 6, 2020 •

edited

Loading

edward-shen commented Dec 6, 2020

marcusklaas commented Dec 6, 2020 •

edited

Loading

edward-shen commented Dec 6, 2020

BenjaminRi commented Dec 6, 2020

Martin1887 commented May 21, 2023

BenjaminRi commented May 21, 2023 •

edited

Loading

Martin1887 commented May 21, 2023 •

edited

Loading

raphlinus commented Jun 17, 2023

BenjaminRi commented Jun 18, 2023

Martin1887 commented Jun 19, 2023

Martin1887 commented Jul 1, 2023

Martin1887 commented Nov 4, 2023

Martin1887 commented Feb 10, 2024

Text events in code block start with newlines #507

Text events in code block start with newlines #507

Comments

BenjaminRi commented Dec 3, 2020 • edited Loading

marcusklaas commented Dec 4, 2020 • edited Loading

BenjaminRi commented Dec 4, 2020 • edited Loading

BenjaminRi commented Dec 5, 2020 • edited Loading

marcusklaas commented Dec 5, 2020

BenjaminRi commented Dec 5, 2020

BenjaminRi commented Dec 6, 2020

edward-shen commented Dec 6, 2020 • edited Loading

edward-shen commented Dec 6, 2020

marcusklaas commented Dec 6, 2020 • edited Loading

edward-shen commented Dec 6, 2020

BenjaminRi commented Dec 6, 2020

Martin1887 commented May 21, 2023

BenjaminRi commented May 21, 2023 • edited Loading

Martin1887 commented May 21, 2023 • edited Loading

raphlinus commented Jun 17, 2023

BenjaminRi commented Jun 18, 2023

Martin1887 commented Jun 19, 2023

Martin1887 commented Jul 1, 2023

Martin1887 commented Nov 4, 2023

Martin1887 commented Feb 10, 2024

BenjaminRi commented Dec 3, 2020 •

edited

Loading

marcusklaas commented Dec 4, 2020 •

edited

Loading

BenjaminRi commented Dec 4, 2020 •

edited

Loading

BenjaminRi commented Dec 5, 2020 •

edited

Loading

edward-shen commented Dec 6, 2020 •

edited

Loading

marcusklaas commented Dec 6, 2020 •

edited

Loading

BenjaminRi commented May 21, 2023 •

edited

Loading

Martin1887 commented May 21, 2023 •

edited

Loading