Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text events in code block start with newlines #507

Closed
BenjaminRi opened this issue Dec 3, 2020 · 20 comments
Closed

Text events in code block start with newlines #507

BenjaminRi opened this issue Dec 3, 2020 · 20 comments

Comments

@BenjaminRi
Copy link

BenjaminRi commented Dec 3, 2020

As can be seen in issue #457 , if you parse the following code block

```
test

test
```

you get

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test"))
Text(Borrowed("\n"))
Text(Borrowed("\ntest"))
Text(Borrowed("\n"))
End(CodeBlock(Fenced(Borrowed(""))))

The strange behaviour here is that the lines start with \n, but don't end with \n. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test\n"))
Text(Borrowed("\n"))
Text(Borrowed("test\n"))
End(CodeBlock(Fenced(Borrowed(""))))

This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like syntect which usually parse code on a line-by-line basis, where a line is defined as a string terminated by \n.

I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.

@marcusklaas
Copy link
Collaborator

marcusklaas commented Dec 4, 2020

That's interesting. When I parse

```
test

test
```

on master, I just get three events:

0..18: Start(CodeBlock(Fenced(Borrowed(""))))
4..15: Text(Borrowed("test\n\ntest\n"))
0..18: End(CodeBlock(Fenced(Borrowed(""))))

Are you on a windows machine and using \r\n line breaks, by chance?

Edit: Ahh no, that wouldn't make sense since we couldn't borrow test\n in that scenario. Is the whole block indented perhaps, as part of list?

@BenjaminRi
Copy link
Author

BenjaminRi commented Dec 4, 2020

You are spot on. I am on Windows 10 and it has to do with \r\n newlines.

Raw input with \n (hex-bytes):

60 60 60 0a 74 65 73 74 0a 0a 74 65 73 74 0a 60 60 60

Text event from Parser:

Borrowed("test\n\ntest\n")

However,

Raw input with \r\n (hex-bytes):

60 60 60 0d 0a 74 65 73 74 0d 0a 0d 0a 74 65 73 74 0d 0a 60 60 60

Text events from Parser:

Borrowed("test")
Borrowed("\n")
Borrowed("\ntest")
Borrowed("\n")

Also, I've been thinking... Are there any guarantees provided by pulldown-cmark regarding how text is emitted? Could I technically get an event for every single character, in other words, does my code have to handle arbitrarily emitted text when I interface with the Parser? I think that's an important part of the interface. I looked over the documentation but couldn't find anything, it just says it's "a text node".

@BenjaminRi
Copy link
Author

BenjaminRi commented Dec 5, 2020

Okay, so I played around on master (e97974b) and this is highly interesting.

The issue is hard to reproduce because Rust does some string magic. Check this out:

fn main() {
	let markdown_input: &str = "```
test

test
```";

	println!("{:?}",  markdown_input.as_bytes());
}

Output:

[96, 96, 96, 10, 116, 101, 115, 116, 10, 10, 116, 101, 115, 116, 10, 96, 96, 96]

This little program will always print a byte sequence with \n, rather than \r\n, even if the original source file has \r\n line endings. In other words, the Rust compiler automatically strips \r from your string even on Windows when you use such line endings.

However, you can trick Rust into using Windows line endings by manually giving it the bytes and converting those into a string, here is a repro of the problem:

use pulldown_cmark::{html, Event, Options, Parser};

fn main() {
	let binary_input = &[0x60, 0x60, 0x60, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x60, 0x60, 0x60];
	let markdown_input = std::str::from_utf8(binary_input).unwrap();
	let parser = Parser::new_ext(markdown_input, Options::empty())
		.map(|event| match event {
			Event::Text(text) => {println!("Text: {:?}", &text); Event::Text(text)},
			_ => event,
		});

	let mut html_output = String::new();
	html::push_html(&mut html_output, parser);
}

Output:

Text: Borrowed("test")
Text: Borrowed("\n")
Text: Borrowed("\ntest")
Text: Borrowed("\n")

The reason why I stumbled into this problem is because I'm using an SQLite database which contains such Windows line endings.

@marcusklaas
Copy link
Collaborator

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

It's actually pulldown doing this. We are normalizing line endings ourselves.

@BenjaminRi
Copy link
Author

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

I beg to differ:
eol_conversion

This is the reason why I initially couldn't reproduce the problem in a minimal example, because rustc gobbles up the \r part of the newlines. The repro I posted in the comment above works though, because I give the input in raw bytes.

It's actually pulldown doing this. We are normalizing line endings ourselves.

Yes, pulldown-cmark does the thing where it emits more text events when \r is in the markdown string.

@BenjaminRi
Copy link
Author

Okay, I now understand that if you normalize line endings to \n and strip away the \r, and you always want to use Borrowed and not do any additional copies of the strings, you are forced to emit a new Text event whenever \r comes up because you have to jump over the \r.

@edward-shen
Copy link
Contributor

edward-shen commented Dec 6, 2020

I think @BenjaminRi brings up a good point -- there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.

Considering how we don't emit multiple Text events for non-normalized strings, I suggest we stay consistent and don't attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I'm not convinced that this library should try and do so on the user's behalf. Since I'm advocating for this change as well, I don't mind making a PR for this if you'd like, @marcusklaas.

@edward-shen
Copy link
Contributor

It also looks like we're currently also inconsistent with HTML events on master, where \n are emitted separately and where the library also does normalization:

#[test]
fn html_no_extraneous_events_with_crlf() {
    let markdown_input = "<p>\r\nhello\r\nworld\r\n</p>";
    let parser_events: Vec<_> = Parser::new_ext(markdown_input, Options::empty()).collect();

    // might not be correct
    let expected_events = vec![
        Event::Html(CowStr::Borrowed("<p>\r\n")),
        Event::Html(CowStr::Borrowed("test\r\ntest\r\n")),
        Event::Html(CowStr::Borrowed("</p>")),
    ];

    assert_eq!(parser_events, expected_events);
}
---- html_no_extraneous_events_with_crlf stdout ----
thread 'html_no_extraneous_events_with_crlf' panicked at 'assertion failed: `(left == right)`
  left: `[Html(Borrowed("<p>")), Html(Borrowed("\n")), Html(Borrowed("hello")), Html(Borrowed("\n")), Html(Borrowed("world")), Html(Borrowed("\n")), Html(Borrowed("</p>"))]`,
 right: `[Html(Borrowed("<p>")), Html(Borrowed("test\r\n\r\ntest\r\n")), Html(Borrowed("</p>"))]`', tests/events.rs:27:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@marcusklaas
Copy link
Collaborator

marcusklaas commented Dec 6, 2020

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

I beg to differ: <images>

Thanks for that, I honestly did not know the compiler normalized line-endings.

Okay, I now understand that if you normalize line endings to \n and strip away the \r, and you always want to use Borrowed and not do any additional copies of the strings, you are forced to emit a new Text event whenever \r comes up because you have to jump over the \r.

Correct!

I think @BenjaminRi brings up a good point -- there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.

If by continuous text events you mean consecutive text events, then yes: pulldown does not guarantee that all text is combined into a single event, even when it can be. We try to do it sometimes as an optimization, but it cannot be relied on. It would be good to clarify this indeed.

Considering how we don't emit multiple Text events for non-normalized strings, I suggest we stay consistent and don't attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I'm not convinced that this library should try and do so on the user's behalf.

These are solid arguments. It would benefit throughput as well, as reducing the number of events we emit is one of the most effective ways to speed up the parser (and renderer!). We need to exercise some caution here though. This would be a breaking change. Clients may implicitly or even unknowingly rely on this behavior. I am not against the change, but we would need consensus that this would indeed benefit the majority of users.

edit: While it would be nice to be able guarantee that continuous blocks of text are indeed emitted as a single event, it seems to not be optimal from a performance standpoint. By the nature of the pull based parsing, we are only emitting events as we walk the tree. Consider the following markdown:

This is text. <notquitehtml This is more text.

The first pass sees the opening bracket and marks it MaybeHtml. When we do the second pass, we can only emit Text("This is text. ") since the next bit could be html. But it is not since it lacks a closing bracket, and the bit <notquitehtml ends up being more text. So even though This is text. <notquitehtml This is more text. could have been emitted as a single text event, we didn't know that a priori. There are ways to work around this, but in my experiments they ended up being slower and more complex than just emitting separate events.

@edward-shen
Copy link
Contributor

It sounds like there are two actionable items then:

  1. Clarify the docs to specify that text events (perhaps other events as well -- HTML for example) aren't currently guaranteed to be continuous blocks of text, and that clients may need to handle multiple events of the same type in a row to get all the text for an object. Since this is a doc change, it should be non-breaking.
  2. Make a breaking change that would remove the normalization code, which we would prefer to hear some feedback to make sure is a worthwhile change. This change does not guarantee a single event for all continuous text.

Are there any issues with the first item? Can we (I, if no one else) make a PR for this?


Alternative suggestion: Leave the parsing as-is, but have some iteration logic that lets the user decide whether or not to merge events. Perhaps part of the Options (or a method on the iterator), we could do some logic to see if the subsequent items are of the same type, and do merging then? While it wouldn't entirely solve the use case, it would greatly simplify it.

The benefits of this approach is that the cost of merging events would only be invoked if you actually set the flag. It could also be effectively implemented as a 3rd optional "ergonomics" pass, where the user could opt into some nicer but expensive things -- stuff like line-by-line text events, full text events, or normalization on those full text events. We wouldn't guarantee any performance metrics on this but could offer an ergonomic API for the clients that would rather have an easier API to use over a performant one.

@BenjaminRi
Copy link
Author

Thank you both for your help. This means, as a user of this library, I will adapt my code to handle any number of continuous Text Events, and then it works fine.

Clarify the docs to specify that text events (perhaps other events as well -- HTML for example) aren't currently guaranteed to be continuous blocks of text, and that clients may need to handle multiple events of the same type in a row to get all the text for an object. Since this is a doc change, it should be non-breaking.

From my point of view, that would be really helpful, yes. Knowing this detail means that library users know what events to expect, so they can write robust code interfacing with pulldown-cmark. The good thing is, people who search for this on google will find this issue on GitHub, so the situation is already improved.

Make a breaking change that would remove the normalization code, which we would prefer to hear some feedback to make sure is a worthwhile change. This change does not guarantee a single event for all continuous text.

It's probably not necessary to introduce a breaking change. As long as the users know that multiple events can be emitted, the parser interface can be used fine. To normalize lines or not to, that's a difficult choice.

@Martin1887
Copy link
Collaborator

What do you think about this, @raphlinus? This does not seem a bug to me now and the breaking change does not seem necessary. Maybe a docs update explaining the consecutive text events is the best option.

@BenjaminRi
Copy link
Author

BenjaminRi commented May 21, 2023

On my side, I ended up writing a text merge stream that merges all these consecutive text events into one large text event, because in almost all use cases, this is what you actually want and need for further processing.

https://github.com/BenjaminRi/Redwood-Wiki/blob/31ae4c43ae34a8bbaf66f9c99e8e5476e4e17502/src/markdown_utils.rs#L9

I just apply the TextMergeStream over the parser and it automatically gives me what I want without changing any of the other semantics of the parser:

TextMergeStream::new(Parser::new_with_broken_link_callback(
	&article.text,
	options,
	Some(&mut broken_link_callback),
))

Other people facing this issue can copy this TextMergeStream, perhaps it would even make sense to add it as a utility to pulldown-cmark directly.

From my point of view, the issue is resolved - with the strong recommendation to describe this behavior in the documentation because it is unexpected and may confuse library users.

@Martin1887
Copy link
Collaborator

Martin1887 commented May 21, 2023

The TextMergeStream seems a very useful utility that could be included in the crate, and it also serves as an opportunity to emphasize in docs that usually text events are fired separately due to performance and algorithmic reasons.

If Raph approves it, I will include it in the next release.

Thanks!

@raphlinus
Copy link
Collaborator

The TextMergeStream sounds like a good idea to me, and the opt-in nature means it won't break existing uses or cause performance issues with the extra string allocations, unless the new behavior is specifically requested.

@BenjaminRi
Copy link
Author

I am happy to hear that you find my code useful. A note on the license: The original source code is GPL 3.0, but you may use, modify and/or redistribute the code in markdown_utils.rs under any license you want.

@Martin1887
Copy link
Collaborator

Great, I hope to include it this week.

Thanks.

@Martin1887
Copy link
Collaborator

The consecutive text events issue has been fixed in #686. The normalization issue is open and it could be handled in the future but it is not guaranteed.

@Martin1887
Copy link
Collaborator

Check if fixed by #776.

@Martin1887
Copy link
Collaborator

Already fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants