-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text events in code block start with newlines #507
Comments
That's interesting. When I parse
on master, I just get three events:
Are you on a windows machine and using Edit: Ahh no, that wouldn't make sense since we couldn't borrow |
You are spot on. I am on Windows 10 and it has to do with Raw input with
Text event from
However, Raw input with
Text events from
Also, I've been thinking... Are there any guarantees provided by |
Okay, so I played around on master (e97974b) and this is highly interesting. The issue is hard to reproduce because Rust does some string magic. Check this out: fn main() {
let markdown_input: &str = "```
test
test
```";
println!("{:?}", markdown_input.as_bytes());
} Output:
This little program will always print a byte sequence with However, you can trick Rust into using Windows line endings by manually giving it the bytes and converting those into a string, here is a repro of the problem: use pulldown_cmark::{html, Event, Options, Parser};
fn main() {
let binary_input = &[0x60, 0x60, 0x60, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x60, 0x60, 0x60];
let markdown_input = std::str::from_utf8(binary_input).unwrap();
let parser = Parser::new_ext(markdown_input, Options::empty())
.map(|event| match event {
Event::Text(text) => {println!("Text: {:?}", &text); Event::Text(text)},
_ => event,
});
let mut html_output = String::new();
html::push_html(&mut html_output, parser);
} Output:
The reason why I stumbled into this problem is because I'm using an SQLite database which contains such Windows line endings. |
There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints. It's actually pulldown doing this. We are normalizing line endings ourselves. |
This is the reason why I initially couldn't reproduce the problem in a minimal example, because
Yes, |
Okay, I now understand that if you normalize line endings to |
I think @BenjaminRi brings up a good point -- there needs to be clarification on what a Text event is, or clients will need to handle continuous text events. Considering how we don't emit multiple Text events for non-normalized strings, I suggest we stay consistent and don't attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I'm not convinced that this library should try and do so on the user's behalf. Since I'm advocating for this change as well, I don't mind making a PR for this if you'd like, @marcusklaas. |
It also looks like we're currently also inconsistent with HTML events on #[test]
fn html_no_extraneous_events_with_crlf() {
let markdown_input = "<p>\r\nhello\r\nworld\r\n</p>";
let parser_events: Vec<_> = Parser::new_ext(markdown_input, Options::empty()).collect();
// might not be correct
let expected_events = vec![
Event::Html(CowStr::Borrowed("<p>\r\n")),
Event::Html(CowStr::Borrowed("test\r\ntest\r\n")),
Event::Html(CowStr::Borrowed("</p>")),
];
assert_eq!(parser_events, expected_events);
}
|
Thanks for that, I honestly did not know the compiler normalized line-endings.
Correct!
If by continuous text events you mean consecutive text events, then yes: pulldown does not guarantee that all text is combined into a single event, even when it can be. We try to do it sometimes as an optimization, but it cannot be relied on. It would be good to clarify this indeed.
These are solid arguments. It would benefit throughput as well, as reducing the number of events we emit is one of the most effective ways to speed up the parser (and renderer!). We need to exercise some caution here though. This would be a breaking change. Clients may implicitly or even unknowingly rely on this behavior. I am not against the change, but we would need consensus that this would indeed benefit the majority of users. edit: While it would be nice to be able guarantee that continuous blocks of text are indeed emitted as a single event, it seems to not be optimal from a performance standpoint. By the nature of the pull based parsing, we are only emitting events as we walk the tree. Consider the following markdown: This is text. <notquitehtml This is more text. The first pass sees the opening bracket and marks it |
It sounds like there are two actionable items then:
Are there any issues with the first item? Can we (I, if no one else) make a PR for this? Alternative suggestion: Leave the parsing as-is, but have some iteration logic that lets the user decide whether or not to merge events. Perhaps part of the The benefits of this approach is that the cost of merging events would only be invoked if you actually set the flag. It could also be effectively implemented as a 3rd optional "ergonomics" pass, where the user could opt into some nicer but expensive things -- stuff like line-by-line text events, full text events, or normalization on those full text events. We wouldn't guarantee any performance metrics on this but could offer an ergonomic API for the clients that would rather have an easier API to use over a performant one. |
Thank you both for your help. This means, as a user of this library, I will adapt my code to handle any number of continuous Text Events, and then it works fine.
From my point of view, that would be really helpful, yes. Knowing this detail means that library users know what events to expect, so they can write robust code interfacing with
It's probably not necessary to introduce a breaking change. As long as the users know that multiple events can be emitted, the parser interface can be used fine. To normalize lines or not to, that's a difficult choice. |
What do you think about this, @raphlinus? This does not seem a bug to me now and the breaking change does not seem necessary. Maybe a docs update explaining the consecutive text events is the best option. |
On my side, I ended up writing a text merge stream that merges all these consecutive text events into one large text event, because in almost all use cases, this is what you actually want and need for further processing. I just apply the TextMergeStream::new(Parser::new_with_broken_link_callback(
&article.text,
options,
Some(&mut broken_link_callback),
)) Other people facing this issue can copy this From my point of view, the issue is resolved - with the strong recommendation to describe this behavior in the documentation because it is unexpected and may confuse library users. |
The If Raph approves it, I will include it in the next release. Thanks! |
The TextMergeStream sounds like a good idea to me, and the opt-in nature means it won't break existing uses or cause performance issues with the extra string allocations, unless the new behavior is specifically requested. |
I am happy to hear that you find my code useful. A note on the license: The original source code is GPL 3.0, but you may use, modify and/or redistribute the code in |
Great, I hope to include it this week. Thanks. |
The consecutive text events issue has been fixed in #686. The normalization issue is open and it could be handled in the future but it is not guaranteed. |
Check if fixed by #776. |
Already fixed. |
As can be seen in issue #457 , if you parse the following code block
you get
The strange behaviour here is that the lines start with
\n
, but don't end with\n
. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like
syntect
which usually parse code on a line-by-line basis, where a line is defined as a string terminated by\n
.I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.
The text was updated successfully, but these errors were encountered: