Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip BOM from the event stream, add a method to Writer for writing BOM #459

Merged
merged 3 commits into from Aug 18, 2022

Conversation

dralley
Copy link
Collaborator

@dralley dralley commented Aug 17, 2022

No description provided.

@dralley
Copy link
Collaborator Author

dralley commented Aug 17, 2022

What I actually want to do: provide a configuration option for either trim_before_first_element which would skip all data prior to the first XML element (including the BOM), or go further and provide a trim_interelement_text option which would do this while also generally trimming all text events in places where they aren't supposed to be (which would address #285 (comment)). If you want to fully reproduce the original event stream you could just disable that trimming and you would get it all (including the BOM in a Text event)

But after attempting to implement it I realized it requires more substantial parsing changes then I'm prepared to make right now. This approach isn't ideal but it covers most needs fairly well, addresses the original issue, and still moves in the right direction.

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2022

Codecov Report

Merging #459 (340ea04) into master (11e483a) will increase coverage by 0.81%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   52.45%   53.27%   +0.81%     
==========================================
  Files          29       29              
  Lines       13555    13527      -28     
==========================================
+ Hits         7110     7206      +96     
+ Misses       6445     6321     -124     
Flag Coverage Δ
unittests 53.27% <63.63%> (+0.81%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
benches/microbenches.rs 0.00% <ø> (ø)
src/de/mod.rs 82.24% <ø> (-0.32%) ⬇️
src/events/mod.rs 69.90% <ø> (+1.38%) ⬆️
src/writer.rs 49.34% <5.00%> (+0.42%) ⬆️
src/encoding.rs 83.87% <87.50%> (-12.91%) ⬇️
src/reader/mod.rs 90.43% <100.00%> (+0.10%) ⬆️
src/reader/parser.rs 98.67% <100.00%> (+0.05%) ⬆️
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@dralley dralley requested a review from Mingun August 17, 2022 04:05
Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the trim_interelement_text option (the name, probably, could be better), that's what I was thinking.

I'm not sure that we should remove StartText if we want

  • return text only as an str
  • return the BOM, which is not an str (UTF BOMs seems to be representable as special characters, but I'm not sure that this is a general case. Should we bother about other possible BOMs?)

StartText solves this problem by splitting out the ordinary text from the BOM bytes and other unexpected chars before the declaration, which are not according to the standard, but I think, could be used in the wild (the examples is our tests where it is often convenient to have leading spaces before a declaration). Actually, it should be named StartBytes, because this is not a text

I agree that using StartText event not the best way to write BOM, the writer should write it by himself. That is why I was thinking about splitting event into events for reading and events for writing, for which reason public reader and writer modules was introduced and Event::borrow method was implemented.

Having different events we could fix #332 (because reader event no longer will own the data and we could return right lifetime with existing Attributes struct).

Different events for reading and writing also has an advantage, that reader event can contain bytes internally and decoded only on demand (although we have a goal to step away from that, but is can have non-obvious implications) and a span, but writer event could

  • own their content
  • store their content in UTF-8 only (and in fact right now you should use only UTF-8 -- event generation in other encodings seems to be broken. serde serializer, for example, always writes in UTF-8)
  • does not have a span which is anyway is useless for it

What exactly prevents StartText?

tests/encodings.rs Show resolved Hide resolved
src/reader/parser.rs Show resolved Hide resolved
@dralley
Copy link
Collaborator Author

dralley commented Aug 17, 2022

I'm not sure that we should remove StartText if we want

  • return text only as an str
  • return the BOM, which is not an str (UTF BOMs seems to be representable as special characters, but I'm not sure that this is a general case. Should we bother about other possible BOMs?)

StartText solves this problem by splitting out the ordinary text from the BOM bytes and other unexpected chars before the declaration, which are not according to the standard, but I think, could be used in the wild (the examples is our tests where it is often convenient to have leading spaces before a declaration). Actually, it should be named StartBytes, because this is not a text

This is a misunderstanding of the BOM that I also shared until a day or two ago.

It's not that it's able to be represented as a special character but that it is literally just a character, with an assigned Unicode character number, which can be encoded in different encodings just like any other Unicode character - and if you put this character at the beginning of the document, and you know how different encodings represent that character, then you can just look at the first few bytes and see if they match how a particular encoding would encode the character U+FEFF.

In retrospect this makes total sense but it never really clicked into place for me until very recently.

The implications are that:

  • str can store it just fine. It's just a unicode character, UTF-8 encodes unicode characters, str stores UTF-8, it all works.
  • Encoders and decoders know how to handle it. You don't have to treat it differently, you can just pass it through the decoder. If you decode a UTF-16 LE BOM as UTF-8, you get a UTF-8 BOM. Because it's just two representations of the same character. The same applies to any combination of UTF-8, UTF-16, and UTF-32 - you can pass it back and forth and it will always be correct for the document encoding.

Therefore, since it is truly just Text, we don't need a separate StartText.

The problem with the BOM is not that it can't be represented with str but that it's annoying and undesirable to output as part of the XML event stream - which is approximately the same problem we have with un-trimmed whitespace between elements. So I think it makes sense to deal with it in the same way. If you want to parse a normalized XML representation (which most people do), you should trim all that stuff away. If you want an exact reproduction, don't trim it.

In terms of Writer, although we probably won't support writing different encodings any time soon (since encoding_rs doesn't support any other encodings), we don't need to worry about multiple different kinds of BOMs. Just write the UTF-8 BOM into the text stream the same as everything else and let the encoder handle it.

edit: I was also wrong about encoding_rs_io not letting you pass the BOM through. You can do that and it's safe for the reason I just described.

Mingun
Mingun approved these changes Aug 18, 2022
Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that it's able to be represented as a special character but that it is literally just a character, with an assigned Unicode character number, which can be encoded in different encodings just like any other Unicode character - and if you put this character at the beginning of the document, and you know how different encodings represent that character, then you can just look at the first few bytes and see if they match how a particular encoding would encode the character U+FEFF.

OK, the official FAQ about that: https://unicode.org/faq/utf_bom.html#bom1

Also I've made some experiments:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4f7773f8164bd228c4f35329b915f38b

Generally approved, but please made the small improvements in doc & tests

Changelog.md Outdated Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
src/encoding.rs Outdated Show resolved Hide resolved
src/reader/mod.rs Show resolved Hide resolved
src/reader/mod.rs Outdated Show resolved Hide resolved
src/writer.rs Outdated Show resolved Hide resolved
src/writer.rs Show resolved Hide resolved
StartText would be out of place once all events are expected to contain UTF-8.
Additionally the decoder implementation strips BOM bytes out of the
bytestream so there's no good way to access them.
@dralley
Copy link
Collaborator Author

dralley commented Aug 18, 2022

@Mingun re: test, I've marked it as an expected failure until we can address the same issue you pointed out. We can strip out the BOM here but it's too late to avoid emitting the text event entirely.

But we had the same issue before with *_with_bom_removal(), so it's not a regression.

@dralley dralley merged commit bbe490b into tafia:master Aug 18, 2022
3 checks passed
@dralley dralley deleted the remove-bytesstart branch August 18, 2022 14:14
Mingun added a commit to Mingun/quick-xml that referenced this pull request Aug 26, 2022
After removing StartText event in tafia#459 text events can be generated
at the beginning of the stream
Mingun added a commit to Mingun/quick-xml that referenced this pull request Aug 26, 2022
After removing StartText event in tafia#459 text events can be generated
at the beginning of the stream
Mingun added a commit to Mingun/quick-xml that referenced this pull request Aug 27, 2022
After removing StartText event in tafia#459 text events can be generated
at the beginning of the stream
dralley pushed a commit that referenced this pull request Aug 27, 2022
After removing StartText event in #459 text events can be generated
at the beginning of the stream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants