Change read_text() return value to BytesText#944
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #944 +/- ##
==========================================
+ Coverage 55.00% 56.38% +1.38%
==========================================
Files 44 44
Lines 16816 17580 +764
==========================================
+ Hits 9249 9913 +664
- Misses 7567 7667 +100
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Tangential to this PR, the XML spec reads as though EOL handling (replacing CR-LF sequences, etc.) ought to be a step that precedes any parsing of the XML. That is, it's not really intended to be something which happens during the processing of individual elements. Is that understanding correct? This PR would still be a reasonable change given that no such mechanism currently exists (same as with decoding in general), but it might be relevant to e.g. #441 (comment) I could imagine having a series of layers where Layer 1) Decoding to utf-8 OR validating utf-8 OR assuming utf-8 (e.g. slice reader) Pros:
Cons:
|
…tent with properly normalized EOLs
5e7c0cf to
c5506c3
Compare
Correct.
Although this is logical from an architectural point of view, it seems to me that one of the key performance features when parsing non-UTF-8 encoded documents is to work with bytes, not characters. Since we only support encodings in which XML control bytes are encoded exactly as in ASCII (and in UTF-8), we can postpone the decoding step until it is actually needed. The same goes for the EOL normalization phase -- all encodings that we support are ASCII-compatible, so we can work with bytes and do normalization at the byte level too. Postpone all this work is especially noticeable when we skip a lot of events during processing. Naturally, there are also disadvantages -- we do not check the correctness of the encoded stream. It would be ideal to leave the validation step optional so that those who do not need it can turn it off. |
I don't mean that we would be working with The other goal behind pre-decoding would be to be able to support encodings like UTF-16 in the first place, as right now we could not do so. I don't have recent hard data to back this up, but a foundational premise is also that the overall performance cost of validating OR decoding would be reduced dramatically by doing it in-bulk rather than doing it repeatedly on many small items due to the nature of caches, vectorization, required allocations etc.
Anyway, that can be discussed further on my other draft PR, where there's something to look at. |
dralley
left a comment
There was a problem hiding this comment.
LGTM, my discussion is not strictly related
Previously we decode the bytes that was read by
read_text(), but correct processing probably should also include EOL normalization (because according to the specification, XML parser should operate at normalized input). So now the user can choose how to process the content:decode()to get only decoded text (as it was before)xml_content()to get the text according to the XML standardDerefimplementation to get the underlying bytes