Change `read_text()` return value to `BytesText` by Mingun · Pull Request #944 · tafia/quick-xml

Mingun · 2026-02-20T18:08:26Z

Previously we decode the bytes that was read by read_text(), but correct processing probably should also include EOL normalization (because according to the specification, XML parser should operate at normalized input). So now the user can choose how to process the content:

use decode() to get only decoded text (as it was before)
use xml_content() to get the text according to the XML standard
use Deref implementation to get the underlying bytes

codecov-commenter · 2026-02-20T18:37:28Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.38%. Comparing base (2b21d40) to head (c5506c3).
⚠️ Report is 24 commits behind head on master.

Files with missing lines	Patch %	Lines
examples/read_nodes.rs	0.00%	4 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #944      +/-   ##
==========================================
+ Coverage   55.00%   56.38%   +1.38%     
==========================================
  Files          44       44              
  Lines       16816    17580     +764     
==========================================
+ Hits         9249     9913     +664     
- Misses       7567     7667     +100

Flag	Coverage Δ
unittests	`56.38% <42.85%> (+1.38%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dralley · 2026-02-21T17:11:32Z

Tangential to this PR, the XML spec reads as though EOL handling (replacing CR-LF sequences, etc.) ought to be a step that precedes any parsing of the XML. That is, it's not really intended to be something which happens during the processing of individual elements.

Is that understanding correct?

This PR would still be a reasonable change given that no such mechanism currently exists (same as with decoding in general), but it might be relevant to e.g. #441 (comment)

I could imagine having a series of layers where

Layer 1) Decoding to utf-8 OR validating utf-8 OR assuming utf-8 (e.g. slice reader)
Layer 2) Preprocessing including EOL normalization, perhaps newline tracking for more useful errors
Layer 3) XML Parsing, which can universally assume pre-normalized utf-8

Pros:

simpler
probably more standard compliant

Cons:

slice reader becomes slightly less useful
- you would either still need a step involving buffered processing, or enforce pre-processing the document using a function that returns some special type wrapped around Cown<str> that might allocate a new copy

…tent with properly normalized EOLs

Mingun · 2026-02-22T12:47:40Z

Is that understanding correct?

Correct.

I could imagine having a series of layers where

Although this is logical from an architectural point of view, it seems to me that one of the key performance features when parsing non-UTF-8 encoded documents is to work with bytes, not characters. Since we only support encodings in which XML control bytes are encoded exactly as in ASCII (and in UTF-8), we can postpone the decoding step until it is actually needed. The same goes for the EOL normalization phase -- all encodings that we support are ASCII-compatible, so we can work with bytes and do normalization at the byte level too. Postpone all this work is especially noticeable when we skip a lot of events during processing. Naturally, there are also disadvantages -- we do not check the correctness of the encoded stream. It would be ideal to leave the validation step optional so that those who do not need it can turn it off.

dralley · 2026-02-22T14:53:15Z

Although this is logical from an architectural point of view, it seems to me that one of the key performance features when parsing non-UTF-8 encoded documents is to work with bytes, not characters. Since we only support encodings in which XML control bytes are encoded exactly as in ASCII (and in UTF-8), we can postpone the decoding step until it is actually needed. The same goes for the EOL normalization phase -- all encodings that we support are ASCII-compatible, so we can work with bytes and do normalization at the byte level too. Postpone all this work is especially noticeable when we skip a lot of events during processing. Naturally, there are also disadvantages -- we do not check the correctness of the encoded stream. It would be ideal to leave the validation step optional so that those who do not need it can turn it off.

I don't mean that we would be working with char or anything. The point is that if you know everything is UTF-8, and the parser knows where the relevant separators are (<, >,etc.) then any "raw bytes" between those known boundaries can be assumed valid UTF-8 (unsafe fn std::str::from_utf8_unchecked()) without any additional performance cost.

The other goal behind pre-decoding would be to be able to support encodings like UTF-16 in the first place, as right now we could not do so.

I don't have recent hard data to back this up, but a foundational premise is also that the overall performance cost of validating OR decoding would be reduced dramatically by doing it in-bulk rather than doing it repeatedly on many small items due to the nature of caches, vectorization, required allocations etc.

It would be ideal to leave the validation step optional so that those who do not need it can turn it off.

Anyway, that can be discussed further on my other draft PR, where there's something to look at.

dralley

LGTM, my discussion is not strictly related

Mingun added the enhancement label Feb 20, 2026

read_text() now returns BytesText which allows you to get the con…

c5506c3

…tent with properly normalized EOLs

Mingun force-pushed the flexible-read-text branch from 5e7c0cf to c5506c3 Compare February 22, 2026 12:30

dralley approved these changes Feb 22, 2026

View reviewed changes

Mingun merged commit 6238d8a into tafia:master Feb 22, 2026
7 checks passed

Mingun deleted the flexible-read-text branch February 22, 2026 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `read_text()` return value to `BytesText`#944

Change `read_text()` return value to `BytesText`#944
Mingun merged 1 commit intotafia:masterfrom
Mingun:flexible-read-text

Mingun commented Feb 20, 2026

Uh oh!

codecov-commenter commented Feb 20, 2026 •

edited

Loading

Uh oh!

dralley commented Feb 21, 2026 •

edited

Loading

Uh oh!

Mingun commented Feb 22, 2026

Uh oh!

dralley commented Feb 22, 2026 •

edited

Loading

Uh oh!

dralley left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Mingun commented Feb 20, 2026

Uh oh!

codecov-commenter commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dralley commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mingun commented Feb 22, 2026

Uh oh!

dralley commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Feb 20, 2026 •

edited

Loading

dralley commented Feb 21, 2026 •

edited

Loading

dralley commented Feb 22, 2026 •

edited

Loading