Closure-based unescaping with custom entities #415

dralley · 2022-07-10T04:03:47Z

dralley · 2022-07-10T04:06:11Z

@Mingun I think this is roughly the API you asked for, but I'm having some trouble getting the custom_entities.rs example to compile.

Under this new model, the lifetimes are a bit more complex. Feel free to do some experimentation (I'm going to sleep)

src/escapei.rs

Mingun · 2022-07-11T16:37:22Z

It is better to replace suffix _with_custom_entities to _with -- this is more similar how other Rust APIs is named

examples/custom_entities.rs

dralley · 2022-07-11T16:48:30Z

It is better to replace suffix _with_custom_entities to _with -- this is more similar how other Rust APIs is named

I agree but when it comes to unescape_and_decode_with() the name sounds misleading - it sounds like the "with" refers to the decoding portion instead of the unescaping portion.

As with the corrections you've made recently, the more appropriate name would be decode_and_unescape_with(). Thoughts?

Mingun · 2022-07-11T16:58:55Z

Yes, it should be decode_and_unescape everywhere, because this is the correct order of operations.

Actually, I would prefer the following naming:

Name	Description
`unescape(&self, decoder: Decoder)`	Decodes internal byte content by specified decoder, then unescapes it
`unescape_with<F>(&self, decoder: Decoder, resolve_entity: F)`	-//-
`utf8_unescape(&self)`	Assumes UTF-8 encoded internal content, unescapes it (available only if feature `"encoding"` is not enabled)
`utf8_unescape_with<F>(&self, resolve_entity: F)`	-//-

But moving some functions under a feature flag will now breaks serde deserializer, which currently also uses incorrect unescape-then-decode order. Fixing that is not a fast task

src/events/attributes.rs

examples/custom_entities.rs

dralley · 2022-07-11T17:46:02Z

@Mingun I'm not 100% sure if this makes sense, obviously you've put more thought into decoding, but:

Given that encoding is a property of the source file, and not something that changes on an element-by-element basis, does it really make sense for the complexity of decoding to be distributed throughout all the different structs?

I feel like the reader ought to just decode everything into the buffer as utf-8, and thereafter Event* and Attribute* and *escape* can just assume that it's working with UTF-8. Instead of reading the bytes directly and trying to work with it no matter the encoding. And likewise on the writing side, working with UTF-8 and then converting it to UTF-16 (or whichever encoding) as it gets written seems much simpler.

This is the approach used by other libraries such as libxml https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Encodings-support#the-internal-encoding-how-and-why

From a performance perspective, this can also be better, since encoding / decoding huge blocks of data keeps the loops hot and is easy to accelerate with SIMD (which encoding_rs tries to use when possible). It might be worse in some cases, if you're using an alternate encoding and not reading the full document, but that feels a) not especially common and b) not worth adding runtime overhead to the common utf-8 case for.

It also means the API would be cut down to a much smaller number of functions.

Mingun · 2022-07-11T17:54:38Z

I would be glad to decode the input at time of get them from the underlying reader, even at cost of some minor performance penalty. But need to make this with caution, I trying to keep the spirit of @tafia's implementation -- do not do work if it is unrequired

dralley · 2022-07-11T18:06:52Z

(BTW, @tafia , feel free to weigh in).

The way I see this is that if the user actually does want / need to support different encodings, the current approach will actually require more work (both human and computer) to be done.

Decoding individual strings means allocating a bunch of individual strings, and if you use the unescape_and_decode() functions to do that then you will be allocating everything all the time whether you're using utf-8 or not - on top of any other small incremental performance penalties that decoding larger blocks of data might avoid.

You could try to decode and unescape manually to get back the performance for utf-8, but that's going to be a huge amount of work to do that in every individual place that it would need to be done, and not worth it at all for the vast majority of users.

So I think, the approach would be better in most cases, and also much simpler to use, maintain and test.

examples/custom_entities.rs

dralley · 2022-07-12T15:48:12Z

Commits are a little dirty in the sense that changes mixed together a bit, I can fix that if you mind deeply.

codecov-commenter · 2022-07-12T15:52:34Z

Codecov Report

Merging #415 (a4d50a5) into master (a6588c2) will increase coverage by 0.09%.
The diff coverage is 54.02%.

@@            Coverage Diff             @@
##           master     #415      +/-   ##
==========================================
+ Coverage   49.58%   49.67%   +0.09%     
==========================================
  Files          22       22              
  Lines       13935    13856      -79     
==========================================
- Hits         6909     6883      -26     
+ Misses       7026     6973      -53

Flag	Coverage Δ
unittests	`49.67% <54.02%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
benches/macrobenches.rs	`0.00% <0.00%> (ø)`
benches/microbenches.rs	`0.00% <0.00%> (ø)`
examples/custom_entities.rs	`0.00% <0.00%> (ø)`
src/events/attributes.rs	`94.42% <0.00%> (+0.83%)`	⬆️
src/lib.rs	`12.33% <0.00%> (+<0.01%)`	⬆️
src/reader.rs	`60.13% <0.00%> (ø)`
src/de/mod.rs	`77.38% <50.00%> (+0.09%)`	⬆️
src/escapei.rs	`12.10% <100.00%> (-1.81%)`	⬇️
src/events/mod.rs	`72.28% <100.00%> (+2.17%)`	⬆️
src/writer.rs	`49.28% <100.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6588c2...a4d50a5. Read the comment docs.

dralley · 2022-07-12T15:59:19Z

I'm leaving the decode_and_unescape*() API as-is for now. The more I think about it the more the UTF-8 only buffer approach seems like the appropriate course of action.

Otherwise it is very difficult to match tags in an encoding-agnostic way without decoding them, and that's a tremendous number of allocations, probably enough to nullify any benefit of avoiding the conversion overhead. If you're parsing the whole document including text and attributes, it would become pure overhead that you can't avoid.

Mingun

It is almost good, only few small nits are better to solve:

change *ed method names to *e names for further consistency
add feature gate for methods that assumes UTF-8 encoding, which should restrict their incorrect usage

src/events/attributes.rs

Mingun · 2022-07-12T16:35:46Z

src/events/mod.rs

    }

    /// gets escaped content
    ///
    /// Searches for '&' into content and try to escape the coded character if possible
    /// returns Malformed error with index within element if '&' is not followed by ';'
    ///
-    /// See also [`unescaped_with_custom_entities()`](Self::unescaped_with_custom_entities)
+    /// See also [`unescaped_with()`](Self::unescaped_with)
    pub fn unescaped(&self) -> Result<Cow<[u8]>> {


Similarly

Suggested change

pub fn unescaped(&self) -> Result<Cow<[u8]>> {

#[cfg(not(feature = "encoding"))]

pub fn unescape(&self) -> Result<Cow<[u8]>> {

Mingun · 2022-07-12T16:36:13Z

src/events/mod.rs

-    }
-
-    fn make_unescaped<'s>(
+    pub fn unescaped_with<'s, 'entity>(


Similarly

Suggested change

pub fn unescaped_with<'s, 'entity>(

#[cfg(not(feature = "encoding"))]

pub fn unescape_with<'s, 'entity>(

Mingun · 2022-07-12T16:55:17Z

src/events/attributes.rs

@@ -37,39 +37,33 @@ impl<'a> Attribute<'a> {
    ///
    /// This will allocate if the value contains any escape sequences.
    ///
-    /// See also [`unescaped_value_with_custom_entities()`](Self::unescaped_value_with_custom_entities)
+    /// See also [`unescaped_value_with()`](Self::unescaped_value_with)
    pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> {


Similarly

Suggested change

pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> {

#[cfg(not(feature = "encoding"))]

pub fn unescape_value(&self) -> XmlResult<Cow<[u8]>> {

Mingun · 2022-07-12T17:07:29Z

src/events/attributes.rs

+    pub fn unescaped_value_with<'s, 'entity>(
+        &'s self,
+        resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>,
+    ) -> XmlResult<Cow<'s, [u8]>> {
+        unescape_with(&*self.value, resolve_entity).map_err(Error::EscapeError)


I think it will be more consistently with other codebase

Suggested change

pub fn unescaped_value_with<'s, 'entity>(

&'s self,

resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>,

) -> XmlResult<Cow<'s, [u8]>> {

unescape_with(&*self.value, resolve_entity).map_err(Error::EscapeError)

#[cfg(not(feature = "encoding"))]

pub fn unescape_value_with<'s, 'entity>(

&'s self,

resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>,

) -> XmlResult<Cow<'s, [u8]>> {

Ok(unescape_with(&*self.value, resolve_entity)?)

The added feature gate prevents incorrect usage. All usages in tests should be replaced with decode_* variant. Another way to solve that -- add a decoder to the Attribute / Event and keep only variant without reader parameter, but I think that this may be a task for another PR, focused on encoding.

I would rather push all the decoding changes into a different PR. At the moment all of the parsing code is based on searches for single ascii bytes, which means it is effectively a UTF-8 only library, even if the detection of encodings works properly. #322

So there's a lot of work needed to make the encoding feature actually useful.

Yes, I understand and support. That is why I suggest just disable dangerous methods when "encoding" feature is enabled.

So there's a lot of work needed to make the encoding feature actually useful.

Actually, it is useful for one-byte encodings. Bearing in mind, that

we use encoding_rs

encoding_rs is not extensible by design

as the result, the list of encodings is fixed

the only supported encodings, that is not XML compatible (because is not ASCII compatible, which is more strict restriction) is (generated by this snippet):

ISO-2022-JP

replacement (not a real encoding, actually)

UTF-16BE

UTF-16LE

So the encoding feature actually may be useful even in the current state.

The other suggestion is to make namings is even more consistently, compare:

unescaped

decode_and_unescape

Because the difference only in the decode operation applied in the second function, it is logically, that stripping out the decode_and_ prefix we should get a name of the method that works without decoding

(also, I think, you could squash last 3 commits)

You are right about the other one-byte encodings, I just don't know how common they actually are, certainly compared to UTF-16. But probably what I should have said is that I don't really want to go back and forth over functions that it might make sense to remove entirely at some point in the future.

I will go ahead and make the changes anyway (tomorrow, since I really need to go to sleep now)

@Mingun I wont forget about this, but I am going to put this in a separate PR. I want to finish running some experiments first re: bulk decode vs. decoding individual elements and the outcome of that might influence what we want the API to look like.

I'm totally agree that encoding part should be in a separate PR, but can you polish namings in this PR? Only one simple change:

unescaped() -> unescape()

unescaped_with() -> unescape_with()

unescaped_value() -> unescape_value()

unescaped_value_with() -> unescape_value_with()

As you already changed some of that names it is quite logically to finish with that here, just to keep history clean :). And after that it can be merged

OK :) but, please take a look at #158 (comment)

I'm going to draft this for now because I'm 100% convinced we should change the APIs to pure &str / String ones.

I'm going to draft this for now because I'm 100% convinced we should change the APIs to pure &str / String ones.

I agree, I already work in this direction on top of this PR, namely changing the order of unescape and decode operations in the serde deserializer. I think I'm going to merge it now and continue evolving API in the another PRs. At least I want to make the new closure-based API accepts &str instead of &[u8] before the release.

src/events/mod.rs

dralley · 2022-07-13T06:13:49Z

I moved all the thoughts about future encoding strategy to the appropriate issue #158, and did a bit of background research on performance

It's redundant now that the macrobenchmarks and (un)escape benchmarks exist.

Instead of providing unescaping functions with an entity mapping via a data structure, instead provide a closure which maps the entity with replacement text.

dralley · 2022-07-15T05:00:50Z

src/events/attributes.rs

    ///
-    /// [`unescaped_value()`]: Self::unescaped_value
+    /// [`unescape_value()`]: Self::unescape_value


To be honest, I think it reads a little worse, even if it's more consistent. I don't care enough to argue any more (lol) but I'll just say it for what it's worth

.value / raw_value() -> ... (nothing)

unescaped_value() -> unescape_value()

normalized_value() -> normalize_value()

The former fields weird when compared against the latter, just a personal opinion. Ultimately it doesn't matter very much.

Also, name conflict needs to be resolved https://github.com/tafia/quick-xml/blob/master/src/events/mod.rs#L738-L757=

Just rename private function to something, I'll remove it in my PR tonight.

Mingun · 2022-07-15T05:57:46Z

Thanks for you hard work!

dralley force-pushed the escape-api branch 2 times, most recently from 657a8f6 to 81d63a4 Compare July 10, 2022 20:06

Mingun reviewed Jul 10, 2022

View reviewed changes

src/escapei.rs Outdated Show resolved Hide resolved

dralley force-pushed the escape-api branch 6 times, most recently from 5eccdae to fd609e2 Compare July 11, 2022 16:22

dralley commented Jul 11, 2022

View reviewed changes

examples/custom_entities.rs Outdated Show resolved Hide resolved

Mingun reviewed Jul 11, 2022

View reviewed changes

src/events/attributes.rs Outdated Show resolved Hide resolved

examples/custom_entities.rs Outdated Show resolved Hide resolved

dralley force-pushed the escape-api branch 4 times, most recently from 6abd001 to ce19763 Compare July 12, 2022 15:40

dralley marked this pull request as ready for review July 12, 2022 15:40

dralley commented Jul 12, 2022

View reviewed changes

examples/custom_entities.rs Show resolved Hide resolved

dralley requested a review from Mingun July 12, 2022 15:46

Mingun requested changes Jul 12, 2022

View reviewed changes

dralley mentioned this pull request Jul 12, 2022

ergonomics & encodings #158

Open

dralley force-pushed the escape-api branch from ce19763 to d764f1e Compare July 12, 2022 21:23

dralley force-pushed the escape-api branch from d764f1e to 99c7e6a Compare July 14, 2022 04:07

Remove the bytes_text_unescaped benchmark

4656f3b

It's redundant now that the macrobenchmarks and (un)escape benchmarks exist.

dralley force-pushed the escape-api branch from 99c7e6a to 2507e12 Compare July 14, 2022 04:09

Change API for providing custom entities

a4a0f8d

Instead of providing unescaping functions with an entity mapping via a data structure, instead provide a closure which maps the entity with replacement text.

dralley force-pushed the escape-api branch from 2507e12 to a4a0f8d Compare July 14, 2022 04:22

dralley marked this pull request as draft July 15, 2022 02:01

dralley marked this pull request as ready for review July 15, 2022 04:53

dralley force-pushed the escape-api branch from 15797df to 22ee00b Compare July 15, 2022 04:56

dralley commented Jul 15, 2022

View reviewed changes

dralley force-pushed the escape-api branch from 22ee00b to b769c64 Compare July 15, 2022 05:03

Make escape / unescape consistent throughout API

a4d50a5

dralley force-pushed the escape-api branch from b769c64 to a4d50a5 Compare July 15, 2022 05:40

Mingun merged commit 72e11d1 into tafia:master Jul 15, 2022

dralley deleted the escape-api branch July 15, 2022 15:40

plugwash mentioned this pull request Dec 22, 2022

patches to support newer quick-xml rust-x-bindings/rust-xcb#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closure-based unescaping with custom entities #415

Closure-based unescaping with custom entities #415

dralley commented Jul 10, 2022

dralley commented Jul 10, 2022 •

edited

Loading

Mingun commented Jul 11, 2022

dralley commented Jul 11, 2022

Mingun commented Jul 11, 2022 •

edited

Loading

dralley commented Jul 11, 2022 •

edited

Loading

Mingun commented Jul 11, 2022

dralley commented Jul 11, 2022 •

edited

Loading

dralley commented Jul 12, 2022

codecov-commenter commented Jul 12, 2022 •

edited

Loading

dralley commented Jul 12, 2022 •

edited

Loading

Mingun left a comment

Mingun Jul 12, 2022

Mingun Jul 12, 2022

Mingun Jul 12, 2022

Mingun Jul 12, 2022

dralley Jul 13, 2022

Mingun Jul 13, 2022 •

edited

Loading

dralley Jul 13, 2022

dralley Jul 13, 2022

dralley Jul 14, 2022

Mingun Jul 14, 2022

dralley Jul 15, 2022

Mingun Jul 15, 2022

dralley commented Jul 13, 2022

dralley Jul 15, 2022 •

edited

Loading

dralley Jul 15, 2022

Mingun Jul 15, 2022

Mingun commented Jul 15, 2022

	pub fn unescaped(&self) -> Result<Cow<[u8]>> {
	#[cfg(not(feature = "encoding"))]
	pub fn unescape(&self) -> Result<Cow<[u8]>> {

	pub fn unescaped_with<'s, 'entity>(
	#[cfg(not(feature = "encoding"))]
	pub fn unescape_with<'s, 'entity>(

	pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> {
	#[cfg(not(feature = "encoding"))]
	pub fn unescape_value(&self) -> XmlResult<Cow<[u8]>> {

Closure-based unescaping with custom entities #415

Closure-based unescaping with custom entities #415

Conversation

dralley commented Jul 10, 2022

dralley commented Jul 10, 2022 • edited Loading

Mingun commented Jul 11, 2022

dralley commented Jul 11, 2022

Mingun commented Jul 11, 2022 • edited Loading

dralley commented Jul 11, 2022 • edited Loading

Mingun commented Jul 11, 2022

dralley commented Jul 11, 2022 • edited Loading

dralley commented Jul 12, 2022

codecov-commenter commented Jul 12, 2022 • edited Loading

Codecov Report

dralley commented Jul 12, 2022 • edited Loading

Mingun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mingun Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dralley commented Jul 13, 2022

dralley Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mingun commented Jul 15, 2022

dralley commented Jul 10, 2022 •

edited

Loading

Mingun commented Jul 11, 2022 •

edited

Loading

dralley commented Jul 11, 2022 •

edited

Loading

dralley commented Jul 11, 2022 •

edited

Loading

codecov-commenter commented Jul 12, 2022 •

edited

Loading

dralley commented Jul 12, 2022 •

edited

Loading

Mingun Jul 13, 2022 •

edited

Loading

dralley Jul 15, 2022 •

edited

Loading