-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closure-based unescaping with custom entities #415
Conversation
@Mingun I think this is roughly the API you asked for, but I'm having some trouble getting the Under this new model, the lifetimes are a bit more complex. Feel free to do some experimentation (I'm going to sleep) |
657a8f6
to
81d63a4
Compare
5eccdae
to
fd609e2
Compare
It is better to replace suffix |
I agree but when it comes to As with the corrections you've made recently, the more appropriate name would be |
Yes, it should be Actually, I would prefer the following naming:
But moving some functions under a feature flag will now breaks serde deserializer, which currently also uses incorrect unescape-then-decode order. Fixing that is not a fast task |
@Mingun I'm not 100% sure if this makes sense, obviously you've put more thought into decoding, but: Given that encoding is a property of the source file, and not something that changes on an element-by-element basis, does it really make sense for the complexity of decoding to be distributed throughout all the different structs? I feel like the reader ought to just decode everything into the buffer as utf-8, and thereafter This is the approach used by other libraries such as libxml https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Encodings-support#the-internal-encoding-how-and-why From a performance perspective, this can also be better, since encoding / decoding huge blocks of data keeps the loops hot and is easy to accelerate with SIMD (which encoding_rs tries to use when possible). It might be worse in some cases, if you're using an alternate encoding and not reading the full document, but that feels a) not especially common and b) not worth adding runtime overhead to the common utf-8 case for. It also means the API would be cut down to a much smaller number of functions. |
I would be glad to decode the input at time of get them from the underlying reader, even at cost of some minor performance penalty. But need to make this with caution, I trying to keep the spirit of @tafia's implementation -- do not do work if it is unrequired |
(BTW, @tafia , feel free to weigh in). The way I see this is that if the user actually does want / need to support different encodings, the current approach will actually require more work (both human and computer) to be done. Decoding individual strings means allocating a bunch of individual strings, and if you use the You could try to decode and unescape manually to get back the performance for utf-8, but that's going to be a huge amount of work to do that in every individual place that it would need to be done, and not worth it at all for the vast majority of users. So I think, the approach would be better in most cases, and also much simpler to use, maintain and test. |
6abd001
to
ce19763
Compare
Commits are a little dirty in the sense that changes mixed together a bit, I can fix that if you mind deeply. |
Codecov Report
@@ Coverage Diff @@
## master #415 +/- ##
==========================================
+ Coverage 49.58% 49.67% +0.09%
==========================================
Files 22 22
Lines 13935 13856 -79
==========================================
- Hits 6909 6883 -26
+ Misses 7026 6973 -53
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I'm leaving the Otherwise it is very difficult to match tags in an encoding-agnostic way without decoding them, and that's a tremendous number of allocations, probably enough to nullify any benefit of avoiding the conversion overhead. If you're parsing the whole document including text and attributes, it would become pure overhead that you can't avoid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is almost good, only few small nits are better to solve:
- change
*ed
method names to*e
names for further consistency - add feature gate for methods that assumes UTF-8 encoding, which should restrict their incorrect usage
src/events/mod.rs
Outdated
} | ||
|
||
/// gets escaped content | ||
/// | ||
/// Searches for '&' into content and try to escape the coded character if possible | ||
/// returns Malformed error with index within element if '&' is not followed by ';' | ||
/// | ||
/// See also [`unescaped_with_custom_entities()`](Self::unescaped_with_custom_entities) | ||
/// See also [`unescaped_with()`](Self::unescaped_with) | ||
pub fn unescaped(&self) -> Result<Cow<[u8]>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly
pub fn unescaped(&self) -> Result<Cow<[u8]>> { | |
#[cfg(not(feature = "encoding"))] | |
pub fn unescape(&self) -> Result<Cow<[u8]>> { |
src/events/mod.rs
Outdated
} | ||
|
||
fn make_unescaped<'s>( | ||
pub fn unescaped_with<'s, 'entity>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly
pub fn unescaped_with<'s, 'entity>( | |
#[cfg(not(feature = "encoding"))] | |
pub fn unescape_with<'s, 'entity>( |
src/events/attributes.rs
Outdated
@@ -37,39 +37,33 @@ impl<'a> Attribute<'a> { | |||
/// | |||
/// This will allocate if the value contains any escape sequences. | |||
/// | |||
/// See also [`unescaped_value_with_custom_entities()`](Self::unescaped_value_with_custom_entities) | |||
/// See also [`unescaped_value_with()`](Self::unescaped_value_with) | |||
pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly
pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> { | |
#[cfg(not(feature = "encoding"))] | |
pub fn unescape_value(&self) -> XmlResult<Cow<[u8]>> { |
src/events/attributes.rs
Outdated
pub fn unescaped_value_with<'s, 'entity>( | ||
&'s self, | ||
resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>, | ||
) -> XmlResult<Cow<'s, [u8]>> { | ||
unescape_with(&*self.value, resolve_entity).map_err(Error::EscapeError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be more consistently with other codebase
pub fn unescaped_value_with<'s, 'entity>( | |
&'s self, | |
resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>, | |
) -> XmlResult<Cow<'s, [u8]>> { | |
unescape_with(&*self.value, resolve_entity).map_err(Error::EscapeError) | |
#[cfg(not(feature = "encoding"))] | |
pub fn unescape_value_with<'s, 'entity>( | |
&'s self, | |
resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>, | |
) -> XmlResult<Cow<'s, [u8]>> { | |
Ok(unescape_with(&*self.value, resolve_entity)?) |
The added feature gate prevents incorrect usage. All usages in tests should be replaced with decode_*
variant. Another way to solve that -- add a decoder to the Attribute
/ Event
and keep only variant without reader
parameter, but I think that this may be a task for another PR, focused on encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather push all the decoding changes into a different PR. At the moment all of the parsing code is based on searches for single ascii bytes, which means it is effectively a UTF-8 only library, even if the detection of encodings works properly. #322
So there's a lot of work needed to make the encoding
feature actually useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand and support. That is why I suggest just disable dangerous methods when "encoding"
feature is enabled.
So there's a lot of work needed to make the
encoding
feature actually useful.
Actually, it is useful for one-byte encodings. Bearing in mind, that
- we use
encoding_rs
encoding_rs
is not extensible by design- as the result, the list of encodings is fixed
- the only supported encodings, that is not XML compatible (because is not ASCII compatible, which is more strict restriction) is (generated by this snippet):
ISO-2022-JP
replacement
(not a real encoding, actually)UTF-16BE
UTF-16LE
So the encoding
feature actually may be useful even in the current state.
The other suggestion is to make namings is even more consistently, compare:
unescaped
decode_and_unescape
Because the difference only in the decode
operation applied in the second function, it is logically, that stripping out the decode_and_
prefix we should get a name of the method that works without decoding
(also, I think, you could squash last 3 commits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about the other one-byte encodings, I just don't know how common they actually are, certainly compared to UTF-16. But probably what I should have said is that I don't really want to go back and forth over functions that it might make sense to remove entirely at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go ahead and make the changes anyway (tomorrow, since I really need to go to sleep now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mingun I wont forget about this, but I am going to put this in a separate PR. I want to finish running some experiments first re: bulk decode vs. decoding individual elements and the outcome of that might influence what we want the API to look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally agree that encoding part should be in a separate PR, but can you polish namings in this PR? Only one simple change:
unescaped()
->unescape()
unescaped_with()
->unescape_with()
unescaped_value()
->unescape_value()
unescaped_value_with()
->unescape_value_with()
As you already changed some of that names it is quite logically to finish with that here, just to keep history clean :). And after that it can be merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK :) but, please take a look at #158 (comment)
I'm going to draft this for now because I'm 100% convinced we should change the APIs to pure &str
/ String
ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to draft this for now because I'm 100% convinced we should change the APIs to pure
&str
/String
ones.
I agree, I already work in this direction on top of this PR, namely changing the order of unescape and decode operations in the serde deserializer. I think I'm going to merge it now and continue evolving API in the another PRs. At least I want to make the new closure-based API accepts &str
instead of &[u8]
before the release.
I moved all the thoughts about future encoding strategy to the appropriate issue #158, and did a bit of background research on performance |
It's redundant now that the macrobenchmarks and (un)escape benchmarks exist.
Instead of providing unescaping functions with an entity mapping via a data structure, instead provide a closure which maps the entity with replacement text.
/// | ||
/// [`unescaped_value()`]: Self::unescaped_value | ||
/// [`unescape_value()`]: Self::unescape_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I think it reads a little worse, even if it's more consistent. I don't care enough to argue any more (lol) but I'll just say it for what it's worth
.value
/raw_value()
-> ... (nothing)unescaped_value()
->unescape_value()
normalized_value()
->normalize_value()
The former fields weird when compared against the latter, just a personal opinion. Ultimately it doesn't matter very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, name conflict needs to be resolved https://github.com/tafia/quick-xml/blob/master/src/events/mod.rs#L738-L757=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just rename private function to something, I'll remove it in my PR tonight.
Thanks for you hard work! |
Re: #379 (comment)