Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for binary blobs inside xml tags #555

Closed
wants to merge 3 commits into from

Conversation

elrnv
Copy link
Contributor

@elrnv elrnv commented Feb 17, 2023

In summary, this PR combines binary support with text events, to avoid copious code duplication. Also:

  • Renamed text events to Content to avoid confusion, since these now represent raw byte slices.
  • Swapped String for Vec<u8> in a few places involving Text (not everywhere) to facilitate this functionality.
  • Fixed unblocked tests due to this change.

Some versions ago quick-xml had exposed a more low level API, which allowed reading/writing bytes directly between xml tags. This was very useful for working with XML based VTK files and I really wish to have this functionality again in quick-xml. This PR attempts to recreate this capability and is being tested in a new release for vtkio. Please let me know if there are any design changes needed here.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not complete review.

It would be much easier for review if non-functional changes (like renames) was in its own commit, so I can review it separately and then focus only on functional changes. In that case, however, I prefer to not rename Text to Content. The "text" here should be treated as defined in the XML specification -- the piece of input that does not form a markup.

@dralley
Copy link
Collaborator

dralley commented Feb 17, 2023

I'm still not sure how I feel about adding a bunch of non-standard XML extensions in general, but at minimum I think it would need to be under of a non-default-enabled feature flag.

I agree with @Mingun, that the current API needs to reflect standard XML definitions and procedures, so you should make a new type rather than changing Text.

However this may also be difficult to integrate with the planned work on encodings because it breaks the expectation that a file that declares its encoding as X can actually be decoded in its entirety as X. That makes me uncomfortable.

I'm also not sure this is reliable if the binary blobs can contain, for example, un-escaped < characters. How does this format handle that?

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that it is better to support non-standard extensions wherever possible, the way of doing that here will be incompatible with changes from #520.

Moreover, this PR contains two changes:

  • the first one to support raw binary data. I would be happy has this feature, but it require more design efforts. I prefer to focus on valid ways of storing binary data first (by specifying rules for converting binary data to strings, for example, in base64 encoding). Unfortunately, I have to reject PR in his current form. I'll leave it open if you plan to adapt it after landing Merge consequent text and CDATA events into one string  #520, but that will probably require a significant rework.
  • serializing structs (and not maps) for text serializer, that should produce only text content, and not structured content. This change is totally unrelated to the binary serialization and should be reverted

@@ -133,7 +133,7 @@ impl<'i, W: Write> Serializer for ContentSerializer<'i, W> {
type SerializeTupleStruct = Self;
type SerializeTupleVariant = ElementSerializer<'i, W>;
type SerializeMap = Impossible<Self::Ok, Self::Error>;
type SerializeStruct = Impossible<Self::Ok, Self::Error>;
type SerializeStruct = Struct<'i, W>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong change. Content serializer represents data in a text, or, in your case, a binary, form. Structs and map a complex data objects with internal structure and cannot be serialized in such unstructured manner directly. If you want that, you should use #[serde(serialize_with)] attribute and convert your types to text / binary data manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks!

@@ -189,7 +192,7 @@ mod without_root {
e
),
}
assert_eq!(buffer, "");
assert!(buffer.is_empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_eq used for purpose because it is from pretty_assertions crate and will produce a diff in case of failure

Combined binary support with text events

Fix unblocked tests due to this change.
This commit creates an optional feature (disabled by default) to be able
to store binary data in xml text.
@elrnv
Copy link
Contributor Author

elrnv commented Mar 13, 2023

Thank you @dralley and @Mingun. Sorry for the late update. It looks like there was a PR that refactored DeEvent further, which invalidated the changes I made. I also hear the reservations for adding builtin handling for binary data in XML formats, which I don't disagree, it seems not straightforward to implement for various reasons, and should be tested and maintained.

So in light of these 2 things, I want to rework how I handle binary data in XML, and maybe instead, open up a custom way to handle text events within quick-xml without being so intrusive and putting the maintenance burden on quick-xml for parsing binary data. I will update this PR with the changes in the near future I hope, since it will serve the same purpose.

Let me know if this sounds reasonable.

@Mingun
Copy link
Collaborator

Mingun commented Mar 13, 2023

Let me know if this sounds reasonable.

Yes, I think so.

@dralley
Copy link
Collaborator

dralley commented Jul 10, 2023

I think that perhaps the least invasive way might be to just have a backdoor way to "read X bytes" from the reader directly, thus bypassing XML parsing entirely for the data section - rather than trying to make the text event itself handle it. I doubt that can be made to work with the existing Serde implementation though.

It does also still require a way to ensure that the reader does not try to bulk-decode / validate the data upfront as some particular encoding (say, utf-8), which would obviously fail.

Or to not use this raw encoding method at all and instead use base64 as suggested by the docs, if that is an option. This embedded binary option seems to be opt-in only.

There is one case in which the file is not a valid XML document. When the AppendedData section is not encoded as base64, raw binary data is present that may violate the XML specification. This is not default behavior, and must be explicitly enabled by the user.

Obviously it's part of the format so that's not ideal, but whether it's acceptable maybe depends on how common such files are.

@dralley
Copy link
Collaborator

dralley commented Jul 10, 2023

I'm going to close this PR. If you want to continue the work that would be fine, so long as it doesn't run afoul of the mentioned concerns, but please file an issue first so we can discuss any proposed design before expending a bunch of work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants