Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge text and CDATA events in serde deserializer #474

Closed
Mingun opened this issue Sep 9, 2022 · 2 comments · Fixed by #520
Closed

Merge text and CDATA events in serde deserializer #474

Mingun opened this issue Sep 9, 2022 · 2 comments · Fixed by #520
Labels
bug help wanted serde Issues related to mapping from Rust types to XML

Comments

@Mingun
Copy link
Collaborator

Mingun commented Sep 9, 2022

CDATA elements cannot contain sequence ]]>. When that sequence is appeared in the data, it should be split into two pieces and each piece should be put in their own CDATA container:

]]>

become

<![CDATA[]]]>
<![CDATA[]>]]>

or

<![CDATA[]]]]>
<![CDATA[>]]>

Currently in serde deserializer only one CDATA event processed at time, that means, that deserialization

<root>
  <string><![CDATA[]]]]><![CDATA[>]]></string>
</root>

into

struct AnyName {
  string: String,
}

would fail or wrongly return ]] instead of ]]>.

To fix that we should merge CDATA events, that there are some ambiguities that should be investigated:

  1. should we merge CDATA and text events:
    <![CDATA[one]]>two
    should return onetwo?

    Judging from that and that, we should do that

  2. should we ignore comments between CDATA events? Between CDATA and text events?
    <![CDATA[one]]><!--comment--><![CDATA[two]]>
    should return onetwo?
    <![CDATA[one]]><!--comment-->two
    should return onetwo?
    Currently all comments are skips at very early stage and deserializer sees
    <![CDATA[one]]><!--comment--><![CDATA[two]]>
    as
    <![CDATA[one]]><![CDATA[two]]>
  3. should we ignore processing instructions between CDATA events? Between CDATA and text events?
    <![CDATA[one]]><?pi?><![CDATA[two]]>
    should return onetwo?
    <![CDATA[one]]><?pi?>two
    should return onetwo?
    Currently all processing instructions are skips at very early stage and deserializer sees
    <![CDATA[one]]><?pi?><![CDATA[two]]>
    as
    <![CDATA[one]]><![CDATA[two]]>
  4. should we ignore whitespaces between CDATAs? Between CDATA and text?
    <![CDATA[one]]>
    <![CDATA[two]]>
    should return onetwo?
    <![CDATA[one]]>
    two
    should return onetwo?
@Mingun Mingun added bug help wanted serde Issues related to mapping from Rust types to XML labels Sep 9, 2022
@Mingun
Copy link
Collaborator Author

Mingun commented Sep 12, 2022

I made some experiments with XmlBeans 5.0.0 -- a popular Java library to work with XML.

Use the following XSD:

<xs:schema xmlns:this="types.xsd"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="types.xsd"
           elementFormDefault="qualified"
           attributeFormDefault="unqualified"
>
  <xs:element name="Str" type="xs:string"/>
</xs:schema>

It skips comments and processing instructions and merge texts and CDATA sections, as suggested in the issue description. All white spaces are significant (namespace definition xmlns="types.xsd" is omitted for brevity in most examples):

XML

Result of doc.getStr()

<Str xmlns="types.xsd">
  text
</Str>

  text

<Str><![CDATA[cdata]]></Str>
cdata
<Str><![CDATA[cdata]]> with text</Str>
cdata with text
<Str><![CDATA[cdata]]]]><![CDATA[> with cdata]]></Str>
cdata]]> with cdata
<Str><![CDATA[cdata]]]]><!--comment--><![CDATA[> with comment between cdata]]></Str>
cdata]]> with comment between cdata
<Str><![CDATA[cdata]]]]><?pi?><![CDATA[> with PI between cdata]]></Str>
cdata]]> with PI between cdata
<Str><![CDATA[cdata ]]><!--comment-->with comment between text</Str>
cdata with comment between text
<Str><![CDATA[cdata ]]><?pi?>with PI between text</Str>
cdata with PI between text

@Mingun
Copy link
Collaborator Author

Mingun commented Nov 5, 2022

Unfortunately, this is not the easy task, because of trim feature, that is activated for serde deserializer. That means, that spaces between CDATA section and text will be trimmed, and it is not easely to fix that, because to do that correctly, we need to lookahead at infinity depth to solve such situations:

text
<!--comment 1-->
<!--comment 2-->
...
<!--comment N--><![CDATA[cdata section]]>

We should not strip between text and CDATA, but should trim between text and tag.

Because comments should not change the content of document, that document is equivalent to:

text


...
<![CDATA[cdata section]]>

("text" + N newlines + "cdata section"). Probably solving #460 first will make that easier to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant