Merge text and CDATA events in serde deserializer #474

Mingun · 2022-09-09T05:14:07Z

CDATA elements cannot contain sequence ]]>. When that sequence is appeared in the data, it should be split into two pieces and each piece should be put in their own CDATA container:

]]>

become

<![CDATA[]]]>
<![CDATA[]>]]>

or

<![CDATA[]]]]>
<![CDATA[>]]>

Currently in serde deserializer only one CDATA event processed at time, that means, that deserialization

<root>
  <string><![CDATA[]]]]><![CDATA[>]]></string>
</root>

into

struct AnyName {
  string: String,
}

would fail or wrongly return ]] instead of ]]>.

To fix that we should merge CDATA events, that there are some ambiguities that should be investigated:

should we merge CDATA and text events:
```
<![CDATA[one]]>two
```
should return onetwo?

Judging from that and that, we should do that
should we ignore comments between CDATA events? Between CDATA and text events?
```
<![CDATA[one]]><![CDATA[two]]>
```
should return onetwo?
```
<![CDATA[one]]>two
```
should return onetwo?
Currently all comments are skips at very early stage and deserializer sees
```
<![CDATA[one]]><![CDATA[two]]>
```
as
```
<![CDATA[one]]><![CDATA[two]]>
```
should we ignore processing instructions between CDATA events? Between CDATA and text events?
```
<![CDATA[one]]><?pi?><![CDATA[two]]>
```
should return onetwo?
```
<![CDATA[one]]><?pi?>two
```
should return onetwo?
Currently all processing instructions are skips at very early stage and deserializer sees
```
<![CDATA[one]]><?pi?><![CDATA[two]]>
```
as
```
<![CDATA[one]]><![CDATA[two]]>
```
should we ignore whitespaces between CDATAs? Between CDATA and text?
```
<![CDATA[one]]>
<![CDATA[two]]>
```
should return onetwo?
```
<![CDATA[one]]>
two
```
should return onetwo?

The text was updated successfully, but these errors were encountered:

Mingun · 2022-09-12T05:41:18Z

I made some experiments with XmlBeans 5.0.0 -- a popular Java library to work with XML.

Use the following XSD:

<xs:schema xmlns:this="types.xsd"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="types.xsd"
           elementFormDefault="qualified"
           attributeFormDefault="unqualified"
>
  <xs:element name="Str" type="xs:string"/>
</xs:schema>

It skips comments and processing instructions and merge texts and CDATA sections, as suggested in the issue description. All white spaces are significant (namespace definition xmlns="types.xsd" is omitted for brevity in most examples):

XML	Result of `doc.getStr()`
<Str xmlns="types.xsd"> text </Str>	`text`
<Str><![CDATA[cdata]]></Str>	`cdata`
<Str><![CDATA[cdata]]> with text</Str>	`cdata with text`
<Str><![CDATA[cdata]]]]><![CDATA[> with cdata]]></Str>	`cdata]]> with cdata`
<Str><![CDATA[cdata]]]]><!--comment--><![CDATA[> with comment between cdata]]></Str>	`cdata]]> with comment between cdata`
<Str><![CDATA[cdata]]]]><?pi?><![CDATA[> with PI between cdata]]></Str>	`cdata]]> with PI between cdata`
<Str><![CDATA[cdata ]]><!--comment-->with comment between text</Str>	`cdata with comment between text`
<Str><![CDATA[cdata ]]><?pi?>with PI between text</Str>	`cdata with PI between text`

Mingun · 2022-11-05T19:59:06Z

Unfortunately, this is not the easy task, because of trim feature, that is activated for serde deserializer. That means, that spaces between CDATA section and text will be trimmed, and it is not easely to fix that, because to do that correctly, we need to lookahead at infinity depth to solve such situations:

text
<!--comment 1-->
<!--comment 2-->
...
<!--comment N--><![CDATA[cdata section]]>

We should not strip between text and CDATA, but should trim between text and tag.

Because comments should not change the content of document, that document is equivalent to:

text


...
<![CDATA[cdata section]]>

("text" + N newlines + "cdata section"). Probably solving #460 first will make that easier to implement.

Mingun added bug help wanted serde Issues related to mapping from Rust types to XML labels Sep 9, 2022

This was referenced Nov 9, 2022

simple example with sub-tree reading in separate function #504

Merged

Add additional tests for borrowed input #509

Merged

Mingun added a commit to Mingun/quick-xml that referenced this issue Dec 4, 2022

Fix tafia#474: merge consequent text and CDATA events into one string

519d271

Mingun mentioned this issue Dec 4, 2022

Merge consequent text and CDATA events into one string #520

Merged

Mingun closed this as completed in #520 Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge text and CDATA events in serde deserializer #474

Merge text and CDATA events in serde deserializer #474

Mingun commented Sep 9, 2022 •

edited

Loading

Mingun commented Sep 12, 2022

Mingun commented Nov 5, 2022

Merge text and CDATA events in serde deserializer #474

Merge text and CDATA events in serde deserializer #474

Comments

Mingun commented Sep 9, 2022 • edited Loading

Mingun commented Sep 12, 2022

Mingun commented Nov 5, 2022

Mingun commented Sep 9, 2022 •

edited

Loading