CDATA should not be treated as markup #82

pacman82 · 2017-08-08T13:07:22Z

How would you feel about making quick-xml not omitting CDATA Events? Instead I would suggest handling CDATA as part of decode(). After all writing: <![CDATA[<example>]]> is just another way of writing <example>. To make matters worse the to approaches may be mixed <exa<![CDATA[mple>]]>. With the current API it is really hard to get this right.

The text was updated successfully, but these errors were encountered:

tafia · 2017-08-18T08:11:24Z

Sorry for the late answer!

I am not very fond of the idea:

As per the spec, we cannot guarantee what's inside CDATA. It may or may not be a markup. It can be literally anything.
From this page: https://developer.mozilla.org/en-US/docs/Web/API/CDATASection, CDATA is considered obsolete. As a result I am not willing to try to adapt the library to all these uses cases, I'd prefer having a separate crate for this.

I'm happy to be convinced otherwise but I'm not sure this is the right thing to do.

pacman82 · 2017-08-21T08:55:20Z

Nothing to apologise for. We all have stuff to do.

Yeah, you are wise to not be fond of the idea, because I stated it wrong. There was some confusion in my mind as to what 'markup' actually means. Let me restate my suggestion in the terminology of quick_xml. I think the CData should not be a variant of Event. The base for my reasoning can be found in the very first line of the spec:

CDATA sections may occur anywhere character data may occur;

In terms of quick_xml API, I read this as: CDATA sections only occur within the sections BytesText points to (am I wrong here?). As such I would not encounter a CData Event if parsing a valid XML Document.

I have not been aware that CDATA is deprecated. I can use the link you send me to argue that my parser doesn need to support it either. So, thanks for that. I am gladly dropping the requirement that CDATA should be supported through decode.

How do you think about dropping the CData variant now?

linkmauve · 2017-08-22T13:07:25Z

This is only about HTML and DOM, while quick-xml targets XML. Many things will break if you don’t handle CDATA correctly, on the web or elsewhere.

pacman82 · 2017-08-29T10:24:49Z

Just to clear up a possible misunderstanding:
My suggestion is not about not supporting CData, but supporting it as part of decode instead as a variant of Event.

tafia · 2017-08-29T11:21:59Z

In this case I think we could even have an external crate to parse CData, (so other xml library may use it) and it'll be an opt-in feature.

pacman82 · 2017-08-29T12:33:37Z

@tafia So, I would just call a decode method implemented by the external crate instead of the one provided by quick_xml?

tafia · 2017-08-29T13:36:16Z

Yes for the time being at least I think it is better to iterate this way. Is this a problem? I am not clear yet how it should look exactly on your API side.

Please let this issue opened so if anyone wants to help or is looking for the same feature they will have some info.

fatihpense · 2018-02-24T22:16:30Z

I think providing content of CDATA as it is, is a correct behavior for an XML parser. XML parser decodes XML content while reading. The data in CDATA is already in the raw format, so no need for decoding.

So <![CDATA[<example>]]> should emit <example> ,
<example> in regular character data should also emit <example>

<exa<![CDATA[mple>]]> should emit <example> better with a way to differentiate CDATA and character data. The library provides a way to differentiate between them. You should decode character data part and concatenate the two.

I think changing <example> to <example> is actually encoding XML So it should be done only while writing XML.

By the way, I'm new to this library. I have only read some source code and tests to reason about it. Sorry if I misunderstood something.

pacman82 · 2018-02-26T17:24:56Z

@fatihpense What is the use case for differentiating between CDATA and 'normal' encoding? I am always interested in the final decoded text. I think it is a bit of a pain, what I have to handle an arbitrary amount of CDATA events anytime I expect character data. Yet I see that my suggestion would ruin any use case which would require differentiating between the two encoding styles.

fatihpense · 2018-02-27T08:20:40Z

@pacman82 I work on enterprise integrations. We do a lot of XML mapping and people can put meaning&business logic in every edge of the spec. Rarely we even deal with invalid XML. So I think it is better to give more power to the library user when it is possible.

In Java SAX, this situation is solved by two interfaces. ContentHandler, which gives you characters as <example> for the input <exa<![CDATA[mple>]]>. If you need to know when CDATA starts and ends there is optional LexicalHandler interface which adds notifier methods: startCDATA() and endCDATA()

Edit: ContentHandler doesn't guarantee giving character data in one function call even there is no CDATA(I think that is because of performance). So in your "push parsing" SAX code you get chardata part, startCDATA, chardata part, endCDATA.

I understand you need convenience over configuration for most of the tasks. I think your needs are better served as another library or new easy-mode methods in the same library.

pacman82 · 2018-02-27T19:36:08Z

@fatihpense Thank you for explaining your use case. While I do think that it is an invalid use of XML to give semantic meaning to the way character data in XML is masked I can not deny that such things may happen.

tafia closed this as completed Feb 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDATA should not be treated as markup #82

CDATA should not be treated as markup #82

pacman82 commented Aug 8, 2017

tafia commented Aug 18, 2017

pacman82 commented Aug 21, 2017 •

edited

linkmauve commented Aug 22, 2017

pacman82 commented Aug 29, 2017

tafia commented Aug 29, 2017

pacman82 commented Aug 29, 2017

tafia commented Aug 29, 2017

fatihpense commented Feb 24, 2018

pacman82 commented Feb 26, 2018

fatihpense commented Feb 27, 2018 •

edited

pacman82 commented Feb 27, 2018

CDATA should not be treated as markup #82

CDATA should not be treated as markup #82

Comments

pacman82 commented Aug 8, 2017

tafia commented Aug 18, 2017

pacman82 commented Aug 21, 2017 • edited

linkmauve commented Aug 22, 2017

pacman82 commented Aug 29, 2017

tafia commented Aug 29, 2017

pacman82 commented Aug 29, 2017

tafia commented Aug 29, 2017

fatihpense commented Feb 24, 2018

pacman82 commented Feb 26, 2018

fatihpense commented Feb 27, 2018 • edited

pacman82 commented Feb 27, 2018

pacman82 commented Aug 21, 2017 •

edited

fatihpense commented Feb 27, 2018 •

edited