Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA should not be treated as markup #82

Closed
pacman82 opened this issue Aug 8, 2017 · 11 comments
Closed

CDATA should not be treated as markup #82

pacman82 opened this issue Aug 8, 2017 · 11 comments

Comments

@pacman82
Copy link
Contributor

pacman82 commented Aug 8, 2017

How would you feel about making quick-xml not omitting CDATA Events? Instead I would suggest handling CDATA as part of decode(). After all writing: <![CDATA[<example>]]> is just another way of writing &lt;example&gt;. To make matters worse the to approaches may be mixed &lt;exa<![CDATA[mple>]]>. With the current API it is really hard to get this right.

@tafia
Copy link
Owner

tafia commented Aug 18, 2017

Sorry for the late answer!

I am not very fond of the idea:

  • As per the spec, we cannot guarantee what's inside CDATA. It may or may not be a markup. It can be literally anything.
  • From this page: https://developer.mozilla.org/en-US/docs/Web/API/CDATASection, CDATA is considered obsolete. As a result I am not willing to try to adapt the library to all these uses cases, I'd prefer having a separate crate for this.

I'm happy to be convinced otherwise but I'm not sure this is the right thing to do.

@pacman82
Copy link
Contributor Author

pacman82 commented Aug 21, 2017

Nothing to apologise for. We all have stuff to do.

Yeah, you are wise to not be fond of the idea, because I stated it wrong. There was some confusion in my mind as to what 'markup' actually means. Let me restate my suggestion in the terminology of quick_xml. I think the CData should not be a variant of Event. The base for my reasoning can be found in the very first line of the spec:

CDATA sections may occur anywhere character data may occur;

In terms of quick_xml API, I read this as: CDATA sections only occur within the sections BytesText points to (am I wrong here?). As such I would not encounter a CData Event if parsing a valid XML Document.

I have not been aware that CDATA is deprecated. I can use the link you send me to argue that my parser doesn need to support it either. So, thanks for that. I am gladly dropping the requirement that CDATA should be supported through decode.

How do you think about dropping the CData variant now?

@linkmauve
Copy link

This is only about HTML and DOM, while quick-xml targets XML. Many things will break if you don’t handle CDATA correctly, on the web or elsewhere.

@pacman82
Copy link
Contributor Author

Just to clear up a possible misunderstanding:
My suggestion is not about not supporting CData, but supporting it as part of decode instead as a variant of Event.

@tafia
Copy link
Owner

tafia commented Aug 29, 2017

In this case I think we could even have an external crate to parse CData, (so other xml library may use it) and it'll be an opt-in feature.

@pacman82
Copy link
Contributor Author

@tafia So, I would just call a decode method implemented by the external crate instead of the one provided by quick_xml?

@tafia
Copy link
Owner

tafia commented Aug 29, 2017

Yes for the time being at least I think it is better to iterate this way. Is this a problem? I am not clear yet how it should look exactly on your API side.

Please let this issue opened so if anyone wants to help or is looking for the same feature they will have some info.

@fatihpense
Copy link

I think providing content of CDATA as it is, is a correct behavior for an XML parser. XML parser decodes XML content while reading. The data in CDATA is already in the raw format, so no need for decoding.

So <![CDATA[<example>]]> should emit <example> ,
&lt;example&gt; in regular character data should also emit <example>

&lt;exa<![CDATA[mple>]]> should emit <example> better with a way to differentiate CDATA and character data. The library provides a way to differentiate between them. You should decode character data part and concatenate the two.

I think changing <example> to &lt;example&gt; is actually encoding XML So it should be done only while writing XML.

By the way, I'm new to this library. I have only read some source code and tests to reason about it. Sorry if I misunderstood something.

@pacman82
Copy link
Contributor Author

@fatihpense What is the use case for differentiating between CDATA and 'normal' encoding? I am always interested in the final decoded text. I think it is a bit of a pain, what I have to handle an arbitrary amount of CDATA events anytime I expect character data. Yet I see that my suggestion would ruin any use case which would require differentiating between the two encoding styles.

@fatihpense
Copy link

fatihpense commented Feb 27, 2018

@pacman82 I work on enterprise integrations. We do a lot of XML mapping and people can put meaning&business logic in every edge of the spec. Rarely we even deal with invalid XML. So I think it is better to give more power to the library user when it is possible.

In Java SAX, this situation is solved by two interfaces. ContentHandler, which gives you characters as <example> for the input &lt;exa<![CDATA[mple>]]>. If you need to know when CDATA starts and ends there is optional LexicalHandler interface which adds notifier methods: startCDATA() and endCDATA()

Edit: ContentHandler doesn't guarantee giving character data in one function call even there is no CDATA(I think that is because of performance). So in your "push parsing" SAX code you get chardata part, startCDATA, chardata part, endCDATA.

I understand you need convenience over configuration for most of the tasks. I think your needs are better served as another library or new easy-mode methods in the same library.

@pacman82
Copy link
Contributor Author

@fatihpense Thank you for explaining your use case. While I do think that it is an invalid use of XML to give semantic meaning to the way character data in XML is masked I can not deny that such things may happen.

@tafia tafia closed this as completed Feb 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants