Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why remove CDataSection? #142

Closed
Harbs opened this issue Jan 5, 2016 · 15 comments
Closed

Why remove CDataSection? #142

Harbs opened this issue Jan 5, 2016 · 15 comments

Comments

@Harbs
Copy link

Harbs commented Jan 5, 2016

I'm in the process of implementing an XML processing library and I see the CDataSection and Node.CDATA_SECTION_NODE is being removed. What is the rationale for that decision?

I do not see any other way of determining that an XML node is CDATA. I need to know whether a node is plain text or CDATA to handle it correctly.

@annevk
Copy link
Member

annevk commented Jan 5, 2016

It might make its way back because it was too hard to remove from browsers, but why would you need to distinguish between these at the data level?

@Harbs
Copy link
Author

Harbs commented Jan 5, 2016

Huh? How else is one to preserve the intent of the XML content?

Basically, I'm implementing E4X support for cross-compiled ActionScript. (The compiler is mapping E4X features not compatible with standard javascript to standard function calls.) Plain text and CDATA content are handled differently in various places internally. I can give more details if you need it, but I'm trying to even understand what the rationale was to try to remove it.

Why get rid of potentially useful information?

@annevk
Copy link
Member

annevk commented Jan 5, 2016

CDATA sections are just a "convenient" way to write text in XML, treating them as a distinct type is a bug.

@Harbs
Copy link
Author

Harbs commented Jan 5, 2016

Not exactly. CDATA text is not encoded while plain text is. Whether it's an entirely distinct type or there's some meta-data describing the content, maintaining the distinction is important.

There are applications where CDATA in XML files are NOT the same as escaped string content. Round-tripping XML files in those cases would produce incorrect output.

I do agree that it's not a fully distinct type, but simply losing the information would be very bad for XML processing.

@domenic
Copy link
Member

domenic commented Jan 5, 2016

Maybe the misconception here is that you think the DOM is supposed to support round-tripping? It in general does not.

@Harbs
Copy link
Author

Harbs commented Jan 5, 2016

Why is this different than handling of white-space? XML parsing in the DOM is very funky to attempt to preserve the white-space. https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace_in_the_DOM

I don't see why CDATA should be any less important.

@annevk
Copy link
Member

annevk commented Jan 5, 2016

Because CDATA is syntax, not content. See e.g., the definition of canonical XML. Anyway, it doesn't really matter, we most likely have to reintroduce CDATA nodes due to web content.

@Harbs
Copy link
Author

Harbs commented Jan 5, 2016

IIUC, some of the white-space handling can be considered "syntax" as well. "Syntax" becomes "content" when the content of the syntax is important.

Either way, if current DOM3 behavior is future-proof, I'm happy leaving this discussion alone.

@ArkadiuszMichalski
Copy link
Contributor

@Harbs There was the idea to expose boolean attribute Text.serializeAsCDATA so after removing CDATASection (as unnecessary type of node) you could still use functionality of CDATASection by using Text node (old P&S spec https://rawgit.com/whatwg/domparsing/edc795ccfdc03e396197bf81a0f550105930e90b/source.html). But it ended up being just an idea.
Next infos you can get on this still open bug (or others internal cited bugs) https://www.w3.org/Bugs/Public/show_bug.cgi?id=27386

@travisleithead
Copy link
Member

If we bring back CDATASection in DOM, then I'm happy to re-add that API to produce a serialization that includes CDATA.

@Harbs
Copy link
Author

Harbs commented Jan 6, 2016

Thanks for that link. Reading through that bug, it looks like making it a Text node would have the problem of merging adjacent non-CDATA nodes during normalization. I don't see how serializeAsCDATA would help for that.

Based on what I've read here, I'm working under the assumption that the following will be future-proof:

  1. nodeType == 4 (CDATA_SECTION_NODE) is a valid check for CDATA
  2. the [CDATA node].data is exactly the contents contained within the original CDATA tags
  3. CDATA will not be merged with adjacent text nodes

If any of these assumptions are wrong, please let me know. Even if changing CDATA does not break things now, the chances of things breaking after the next release of FlexJS will be much higher. I believe that E4X will be a very popular feature of FlexJS.

@annevk
Copy link
Member

annevk commented Jan 6, 2016

It's still not clear to me why you think it's semantically distinct from text. I haven't checked with others yet if they also consider it a lost cause to remove it.

@Harbs
Copy link
Author

Harbs commented Jan 6, 2016

I was looking for a good practical example, and this is the only one I came up with at the moment, but I do think it's valid.

Take a look at this (Flash) web app here: http://www.radii8.com/demo/ It's basically an online design tool for building apps. One of the formats that it supports is MXML. Script blocks within MXML have to be contained within CDATA blocks. I have never tried to see whether script blocks as XML encoded text (as opposed to CDATA blocks) will compile, but trying to edit the file with all the code converted into gobbledygook would be a nightmare. So, even though from an XML structure perspective CDATA is syntax, in this case, the "syntax" clearly becomes "content".

One of our goals with FlexJS is to minimize the pain of porting Flash apps to HTML apps. Flash apps in general tend to do a lot more XML processing than HTML ones (probably because of the native E4X support). Hence preserving XML processing native to E4X is an important goal. I have no data on how prevalent CDATA blocks are in XML files, but if preserving legibility of the files is important (and I think it is), preserving CDATA blocks is important.

I hope this is helpful in explaining where I'm coming from...

@Harbs
Copy link
Author

Harbs commented Jan 6, 2016

FWIW, another data point:
It seems like E4X treats CDATA as content. The CDATA node when parsed in E4X is considered a normal text node, but it includes the CDATA tags in the text. The E4X spec is a bit vague on this point, but I do think that's the intention and that's definitely the implementation as I've observed it in ActionScript. I tried to test what the implementation was in SpiderMonkey (which is still in use in Adobe's ExtendScript), but I was having some problems getting my tests to work.

@annevk
Copy link
Member

annevk commented Mar 14, 2016

Duping this against https://www.w3.org/Bugs/Public/show_bug.cgi?id=27386.

@annevk annevk closed this as completed Mar 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants