Scala XML Conformance

abcoates edited this page Sep 13, 2010 · 1 revision

There are a couple of issues for the existing XML support in Scala:

  • is the XML parsing fully conformant with W3C XML specifications?
  • does the Scala parser for XML files/strings produce exactly the same XML object trees as the parser which is used for inline XML values in code?

XML conformance is not easy to get correct. The following note comes from the website for the XOM XML API (Java), which is arguably the most technically correct of the XML object APIs for Java:

XOM is not complete unto itself. It depends on an underlying SAX parser to read documents and feed the data into a tree structure. While theoretically any SAX2 compliant parser should work, Xerces 2.6.1 and later is the only one that I am fairly confident does work. Xerces 2.8.0 is included with the full distribution. This product includes software developed by the Apache Software Foundation ( Piccolo 1.0.3, Crimson, GNU JAXP 1.0b1, the Oracle XML Parser for Java and, and Xerces versions prior to 2.6.1 all have bugs that prevent them from doing what XOM needs them to do. (Note to XML parser vendors: XOM’s test suite gives parsers a very thorough workout, and delves into many of the more obscure parts of the XML spec that many parsers get wrong. You could do a lot worse for testing than making sure all the XOM unit tests pass when using your parser.)

XOM’s test results suggest that, for Java, you need to choose the right parser in order to get the best conformance. The available parsers are not all equal. All things being equal, it is likely that Scala’s built-in parser is not as conformant as Xerces.

Note that when I refer to Xerces, I am referring to genuine Apache Xerces, not the modified Xerces that is packaged with the JDK. The version in the JDK has been altered by Sun, and is known to be less conformant than the original Apache version.

An issue for Scala is that it doesn’t only need to support “vanilla” XML, it also needs to support Scala’s extension that allows values to be inserted into XML expressions using parentheses, e.g.

val docId = ...
val docText = ...
val result = <doc id={docId}>{docText}</doc>

A possible approach for Scala would be as follows:

  • use JAXP to allow users to select the Java parser of their choice for files/strings, and ship Scala with a genuine Apache version of Xerces
  • create a JAXP equivalent for .NET, and ship Scala for .NET with a genuine Apache version of Xerces that has been transcoded to .NET CLR code using IKVM.NET (as used by Mike Kay to generate the .NET version of Saxon)
  • test the Scala parser for inline XML values against Xerces for XML conformance using a large set of text examples, e.g. the W3C XML Conformance Test Suite

This approach should allow a reasonable balance between conformance, compatibility across Java and .NET, and continued support for Scala expressions inside inline XML values in Scala code.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.