XML Pretty Printing

abcoates edited this page Sep 13, 2010 · 3 revisions

Scala currently has an XML pretty printer based on a “text box” model of fitting XML content into text boxes of fixed widths. There are some issues with the current approach:

  • it wraps elements/attributes, but doesn’t wrap text;
  • it forces you to specify a particular maximum width, even if you just want the XML wrapped without caring about the final width.

The following is a proposal for an alternative approach to XML Pretty Printing.

Formatting XML Using Weighted Breakpoints

There are many places that you can legitimately break XML across lines without affecting it. For example

<a b="bee" c="sea"><d>Old McDonald had a farm</d><e><f h="Hi">fore</f><g j="Jay">go</g></e></a>

can be written as

<a
b
=
"bee"
c
=
"sea"
>
<d
>
Old McDonald had a farm
</d
>
<e
>
<f
h
=
"Hi"
>
fore
</f
>
<g
j
=
"Jay"
>
go
</g
>
</e
>
</a
>

Further, the text can be wrapped if the user doesn’t mind having text re-flowed during formatting. On the other hand, no user is likely to want this level of wrapping. For example, the user may not want individual attributes to be spread across multiple lines, and may not want right angle-brackets “>” in a separate line. Also, the user may like some indenting:

<a
  b="bee"
  c="sea">
  <d>
    Old McDonald had a farm
  </d>
  <e>
    <f
      h="Hi">
      fore
    </f>
    <g
      j="Jay">
      go
    </g>
  </e>
</a>

Here a “2 space” indent has been used, but the user may want any number of spaces or tabs as the indent. Perhaps the user prefers to have all attributes on the same line as the element name, and doesn’t want breaks before and after text in an element:

<a b="bee" c="sea">
  <d>Old McDonald had a farm</d>
  <e>
    <f h="Hi">  fore</f>
    <g j="Jay">go</g>
  </e>
</a>

What differs in all of these cases is

  • where the user always wants a line break;
  • where the user would accept a line break if needed to fit a particular width;
  • where the user never wants a line break;
  • whether the user wants indenting, and if so, what the indenting string is;
  • whether the user wants text re-flowed to fit a particular width, and if so, whether white space should be modified at the beginnings and ends of text lines as part of that re-flowing.

There is also the question of how the user wants to see namespace declarations:

  • on the root element? If so, before or after attributes, or in the “original order” is such an order exists?
  • at the lowest level possible, repeated as required?
  • with minimum use of prefixes? With maximum use of prefixes?
  • if with prefixes, only using prefixes declared in the XML, or using an external list? Does the external list over-ride prefixes in the XML, or not?

Back to the core question of when and where to wrap an XML file when formatting it, the above suggests that a certain weighting should be applied to each point in the XML where a line break is possible. One weighting is “always”, one is “never”, and the others (“intermediate weightings”) would be somewhere in between. Intermediate weightings would only be used when wrapping to a particular width, with “more preferred” wrapping points being used before “less preferred” wrapping points are.

Some advice from Michael Kay, author of Saxon

Comment from ‘kontrawize’ blog:

You seem to be reinventing the wheel.

Follow the rules of the XSLT/XQuery serialization spec, providing all the serialization parameters listed in that spec as user options if you can. Apart from anything else, this gives you the opportunity to reuse an existing serializer written to this spec, and it means that others who want to reuse your serializer as part of a larger piece of software are likely to be able to customize it as required.

The tricky part is selecting the default serialization parameters, and the trickiest one (as you appear to recognize in the Wiki) is the indent parameter. There’s no right answer to this: you want indent=yes for data-oriented XML but this is bad for mixed content.

Canonical XML

A suggestion from Ron Bourret is that Canonical XML might be the best format to use for doing textual differencing of XML files. This is something to investigate.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.