Combine Scalar Forms

Ingy döt Net edited this page Nov 28, 2011 · 6 revisions

YAML currently has 5 scalar forms: plain, single-quoted, double-quoted, literal and folded. These 5 forms are only supported under block context while flow context only allows the first 3.

This proposal is about reducing this number to 3 without losing anything in wide use, and possibly even adding more. The new three forms are plain, single-quoted and double-quoted, but each of them are modified from how they exist now. The benefits of the literal and folded forms can be obtained with slight modifications to the quoted forms.

Plain

The new plain scalar is an unquoted string of the characters in the YAML character set that conform to the same restrictions that the current plain forms do. We add the restriction that a plain form doesn't not contain newline characters, in other words a plain scalar is always on a single-line.

We also should consider lifting the restrictions on starting character limitations. For instance % can almost always be allowed, since it only indicates a directive in very specific contexts.

Single-Quoted

In flow context and in block mapping keys, single-quoted scalars are parsed under the current rules.

In block context (but not in a key) we change things a bit. The idea is to leave stuff that would exist in real world YAML today, while taking some forms that would not exist and changing the semantics to get want we want. Namely, flow and literal support.

Since quoted strings already fold, there is really not much we need to do to support folding. All we really need to do is worry about whether or not the string has a final newline or not.

Consider this example that combines YAML 1.2 and YAML 2.0 in equivalent forms, with a double-quoted value to show what they mean:

Ends with newline:
  1: >
    folded
    stuff
  2: 'folded
    stuff
    '
  V: "folded stuff\n"
No final newline:
  1: >-
    folded
    stuff
  2: 'folded
    stuff'
  V: "folded stuff"

Great. This is a nice reduction. Goodbye folded indicator. Now let's get rid of the literal indicator. Imagine if you wanted to serialize ' foo '. You could do it like this:

1: '
  foo
  '
V: ' foo '

YAML 1 folding puts a space before and after foo, but this would simply not exist in the real world. We can change the semantics to this:

1: |
  \ / /\ |\/||
   | /--\|  ||__
2: '
  \ / /\ |\/||
   | /--\|  ||__
  '
V: "\\ / /\\ |\\/||\n| /--\\|  ||__\n"

So a leading single-quote followed by a newline means we parse with literal semantics. This is a good start but the final single-quote is distracting and we don't need it. We know when a block level scalar ends from indentation:

1: |
  \ / /\ |\/||
   | /--\|  ||__
2: '
  \ / /\ |\/||
   | /--\|  ||__
V: "\\ / /\\ |\\/||\n| /--\\|  ||__\n"

Nice. Two more things to worry about. Indentation amount and final newlines. I propose we do it like this:

1: |4-
       /\ |\/|
      /--\|  |
2: '
   '   /\ |\/|
      /--\|  |'
V: "   /\\ |\\/|\n  /--\\|  |"

Three things have happened above:

  1. We used the first quote to indicate literal form.
  2. The second quote indicates where indentation begins (we can't infer it).
  3. The third quote indicates no final newline.

We can specify extra trailing newlines as such:

1: |+
  \ / /\ |\/||
   | /--\|  ||__

2: '
  \ / /\ |\/||
   | /--\|  ||__

  '
V: "\\ / /\\ |\\/||\n| /--\\|  ||__\n\n"

In order to used for literals, we have to make it true that single quotes don't need to be escaped. Luckily, we have asserted that indentation terminates the scalar, so we don't need to escape the quote. Since we still allow/remove the a closing quote, if a scalar ends in a quote, we just have to add a closing quote.

1: '- I can''t believe it
    ends with a '''
2: '- I can't believe it
    ends with a ''
V: "- I can't believe it ends with a '"

Double-Quoted

Double-quoted works the same as single-quoted above, except that it supports the current double-quoting escapes. This actually adds a whole new ability to YAML. Now you can use YAML's rich escaping capabilities with the benefits of literal and folding.

Note that it is not longer required to escape a double quote with \".

Rationale for the Changes

Folded block scalars (>) actually have really weird semantics and probably not fully understood (let alone used) by anyone other than YAML implementors. The folding properties already exist in the quoted and unquoted forms. So really the folded form serves no good purpose and should be abandoned.

Assuming we get rid of folded, it makes literal (|) stick out like a sore thumb. No other language uses | or > as a quoting mechanism.

So we overload the semantics of the quotes a little. It makes YAML seem more normal, and actually makes the language more expressive than before. Cheers.