Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical form of tag:yaml.org,2002:str #274

Open
jasom opened this issue Jan 25, 2022 · 1 comment
Open

Canonical form of tag:yaml.org,2002:str #274

jasom opened this issue Jan 25, 2022 · 1 comment

Comments

@jasom
Copy link

jasom commented Jan 25, 2022

The spec just says "the obvious" but assuming that the canonical form is a double-quoted string (which is not immediately obvious) and that all printable non-newline characters should be printed unescaped, what is the canonical form of a string that contains:

  • Newlines
  • C0 control codes
  • C1 control codes
  • Surrogate codes

All of these have multiple valid representations inside an nb-double-one-line

@Thom1729
Copy link
Collaborator

The canonical form of a scalar node with tag tag:yaml.org,2002:str is the same as the formatted content of the node.

Whether a scalar is double-quoted, single-quoted, plain, or in block form is a presentational detail, as is escaping. For instance, the scalars 'foo' and "fo\x6f" have the same formatted content. Those two scalars are perfectly interchangeable for all purposes, regardless of the tag. An implementation may present scalars in whatever style and with whatever escaping it chooses.

Note that scalar content cannot include surrogates. The content of a scalar is a sequence of zero or more Unicode characters. Surrogates are not Unicode characters — a high or low surrogate code point does not correspond to any character. C0 and C1 control codes (including line feeds and carriage returns) are Unicode characters.

(Notwithstanding the above, a single astral character in a double-quoted scalar may be represented by two escape sequences specifying the code points of a surrogate pair. For instance, the scalar '𝄞' may also be presented as "\ud834\udd1e". In either case, the formatted content of the scalar is the single character 𝄞. This feature is purely for JSON compatibility, because the same scalar could also be presented as "\U0001d121", which avoids surrogates entirely.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants