New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguities/Lack of Specification—Non Printable characters, Unicode, Doubles, Keys, Keygroups. #42

Closed
tef opened this Issue Feb 24, 2013 · 15 comments

Comments

Projects
None yet
9 participants
@tef

tef commented Feb 24, 2013

Is the format designed to allow non printable characters inside of it? The special characters pick a small set of ascii control codes to escape, but ignores the other C0/Unicode control codes.

I.e Why is tab escaped, but not vertical tab? Additionally, unicode line breaks aren't escaped i.e \u2028 (Line separator) and \u2029 (Paragraph separator), but the ascii CR and LF are.

Since strings are in UTF-8, you won't normally expect a BOM, but the spec doesn't say if toml files should/must/may have a BOM. It should also be clear on if any unicode normalisation should be applied/expected on the strings inside, or the key names.

What should a compliant parser do when it encounters an invalid utf-8 byte sequence? Does it skip the line? Should it error? Interoperability means parsers should have the same failure behaviours.

Floats (doubles) don't represent the full spectra of acceptable IEEE values, missing -Infinity, +Infinity and NaN.)

What should a compliant parser do when it encounters a float it can't represent? Similarly, there is no rounding mechanism specified for when a decimal float value doesn't fit neatly into a double.

Note: If precision is required, perhaps C99 hex formatted floats should be allowed (or in c speak, scanf("%a"))

Datetimes: You may also want to refer to RFC 3339 over ISO8601. The RFC is more explicit about handling leap seconds and what information is required in a timestamp, and demands uppercase 'Z' and 'T' in datetimes, which toml doesn't mandate.

You ask for "Full Zulu" but your example skips the fractional seconds part of an ISO datetime, which is optional in the RFC above, but it isn't clear if a compliant parser should support them.

Keygroups/Keys trimming leading and trailing tab/spaces, and it is implied that these names will not have any whitespace (0x09 or 0x20) characters, but as mentioned, unicode's whitespace (including vertical tabs) are not forbidden or stripped. Note: Unicode defines 26 whitespace characters.

Keys and Keygroup names have a whole load of ambiguity and I have a lot of questions:

Can escape sequences appear inside of key/keygroup names?

Can a key/group have a space in the name? Is "[foo bar]" valid? is "foo bar = 1" valid?

Can key groups have []'s in their name? Is "[[foo]]" a valid keygroup name, or "[[foo]", "[foo]]" ?

How do you parse "foo = bar = 1"? is this the key "foo = bar" or assigning to two keys "foo" and "bar"? The spec doesn't say which equals sign to use when there is more than one, or if more than one is an error.

Can keygroups have empty names? i.e is [] a valid keygroup.

Can keygroups have nothing between dots? is [foo..bar] the keygroup for "foo", "", "bar".

Can keys be empty? Does "=1" parse as with the key '' and the value 1.

Can keys have the characters '[' or ']' inside of them. What if you have "[foo] = blah"?

Can keys have '.' in the name? Would "foo.blah = 1" be the key 'foo.blah' or would it work like a keygroup?

Can keys be numbers or dates? Can keygroup names be numbers or dates?

What happens if you use the name 'true' or 'false' as a keygroup or a key? Does it use the string value or the implied boolean value?

@mikkelee

This comment has been minimized.

Show comment
Hide comment
@mikkelee

mikkelee commented Feb 24, 2013

+1

@tef

This comment has been minimized.

Show comment
Hide comment
@tef

tef Feb 24, 2013

So far from discussions and reading other implementations:

Keys/Keygroups cannot be zero length.
Datetimes should have uppercase letters. Fractional seconds are ignored.
The left most equal sign is used to split key = value configurations.
Keys can have .'s in them and are treated like text. The line "true = false" would set the key "true" to the boolean value false.
BOM should be ignored if present. BOM should not be written.
Floats cannot represent Infinity or NaN, how to convert a decimal into a binary float is implementation defined.

tef commented Feb 24, 2013

So far from discussions and reading other implementations:

Keys/Keygroups cannot be zero length.
Datetimes should have uppercase letters. Fractional seconds are ignored.
The left most equal sign is used to split key = value configurations.
Keys can have .'s in them and are treated like text. The line "true = false" would set the key "true" to the boolean value false.
BOM should be ignored if present. BOM should not be written.
Floats cannot represent Infinity or NaN, how to convert a decimal into a binary float is implementation defined.

@mojombo

This comment has been minimized.

Show comment
Hide comment
@mojombo

mojombo Feb 24, 2013

Member

I'll hit these tomorrow. I'd love to hear any recommendations you have.

Member

mojombo commented Feb 24, 2013

I'll hit these tomorrow. I'd love to hear any recommendations you have.

@mikkelee

This comment has been minimized.

Show comment
Hide comment
@mikkelee

mikkelee Feb 24, 2013

I'm a lil confused. What are we doing right now? Are we (read: you) defining a ruleset to limit damage?

mikkelee commented Feb 24, 2013

I'm a lil confused. What are we doing right now? Are we (read: you) defining a ruleset to limit damage?

@tef

This comment has been minimized.

Show comment
Hide comment
@tef

tef Feb 24, 2013

I am going to make recommendations in the form of a pull request. I have no idea what I am doing.

#51

tef commented Feb 24, 2013

I am going to make recommendations in the form of a pull request. I have no idea what I am doing.

#51

@mikkelee

This comment has been minimized.

Show comment
Hide comment
@mikkelee

mikkelee Feb 24, 2013

Awesome. I'm into this

mikkelee commented Feb 24, 2013

Awesome. I'm into this

@tef

This comment has been minimized.

Show comment
Hide comment
@tef

tef Feb 24, 2013

Scanning through the other issues.
#55 covers the float complaint
#44 covers hex floats.

I've opened two pull requests and hopefully not fucked it up.

#56 covers float conversion, using rfc in lieu of iso, banning stupid keygroup names, die on bad escape sequences.
#62 is a proposal to add identifier like strings, to make parsing easier, also eliminates issues like bad key/keygroup names (i.e [[foo], [foo]], and foo.bar = value)

tef commented Feb 24, 2013

Scanning through the other issues.
#55 covers the float complaint
#44 covers hex floats.

I've opened two pull requests and hopefully not fucked it up.

#56 covers float conversion, using rfc in lieu of iso, banning stupid keygroup names, die on bad escape sequences.
#62 is a proposal to add identifier like strings, to make parsing easier, also eliminates issues like bad key/keygroup names (i.e [[foo], [foo]], and foo.bar = value)

@benolee

This comment has been minimized.

Show comment
Hide comment
@benolee

benolee Feb 24, 2013

Contributor

I think my question fits into the scope of this issue, but if not, I can move it elsewhere.

Keygroup whitespace

For keygroups, are spaces (let's just say spaces are \s and \t) allowed between [ and the keyname, and between the keyname and ]? If so, are they ignored?

I made some diagrams to explain what I mean.

Option 1

  • spaces between [ and keyname are not allowed
  • spaces between keyname and ] are not allowed

keygroup-1

Option 2

  • spaces after [ but before key name starts are ignored
  • spaces after key name but before ] are ignored

keygroup-2

Contributor

benolee commented Feb 24, 2013

I think my question fits into the scope of this issue, but if not, I can move it elsewhere.

Keygroup whitespace

For keygroups, are spaces (let's just say spaces are \s and \t) allowed between [ and the keyname, and between the keyname and ]? If so, are they ignored?

I made some diagrams to explain what I mean.

Option 1

  • spaces between [ and keyname are not allowed
  • spaces between keyname and ] are not allowed

keygroup-1

Option 2

  • spaces after [ but before key name starts are ignored
  • spaces after key name but before ] are ignored

keygroup-2

@ocxo

This comment has been minimized.

Show comment
Hide comment
@ocxo

ocxo Feb 24, 2013

I opt for the simpler, no diagrams required, "ignore all whitespace".

ocxo commented Feb 24, 2013

I opt for the simpler, no diagrams required, "ignore all whitespace".

@jayzeng

This comment has been minimized.

Show comment
Hide comment
@jayzeng

jayzeng Feb 25, 2013

Simple is beautiful. +1 on "ignore all whitespace", simplifies parsing and increases flexibility.

jayzeng commented Feb 25, 2013

Simple is beautiful. +1 on "ignore all whitespace", simplifies parsing and increases flexibility.

@rossipedia

This comment has been minimized.

Show comment
Hide comment
@rossipedia

rossipedia Feb 25, 2013

Contributor

So does that mean that [foo bar] and [foobar] are equivalent? That seems like it could be confusing and cause all sorts of headaches.

Contributor

rossipedia commented Feb 25, 2013

So does that mean that [foo bar] and [foobar] are equivalent? That seems like it could be confusing and cause all sorts of headaches.

@rossipedia

This comment has been minimized.

Show comment
Hide comment
@rossipedia

rossipedia Feb 25, 2013

Contributor

I'll probably have be submitting a pull request for a formal language-agnostic EBNF document for the language. That's really the best way to resolve these ambiguities.

Contributor

rossipedia commented Feb 25, 2013

I'll probably have be submitting a pull request for a formal language-agnostic EBNF document for the language. That's really the best way to resolve these ambiguities.

@88Alex

This comment has been minimized.

Show comment
Hide comment
@88Alex

88Alex Apr 13, 2013

I think it would be simpler to just disallow spaces in identifier names, like most languages.

88Alex commented Apr 13, 2013

I think it would be simpler to just disallow spaces in identifier names, like most languages.

@88Alex

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Jun 25, 2014

Member

There's a lot in this issue, so I'm going to run through all of it. I think we've already got most things covered, but I'll try to fix things we don't. (I agree that we're going to want an EBNF, but resolving ambiguity is necessary before we can write one.)

I'm going to close this issue. If I've done or said anything deserving of criticism, please open a new issue for each. (I'm trying to get rid of these kitchen-sink issues that are hard to manage.)

Note that some of these questions seem a little weird now, but it looks like they were asked a long time ago and many of them were clarified.

Is the format designed to allow non printable characters inside of it? The special characters pick a small set of ascii control codes to escape, but ignores the other C0/Unicode control codes.

This was fixed when strings were revamped in the early days.

I.e Why is tab escaped, but not vertical tab? Additionally, unicode line breaks aren't escaped i.e \u2028 (Line separator) and \u2029 (Paragraph separator), but the ascii CR and LF are.

Also fixed. The \t special escape sequences were chosen because they are convenient. (cf. Most programming languages.)

Since strings are in UTF-8, you won't normally expect a BOM, but the spec doesn't say if toml files should/must/may have a BOM.

TOML is UTF-8 only, so no BOM.

It should also be clear on if any unicode normalisation should be applied/expected on the strings inside, or the key names.

I'm fine with this being unspecified. I'm happy to be shown to be wrong. (However, I don't think it's a good idea to specifically require some kind of normalization.)

What should a compliant parser do when it encounters an invalid utf-8 byte sequence? Does it skip the line? Should it error? Interoperability means parsers should have the same failure behaviours.

TOML is UTF-8 only. A parser which accepts invalid UTF-8 isn't compliant.

Floats (doubles) don't represent the full spectra of acceptable IEEE values, missing -Infinity, +Infinity and NaN.)

I think the initial TOML spec did include IEEE in it, but that claim is no longer there.

What should a compliant parser do when it encounters a float it can't represent? Similarly, there is no rounding mechanism specified for when a decimal float value doesn't fit neatly into a double.

TOML should be easy to parse. Would specifying such things impose a large burden? I suspect so. We're pretty much at the mercy of the floating point representation chosen by our programming language of choice. So we might want to leave this unspecified.

Datetimes: You may also want to refer to RFC 3339 over ISO8601. The RFC is more explicit about handling leap seconds and what information is required in a timestamp, and demands uppercase 'Z' and 'T' in datetimes, which toml doesn't mandate.

Covered by #189.

Keygroups/Keys trimming leading and trailing tab/spaces, and it is implied that these names will not have any whitespace (0x09 or 0x20) characters, but as mentioned, unicode's whitespace (including vertical tabs) are not forbidden or stripped. Note: Unicode defines 26 whitespace characters.

Both can have whitespace. The TOML spec defines whitespace only as tabs and spaces.

Can escape sequences appear inside of key/keygroup names?

No. I think the spec is fine as is. It's clear that escape sequences only show up in strings. (With that said, if #220 is accepted and strings are allowed as keys, then escape sequences would obviously be allowed.)

Can a key/group have a space in the name? Is "[foo bar]" valid? is "foo bar = 1" valid?

Yes. Spec is clear, I think.

Can key groups have []'s in their name? Is "[[foo]]" a valid keygroup name, or "[[foo]", "[foo]]" ?

No. Fixed in #222.

How do you parse "foo = bar = 1"? is this the key "foo = bar" or assigning to two keys "foo" and "bar"? The spec doesn't say which equals sign to use when there is more than one, or if more than one is an error.

Spec is clear. Keys "end with the last non-whitespace character before the equals sign." This implies that = is not allowed in a key name. Since bar = 1 isn't a valid TOML value, foo = bar = 1 is invalid.

Can keygroups have empty names? i.e is [] a valid keygroup.

To be fixed in #223.

Can keygroups have nothing between dots? is [foo..bar] the keygroup for "foo", "", "bar".

Clarified in #223.

Can keys be empty? Does "=1" parse as with the key '' and the value 1.

Clarified in #223.

Can keys have the characters '[' or ']' inside of them. What if you have "[foo] = blah"?

Hopefully fixed with #224.

Can keys have '.' in the name? Would "foo.blah = 1" be the key 'foo.blah' or would it work like a keygroup?

Spec is clear I think. foo.blah = 1 maps to {"foo.blah": 1}. That is, . has no significance in key names (only table names).

Can keys be numbers or dates? Can keygroup names be numbers or dates?

Yes.

What happens if you use the name 'true' or 'false' as a keygroup or a key? Does it use the string value or the implied boolean value?

Keys and table names are their own syntactic category that is distinct from values (which boolean are a part of). I think this is clear in the spec.

Member

BurntSushi commented Jun 25, 2014

There's a lot in this issue, so I'm going to run through all of it. I think we've already got most things covered, but I'll try to fix things we don't. (I agree that we're going to want an EBNF, but resolving ambiguity is necessary before we can write one.)

I'm going to close this issue. If I've done or said anything deserving of criticism, please open a new issue for each. (I'm trying to get rid of these kitchen-sink issues that are hard to manage.)

Note that some of these questions seem a little weird now, but it looks like they were asked a long time ago and many of them were clarified.

Is the format designed to allow non printable characters inside of it? The special characters pick a small set of ascii control codes to escape, but ignores the other C0/Unicode control codes.

This was fixed when strings were revamped in the early days.

I.e Why is tab escaped, but not vertical tab? Additionally, unicode line breaks aren't escaped i.e \u2028 (Line separator) and \u2029 (Paragraph separator), but the ascii CR and LF are.

Also fixed. The \t special escape sequences were chosen because they are convenient. (cf. Most programming languages.)

Since strings are in UTF-8, you won't normally expect a BOM, but the spec doesn't say if toml files should/must/may have a BOM.

TOML is UTF-8 only, so no BOM.

It should also be clear on if any unicode normalisation should be applied/expected on the strings inside, or the key names.

I'm fine with this being unspecified. I'm happy to be shown to be wrong. (However, I don't think it's a good idea to specifically require some kind of normalization.)

What should a compliant parser do when it encounters an invalid utf-8 byte sequence? Does it skip the line? Should it error? Interoperability means parsers should have the same failure behaviours.

TOML is UTF-8 only. A parser which accepts invalid UTF-8 isn't compliant.

Floats (doubles) don't represent the full spectra of acceptable IEEE values, missing -Infinity, +Infinity and NaN.)

I think the initial TOML spec did include IEEE in it, but that claim is no longer there.

What should a compliant parser do when it encounters a float it can't represent? Similarly, there is no rounding mechanism specified for when a decimal float value doesn't fit neatly into a double.

TOML should be easy to parse. Would specifying such things impose a large burden? I suspect so. We're pretty much at the mercy of the floating point representation chosen by our programming language of choice. So we might want to leave this unspecified.

Datetimes: You may also want to refer to RFC 3339 over ISO8601. The RFC is more explicit about handling leap seconds and what information is required in a timestamp, and demands uppercase 'Z' and 'T' in datetimes, which toml doesn't mandate.

Covered by #189.

Keygroups/Keys trimming leading and trailing tab/spaces, and it is implied that these names will not have any whitespace (0x09 or 0x20) characters, but as mentioned, unicode's whitespace (including vertical tabs) are not forbidden or stripped. Note: Unicode defines 26 whitespace characters.

Both can have whitespace. The TOML spec defines whitespace only as tabs and spaces.

Can escape sequences appear inside of key/keygroup names?

No. I think the spec is fine as is. It's clear that escape sequences only show up in strings. (With that said, if #220 is accepted and strings are allowed as keys, then escape sequences would obviously be allowed.)

Can a key/group have a space in the name? Is "[foo bar]" valid? is "foo bar = 1" valid?

Yes. Spec is clear, I think.

Can key groups have []'s in their name? Is "[[foo]]" a valid keygroup name, or "[[foo]", "[foo]]" ?

No. Fixed in #222.

How do you parse "foo = bar = 1"? is this the key "foo = bar" or assigning to two keys "foo" and "bar"? The spec doesn't say which equals sign to use when there is more than one, or if more than one is an error.

Spec is clear. Keys "end with the last non-whitespace character before the equals sign." This implies that = is not allowed in a key name. Since bar = 1 isn't a valid TOML value, foo = bar = 1 is invalid.

Can keygroups have empty names? i.e is [] a valid keygroup.

To be fixed in #223.

Can keygroups have nothing between dots? is [foo..bar] the keygroup for "foo", "", "bar".

Clarified in #223.

Can keys be empty? Does "=1" parse as with the key '' and the value 1.

Clarified in #223.

Can keys have the characters '[' or ']' inside of them. What if you have "[foo] = blah"?

Hopefully fixed with #224.

Can keys have '.' in the name? Would "foo.blah = 1" be the key 'foo.blah' or would it work like a keygroup?

Spec is clear I think. foo.blah = 1 maps to {"foo.blah": 1}. That is, . has no significance in key names (only table names).

Can keys be numbers or dates? Can keygroup names be numbers or dates?

Yes.

What happens if you use the name 'true' or 'false' as a keygroup or a key? Does it use the string value or the implied boolean value?

Keys and table names are their own syntactic category that is distinct from values (which boolean are a part of). I think this is clear in the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment