Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAmbiguities/Lack of Specification—Non Printable characters, Unicode, Doubles, Keys, Keygroups. #42
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mikkelee
commented
Feb 24, 2013
|
+1 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tef
Feb 24, 2013
So far from discussions and reading other implementations:
Keys/Keygroups cannot be zero length.
Datetimes should have uppercase letters. Fractional seconds are ignored.
The left most equal sign is used to split key = value configurations.
Keys can have .'s in them and are treated like text. The line "true = false" would set the key "true" to the boolean value false.
BOM should be ignored if present. BOM should not be written.
Floats cannot represent Infinity or NaN, how to convert a decimal into a binary float is implementation defined.
tef
commented
Feb 24, 2013
|
So far from discussions and reading other implementations: Keys/Keygroups cannot be zero length. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
I'll hit these tomorrow. I'd love to hear any recommendations you have. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mikkelee
Feb 24, 2013
I'm a lil confused. What are we doing right now? Are we (read: you) defining a ruleset to limit damage?
mikkelee
commented
Feb 24, 2013
|
I'm a lil confused. What are we doing right now? Are we (read: you) defining a ruleset to limit damage? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tef
Feb 24, 2013
I am going to make recommendations in the form of a pull request. I have no idea what I am doing.
tef
commented
Feb 24, 2013
|
I am going to make recommendations in the form of a pull request. I have no idea what I am doing. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mikkelee
commented
Feb 24, 2013
|
Awesome. I'm into this |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tef
Feb 24, 2013
Scanning through the other issues.
#55 covers the float complaint
#44 covers hex floats.
I've opened two pull requests and hopefully not fucked it up.
#56 covers float conversion, using rfc in lieu of iso, banning stupid keygroup names, die on bad escape sequences.
#62 is a proposal to add identifier like strings, to make parsing easier, also eliminates issues like bad key/keygroup names (i.e [[foo], [foo]], and foo.bar = value)
tef
commented
Feb 24, 2013
|
Scanning through the other issues. I've opened two pull requests and hopefully not fucked it up. #56 covers float conversion, using rfc in lieu of iso, banning stupid keygroup names, die on bad escape sequences. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benolee
Feb 24, 2013
Contributor
I think my question fits into the scope of this issue, but if not, I can move it elsewhere.
Keygroup whitespace
For keygroups, are spaces (let's just say spaces are \s and \t) allowed between [ and the keyname, and between the keyname and ]? If so, are they ignored?
I made some diagrams to explain what I mean.
Option 1
- spaces between
[and keyname are not allowed - spaces between keyname and
]are not allowed
Option 2
- spaces after
[but before key name starts are ignored - spaces after key name but before
]are ignored
|
I think my question fits into the scope of this issue, but if not, I can move it elsewhere. Keygroup whitespaceFor keygroups, are spaces (let's just say spaces are I made some diagrams to explain what I mean. Option 1
Option 2
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ocxo
commented
Feb 24, 2013
|
I opt for the simpler, no diagrams required, "ignore all whitespace". |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jayzeng
Feb 25, 2013
Simple is beautiful. +1 on "ignore all whitespace", simplifies parsing and increases flexibility.
jayzeng
commented
Feb 25, 2013
|
Simple is beautiful. +1 on "ignore all whitespace", simplifies parsing and increases flexibility. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rossipedia
Feb 25, 2013
Contributor
So does that mean that [foo bar] and [foobar] are equivalent? That seems like it could be confusing and cause all sorts of headaches.
|
So does that mean that |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rossipedia
Feb 25, 2013
Contributor
I'll probably have be submitting a pull request for a formal language-agnostic EBNF document for the language. That's really the best way to resolve these ambiguities.
|
I'll probably have be submitting a pull request for a formal language-agnostic EBNF document for the language. That's really the best way to resolve these ambiguities. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
88Alex
Apr 13, 2013
I think it would be simpler to just disallow spaces in identifier names, like most languages.
88Alex
commented
Apr 13, 2013
|
I think it would be simpler to just disallow spaces in identifier names, like most languages. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
88Alex
commented
Apr 18, 2013
|
RE spaces in keys: https://www.github.com/mojombo/toml/issues/185 |
BurntSushi
referenced this issue
Jun 25, 2014
Merged
Clarify that empty keys/table names are not allowed. #223
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Jun 25, 2014
Member
There's a lot in this issue, so I'm going to run through all of it. I think we've already got most things covered, but I'll try to fix things we don't. (I agree that we're going to want an EBNF, but resolving ambiguity is necessary before we can write one.)
I'm going to close this issue. If I've done or said anything deserving of criticism, please open a new issue for each. (I'm trying to get rid of these kitchen-sink issues that are hard to manage.)
Note that some of these questions seem a little weird now, but it looks like they were asked a long time ago and many of them were clarified.
Is the format designed to allow non printable characters inside of it? The special characters pick a small set of ascii control codes to escape, but ignores the other C0/Unicode control codes.
This was fixed when strings were revamped in the early days.
I.e Why is tab escaped, but not vertical tab? Additionally, unicode line breaks aren't escaped i.e \u2028 (Line separator) and \u2029 (Paragraph separator), but the ascii CR and LF are.
Also fixed. The \t special escape sequences were chosen because they are convenient. (cf. Most programming languages.)
Since strings are in UTF-8, you won't normally expect a BOM, but the spec doesn't say if toml files should/must/may have a BOM.
TOML is UTF-8 only, so no BOM.
It should also be clear on if any unicode normalisation should be applied/expected on the strings inside, or the key names.
I'm fine with this being unspecified. I'm happy to be shown to be wrong. (However, I don't think it's a good idea to specifically require some kind of normalization.)
What should a compliant parser do when it encounters an invalid utf-8 byte sequence? Does it skip the line? Should it error? Interoperability means parsers should have the same failure behaviours.
TOML is UTF-8 only. A parser which accepts invalid UTF-8 isn't compliant.
Floats (doubles) don't represent the full spectra of acceptable IEEE values, missing -Infinity, +Infinity and NaN.)
I think the initial TOML spec did include IEEE in it, but that claim is no longer there.
What should a compliant parser do when it encounters a float it can't represent? Similarly, there is no rounding mechanism specified for when a decimal float value doesn't fit neatly into a double.
TOML should be easy to parse. Would specifying such things impose a large burden? I suspect so. We're pretty much at the mercy of the floating point representation chosen by our programming language of choice. So we might want to leave this unspecified.
Datetimes: You may also want to refer to RFC 3339 over ISO8601. The RFC is more explicit about handling leap seconds and what information is required in a timestamp, and demands uppercase 'Z' and 'T' in datetimes, which toml doesn't mandate.
Covered by #189.
Keygroups/Keys trimming leading and trailing tab/spaces, and it is implied that these names will not have any whitespace (0x09 or 0x20) characters, but as mentioned, unicode's whitespace (including vertical tabs) are not forbidden or stripped. Note: Unicode defines 26 whitespace characters.
Both can have whitespace. The TOML spec defines whitespace only as tabs and spaces.
Can escape sequences appear inside of key/keygroup names?
No. I think the spec is fine as is. It's clear that escape sequences only show up in strings. (With that said, if #220 is accepted and strings are allowed as keys, then escape sequences would obviously be allowed.)
Can a key/group have a space in the name? Is "[foo bar]" valid? is "foo bar = 1" valid?
Yes. Spec is clear, I think.
Can key groups have []'s in their name? Is "[[foo]]" a valid keygroup name, or "[[foo]", "[foo]]" ?
No. Fixed in #222.
How do you parse "foo = bar = 1"? is this the key "foo = bar" or assigning to two keys "foo" and "bar"? The spec doesn't say which equals sign to use when there is more than one, or if more than one is an error.
Spec is clear. Keys "end with the last non-whitespace character before the equals sign." This implies that = is not allowed in a key name. Since bar = 1 isn't a valid TOML value, foo = bar = 1 is invalid.
Can keygroups have empty names? i.e is [] a valid keygroup.
To be fixed in #223.
Can keygroups have nothing between dots? is [foo..bar] the keygroup for "foo", "", "bar".
Clarified in #223.
Can keys be empty? Does "=1" parse as with the key '' and the value 1.
Clarified in #223.
Can keys have the characters '[' or ']' inside of them. What if you have "[foo] = blah"?
Hopefully fixed with #224.
Can keys have '.' in the name? Would "foo.blah = 1" be the key 'foo.blah' or would it work like a keygroup?
Spec is clear I think. foo.blah = 1 maps to {"foo.blah": 1}. That is, . has no significance in key names (only table names).
Can keys be numbers or dates? Can keygroup names be numbers or dates?
Yes.
What happens if you use the name 'true' or 'false' as a keygroup or a key? Does it use the string value or the implied boolean value?
Keys and table names are their own syntactic category that is distinct from values (which boolean are a part of). I think this is clear in the spec.
|
There's a lot in this issue, so I'm going to run through all of it. I think we've already got most things covered, but I'll try to fix things we don't. (I agree that we're going to want an EBNF, but resolving ambiguity is necessary before we can write one.) I'm going to close this issue. If I've done or said anything deserving of criticism, please open a new issue for each. (I'm trying to get rid of these kitchen-sink issues that are hard to manage.) Note that some of these questions seem a little weird now, but it looks like they were asked a long time ago and many of them were clarified.
This was fixed when strings were revamped in the early days.
Also fixed. The
TOML is UTF-8 only, so no BOM.
I'm fine with this being unspecified. I'm happy to be shown to be wrong. (However, I don't think it's a good idea to specifically require some kind of normalization.)
TOML is UTF-8 only. A parser which accepts invalid UTF-8 isn't compliant.
I think the initial TOML spec did include IEEE in it, but that claim is no longer there.
TOML should be easy to parse. Would specifying such things impose a large burden? I suspect so. We're pretty much at the mercy of the floating point representation chosen by our programming language of choice. So we might want to leave this unspecified.
Covered by #189.
Both can have whitespace. The TOML spec defines whitespace only as tabs and spaces.
No. I think the spec is fine as is. It's clear that escape sequences only show up in strings. (With that said, if #220 is accepted and strings are allowed as keys, then escape sequences would obviously be allowed.)
Yes. Spec is clear, I think.
No. Fixed in #222.
Spec is clear. Keys "end with the last non-whitespace character before the equals sign." This implies that
To be fixed in #223.
Clarified in #223.
Clarified in #223.
Hopefully fixed with #224.
Spec is clear I think.
Yes.
Keys and table names are their own syntactic category that is distinct from values (which boolean are a part of). I think this is clear in the spec. |


tef commentedFeb 24, 2013
Is the format designed to allow non printable characters inside of it? The special characters pick a small set of ascii control codes to escape, but ignores the other C0/Unicode control codes.
I.e Why is tab escaped, but not vertical tab? Additionally, unicode line breaks aren't escaped i.e \u2028 (Line separator) and \u2029 (Paragraph separator), but the ascii CR and LF are.
Since strings are in UTF-8, you won't normally expect a BOM, but the spec doesn't say if toml files should/must/may have a BOM. It should also be clear on if any unicode normalisation should be applied/expected on the strings inside, or the key names.
What should a compliant parser do when it encounters an invalid utf-8 byte sequence? Does it skip the line? Should it error? Interoperability means parsers should have the same failure behaviours.
Floats (doubles) don't represent the full spectra of acceptable IEEE values, missing -Infinity, +Infinity and NaN.)
What should a compliant parser do when it encounters a float it can't represent? Similarly, there is no rounding mechanism specified for when a decimal float value doesn't fit neatly into a double.
Note: If precision is required, perhaps C99 hex formatted floats should be allowed (or in c speak, scanf("%a"))
Datetimes: You may also want to refer to RFC 3339 over ISO8601. The RFC is more explicit about handling leap seconds and what information is required in a timestamp, and demands uppercase 'Z' and 'T' in datetimes, which toml doesn't mandate.
You ask for "Full Zulu" but your example skips the fractional seconds part of an ISO datetime, which is optional in the RFC above, but it isn't clear if a compliant parser should support them.
Keygroups/Keys trimming leading and trailing tab/spaces, and it is implied that these names will not have any whitespace (0x09 or 0x20) characters, but as mentioned, unicode's whitespace (including vertical tabs) are not forbidden or stripped. Note: Unicode defines 26 whitespace characters.
Keys and Keygroup names have a whole load of ambiguity and I have a lot of questions:
Can escape sequences appear inside of key/keygroup names?
Can a key/group have a space in the name? Is "[foo bar]" valid? is "foo bar = 1" valid?
Can key groups have []'s in their name? Is "[[foo]]" a valid keygroup name, or "[[foo]", "[foo]]" ?
How do you parse "foo = bar = 1"? is this the key "foo = bar" or assigning to two keys "foo" and "bar"? The spec doesn't say which equals sign to use when there is more than one, or if more than one is an error.
Can keygroups have empty names? i.e is [] a valid keygroup.
Can keygroups have nothing between dots? is [foo..bar] the keygroup for "foo", "", "bar".
Can keys be empty? Does "=1" parse as with the key '' and the value 1.
Can keys have the characters '[' or ']' inside of them. What if you have "[foo] = blah"?
Can keys have '.' in the name? Would "foo.blah = 1" be the key 'foo.blah' or would it work like a keygroup?
Can keys be numbers or dates? Can keygroup names be numbers or dates?
What happens if you use the name 'true' or 'false' as a keygroup or a key? Does it use the string value or the implied boolean value?