Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permit more control characters in comments #924

Merged

Conversation

eksortso
Copy link
Contributor

In reconsideration of the limitation of control codes within comments discussed in #567, a case was made to simplify comment parsing. This PR intends to replace those previously imposed restrictions with a simplified approach that permits anything except newlines.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 12, 2022

I think I could get on board with this. However, I don't understand why this seems to redefine the concept of a newline. The spec already clearly says: "Newline means LF (0x0A) or CRLF (0x0D 0x0A)."

So, so comment ends at a newline, whether it's LF or CRLF. That's very clear and non-negotiable. So I don't understand what this language regarding "exclude a final carriage return" is meant to say.

Another issue is whether a lone CR (0x0A) should be allowed and ignored in comments. The PR current does so, but I think it should rather be rejected. Some not fully compliant parsers might treat CR as a linebreak. That would allow smuggling data into a document that a compliant parser ignores as part of a comment. If such CR's are rejected this risk is reduced since the compliant parser would then reject the document altogether.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 12, 2022

So I would suggest to change the wording to:

All characters except line feed (0x0A) and carriage return (0x0D) are permitted in comments and will be ignored.

The ABNF would need to be adjusted as well.

@arp242
Copy link
Contributor

arp242 commented Oct 12, 2022

Also should to use U+ syntax instead of 0x for codepoints.

The way it's phrased now is kinda confusing; U+0A is the newline; this is already allowed in comments, because every line ends with a comment.

In general, I don't really see the point of this though; no one raised any practical concerns over this ("I tried to do X but I can't and the specification limits me") and it just seems to introduce needless churn for implementers who will need to update the implementations, tests, etc. I would not be in favour of merging this.

@eksortso
Copy link
Contributor Author

@ChristianSi What I was attempting to do was not redefining the concept of newlines. But let me come back to that.

I expect comments to ignore everything to the end of the line. For practical purposes, that means starting from a free-standing hash character, scanning ahead until the first LF character (or the EOF) is found, and disposing of everything from the hash up to but not including that LF character. And by everything, I mean everything, including CR characters. Call it a "naive" approach to comment disposal.

Your concern about "smuggling data into a document" would be mostly baseless, except for the existence of comment-preserving parsers. Such a parser cannot use the naive approach described previously, because if the TOML document uses CR LF for newlines, the final CR must be excluded from the comment registered by the parser. The last sentence of my change to toml.md was intended to allow for this. But since it does not seem necessary in retrospect, I'm deleting that sentence.

Strike the entire paragraph, and replace the first paragraph in the Comment section with the following, which simplifies the thing further:

A hash symbol marks the rest of the line as a comment, except when inside a
string. All code points except line feeds (U+000A) are permitted in comments.

I explicitly say "code points" here because that includes all characters, all ASCII control codes, all U+everything, except for a U+000A, which is a newline in TOML. It makes no sense to make an exception for U+000D, except to parsers that read all newlines first. And the authors of those parsers should be savvy enough to know what to keep if they're preserving a comment.

@arp242 This PR addresses a number of points raised by @abelbraaksma in #567, specifically getting rid of (almost) all restrictions on characters permitted within comments, because we do not want a document rejected just because a control character shows up in a comment.

And if you've ever copy/pasted Excel cells into a text widget before, you do come across stray CR characters far too often.

@arp242
Copy link
Contributor

arp242 commented Oct 13, 2022

@arp242 This PR addresses a number of points raised by @abelbraaksma in #567, specifically getting rid of (almost) all restrictions on characters permitted within comments, because we do not want a document rejected just because a control character shows up in a comment.

And if you've ever copy/pasted Excel cells into a text widget before, you do come across stray CR characters far too often.

Yes, I read that; but I don't see any practical problems. And both LF and CRLF end of lines are already allowed; if the ABNF excludes that then that's a bug in the ABNF.

@eksortso
Copy link
Contributor Author

And both LF and CRLF end of lines are already allowed; if the ABNF excludes that then that's a bug in the ABNF.

A CR LF line ending is still permitted, because if you follow the ABNF to the letter, the CR in the newline is considered part of the content of the comment; it's all in comment and gets picked up in *non-eol. What remains is the LF, which is still a newline. So the ABNF is not ambiguous. If you are preserving comments, you can exclude the final CR in the comment if the newline immediately follows.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 13, 2022

OK, I think I can live with lonely CR's being ignored.

And I agree with @arp242 that even the wording

All code points except line feeds (U+000A) are permitted in comments.

is still a bit confusing – after all, if you add an LF to a comment, you don't produce an error, but simply close the comment. Maybe it would be better to express this as something like:

All valid code points are permitted in comments (but keep in mind that a linebreak – LF or CRLF – terminates the comment).

Another issue that needs to be addressed is Unicode validity. Wikipedia writes:

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

Explaining the calculating in a footnote:

17 planes times 2^16 code points per plane, minus 2^11 technically-invalid surrogates.

What to do with these surrogate code points? They are "technically invalid" and have no place in a well-formed UTF-8 string. Hence they are not allowed in TOML strings, for good reasons. But should their inclusion in a comment be tolerated, or should it cause the document to be rejected as invalid? As I argued earlier, it might be reasonable to allow either behavior, stating something like:

Surrogate code points (U+D800 to U+DFFF) should never be used in a TOML document, not even in comments. Parsers encountering such a code point may reject the document, but don't have to do so.

There may be a few other code points to which the same reasoning applies, but I'll leave that to the Unicode experts.

@eksortso
Copy link
Contributor Author

@ChristianSi You said:

Maybe it would be better to express this as something like:

All valid code points are permitted in comments (but keep in mind that a linebreak – LF or CRLF – terminates the comment).

That makes sense. In fact, let's remove the parentheses. The definition of a newline immediately precedes this section, so it ought to be clear why we mention LF and CRLF. Here's my reworked version.

All valid code points are permitted in a comment, but keep in mind that a LF
or CRLF terminates the comment.

I weighed the value of keeping the word "valid" in there. I was presuming that the document was already verified as being valid UTF-8, meaning no illegal bytes and no byte streams representing surrogate code points (among others) would be present. If a TOML document is read into a string for processing, then validation is performed before the hash symbol is even processed as the start of a comment. But you are coming from a perspective in which the validity of the document's byte stream is yet to be verified. For languages that live close to the bare metal, that could be the case. But surrogates are always invalid in UTF-8, and if they are present in any text document, then that's a bigger problem than a TOML parser ought to handle. Certainly bigger than isolated CRs from Excel.

Invalid byte streams, including surrogate code points, should be rejected outright. Parsers should not tolerate them in comments, as they should not tolerate invalid UTF-8 documents in general. But we can still say "valid" code points, so there's no ambiguity whatsoever.

I will make one modification to the ABNF: rename non-eol to non-linefeed. It isn't used anywhere else, and the new name doesn't presume the line ending. I tested it in Instaparse Live, which was smart enough not to suck in the CR in the CRLF into the comment.

@eksortso eksortso marked this pull request as ready for review October 14, 2022 03:31
@hukkin
Copy link
Contributor

hukkin commented Oct 14, 2022

As an implementation maintainer I'm slightly opposed to this change simply because it seems this isn't something that users are asking for but rather something that toml-lang contributors think would be an improvement. Seems like potential busywork for implementation maintainers.

I'm also concerned about allowing lone CR in comments. Assuming that text editors render lone CR as a line ending, I think it creates an opportunity to craft valid TOML documents that look different to human eye than a compliant parser. For example:

# All line endings in this document are lone CRs
a = "hello"
b = "world"

Assuming that all line endings in the document are CRs, the document is still valid. It is a single line TOML document consisting of one comment. I.e. the content, when read by a compliant parser is an empty mapping {}. To human eye it looks like {"a": "hello", "b": "world"} though.

@eksortso
Copy link
Contributor Author

Assuming that all line endings in the document are CRs, the document is still valid

The only places where I've seen single CRs as line endings are in text documents on pre-OSX MacOS systems and in text cells in Excel. Since we are not creating the TOML standard for twenty-year-old Macs, uncoupled CRs created by text editors ought to be a rare occurrence.

When comments were first defined in TOML, there were no restrictions on control codes in comments. That was changed a few years ago when all control codes except horizontal tabs were excluded.

Was this a burden for parser writers to handle when it was introduced? A little, though I'd have to check if they were doing it right. Was it a limitation that users were asking for? It was not, based on what I've read here.

Will existing parsers have to relax these limitations if this PR is merged? Well yes, but they would also be making changes to implement other new features in the future TOML v1.1.0.

Do I have a stake in this? Just in how users copy whole Excel cells into text widgets and the pain I've experienced in stripping loose CRs out of database columns when they are not properly removed or converted. Naive users won't be flummoxed by CRs in comments any longer. This PR doesn't address single CRs showing up anywhere else though.

@hukkin
Copy link
Contributor

hukkin commented Oct 14, 2022

The only places where I've seen single CRs as line endings are in text documents on pre-OSX MacOS systems and in text cells in Excel. Since we are not creating the TOML standard for twenty-year-old Macs, uncoupled CRs created by text editors ought to be a rare occurrence.

I agree lone CRs are a rare occurrence. My concern is someone with malicious intent creating a TOML document that looks different to human eye compared to a compliant parser. Status quo, where the compliant parser errors, may be safer than what this PR proposes.

@ChristianSi
Copy link
Contributor

I think @hukkin's concern is valid if editors tend to interpret lone CR's as newlines. What you see should be what you get, so a TOML file that has different linebreaks in a text editor than what a parser sees it not a good idea.

So what do text editors do? I saved @hukkin's CR-only example from above as a text file:

with open('test.txt', 'w') as f:
    f.write('# All line endings in this document are lone CRs\ra = "hello"\rb = "world"\r')

and opened it in a few editors, with the following results:

  • vim/gvim renders them as ^M, not linebreak
  • libreoffice ignores them altogether
  • nano renders them as linebreaks and shows "Converted from Mac format"
  • gedit renders them as linebreaks and doesn't show any warning, as far as I could see.

You are all invited to make the same experiment with the editors you commonly use or can think of, and report back.

For me I must say that two our of four is too close for comfort. Gedit's behavior is especially problematic, and since it's the default GUI text editor in Ubuntu (the one that opens when you double-click a text file), I assume it's pretty widely used.

Therefore I return to my earlier position that lone CRs should remain forbidden in comments, even if all other control characters are allowed.

@eksortso
Copy link
Contributor Author

eksortso commented Oct 16, 2022

@ChristianSi I tried basically the same thing in several Windows 10 text editors. I altered your code slightly, writing the same file in binary mode:

with open('testCR.txt', 'wb') as f:
    f.write(b'# All line endings in this document are lone CRs\ra = "hello"\rb = "world"\r')

When I opened them, these were the results:

  • Windows 10 Notepad renders them as linebreaks and shows "Macintosh (CR)" as the line ending.
  • Notepad++ does the same thing. The line ending status allows you to modify the line ending to "Windows (CR LF)" and "Unix (LF)" and back to "Macintosh (CR)".
  • VS Code, out of the box, renders the CRs as linebreaks but incorrectly identifies the line ending as "CRLF". The hex editor does show the 0D codes correctly though. The line ending status will allow you to switch to LF; switching to CRLF does nothing immediately. There's no option to return to CR.
  • Notepad2 (an old reliable workhorse) renders the CRs as linebreaks and identifies the line ending as "CR".
  • micro (a new reliable workhorse) incorrectly renders the CRs as whitespace and also misidentifies the line ending as "unix". Among these editors, micro is the newest, so the disregard for CRs can't be unexpected.

I also ran type testCR.txt in both cmd and PowerShell 7. In cmd, typing out the document caused each subsequent line to overwrite the beginning of the previous line, adhering closely to CR's original meaning as a carriage return without a line feed. In PowerShell 7, the CR's were treated as linebreaks.

The mixed behaviors are discomforting. So despite my misgivings expressed previously, I'm changing my mind. Lone CRs ought to be forbidden in comments.

But @hukkin's position is to retain the status quo. Keep all control codes except tab out of comments.

Let me run one more acid test. What happens when the DEL control code, U+007F, is present?

with open('testDEL.txt', 'w') as f:
    f.write("""\
# All lines in this document end with a DEL character.\x7f
a = "hello"\x7f
b = "world"\x7f
""")

Notepad and Notepad2 show the DEL character as an unknown character, with a rectangular placeholder. VS Code and Notepad++ show the character as a "DEL" symbol, the former even displaying the symbol in red. micro shows them as whitespace characters. type testDEL.txt showed DEL as a pentagon-shaped placeholder when run in both cmd and PowerShell 7.

@ChristianSi @abelbraaksma You may want to test how these appear in *nix environment text editors, and how they appear when you cat or bat them. (I ought to install WSL2 someday.)

The similarities in DEL's appearances, though less glaring, still suggest that allowing other control characters in comments may still be sensible. But I am starting to think that, per @hukkin, keeping the status quo is the best route for us to take. I'm not about to close this PR though; in fact, I will disallow CR characters and replace the "0x"s with "U+00"s in toml.md where appropriate, per @arp242.

@abelbraaksma I'm beginning to consider closing this PR, so if you want to continue making your case, now's the time to do it.

As for anyone else with a stake in this, including @pradyunsg @BurntSushi @mojombo, what are your thoughts?

@marzer
Copy link
Contributor

marzer commented Oct 16, 2022

I haven't really been following this discourse, but FWIW my position is this:

  • status quo bad
  • but lone CR also bad

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Oct 16, 2022

@abelbraaksma I'm beginning to consider closing this PR, so if you want to continue making your case, now's the time to do it.

@eksortso It's been a while since I wrote my original opinion. But from your comments above, I think we have basically two things to consider:

  • Line ending normalization
  • Any Unicode character allowed in comments

Thoughts on line endings

I think the first point above belongs to editors. However, if you compare it to some W3C standards, they are liberal in what they accept, but require normalization. That is, they allow CRLF, lone CR, lone LF. Not all programming platforms do this automatically, though (I mean, in general, not specific to XML or anything W3 related).

However, the algorithm is ridiculously simple:

  • All line endings must be normalized before parsing
  • CRLF is interpreted as single LF
  • any CR not followed by an LF is interpreted as a single LF

Henceforth, line endings are then normalized, and what we call a new line or carriage return in Unicode is interpreted as such and always start a new line.

In practice, this would mean removing the following from the ABNF:

newline =  %x0A     ; LF
newline =/ %x0D.0A  ; CRLF

and making newlines implicit (i.e., part of the prose, not the ABNF). Or, conversely, since alternatives in parsing are ordered, this could work, I think (which, btw, already suggests that the current code is actually incorrect?):

newline =/ %x0D.0A  ; CRLF
newline =/ %x0D     ; CR
newline =  %x0A     ; LF

Thoughts on control characters

Currently, it is clear to me that comments are overly restrictive (i.e., control characters are absent), but also too lenient (illegal unicode characters are allowed). The range, as it currently stands is:

comment-start-symbol = %x23 ; #
non-ascii = %x80-D7FF / %xE000-10FFFF
non-eol = %x09 / %x20-7E / non-ascii

comment = comment-start-symbol *non-eol

If we expand this, we get:

allowed-comment-char =%x09 / %x20-7E / %x80-D7FF / %xE000-10FFFF

Let's break this down:

  1. Control characters (the current discussion) are disallowed, but my point in the original issue was: "encounder #, then read to EOL" is easier to understand, and at least parsers won't need to raise to a user something like: "hey, you wrote a comment, and we know you just want to ignore whatever is in there, but we cannot do that, there are invisible characters that we don't like, please remove them".

  2. %7F is the DEL character. This is just a valid Unicode character, albeit without a glyph, but that's true for so many. I'd just allow it.

  3. %D800 - %xDFFF, the so-called "surrogate block". These have special meaning, but only in UTF-16, when encountered as literal bytes in the stream. However, they themselves can be encoded just fine, both in UTF-16 and UTF-8, so there is no specific need to disallow them. see below for a better explanation.

  4. %FFFE and %FFFF. From all the noncharacters in Unicode, these are special, as they are used as BOM (byte order mark). It is certainly discouraged to use these in any text document. Even in UTF-8, it often appears in the start to signal that the document should be interpreted as UTF-8 and not ASCII or Latin-1 or something. But, if appearing at the start of a document, they may be special, there is nothing that disallows UTF-8 to encode this, even though the characters themselves have no meaning (unless software assigns special meaning). This used to be different, but has changed long ago, see section 9.3 in these minutes: https://www.unicode.org/wg2/docs/n2403.pdf.

  5. Other noncharacters: basically anything that contains FFFE or FFFF is considered a non-character in Unicode. I.e. %2FFFE and %10FFFE are non-characters. Again, there use is discouraged, because both UTF-16 and UTF-32 use these for endianness detection (i.e., FFFE means big endian and FEFF means little endian). But again, in UTF-8, there's nothing wrong in encoding these, it's just bad practice to use them.

  6. %FFFD the Replacement Character. This one is currently allowed, but problematic in many ways. A lot of editors, when encountering invalid encodings (not to be confused with invalid code points), replace these with this character. It is used for cases where the parser basically says "I don't know what you mean". Since it's already allowed, it should stay that way, I guess.

  7. %00. The NUL character is allowed in Unicode, but many standards disallow it explicitly. The reason is simple: a lot of programming languages represent a string in memory using a NUL delimiter, like C/C++. While this may not be the case for many other languages, there have been recorded cases where DOS and buffer overflow attacks where successful using NUL in Unicode strings. TLDR: I'd vote for certainly still disallowing this character anywhere in the document.

  8. %85, %2028 and %2029 (see https://en.wikipedia.org/wiki/Newline#Unicode). Different standards give these different meanings. %85 is NEL and that code point is special as it maps to some old EBCDIC newline character. The others were "invented" for Unicode and I've rarely seen them in practice. Since we currently don't consider these newlines, I don't think we can change that (or should even consider).

  9. %0B and %0C are Form Feed FF and Vertical Tab VT. While modern editors mostly ignore there (and just display some replacement char), there will be editors out there that "correctly" interpret these the ASCII way. That is: apply a form feed (new line/page/para) or vertical tab (never seen this in practice). It is probably best to continue to disallow these anywhere in the TOML document.

Conclusion?

Sorry for the long post (it didn't set out that way!). Basically, we have this situation:

  • We already allow characters that are strongly discouraged to appear in Unicode
  • We disallow characters that Unicode has absolutely no problem with

My vote: let's take the union of these two and simplify parsing. A comment can just contain any Unicode character, with just two exceptions: line endings (including FF and VT), and NUL. I.e., it would become this:

allowed-comment-char =%x01-%09 / %0E-10FFFF

Or alternatively (that is, allowing FF and VT, as these are really just a remnant of the past):

allowed-comment-char =%x01-%09 / %0B / %0C / %0E-10FFFF

Meanwhile, I do think we should correct the newline handling.

I realize this may be a little too lenient for some, but I see no harm in doing so. Any existing correct TOML file out there still parses just fine, and once this is adopted, parsers won't have to treat comments any different than "ignore from # to EOL`.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 17, 2022

Hmm, I don't think we should allow lone CR's as linebreaks – quite simply, since we had discussed that in the past and decided against it. I'm not sure where that was, but anyone willing to search should be able to find it. And just reversing one's position every few years without really new arguments having come up is not a good thing.

More specifically, there are currently no TOML files that use CR's as linebreaks. And since this linebreak convention isn't used anywhere anymore, except maybe on some computers that have long ago ended up in museums, there is just no reason why they should ever come into existence.

As for @abelbraaksma's suggestions on allowed-comment-char, the first variant seems fine to me. But CR shouldn't be interpreted as linebreak, it should simply remain forbidden (and hence trigger an error, just like FF, VT, and NUL).

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Oct 17, 2022

@ChristianSi thanks for clarifying. I’m not aware of the old decisions, I wholeheartedly agree to staying consistent.

I have seen such file in the wild, usually as a result of bad switching between platforms or editors mistreating LFs (try opening such files in Notepad, then manually adding newlines on top of those).

But bad editing is no reason for allowing such things, I agree. And the times that DNX had LFCR (as opposed to CRLF) is indeed far in the past.

@eksortso
Copy link
Contributor Author

I'll use @abelbraaksma's first allowed-comment-char variant. So we will be disallowing NUL, CR, FF, VT, and LF in comments (though the last one won't generate an error because it would simply end the line).

The matter of normalizing line endings (just LF and CRLF in TOML's case) is appealing to me, especially since we're going to someday form an RFC based on the TOML spec, per #870. But seeing as text editors can unintentionally mix line endings, I won't push it.

@eksortso
Copy link
Contributor Author

A few more notes. I do not like surrogate code points being encoded in UTF-8, because it is in fact illegal to have such byte streams in UTF-8. And in fact, we banned surrogates from TOML documents in #619 three years ago with the introduction of non-ascii in the ABNF.

But to simplify comment scanning while disallowing the historic line-breaking ASCII codes (please help: what's a better name for the collected set of codes CR, FF, VT, and LF?), I'll use the new allowed-comment-char = %x01-%09 / %0E-10FFFF, and relocate non-ascii elsewhere closer to where it is first used.

Expect a little rearrangement.

@eksortso
Copy link
Contributor Author

And now that I think about it, I ought to strip out the word "valid" in "All valid code points" since we're allowing invalid code points in comments now. But somebody give me a better description of what %x0A through %x0D should be called collectively.

toml.md Outdated Show resolved Hide resolved
Fix newline description

Co-authored-by: Taneli Hukkinen <3275109+hukkin@users.noreply.github.com>
@abelbraaksma
Copy link
Contributor

abelbraaksma commented Oct 18, 2022

I wrote this about surrogate codepoints:

However, they themselves can be encoded just fine, both in UTF-16 and UTF-8, so there is no specific need to disallow them.

This isn't (entirely) correct, apologies it's just wrong, except for the conclusion. I remembered things wrongly and didn't do enough research, apologies. The short summary of a long reading of specs:

  • In the past, surrogate characters where encoded as two pairs of three UTF-8 code units (bytes). There is still software out there that does this, but it is considered invalid. A historical Unicode encoding, CESU-8, supports this, though. UTF-8 readers encountering this should either signal with an error, or display the U+FFFD (replacement char, or ).
  • Proper encoding of single surrogate characters (not being in pairs) does not exist (unless historically, CESU-8), and when encountered, should be replaced, again, by U+FFFD.
  • Proper encoding of proper pairs of surrogates, encountered as transcoding from a UTF-16 stream, must be encoded as a single 4-unit (byte) UTF-8 stream. It is illegal to do otherwise (illegal from a standards perspective, I doubt you'll be persecuted).
  • Encoding of surrogate code-points as themselves in UTF-16 is not possible. They have special meaning.
  • Lone surrogate code code-points in UTF-16 are, again, illegal. They can either be signaled, or replaced by .

What does this mean for us?

Not much, really. TOML currently requires, encoding as UTF-8. What this means is, that it is impossible to encounter surrogate characters.

If you do encounter them, you'll encounter them prior to reading them as UTF-8. By the time your stream turns from a byte stream into a proper UTF-8 stream, those (illegal) surrogate pairs are no longer there, they must be replaced by , or, alternatively, have raised an error: "Illegal file encoding detected", or something like that.

So, what does this really mean for us?

We should not treat surrogate characters as special. They don't exist. The only place where we should disallow them, is in escape sequences. I.e., you should not be allowed to write \D800 in a string. (and we do, we actually point to the definition of Unicode scalar values).

If we still have remnants of ABNF dealing with surrogates, we should really (probably?) remove them. Or we should remove the requirement for TOML being valid UTF-8, in which case it makes sense to embed the Unicode rules in the spec.


For reference, this FAQ summarizes the above as well: UTF-8 FAQ

@pradyunsg pradyunsg mentioned this pull request Oct 26, 2022
@eksortso
Copy link
Contributor Author

eksortso commented Oct 27, 2022

@pradyunsg Could we split the changes into two separate PRs? The Unicode language changed a lot. Suppose we could isolate those changes and address them separately?

UPDATE: Already isolated those changes. See #929. If that's merged, then this PR will be greatly simplified.

eksortso added a commit to eksortso/toml that referenced this pull request Oct 27, 2022
eksortso added a commit to eksortso/toml that referenced this pull request Oct 27, 2022
@eksortso eksortso force-pushed the reallow-control-characters-in-comments branch from a338fb0 to 0dade12 Compare November 7, 2022 21:52
@eksortso
Copy link
Contributor Author

eksortso commented Nov 7, 2022

I stripped this PR back down to its essentials, that is, the re-allowance of most control characters inside comments. Anything pertaining to Unicode language changes was already moved over to #929, which can be reviewed separately.

@ChristianSi @abelbraaksma Want to take another look?

@pradyunsg Can you review this PR again?

Copy link
Contributor

@ChristianSi ChristianSi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT!

Copy link
Contributor

@abelbraaksma abelbraaksma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good! Let’s finish this! :)

@eksortso
Copy link
Contributor Author

@pradyunsg This is a lot simpler since the last time you looked at it. We're allowing all but five (UTF-8-valid) code points in comments. Is this good to merge?

@eksortso
Copy link
Contributor Author

eksortso commented Jan 7, 2023

@pradyunsg Could you please review this and make a decision? It's been two months since the last reminder.

toml.md Outdated Show resolved Hide resolved
Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a minor nit-pick. Thanks for a very extensive discussion here folks. ^>^

eksortso and others added 2 commits January 10, 2023 15:08
Sure thing. I didn't know this was a style guideline, but I know now.

Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com>
@pradyunsg pradyunsg changed the title Relax comment parsing, per discussion on #567 Permit more control characters in comments Jan 10, 2023
@pradyunsg pradyunsg merged commit ab74958 into toml-lang:main Jan 10, 2023
@pradyunsg
Copy link
Member

Thanks @eksortso! ^.^

@eksortso eksortso deleted the reallow-control-characters-in-comments branch January 10, 2023 21:33
@abelbraaksma
Copy link
Contributor

Well done, great work!

arp242 added a commit to arp242/toml that referenced this pull request Oct 1, 2023
This reverts commit ab74958.

I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer.

No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works.

And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments.

So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this.

---

And while I'm at it, let me have a complaint about how this was merged:

1. Two people, both of whom actually maintain implementations, say they don't like this change.
2. This is basically ignored.
3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷
4. "Consensus".

Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality.

Fixes toml-lang#995
arp242 added a commit to arp242/toml that referenced this pull request Oct 1, 2023
This reverts commit ab74958.

I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer.

No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works.

[None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments.

So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so...

---

And while I'm at it, let me have a complaint about how this was merged:

1. Two people, both of whom actually maintain implementations, say they don't like this change.
2. This is basically ignored.
3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷
4. "Consensus".

Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality.

Fixes toml-lang#995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants