grammar clarifications regarding tabs and carriage returns #38

andrewrk · 2021-07-08T01:33:33Z

NL = New Line (0x0a)
CR = Carriage Return (0x0d)
TAB = Tab (0x09)

Inside Line Comments and Documentation Comments

Any TAB is rejected by the grammar since it is ambiguous how it should be rendered.
CR directly preceding NL is unambiguously part of the newline sequence. It is accepted by the grammar and removed by zig fmt, leaving only NL.
CR anywhere else is rejected by the grammar.

Inside Multi-Line String Literals

zig fmt may not mangle multi line string literals, and therefore the control character TAB are rejected by the grammar inside multi-line string literals.
CR inside the multiline string literal is also rejected for the same reason
However CR directly before NL is interpreted as only a newline and not part of the multiline string. zig fmt will delete the CR.

For string literals that want to include CR, TAB, or any other control sequences, they will need to use regular string literals and the ++ operator, or @embedFile.

Whitespace

TAB used as whitespace is unambiguous. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.
CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.

The text was updated successfully, but these errors were encountered:

vrischmann · 2021-07-15T10:12:48Z

This

zig fmt may not mangle multi line string literals, and therefore the control characters TAB and CR are rejected by the grammar inside multi-line string literals.

causes a problem when using git on windows, unless I'm missing something.

If you clone this gist (which contains no CR) on Windows with the default git configuration, it produces this:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> xxd .\main.zig
00000000: 666e 2063 7265 6174 6554 6573 7454 6162  fn createTestTab
00000010: 6c65 7328 2920 2176 6f69 6420 7b0d 0a20  les() !void {..
00000020: 2020 2063 6f6e 7374 2062 6172 203d 2026     const bar = &
00000030: 5b5f 5d5b 5d63 6f6e 7374 2075 387b 0d0a  [_][]const u8{..
00000040: 2020 2020 2020 2020 5c5c 666f 6f62 6172          \\foobar
00000050: 0d0a 2020 2020 7d3b 0d0a 2020 2020 5f20  ..    };..    _
00000060: 3d20 6261 723b 0d0a 7d                   = bar;..}

which does not compile:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> zig build-exe .\main.zig
.\main.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: '\n'
        \\foobar
                 ^

it is the intended behaviour but IMO it's surprising. Currently as far as I can see you need to either configure git with core.autocrlf=false or use a gitattributes files.

thejoshwolfe · 2021-07-15T13:58:15Z

It still surprises me that Git's default configuration is to give you the wrong bytes on windows. Can you clarify which client you're using? is it provided by GitHub? by git-scm.com? by a cygwin package manager (probably not this one)?

As much as I'd like to say that Git is wrong (which i kinda did in the last paragraph), the conflict is arising from the concept of a "text file", which many people think is an array of lines rather than a single byte array. Sometimes discussions of character encoding get tangled up in the concept of a "text file" too.

But Zig is not interested in all the complexity of "text files", and specifies always UTF-8, always spaces over tabs, always LF line endings. There's some room for accepting and canonicalizing (in zig fmt) unambiguous alternatives for everyone's convenience, but CRLF line endings are officially the wrong way to end lines in Zig (along with NEL, LS, and PS). While it bothers me emotionally that some Git cient on Windows is wasting our time here talking about CRLF line endings, it seems important to get this right to avoid even more wasted time for everyone in the future.

With that preamble ramble out of the way: i think we should definitely canonicalize CRLF line endings in multiline string literals to LF line endings.

vrischmann · 2021-07-15T14:31:29Z

I'm using the installer from git-scm.

Vexu · 2021-07-15T14:53:06Z

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

If mangling the files is an issue, the feature can be put behind a --convert-whitespace flag.

nektro · 2022-04-28T23:48:11Z

could even make zig build call zig fmt on the whole directory before running, or otherwise tell them to run it on the whole folder since its likely not the one file thats mangled

andrewrk · 2022-10-31T00:24:41Z

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

Yes, I'm thinking along the same lines. Make it a hard error, just like unused locals, and then have --autofix (as well as zig fmt) make the compilation "just work" while modifying the source files to comply.

rockorager · 2024-08-03T01:14:21Z

For string literals that want to include [...] any other control sequences, they will need to use regular string literals and the ++ operator, or @embedfile.

This is specific to multi-line string literals, right? I can still have const seq = "\x1b[31m";? Or potentially any other string literal that contains 0x00...0x1F, except for CR, NL, and Tab?

castholm · 2024-08-05T21:50:37Z

This is specific to multi-line string literals, right? I can still have const seq = "\x1b[31m";? Or potentially any other string literal that contains 0x00...0x1F, except for CR, NL, and Tab?

Yes, this clarification is specifically for multi-line string literals. Regular string literals can of course still use escape sequences to encode characters/byte sequences like control characters or invalid UTF-8 sequences that are illegal in Zig source code. But multi-line string literals don't support escape sequences and thus have no way to encode HT, LF, CR or other illegal characters, so for those rare cases you are need to concatenate the multi-line string literal with a regular string literal containing the necessary escape sequences, or put the text data in a separate file and import it using @embedfile.

PanSashko · 2024-09-08T11:09:23Z

The restriction to use tabs in comments and multi line strings is frustrating.

What is "rendering" and why does it matter?

It is a really annoying change for any project where there are comments or multi line strings with tabs. But what are real benefits for others who do not care?

thejoshwolfe · 2024-09-08T13:46:07Z

@PanSashko is your project where you want tabs in comments and strings something you can link to? It might help better understand the pros and cons to see a real example.

PanSashko · 2024-09-08T15:11:59Z

@thejoshwolfe unfortunately I cannot publish it yet. But I can describe 2 examples.

Imagine a pet language that is indent-based like Python. Let's say it supports tabs or spaces for indentation. Then, e.g., parser tests should cover both cases. Some small tests are using multi line strings for embedding source code under test.
An obvious workaround for that case is @embedFile, but why?
With the restriction not all UTF-8 strings can be represented in Zig's multi line strings.

Other example. Imagine a Zig project and 2 developers that work on it. One of them prefers 4-space indents, other one 2-space indents. The easiest option is to use tabs, and that's allowed by Zig, according to this issue. But the comments restriction disables an ability to temporarily comment out a piece of code, as it does not compile.
So the convenient code commenting is broken for any Zig project that uses tabs. As I understand, that is not justified, because regular comments are ignored by compiler anyway.

I think before this change Zig coders from both sides of the "spaces vs tabs war" were happy. Also everyone who did not care was happy. This why I really wonder what is the advantage of this change?

castholm · 2024-09-08T16:37:52Z

One problem with rejecting tabs in comments is that you can't easily comment out code. If you are forced to write Zig with tabs for whatever reason, you will need to find a way to make your editor replace the commented out tabs with spaces (and then bring them back when uncommenting). Just selecting lines and prepending them with // won't work if they contain indentation.

thejoshwolfe · 2024-09-08T19:35:21Z

i certainly agree that it's confusing to get an error when commenting out working code. the original design of zig was to consistently disallow tabs (and all this is true about carriage returns as well), but in order for zig fmt to fix tabs, it had to support parsing them without error. once the default implementation of zig switched to the self hosted one, the main compiler got this tolerance as well.

however, there is still an intention to disallow hard tabs in zig source, whether in a comment, a string literal, or whitespace around tokens. see andrewrk's comment above. this would avoid the confusion when commenting out code, because the error would be much sooner.

What is "rendering" and why does it matter?

the relevant aspect of rendering in this context is the vertical alignment of grapheme clusters across multiple lines when using a monospace font. different editor implementations and configurations are also relevant. i have a hard time explaining why it matters in general, but certainly there are plenty of specific cases where the text is intended to align in a particular way independent of editor configuration.

Imagine a pet language that is indent-based like Python. Let's say it supports tabs or spaces for indentation. Then, e.g., parser tests should cover both cases. Some small tests are using multi line strings for embedding source code under test.

I don't have an easy way to search the zig codebase at the moment, but I'm pretty sure zig's own tests do this. the workaround is to use ++ "\t" ++ style string concatenation to get the invalid characters into the test case, or at least that's an option i would reach for. pretty awkward, but it seems appropriate for how unusual the use case is.

Imagine a Zig project and 2 developers that work on it. One of them prefers 4-space indents, other one 2-space indents. The easiest option is to use tabs,

The intent is that you use zig fmt, and then your preferences yield to the community standard. allowing different preferences for indentation is an explicit antifeature of zig fmt.

I think before this change Zig coders from both sides of the "spaces vs tabs war" were happy

ideally, we get to a point where is no spaces vs tabs war among zig authors, because there will be no tabs in zig. that's what i hope for at least.

If you are forced to write Zig with tabs for whatever reason

@castholm are you in that situation? how did that happen?

castholm · 2024-09-08T20:07:28Z

are you in that situation? how did that happen?

I'm not (but I recall from earlier discussion from back when tabs were hard banned that some people suggested they need them for accessibility) and fwiw I'm strictly on the spaces side. I was mainly just trying to make a point that allowing them in source but not comments is the worst of both worlds.
If the end goal is to begrudgingly support tabs but as a second-class citizen (i.e. still have the canonical zig fmt format favor spaces), they should also be supported in regular comments.
If the end goal is to make tabs a compile error, they shouldn't be supported anywhere. But as you mention, since zig fmt is currently the mechanism that normalizes source encoding, for zig fmt to even be able to parse and fix up the code they would need to be allowed.

PanSashko · 2024-09-08T22:19:28Z

pretty awkward, but it seems appropriate for how unusual the use case is.

I do not see any point in awkwardness, when it is so easy to fix. But in this case it was working, but now it is intentionally broken.

On other hand, "unambiguous source formatting" looks to me like a fictional benefit. I would be happy to hear at least one practical example when it is required or useful, because I cannot find any.

Recommending zig fmt and "community standards" for everyone is OK. But in this case it causes annoyance for some with no benefits for others.

Closes ziglang#21358 Closes ziglang#21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.

Closes #21358 Closes #21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.

Closes ziglang#21358 Closes ziglang#21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.

thejoshwolfe · 2024-09-28T15:22:52Z

On other hand, "unambiguous source formatting" looks to me like a fictional benefit. I would be happy to hear at least one practical example when it is required or useful, because I cannot find any.

Back in Java 1.5, before openjdk, the source for java.lang.String and other classes was formatted using a mix of tabs and spaces, where the intent was that a tab stood in for 8 spaces (to save disk space or something probably). I can't find the source code for jdk pre openjdk 7 (without creating an oracle account?), but i can recreate it as an example. Starting with this modern excerpt: https://github.com/openjdk/jdk/blob/73ebb848fdb66861e912ea747c039ddd1f7a5f48/src/java.base/share/classes/java/lang/String.java#L552-L581

then replacing 8 spaces with tabs to recreate the style of jdk 1.5, it looks like this in github web/mobile app/whatever you're using, which for me renders the intended way:

    @SuppressWarnings("removal")
    private String(Charset charset, byte[] bytes, int offset, int length) {
	if (length == 0) {
	    this.value = "".value;
	    this.coder = "".coder;
	} else if (charset == UTF_8.INSTANCE) {
	    if (COMPACT_STRINGS) {
		int dp = StringCoding.countPositives(bytes, offset, length);
		if (dp == length) {
		    this.value = Arrays.copyOfRange(bytes, offset, offset + length);
		    this.coder = LATIN1;
		    return;
		}
		// Decode with a stable copy, to be the result if the decoded length is the same
		byte[] latin1 = Arrays.copyOfRange(bytes, offset, offset + length);
		int sp = dp;            // first dp bytes are already in the copy
		while (sp < length) {
		    int b1 = latin1[sp++];
		    if (b1 >= 0) {
			latin1[dp++] = (byte)b1;
			continue;
		    }
		    if ((b1 & 0xfe) == 0xc2 && sp < length) { // b1 either 0xc2 or 0xc3
			int b2 = latin1[sp];
			if (b2 < -64) { // continuation bytes are always negative values in the range -128 to -65
			    latin1[dp++] = (byte)decode2(b1, b2);
			    sp++;
			    continue;
			}
		    }

And if you set tabs to appear as 2 spaces, a configuration i've seen commonly in web development contexts, it looks like this:

I would call that simply incorrect behavior. For the indentation to appear to go backwards when it should go forwards is a failure between the source code author and the editor configuration. The original authors (according to my story, which i've simulated in this discussion) were vigilant about formatting the indentation consistently; it wasn't sloppiness; it was a strategy that was prone to ambiguous rendering.

In countless modern scenarios, mostly in closed-source code authored by my colleagues at work, I've encountered a few stray tabs here and there where someone didn't notice that their editor was defaulting to inserting tabs where there were already spaces or visa versa. This typically happens in code bases without auto formatters, linters, or other modern comforts, such as in shell scripts for devops use cases. It looked correct with their editor configuration, which is often 4-space tab width, but then in the github PR review web interface, tabs are typically rendered 8-spaces wide. This causes incorrect indentation rendering, which is a failure between the author and the renderer configuration.

These types of failure are the main casualties in the tabs vs spaces war. There are many solutions, of course, and an auto formatter is involved in many of them. Zig has an autoformatter, and it's generally recommended to use it in order to avoid ambiguous and incorrect appearing indentation like these examples.

PanSashko · 2024-09-30T07:32:39Z

@thejoshwolfe you described a "problem" that Zig still has too. It allows mixed indentation, so Zig sources with mixed indentation will look like your screenshot for some tab widths.

On other hand, zig fmt solves all of that for anyone who cares.
I don't see the point of breaking developer experience for those who use tabs and do not use zig fmt.

IntegratedQuantum · 2024-10-03T12:52:22Z

I just came across this because I wanted to comment out some code to take care of after testing the other changes that I made.
And I have to say, getting an error for commented code is a pretty bad user experience.

Sure there is a workaround for this: Only comment the code without the tabs in front of it.
This is quite easy to do with a multi-cursor capable editor, but it is non-obvious.

Do you really want to return to the old (stage1) days, where we (tab users) had to write an extra pre-processing step in the build.zig just to be able to conveniently use tabs in the source code?

Closes ziglang#21358 Closes ziglang#21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.

jibal · 2024-11-09T20:15:37Z

"CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace."

But it's not unambiguously rendered ... there's no telling how it will be displayed, and on some displays it will hide code. If you're concerned about rendering, then you should ban CRs that don't immediately precede NL ... unlike tabs, there's no reason at all for anyone to want a stray CR. The reference manual gets it right:

Each LF may be immediately preceded by a single CR (byte value 0x0d, code point U+000d, '\r') to form a Windows style line ending, but this is discouraged. Note that in multiline strings, CRLF sequences will be encoded as LF when compiled into a zig program. A CR in any other context is not allowed.

BTW, the reference manual has the wrong (example) regex for line ends (how has no one noticed? Or is feedback about such things broken? I like that the D documentation has a link on every page to file an error report): /\r\n?|[\n\u0085\u2028\u2029]/

jibal · 2024-11-09T20:42:51Z

Inside Line Comments [...]
Any TAB is rejected by the grammar since it is ambiguous how it should be rendered.

But who cares? I suggest that inside of non-doc comments anything at all should be allowed, including ctrl chars and chars with the high bit set (so arbitrary unicode). Since non-doc comments are completely elided, it's none of the compiler's (or language designers') damn business what is in them.

andrewrk mentioned this issue Jul 8, 2021

parsing failure on windows ziglang/zig#9257

Closed

mbcrocci mentioned this issue Jul 12, 2021

Compatibility with 0.9.0-dev changes Not-Nik/raylib-zig#10

Merged

Vexu mentioned this issue Apr 28, 2022

CR-LF Line Endings Break Multi-Line Strings ziglang/zig#11414

Closed

This was referenced Aug 28, 2022

Compiler asserts & crashes on multiline string containing CRLF line ending (mainly affects Windows) ziglang/zig#12550

Closed

Correctly handle carriage return characters according to the spec ziglang/zig#12661

Merged

IntegratedQuantum mentioned this issue Sep 29, 2022

Allow tabs for the indentation added by code actions. zigtools/zls#684

Merged

WillLillis mentioned this issue Jul 19, 2024

feat: Implement string literal conversion code actions zigtools/zls#1931

Open

andrewrk mentioned this issue Jul 31, 2024

std.zig.tokenizer: simplification and spec conformance ziglang/zig#20885

Merged

karlseguin mentioned this issue Aug 3, 2024

bizarre zig compilation error on commented out Go code in zig src file karlseguin/zul#5

Closed

Vexu mentioned this issue Aug 5, 2024

unhelpful error message on invalid bytes inside multi line string literal ziglang/zig#20900

Open

jiacai2050 mentioned this issue Aug 6, 2024

0.14.0 jiacai2050/zigcli#27

Merged

castholm mentioned this issue Aug 18, 2024

Invalid token in commented code ziglang/zig#21126

Closed

ianprime0509 mentioned this issue Sep 10, 2024

std.zig.tokenizer: simplify line-based tokens ziglang/zig#21371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grammar clarifications regarding tabs and carriage returns #38

grammar clarifications regarding tabs and carriage returns #38

andrewrk commented Jul 8, 2021 •

edited

Loading

vrischmann commented Jul 15, 2021 •

edited

Loading

thejoshwolfe commented Jul 15, 2021

vrischmann commented Jul 15, 2021

Vexu commented Jul 15, 2021

nektro commented Apr 28, 2022 •

edited

Loading

andrewrk commented Oct 31, 2022

rockorager commented Aug 3, 2024 •

edited

Loading

castholm commented Aug 5, 2024

PanSashko commented Sep 8, 2024

thejoshwolfe commented Sep 8, 2024

PanSashko commented Sep 8, 2024 •

edited

Loading

castholm commented Sep 8, 2024

thejoshwolfe commented Sep 8, 2024

castholm commented Sep 8, 2024

PanSashko commented Sep 8, 2024

thejoshwolfe commented Sep 28, 2024 •

edited

Loading

PanSashko commented Sep 30, 2024 •

edited

Loading

IntegratedQuantum commented Oct 3, 2024

jibal commented Nov 9, 2024 •

edited

Loading

jibal commented Nov 9, 2024 •

edited

Loading

grammar clarifications regarding tabs and carriage returns #38

grammar clarifications regarding tabs and carriage returns #38

Comments

andrewrk commented Jul 8, 2021 • edited Loading

Inside Line Comments and Documentation Comments

Inside Multi-Line String Literals

Whitespace

vrischmann commented Jul 15, 2021 • edited Loading

thejoshwolfe commented Jul 15, 2021

vrischmann commented Jul 15, 2021

Vexu commented Jul 15, 2021

nektro commented Apr 28, 2022 • edited Loading

andrewrk commented Oct 31, 2022

rockorager commented Aug 3, 2024 • edited Loading

castholm commented Aug 5, 2024

PanSashko commented Sep 8, 2024

thejoshwolfe commented Sep 8, 2024

PanSashko commented Sep 8, 2024 • edited Loading

castholm commented Sep 8, 2024

thejoshwolfe commented Sep 8, 2024

castholm commented Sep 8, 2024

PanSashko commented Sep 8, 2024

thejoshwolfe commented Sep 28, 2024 • edited Loading

PanSashko commented Sep 30, 2024 • edited Loading

IntegratedQuantum commented Oct 3, 2024

jibal commented Nov 9, 2024 • edited Loading

jibal commented Nov 9, 2024 • edited Loading

andrewrk commented Jul 8, 2021 •

edited

Loading

vrischmann commented Jul 15, 2021 •

edited

Loading

nektro commented Apr 28, 2022 •

edited

Loading

rockorager commented Aug 3, 2024 •

edited

Loading

PanSashko commented Sep 8, 2024 •

edited

Loading

thejoshwolfe commented Sep 28, 2024 •

edited

Loading

PanSashko commented Sep 30, 2024 •

edited

Loading

jibal commented Nov 9, 2024 •

edited

Loading

jibal commented Nov 9, 2024 •

edited

Loading