-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grammar clarifications regarding tabs and carriage returns #38
Comments
This
causes a problem when using git on windows, unless I'm missing something. If you clone this gist (which contains no CR) on Windows with the default git configuration, it produces this:
which does not compile:
it is the intended behaviour but IMO it's surprising. Currently as far as I can see you need to either configure git with |
It still surprises me that Git's default configuration is to give you the wrong bytes on windows. Can you clarify which client you're using? is it provided by GitHub? by git-scm.com? by a cygwin package manager (probably not this one)? As much as I'd like to say that Git is wrong (which i kinda did in the last paragraph), the conflict is arising from the concept of a "text file", which many people think is an array of lines rather than a single byte array. Sometimes discussions of character encoding get tangled up in the concept of a "text file" too. But Zig is not interested in all the complexity of "text files", and specifies always UTF-8, always spaces over tabs, always LF line endings. There's some room for accepting and canonicalizing (in zig fmt) unambiguous alternatives for everyone's convenience, but CRLF line endings are officially the wrong way to end lines in Zig (along with NEL, LS, and PS). While it bothers me emotionally that some Git cient on Windows is wasting our time here talking about CRLF line endings, it seems important to get this right to avoid even more wasted time for everyone in the future. With that preamble ramble out of the way: i think we should definitely canonicalize CRLF line endings in multiline string literals to LF line endings. |
I'm using the installer from git-scm. |
Couldn't we just add an error like If mangling the files is an issue, the feature can be put behind a |
could even make |
Yes, I'm thinking along the same lines. Make it a hard error, just like unused locals, and then have |
This is specific to multi-line string literals, right? I can still have |
Yes, this clarification is specifically for multi-line string literals. Regular string literals can of course still use escape sequences to encode characters/byte sequences like control characters or invalid UTF-8 sequences that are illegal in Zig source code. But multi-line string literals don't support escape sequences and thus have no way to encode HT, LF, CR or other illegal characters, so for those rare cases you are need to concatenate the multi-line string literal with a regular string literal containing the necessary escape sequences, or put the text data in a separate file and import it using |
The restriction to use tabs in comments and multi line strings is frustrating. What is "rendering" and why does it matter? It is a really annoying change for any project where there are comments or multi line strings with tabs. But what are real benefits for others who do not care? |
@PanSashko is your project where you want tabs in comments and strings something you can link to? It might help better understand the pros and cons to see a real example. |
@thejoshwolfe unfortunately I cannot publish it yet. But I can describe 2 examples. Imagine a pet language that is indent-based like Python. Let's say it supports tabs or spaces for indentation. Then, e.g., parser tests should cover both cases. Some small tests are using multi line strings for embedding source code under test. Other example. Imagine a Zig project and 2 developers that work on it. One of them prefers 4-space indents, other one 2-space indents. The easiest option is to use tabs, and that's allowed by Zig, according to this issue. But the comments restriction disables an ability to temporarily comment out a piece of code, as it does not compile. I think before this change Zig coders from both sides of the "spaces vs tabs war" were happy. Also everyone who did not care was happy. This why I really wonder what is the advantage of this change? |
One problem with rejecting tabs in comments is that you can't easily comment out code. If you are forced to write Zig with tabs for whatever reason, you will need to find a way to make your editor replace the commented out tabs with spaces (and then bring them back when uncommenting). Just selecting lines and prepending them with |
i certainly agree that it's confusing to get an error when commenting out working code. the original design of zig was to consistently disallow tabs (and all this is true about carriage returns as well), but in order for however, there is still an intention to disallow hard tabs in zig source, whether in a comment, a string literal, or whitespace around tokens. see andrewrk's comment above. this would avoid the confusion when commenting out code, because the error would be much sooner.
the relevant aspect of rendering in this context is the vertical alignment of grapheme clusters across multiple lines when using a monospace font. different editor implementations and configurations are also relevant. i have a hard time explaining why it matters in general, but certainly there are plenty of specific cases where the text is intended to align in a particular way independent of editor configuration.
I don't have an easy way to search the zig codebase at the moment, but I'm pretty sure zig's own tests do this. the workaround is to use
The intent is that you use
ideally, we get to a point where is no spaces vs tabs war among zig authors, because there will be no tabs in zig. that's what i hope for at least.
@castholm are you in that situation? how did that happen? |
I'm not (but I recall from earlier discussion from back when tabs were hard banned that some people suggested they need them for accessibility) and fwiw I'm strictly on the spaces side. I was mainly just trying to make a point that allowing them in source but not comments is the worst of both worlds. |
I do not see any point in awkwardness, when it is so easy to fix. But in this case it was working, but now it is intentionally broken. On other hand, "unambiguous source formatting" looks to me like a fictional benefit. I would be happy to hear at least one practical example when it is required or useful, because I cannot find any. Recommending |
Closes ziglang#21358 Closes ziglang#21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.
Closes #21358 Closes #21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.
Closes ziglang#21358 Closes ziglang#21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.
Back in Java 1.5, before openjdk, the source for then replacing 8 spaces with tabs to recreate the style of jdk 1.5, it looks like this in github web/mobile app/whatever you're using, which for me renders the intended way: @SuppressWarnings("removal")
private String(Charset charset, byte[] bytes, int offset, int length) {
if (length == 0) {
this.value = "".value;
this.coder = "".coder;
} else if (charset == UTF_8.INSTANCE) {
if (COMPACT_STRINGS) {
int dp = StringCoding.countPositives(bytes, offset, length);
if (dp == length) {
this.value = Arrays.copyOfRange(bytes, offset, offset + length);
this.coder = LATIN1;
return;
}
// Decode with a stable copy, to be the result if the decoded length is the same
byte[] latin1 = Arrays.copyOfRange(bytes, offset, offset + length);
int sp = dp; // first dp bytes are already in the copy
while (sp < length) {
int b1 = latin1[sp++];
if (b1 >= 0) {
latin1[dp++] = (byte)b1;
continue;
}
if ((b1 & 0xfe) == 0xc2 && sp < length) { // b1 either 0xc2 or 0xc3
int b2 = latin1[sp];
if (b2 < -64) { // continuation bytes are always negative values in the range -128 to -65
latin1[dp++] = (byte)decode2(b1, b2);
sp++;
continue;
}
} And if you set tabs to appear as 2 spaces, a configuration i've seen commonly in web development contexts, it looks like this: I would call that simply incorrect behavior. For the indentation to appear to go backwards when it should go forwards is a failure between the source code author and the editor configuration. The original authors (according to my story, which i've simulated in this discussion) were vigilant about formatting the indentation consistently; it wasn't sloppiness; it was a strategy that was prone to ambiguous rendering. In countless modern scenarios, mostly in closed-source code authored by my colleagues at work, I've encountered a few stray tabs here and there where someone didn't notice that their editor was defaulting to inserting tabs where there were already spaces or visa versa. This typically happens in code bases without auto formatters, linters, or other modern comforts, such as in shell scripts for devops use cases. It looked correct with their editor configuration, which is often 4-space tab width, but then in the github PR review web interface, tabs are typically rendered 8-spaces wide. This causes incorrect indentation rendering, which is a failure between the author and the renderer configuration. These types of failure are the main casualties in the tabs vs spaces war. There are many solutions, of course, and an auto formatter is involved in many of them. Zig has an autoformatter, and it's generally recommended to use it in order to avoid ambiguous and incorrect appearing indentation like these examples. |
@thejoshwolfe you described a "problem" that Zig still has too. It allows mixed indentation, so Zig sources with mixed indentation will look like your screenshot for some tab widths. On other hand, |
I just came across this because I wanted to comment out some code to take care of after testing the other changes that I made. Sure there is a workaround for this: Only comment the code without the tabs in front of it. Do you really want to return to the old (stage1) days, where we (tab users) had to write an extra pre-processing step in the build.zig just to be able to conveniently use tabs in the source code? |
Closes ziglang#21358 Closes ziglang#21360 This commit modifies the `multiline_string_literal_line`, `doc_comment`, and `container_doc_comment` tokens to no longer include the line ending as part of the token. This makes it easier to handle line endings (which may be LF, CRLF, or in edge cases possibly nonexistent) consistently. In the two issues linked above, Autodoc was already assuming this for doc comments, and yielding incorrect results when handling files with CRLF line endings (both in Markdown parsing and source rendering). Applying the same simplification for multiline string literals also brings `zig fmt` into conformance with ziglang/zig-spec#38 regarding formatting of multiline strings with CRLF line endings: the spec says that `zig fmt` should remove the CR from such line endings, but this was not previously the case.
"CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace." But it's not unambiguously rendered ... there's no telling how it will be displayed, and on some displays it will hide code. If you're concerned about rendering, then you should ban CRs that don't immediately precede NL ... unlike tabs, there's no reason at all for anyone to want a stray CR. The reference manual gets it right:
BTW, the reference manual has the wrong (example) regex for line ends (how has no one noticed? Or is feedback about such things broken? I like that the D documentation has a link on every page to file an error report): /\r\n?|[\n\u0085\u2028\u2029]/ |
But who cares? I suggest that inside of non-doc comments anything at all should be allowed, including ctrl chars and chars with the high bit set (so arbitrary unicode). Since non-doc comments are completely elided, it's none of the compiler's (or language designers') damn business what is in them. |
NL = New Line (0x0a)
CR = Carriage Return (0x0d)
TAB = Tab (0x09)
Inside Line Comments and Documentation Comments
zig fmt
, leaving only NL.Inside Multi-Line String Literals
zig fmt
may not mangle multi line string literals, and therefore the control character TAB are rejected by the grammar inside multi-line string literals.For string literals that want to include CR, TAB, or any other control sequences, they will need to use regular string literals and the
++
operator, or@embedFile
.Whitespace
zig fmt
.zig fmt
.The text was updated successfully, but these errors were encountered: