Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing failure on windows #9257

Closed
vrischmann opened this issue Jun 27, 2021 · 16 comments · Fixed by #9274
Closed

parsing failure on windows #9257

vrischmann opened this issue Jun 27, 2021 · 16 comments · Fixed by #9274

Comments

@vrischmann
Copy link
Contributor

Hi,

the following code fails to compile only on Windows:

fn createTestTables() !void {
    const bar = &[_][]const u8{
        \\foobar
    };
    _ = bar;
}

I get this error:

>> C:\Users\vincent\dev\zig-windows-x86_64-0.9.0-dev.321+15a030ef3\zig.exe build-obj .\sqlite.zig
.\sqlite.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^

I'm testing with zig-windows-x86_64-0.9.0-dev.321+15a030ef3.zip.

Weird thing is it works fine on Linux and FreeBSD.

@jmc-88
Copy link
Contributor

jmc-88 commented Jul 1, 2021

I can reproduce the issue on a Linux system, provided that I save the file with DOS newlines (CR+LF). Likely the lexer is choking on the unexpected CR.

Easy steps to reproduce:

  • vi dosnl.zig
  • set fileformat=dos
  • paste the above code in and :x to save + quit
  • zig build-obj dosnl.zig

@jmc-88
Copy link
Contributor

jmc-88 commented Jul 1, 2021

A patch as simple as this seems to make the tokenizer happy:

diff --git a/lib/std/zig/tokenizer.zig b/lib/std/zig/tokenizer.zig
index 94a20d958..edf690ebe 100644
--- a/lib/std/zig/tokenizer.zig
+++ b/lib/std/zig/tokenizer.zig
@@ -834,7 +834,7 @@ pub const Tokenizer = struct {
                 },
 
                 .multiline_string_literal_line => switch (c) {
-                    '\n' => {
+                    '\r', '\n' => {
                         self.index += 1;
                         break;
                     },

I'll see if there's any tests where I can record the regression and send a PR out for review.

jmc-88 added a commit to jmc-88/zig that referenced this issue Jul 1, 2021
Fixes ziglang#9257.
This is needed when tokenizing input containing DOS line endings, i.e.
the CRLF sequence.
jmc-88 added a commit to jmc-88/zig that referenced this issue Jul 1, 2021
Fixes ziglang#9257.
This is needed when tokenizing input containing DOS line endings, i.e.
the CRLF sequence.
@Avokadoen
Copy link

I have the same issue on zig-windows-x86_64-0.9.0-dev.397+b7da1b2d4

jmc-88 added a commit to jmc-88/zig that referenced this issue Jul 6, 2021
Fixes ziglang#9257.
This is needed when tokenizing input containing DOS line endings, i.e.
the CRLF sequence.
@Vexu Vexu closed this as completed in #9274 Jul 7, 2021
Vexu pushed a commit that referenced this issue Jul 7, 2021
Fixes #9257.
This is needed when tokenizing input containing DOS line endings, i.e.
the CRLF sequence.
@andrewrk
Copy link
Member

andrewrk commented Jul 8, 2021

Thanks for the patch @jmc-88, however I reverted it because this was working as designed. Carriage returns are discouraged in general, removed by zig fmt, and in the case of multi-line string literals, not allowed at all. If you need carriage returns in a string literal, you have to use a clunkier syntax.

@andrewrk
Copy link
Member

andrewrk commented Jul 8, 2021

See ziglang/zig-spec#38 for clarification. I opened an issue because the draft spec and the current compiler implementation do not match up exactly; this issue has the canonical specification.

@Hadron67
Copy link
Contributor

Hadron67 commented Jul 8, 2021

If no CRs are allowed in multi-line string literals, does this mean it's impossible to include multi-line string literals in source files with DOS newlines without an editor plugin?

@andrewrk
Copy link
Member

andrewrk commented Jul 8, 2021

I'm not sure what you mean about an editor plugin, but yes, if you need a string with DOS newlines, it would require a strategy such as:

  • regular string literals, using \r and the ++ operator (this would look similar to C code accomplishing the same task)
  • a function that is called at comptime and converts to DOS newlines, and using that to wrap a multiline string literal
  • @embedFile

@Hadron67
Copy link
Contributor

Hadron67 commented Jul 8, 2021

I mean, if my source file was saved with DOS newlines and it includes multiline strings, the newlines in them would become CRLF and thus rejected by the compiler, as in this issue. But I don't mean to verbatim include those CRLFs in my string, I just want them to be standard '\n's. To solve this I'll need to either save the file with UNIX newlines, or use some editor plugin that converts CRs in multiline strings to NL.

I know language such as Javascript accepts all styles of newlines in multiline string literals, but they will get converted to '\n', making the newlines unambiguous. Maybe this should also be zig's case?

@jmc-88
Copy link
Contributor

jmc-88 commented Jul 8, 2021

Thanks for the patch @jmc-88, however I reverted it because this was working as designed. Carriage returns are discouraged in general, removed by zig fmt, and in the case of multi-line string literals, not allowed at all. If you need carriage returns in a string literal, you have to use a clunkier syntax.

OK, but then looking at ziglang/zig-spec#38, '\t' characters should be similarly disallowed, while currently the tokenizer ignores them? From the above it sounds this line should also disappear.

In any case, if certain bytes are disallowed it may help to improve the error message to point out which invalid bytes were found and rejected.

I mean, if my source file was saved with DOS newlines and it includes multiline strings, the newlines in them would become CRLF and thus rejected by the compiler, as in this issue. But I don't mean to verbatim include those CRLFs in my string, I just want them to be standard '\n's. To solve this I'll need to either save the file with UNIX newlines, or use some editor plugin that converts CRs in multiline strings to NL.

I think what @andrewrk suggests is that you need to apply zig fmt as part of your development workflow, to automatically convert these bytes sequences for you.

@vrischmann
Copy link
Contributor Author

I just got back to this. My issue is that git on windows has core.autocrlf=true by default (at least on my vanilla installation) which now will cause code to not compile. Simply cloning like this works: git clone --config core.autocrlf=false but I don't think this is ideal.

@vrischmann
Copy link
Contributor Author

Also, I can't run zig fmt on my file either:

PS C:\Users\vincent\dev\tmp\invalid-bytes> C:\Users\vincent\dev\zig-windows-x86_64-0.9.0-dev.453+7ef854682\zig.exe fmt .\main.zig
.\main.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: ' '
        \\foobar
                 ^
.\main.zig:3:17: error: expected expression, found 'invalid'
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: ' '
        \\foobar
                 ^
.\main.zig:4:6: error: expected test, comptime, var decl, or container field, found ';'
    };
     ^
.\main.zig:5:12: error: expected ',', found ';'
    _ = bar;
           ^
.\main.zig:6:1: error: expected 'eof', found '}'
}
^

@Hadron67
Copy link
Contributor

I think what @andrewrk suggests is that you need to apply zig fmt as part of your development workflow, to automatically convert these bytes sequences for you.

zig fmt rejects them too, as ziglang/zig-spec#38 suggests:

  • zig fmt may not mangle multi line string literals, and therefore the control characters TAB and CR are rejected by the grammar inside multi-line string literals.

@crystalthoughts
Copy link

crystalthoughts commented Jul 20, 2021

This has really sent me down a rabbit hole trying to get certain repositories to work on Windows! I think this is one of those times where striving for simplicity is actually making things harder in certain cases.

It's quite a subtle thing to try and debug for someone not already aware of the issue, especially given the non-visible nature of whitespace. Is there a way for this to fail gracefully? Either by a conversion prompt or a tweak to how things are handled?

@tecanec
Copy link
Contributor

tecanec commented Mar 5, 2022

I must point out that this has confused me, as well. CR is mostly invisible to the programmer, so it gets confusing when your code suddenly won't compile on Windows and the compiler is complaining about an invalid byte in your string literal that you can't see.

I agree with throwing compile errors when the programmer themself makes a mistake. However, I do think that this particular issue is less about the programmer and more about compatibility with software made by third parties. And even though most proper text editors have settings for choosing between CRLF or just LF, I still think that it's important to support any could-be outliers. In other words: The language belongs to Zig, but the format for encoding text files doesn't. Zig merely supports it, and holds no more power over the encoding than the average text editor does.

@Hadron67 mentioned the idea of simply converting the line endings of string literals to LF. I agree with that idea. While it technically is a "hidden element", it does simplify things overall and shouldn't be hard to document. It means that you can write the exact same code in any text editor and get the exact same result, regardless of whether the editor uses CRLF or LF. I'm fine with using \r when I really have to, as long as everything else is consistent.

However, if the compiler really must reject CR like this, I'd appreciate it if it were at least being less confusing about it. Rejecting CR in multiline string literals specifically seems arbitrary and confusing, so I honestly think that it'd be better to just outright reject CR everywhere, period. Other than that, simply making the compile error a bit more specific (such as error: expected ',', found invalid byte 'CR') seems like a safe bet.

@motiejus
Copy link
Contributor

motiejus commented Jan 3, 2023

I suggest we re-open this issue.

One more data point: https://lists.sr.ht/~motiejus/bazel-zig-cc/%3C167179347159.25324.12223141818448163553-0%40git.sr.ht%3E#%3Cee9821ec-6b7c-4d62-a1f2-0b54cc2ef728@hahn.graphics%3E

Extract from @FabianHahn email:

Zig further fails to compile launcher.zig with the following error:

launcher.zig:70:73: error: expected ';' after declaration

     \\Usage: <...>/tools/<target-triple>/{[zig_tool]s}{[exe]s} <args>...

                                                                         ^~~~

launcher.zig:71:3: note: invalid byte: ' '

     \\

   ^

The reason for this is that Git on Windows inserts carriage return
characters before newlines when checking out local files so they show up
fine in Windows text editors where \r\n is the default for line breaks.
This then breaks the zig parser's support for multiline constants as
detailed in #9257 which is marked
as closed even though it doesn't look resolved to me at all. Converting
newlines is the default behavior for Git unless you explicitly disable
it by setting core.autocrlf in your Git config. While that would be a
workaround, I would find this an uncommon thing to do and an odd thing
to require of Windows users, which is something also pointed out in the
above issue.

@Vexu
Copy link
Member

Vexu commented Jan 3, 2023

That should be covered by #11414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants