Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yaml parsing: End_of_file exception on Windows #2991

Open
Tracked by #1330
mjambon opened this issue Apr 22, 2021 · 22 comments
Open
Tracked by #1330

yaml parsing: End_of_file exception on Windows #2991

mjambon opened this issue Apr 22, 2021 · 22 comments
Labels
alpha Relates to an experimental feature bug Something isn't working lang:yaml os:windows user:external requested by someone outside of r2c

Comments

@mjambon
Copy link
Member

mjambon commented Apr 22, 2021

C:\foo>semgrep --config=rules.yaml --lang=py foo.py
running 44 rules...
  0%|                                                                                                             |0/44
an internal error occured while invoking semgrep-core while running rule 'testmodule.Test'. Consider skipping this rule and reporting this issue.
        unknown exception: End_of_file
An error occurred while invoking the semgrep engine; please help us fix this by creating an issue at https://github.com/returntocorp/semgrep

See original report.

@fasiha
Copy link

fasiha commented Apr 23, 2021

I added some logging to core_runner.py so that, when I invoke semgrep as

semgrep -e "eval(...)" --lang=py demo.py

the full command (cmd in the code) that Python will shell out to semgrep-core is

C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\semgrep\bin\semgrep-core -lang py -json -rules_file C:\Users\User\AppData\Local\Temp\tmpj9xbese9 -j 2 -target_file C:\Users\User\AppData\Local\Temp\tmpwohu0jdd -use_parsing_cache C:\Users\User\AppData\Local\Temp\tmpn_ilbh86 -timeout 30 -max_memory 0

I can put a sleep after the command is invoked to inspect these tempt files. The rules_file contains YAML, but different than the rules spec that I've seen?

$ cat /c/Users/User/AppData/Local/Temp/tmpj9xbese9
rules:
- id: 0.-
  pattern: eval(...)
  severity: ERROR
  languages:
  - py
  message: <internalonly>

(This is in Git Bash so I can use cat etc.) And target_file looks to be just a list of paths:

$ cat /c/Users/User/AppData/Local/Temp/tmpwohu0jdd
demo.py

But I noticed that the rules file uses CRLF Windows newlines. When I convert to Unix newlines via dos2unix, and rerun the above command but adapted to use local files, I get a new error message 😳😳!!

C:\foo>C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\semgrep\bin\semgrep-core.exe -lang py -json -rules_file C:\foo\prepped1-unix.txt -target_file c:\foo\files.txt
{"pattern_id":"0.-","error":"invalid pattern","pattern":"eval(...)","language":"Python","message":"(Sys_error \"/tmp/tmp-681-6c7c1c.py: No such file or directory\")"}

(Above, C:\foo> is the cmd.exe equivalent of $, so the command I ran starts with C:\Users\User…, hopefully that's clear. And prepped1-unix.txt just has the same YAML as above, just Unix line endings, and files.txt just contains "c:\demo.py".)

SO. I can readily change the Python code to replace the contents of pattern_file.name (line 236) with Unix line endings.

But this new error message seems to say that semgrep-core is writing to /tmp which of course doesn't exist in Windows. I see that

  1. both program_of_string and any_of_sting in Parse_python.ml both use
  2. Common2.with_tmp_file which uses
  3. Common.new_temp_file which invokes
  4. Filename.temp_file.

Can you help me find out who is attempting to write to /tmp, if it's in the chain above, or somewhere else?

(I can also prepare a PR to semgrep to ensure that the files sent to semgrep-core have Unix line ends—or would you prefer to ensure that the OCaml file readers are cross-platform?)

@mjambon
Copy link
Member Author

mjambon commented Apr 23, 2021

Indeed it looks like the bad temp folder /tmp comes from the call to Filename.temp_file by Common.new_temp_file. This is surprising because the whole point of such function is to pick a proper folder.

To test this, you can run utop (if not installed, use ocaml or run opam install utop to install it) and check the value of Filename.get_temp_dir_name ():

$ utop
────────┬─────────────────────────────────────────────────────────────┬─────────
        │ Welcome to utop version 2.6.0 (using OCaml version 4.10.0)! │         
        └─────────────────────────────────────────────────────────────┘         
Findlib has been successfully loaded. Additional directives:
  #require "package";;      to load a package
  #list;;                   to list the available packages
  #camlp4o;;                to load camlp4 (standard syntax)
  #camlp4r;;                to load camlp4 (revised syntax)
  #predicates "p,q,...";;   to set these predicates
  Topfind.reset();;         to force that packages will be reloaded
  #thread;;                 to enable threads


Type #utop_help for help about using utop.

─( 18:12:39 )─< command 0 >──────────────────────────────────────{ counter: 0 }─
utop # Filename.get_temp_dir_name ();;
- : string = "/tmp"

^ if you're getting "/tmp" too, this explains the bug.

I'm also getting this because I'm on linux:

utop # Sys.os_type;;
- : string = "Unix"

You should get "Win32" or "Cygwin". Based on that, the Filename module implementation selects one of three submodules. Apparently, for Cygwin, it's the same as for Unix, which uses an environment variable and falls back to /tmp. For Windows, it uses the TEMP environment variable and falls back to ..

@fasiha
Copy link

fasiha commented Apr 23, 2021

You should get "Win32" or "Cygwin". Based on that, the Filename module implementation selects one of three submodules. Apparently, for Cygwin, it's the same as for Unix, which uses an environment variable and falls back to /tmp. For Windows, it uses the TEMP environment variable and falls back to ..

Ahhh, thank you @mjambon that makes total sense! I built this binary on Cygwin, and it would make sense that the resulting binary only contains the Cygwin module! Otherwise it would be silly to compile all modules in the resulting binary and checking Sys.os_type at runtime to decide how to make temp files. Hmmm.

From Semgrep's perspective, you're probably not interested in Cygwin and would rather have a fully-native Windows build (requiring a workaround to parmap and the semgrep/ocaml-tree-sitter-semgrep#65 bug). I'd of course prefer that too, but I need a Windows Semgrep build now, even if it depends on Cygwin's DLLs. (Recall I'm building semgrep-core on a box with full Cygwin install but running it on a plain Windows box with no Cygwin, just four Cygwin DLLs and the binaries for semgrep-core and spacegrep.)

I'm sure I can hack the Sysdeps in the OCaml source to choose Win32 for Cygwin but, do you know if OCaml can do "cross-compilation" where I build in Cygwin but the resulting binary is intended for Win32? Or any elegant workaround to getting Win32 temp file behavior despite being built on Cygwin?

I'll also continue poking around, but—thank you SO MUCH For your time and your hard work in getting to the bottom of this! 🙏🙇🙏!

@mjambon
Copy link
Member Author

mjambon commented Apr 23, 2021

Ahhh, thank you @mjambon that makes total sense! I built this binary on Cygwin, and it would make sense if the compiler's dead-code elimination removed all non-Cygwin functionality at compile time, instead of checking Sys.os_type at runtime to decide how to make temp files. Hmmm.

"Hmmm" is also my reaction 😅. I don't fully understand what's going on. I asked the question here: https://discuss.ocaml.org/t/executable-built-on-cygwin-runs-on-win32-tries-and-fail-to-use-tmp/7726

PS: I'm not clear on the different variants of Windows builds. Until today I thought there were just 2 kinds: MinGW32 and Cygwin, the former requiring Cygwin at compile time but otherwise building a native Windows executable independent from Cygwin. The latter would require the Cygwin DLL(s) but I thought it would also run in a Cygwin environment and have access to /tmp.

@fasiha
Copy link

fasiha commented Apr 23, 2021

The latter would require the Cygwin DLL(s) but I thought it would also run in a Cygwin environment and have access to /tmp.

Ahh you raise a great point—when Cygwin executables ask for /tmp, who translates that to somewhere on the Windows file system? Is it the shell? No, because I can open cmd.exe and run c:\cygwin64\bin\ls -sh /tmp which works just fine, and prints out the contents of C:\cygwin\tmp. Sooo I'm guessing that semgrep-core is failing to write to it's "/tmp" because that directory doesn't exist in whatever Windows directory the executable or the DLLs interpret to be /. Let me try to find an answer for this, maybe it's as easy as creating a tmp subdirectory somewhere relative to the DLLs or the executable! Thanks for bearing with me and helping me so much in figuring this out!

(And just to confirm. In OCaml for Windows (which as you said uses MinGW plus Cygwin to build fully-native Windows executables):

# Filename.get_temp_dir_name () ;;
- : string = "C:\\OCaml64\\tmp"

Whereas in Cygwin:

utop # Filename.get_temp_dir_name () ;;
- : string = "/tmp"

Makes sense!)

@fasiha
Copy link

fasiha commented Apr 23, 2021

Copying cygpath from my full Cygwin install to the plain Windows VM where I've installed Semgrep, I see that cygpath -w /tmp corresponds to C:\Users\User\AppData\Local\Programs\Python\Python39\Lib\site-packages\semgrep\tmp.

This must be because I put cygwin1.dll in C:\Users\User\AppData\Local\Programs\Python\Python39\Lib\site-packages\semgrep\bin, to sit alongside the semgrep-core executable, and because this path is in my %PATH% environment variable. In a full Cygwin install, cygwin1.dll expects to be in /bin, so it makes sense that it thinks /tmp corresponds to ../tmp relative to it.

When I create this "tmp" directory site-packages\semgrep\tmp, and convert my demo.py script to have Unix line endings via dos2unix, semgrep-core runs!!!

c:\foo>semgrep-core.exe -lang py -json -rules_file c:\foo\prepped1-unix.txt -target_file c:\foo\files.txt
{"matches":[{"check_id":"0.-","path":"c:\\foo\\foo.py","start":{"line":4,"col":1,"offset":15},"end":{"line":4,"col":14,"offset":28},"extra":{"message":"<internalonly>","metavars":{}}}],"errors":[],"stats":{"okfiles":1,"errorfiles":0}}

¡¡¡🙌👏👌💋🥳!!!

(For completeness, here's the contents of prepped1-unix.txt, which Semgrep Python library automatically created when I fed it -e "eval(...)":

rules:
- id: 0.-
  pattern: eval(...)
  severity: ERROR
  languages:
  - py
  message: <internalonly>

and here's the contents of files.txt:

c:\foo\foo.py

and foo.py contains a use of eval.)

(If foo.py has Windows file endings, semgrep-core reports a FatalError: {"matches":[],"errors":[{"check_id":"FatalError","path":"c:\\foo\\foo.py","start":{"line":1,"col":1},"end":{"line":1,"col":1},"extra":{"message":"Fatal Error: End_of_file","line":"x = 123"}}],"stats":{"okfiles":0,"errorfiles":1}}. It has to have Unix line endings.)

Thank you @mjambon for your comment about Cygwin exes understanding Unix paths, that was the crucial breakthrough!

Soooo. Taking stock. For Cygwin to be a viable option, we have to, in addition to the steps in my comment to #1330,

  • copy the four Cygwin DLLs to C:\Users\User\AppData\Local\Programs\Python\Python39\Lib\site-packages\semgrep\bin to sit alongside semgrep-core,
  • then mkdir site-packages\semgrep\tmp
  • ensure that the rules and the target files given as arguments to semgrep-core are Unix formatted (these paths can be Windows paths, but the file contents need Unix line endings).

(Might this perhaps also explain #1037 and #2237, if files had Windows file endings…?)

I think this is what I need to make progress on my project!

Hopefully some of this analysis will be useful in a non-Cygwin pure-OCaml for Windows approach! Super duper thank you to @mjambon and @mschwager and everyone for being so patient and generous with your time and effort! I am truly in your debt.

@fasiha
Copy link

fasiha commented Apr 23, 2021

I asked the question here: https://discuss.ocaml.org/t/executable-built-on-cygwin-runs-on-win32-tries-and-fail-to-use-tmp/7726

I posted a reply there that hopefully is both correct and also useful to the OCaml community. In researching it, I realized that as an alternative to creating a tmp subdirectory to be a sibling directory of the directory that contains cygwin1.dll, we could also explicitly set TMPDIR, which is what the Unix temp_dir_name function looks for. That could be very convenient too, for cases where we maybe can't create directories.

@mjambon
Copy link
Member Author

mjambon commented Apr 23, 2021

Excellent work, @fasiha!

@mschwager
Copy link
Contributor

Great work @fasiha! Thanks for looking so deeply into this, you're making Semgrep better for everyone 🎉

then mkdir site-packages\semgrep\tmp

As you mentioned, it seems we can leverage TMPDIR to get around this issue. I suspect setting an environment variable here is preferable vs. patching the build 👍. Similar to #2018, this is one of the great things about leveraging these standard libraries.

ensure that the rules and the target files given as arguments to semgrep-core are Unix formatted (these paths can be Windows paths, but the file contents need Unix line endings).

Supporting Windows line endings seems like a quick fix we could make. That would make downstream maintainers lives much easier and limit the "one-off steps" necessary to get these binaries running on Windows environments. To clarify this task a bit, can we consider this issue fixed when we support Windows line endings?

@fasiha
Copy link

fasiha commented Apr 26, 2021

Supporting Windows line endings seems like a quick fix we could make. That would make downstream maintainers lives much easier and limit the "one-off steps" necessary to get these binaries running on Windows environments. To clarify this task a bit, can we consider this issue fixed when we support Windows line endings?

If you'd like to support both line endings in the same binary, that would be very cool!

But this behavior might be "correct": we might be seeing this behavior only because the binary was built in Cygwin, so it only supports Unix line endings, similar to how it would support only Windows line endings if it was built as a native-Windows executable? If this is the case, I think you can close this issue right now, since it's caused by an "incorrect" use of the binary (built on Cygwin, running on Windows).


Sidenote. A bit frustratingly for me (and my stupid Cygwin use case 😅)—the files are created by ruamel, which is doing the right thing by emitting Windows line endings: in core_runer.py:

            yaml = YAML()
            yaml.dump({"rules": patterns_json}, pattern_file)

My patch for Cygwin support needs to be pretty ugly 😕. I hope we can get a Windows-native build soon!

@emjin
Copy link
Collaborator

emjin commented May 5, 2021

Thanks for all the work you've already done on this! I'm looking into whether we can support the line endings. Would it be possible for you to attach a file that exhibits what you're describing (both line endings in the same binary)?

@fasiha
Copy link

fasiha commented May 5, 2021

Here's a Python script that I saved as test.py that produces the same rules (and target) files that the real Semgrep Python wrapper creates before invoking semgrep-core—with Windows line endings:

rules = """rules:
- id: 0.-
  pattern: eval(...)
  severity: ERROR
  languages:
  - py
  message: <internalonly>
"""

target = """test.py"""


def save_windows_lineends(contents: str, filename: str):
    new_contents = '\r\n'.join(contents.splitlines())
    open(filename, 'wb').write(bytes(new_contents, 'ascii'))


save_windows_lineends(rules, 'rules.txt')
save_windows_lineends(target, 'target.txt')

eval('1+2')

After I run the script, if I invoke semgrep-core like so on Windows:

>C:\Users\User\AppData\Local\Programs\Python\Python39\Lib\site-packages\semgrep\bin\semgrep-core.exe -lang py -json -rules_file rules.txt -target_file target.txt
{"error":"unknown exception","message":"End_of_file"}

If I convert from Win to Unix line endings, it works:

>copy rules.txt rules-unix.txt
        1 file(s) copied.

>c:\OCaml64\bin\dos2unix.exe rules-unix.txt
dos2unix: converting file rules-unix.txt to Unix format...

>C:\Users\User\AppData\Local\Programs\Python\Python39\Lib\site-packages\semgrep\bin\semgrep-core.exe -lang py -json -rules_file rules-unix.txt -target_file target.txt
{"matches":[{"check_id":"0.-","path":"test.py","start":{"line":23,"col":1,"offset":563},"end":{"line":23,"col":12,"offset":574},"extra":{"message":"<internalonly>","metavars":{}}}],"errors":[],"stats":{"okfiles":1,"errorfiles":0}}

Attached is a ZIP file containing the Cygwin-built binaries and required Cygwin DLLs. If you have a Windows machine or download a Windows 10 evaluation image from Microsoft, you should be able to uncompress this and run \bin\semgrep-core: Cygwin-built semgrep binaries and Cygwin DLLs: semgrep.zip

Hmm, very curious. I just noticed that if I invoke semgrep-core on macOS on the rules file with Windows line endings, it runs fine!? I'm actually not surprised, I'd expect a nice language like OCaml to know how to deal with line endings, but it really confuses me that the Cygwin–Windows build chokes on Windows line endings but the macOS build doesn't? If it's difficult to reproduce this issue in macOS/Linux, please feel free to close, it must be something really weird about Windows, either the build or the runtime.

@fasiha
Copy link

fasiha commented May 7, 2021

Please forgive me for posting this minor observation in this thread instead of #1330—I'm unsure how much of it is related to Windows vs the Cygwin build I'm putting together, so I decided to try here to keep the thread intact.

Recall above our discovery that the Cygwin build of semgrep-core needs (1) a TMPDIR env var declared to work, and (2) Unix-line endings YAML files. I'm experimenting with a patched version of the Python Semgrep package that ensures these two fixes are in place, when I ran into a new error—

I run the following, and get a basic error:

>semgrep -e "eval(...)" --lang=py  z:\Downloads\semgrep-2991\test.py --disable-version-check
ran 1 rules on 1 files: 0 findings
1 files could not be analyzed; run with --verbose for details or run with --strict to exit non-zero if any file cannot be analyzed

Adding logging, I see that this is the command that Python is running (newlines added for clarity):

C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\semgrep\bin\semgrep-core
-lang py
-json
-rules_file C:\Users\User\AppData\Local\Temp\tmpgv5zhylr
-j 2
-target_file C:\Users\User\AppData\Local\Temp\tmpxmdk3zua
-use_parsing_cache C:\Users\User\AppData\Local\Temp\tmp0yqxv1_r
-timeout 30
-max_memory 0

which returns 0 but which prints the following to stdout (whitespace added for clarity):

{
    "matches": [],
    "errors": [
        {
            "check_id": "FatalError",
            "path": "z:\\Downloads\\semgrep-2991\\test.py",
            "start": {"line": 1,"col": 1},
            "end": {"line": 1,"col": 1},
            "extra": {
                "message": "Fatal Error: (Failure \"output_value: not a binary channel\")",
                "line": "#%%"
            }
        }
    ],
    "stats": {"okfiles": 0,"errorfiles": 1}
}

Notice this Fatal Error: (Failure \"output_value: not a binary channel\") error.

However, if I remove the -use_parsing_cache flag (by commenting out two lines), the semgrep invocation works fine:

>semgrep -e "eval(...)" --lang=py test.py --disable-version-check
test.py
22:eval('1+2')
ran 1 rules on 1 files: 1 findings

I see in ocaml-base-compiler.4.10.2/runtime/extern.c:696 caml_failwith("output_value: not a binary channel");.

I noticed that running the raw semgrep-core command above results in an empty file in the parsing cache directory called 0c0defdb9fc576635baf2b07d888bc2d.ast_cache. Rerunning it results in the same empty file. (Changing the Python file to scan results in a different ast_cache file.) I wonder if the error when using parsing cache might be caused by some error in persisting an OCaml data structure to disk?

Just a minor issue I thought to document that might be handy in the eventual native-Windows build (or might be handy to someone like me creating a Cygwin version) 😄.

@mjambon
Copy link
Member Author

mjambon commented May 7, 2021

Thanks for the detailed report, @fasiha. I think I know what causes Fatal Error: (Failure \"output_value: not a binary channel\") and it should be easy to fix. output_value is the function used by Marshal.to_channel to write a binary blob representing an OCaml value (in this case presumably the representation of a parsed file). The error would be because the channel is opened in text mode rather than binary mode and it matters on Windows while it makes no difference on unixes. The fix should consist in using open_out_bin instead of open_out.

@fasiha
Copy link

fasiha commented Jun 17, 2021

Hi Semgrep friends, I'm running into another interesting error with my Cygwin-compiled Windows build of semgrep-core, specifically with multiline patterns.

The following config:

rules:
- id: testmodule.Test
  languages: [python]
  message: Oops...
  severity: WARNING
  patterns:
  - pattern-inside: |
      $OBJECT = testmodule.Test(...)
      ...
      $OBJECT.do_something(...)
  - pattern: $OBJECT.do_something(..., b=$ARG, ...)

runs fine on macOS (of course 😅) but on my Windows build, I see the following error, with whitespace added for ease of reading:

>semgrep.exe --config rule-0.yaml --lang=py --json --disable-version-check z:\demo.py
{
    "results": [],
    "errors": [
        {
            "type": "InvalidPatternError",
            "code": 4,
            "level": "error",
            "short_msg": "invalid pattern",
            "long_msg": "Pattern could not be parsed as a Python semgrep pattern",
            "spans": [
                {
                    "start": {"line": 7,"col": 21},
                    "end": {"line": 11,"col": 1},
                    "source_hash": "5fefce88f712c0b9740c9a9651c4cb8d3c9adebe67fd94cab939fc0dacc92584",
                    "file": "rule-0.yaml",
                    "context_start": null,
                    "context_end": null
                }
            ]
        }
    ]
}

Notice the "Pattern could not be parsed as a Python semgrep pattern" error.

Semgrep converted the above into the following semgrep-core invocation (newlines added for clarity):

semgrep-core
 -lang python
 -json
 -rules_file C:\Users\traveler\pattern-file.txt
 -j 2
 -target_file C:\Users\traveler\target-file.txt
 -timeout 30
 -max_memory 0

And here's the rules_file that the Python Semgrep package generates on Windows (it's identical to the file generated on macOS, same MD5 checksum):

rules:
- id: 0..0
  pattern: |
    $OBJECT = testmodule.Test(...)
    ...
    $OBJECT.do_something(...)
  severity: WARNING
  languages:
  - python
  message: <internalonly>
- id: 0..1
  pattern: $OBJECT.do_something(..., b=$ARG, ...)
  severity: WARNING
  languages:
  - python
  message: <internalonly>

That semgrep-core invocation prints the following to stdout—again whitespace added for readability:

{
    "pattern_id": "0..0",
    "error": "invalid pattern",
    "pattern": "$OBJECT = testmodule.Test(...)\n...\n$OBJECT.do_something(...)\n",
    "language": "Python",
    "message": "End_of_file"
}

It looks like semgrep-core is parsing the pattern correctly: in its error printout above, I think it looks exactly like it ought to?

Is semgrep-core saying the pattern is wrong when it says End_of_file? Or is the pattern being multiline causing some other file to be parsed incorrectly? (Recall from earlier in this thread that if the files passed into semgrep-core had Windows line endings, semgrep-core would error with End_of_file but I'm taking care that everything has Unix line endings, and works great for many other rules.)

I was wondering if you had any idea what might be causing this End_of_file issue? Things I've done to narrow things down and rule things out:

  • Semgrep on Windows works fine if I don't have a multiline pattern-inside. That is,
    • a single-line case (which is quite useless), - pattern-inside: $OBJECT = testmodule.Test(...), works fine.
    • A "multiline" YAML string without any newlines works fine too—I can achieve this by using the |- YAML multiline syntax to strip trailing newline:
      - pattern-inside: |-
            $OBJECT = testmodule.Test(...)
      This works fine too, but obviously is useless.
    • The block-chomping multiline syntax, i.e., replacing | with |- or |+ to control trailing newlines, doesn't help: semgrep-core still prints the End_of_fle error for all of these, if my pattern actually has more than one line. From all these observations, I don't think it's a YAML parsing issue.
  • The rules_file that semgrep-core gets is exactly the same on Windows (which doesn't work) and macOS (which does work), same MD5 checksum.
  • The Python input file to be checked is the same file.
  • For simpler rules without multi-line patterns, Semgrep works great on Windows.
  • I am confident that the problem is with the multiline string in pattern-inside because Semgrep works fine for the same input Python file if I change it to a single line pattern.

Hopefully the above made sense, forgive me for including so many details, I'm hoping someone can think of another tip. Thanks for your patience with me and my weird Cygwin-based Semgrep build! I'm happy to share a zip file of a conda tarball that you can install on Windows to experiment with this.

Edit Adding a -debug flag to semgrep-core adds an extra line of context:

[0.156  Debug      Main.Dune__exe__Main ] exn before exit Parse_mini_rule.InvalidPatternException("0..0", "$OBJECT = testmodule.Test(...)\n...\n$OBJECT.do_something(...)\n", "Python", "End_of_file")

@fasiha
Copy link

fasiha commented Jun 18, 2021

A quick update on the multiline pattern problem described above.

This happens because pfff's Parse_python.ml opens a temp file and write the pattern to it, but it uses Common2.with_tmp_file which uses Common2.write_file which uses open_out and output_string, which unfortunately in Cygwin translate strings with Unix newline (LF) into Windows newlines (CRLF):

$ ocaml
        OCaml version 4.10.2

# let chan = open_out "foo.py" in (output_string chan "a\nb\nc\nd!"; close_out chan);;
- : unit = ()
#

$ xxd foo.py
00000000: 610d 0a62 0d0a 630d 0a64 21              a..b..c..d!

You see 0xd 0xa (CRLF) newlines in the file.

Therefore, when Parse_info.tokenize_all_and_adjust_pos asks Unicode.input_all to use really_input_string and slurp the entire file contents, OCaml uses the length of the channel (in_channel_length) in bytes as the number of characters to read. This causes the End_of_file errors when Cygwin-compiled semgrep-core attempts to read files with Windows line endings.

I think this might be a OCaml+Cygwin bug? Do you think I should ask on the OCaml Discourse why

let file = "foo.txt" ;;

let chan = open_out file;;
output_string chan "a\nb\nc\nd!";;
close_out chan;;

let chan = open_in file ;;
let res = really_input_string chan (in_channel_length chan);;
close_in chan;;

throws End_of_file only on Cygwin? The exact same reader code above with the Windows-formatted file runs fine on macOS: no End_of_file there. That is, your idiom on Unix handles Win and Unix line endings just fine, but breaks horribly on Cygwin.

A stupid solution appears to be, in Parse_python.tokens to just pass in ~unicode_hack:false. I admit I don't quite understand what impact this will have, but with this tiny change, Cygwin-built Semgrep happily parses my rules with multiline patterns on all my Python files 😊. But this just sidesteps the problem.

A second solution appears to be, in the beginning of Unicode.input_and_replace_non_ascii, if you set the channel to binary mode with set_binary_mode_in ic true;, Semgrep again happily consumes multiline patterns and all is well.

I think for my Windows version I'll go with the second solution, but do let me know if you think I ought to raise this question with OCaml Discourse. I can also try and use native Windows OCaml to see if it has the same issue.

@fasiha
Copy link

fasiha commented Jun 18, 2021

I asked on OCaml Discourse. Someone else might chime in but for now I'm afraid the idea seems to be that we'll have to convert all input channels in Semgrep to binary channels for an eventual Windows port 😕. It's not a Cygwin-only issue but rather a Windows issue with text mode: ocaml/ocaml#9868.

@mjambon
Copy link
Member Author

mjambon commented Jun 21, 2021

@fasiha great investigation. I think you should try replacing all instances of open_in with open_in_bin in both pfff and semgrep. pfff is used as a git module within semgrep. So you'd make a branch in pfff, and another one in semgrep, and we'll see if the tests pass. @aryx might know if there's a reason to not do this but he's on vacation until next week so I say let's just try this.

@fasiha
Copy link

fasiha commented Jun 22, 2021

replacing all instances of open_in with open_in_bin in both pfff and semgrep

Ok! Will try this and let you know what happens 💪🤞!

@mjambon
Copy link
Member Author

mjambon commented Aug 4, 2021

I went ahead and changed the ocaml code in pfff and semgrep-core so that all files are opened in binary mode, both for writing and for reading (#3663). I hope this will just fix a bunch of problems that are only seen on Windows (End_of_file exceptions, corrupt cache data). Anything that writes to a regular file will write LF-terminated lines rather than CRLF. This may be too much for certain things like logs, I'm not sure. It may not matter much for now.

@fasiha
Copy link

fasiha commented Aug 5, 2021

pfff and semgrep-core so that all files are opened in binary mode

Very nice, I'm glad to see all checks passing in the PR! I should have mentioned it earlier: when I ran the tests on semgrep before making this change that you suggested, on Windows, a few weeks ago, I saw, as a baseline:

FAILED: Cases: 1432 Tried: 1432 Errors: 5 Failures: 0 Skip:  0 Todo: 0 Timeouts: 0.

Then I changed all open_out to open_out_bin and got many more failures:

FAILED: Cases: 1432 Tried: 1432 Errors: 39 Failures: 0 Skip:  0 Todo: 0 Timeouts: 0.

I'll clone your PR and see if I can run tests on that in Windows. But it's very gratifying that this change doesn't seem to break anything on Unix!

@mjambon
Copy link
Member Author

mjambon commented Aug 6, 2021

I'm also curious to see what we get now.

@nbrahms nbrahms added alpha Relates to an experimental feature and removed priority:medium labels Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alpha Relates to an experimental feature bug Something isn't working lang:yaml os:windows user:external requested by someone outside of r2c
Development

No branches or pull requests

7 participants
@fasiha @mjambon @mschwager @nbrahms @emjin @sabrinabrogren and others