Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semgrep Core sets file as "NO FILE INFO YET" when failing to parse C code #1925

Closed
brendongo opened this issue Oct 29, 2020 · 6 comments
Closed
Labels
bug Something isn't working lang:c

Comments

@brendongo
Copy link
Member

Describe the bug
When running on invalid C code semgrep is unable to parse semgrep-core output. It assumes that
the path field in a LexicalError is a path to a file but instead it is the string "NO FILE INFO YET"

To Reproduce
Locally run semgrep --config s/eryd on the following target:

foo.c

# this is invalid code
int x = 1

Expected behavior
Should report as a parseerror

@brendongo brendongo added the bug Something isn't working label Oct 29, 2020
@aryx aryx added the lang:c label Oct 30, 2020
@mschwager
Copy link
Contributor

See our other NO FILE INFO YET issues too:

We should at least return a more helpful error - something that the user can use to fix the problem or take additional action.

@aryx
Copy link
Collaborator

aryx commented Nov 9, 2020

Note that the code is invalid in your example because you use python style comment # xxx on a C file.
It causes a lexical error (not a parse error), which causes then the No File info Yet.

aryx added a commit to semgrep/pfff that referenced this issue Nov 9, 2020
…st_pos

This will help semgrep/semgrep#1925
The helper tokenize_all_and_adjust_pos correctly intercept Lexical_error
and adjust the file position of the token inside the Lexical_error.
When I introduced this helper function, I forgot to use it for the
C/C++ parser (not sure why, maybe because the code was also handling
ExpandedTok).

test plan:
$ semgrep -l c -e 'FOO' /tmp/foo.c
ran 1 rules on 1 files: 0 findings
1 files could not be analyzed; run with --verbose for details or run with --strict to exit non-zero if any file cannot be analyzed

does not generate Python backtrace anymore.

Same with
$ /home/pad/semgrep/_build/default/cli/Main.exe -dump_ast /tmp/foo.c
/tmp/foo.c:3:0: Lexical error: unrecognised symbol, in token rule:#
Raised at file "parsing/Parse_code.ml", line 144, characters 24-27
Called from file "parsing/Parse_code.ml", line 236, characters 18-48
Called from file "cli/Main.ml", line 855, characters 6-72
Called from file "pfff/h_program-lang/Error_code.ml", line 388, characters 4-8

no more "NO FILE INFO YET" exn.
aryx added a commit that referenced this issue Nov 9, 2020
…C code

Fixes #1925

test plan:
$ /home/pad/semgrep/_build/default/cli/Main.exe -dump_ast foo.c
foo.c:3:0: Lexical error: unrecognised symbol, in token rule:#
Raised at file "parsing/Parse_code.ml", line 144, characters 24-27
Called from file "parsing/Parse_code.ml", line 236, characters 18-48

no more NO_FILE_INFO_YET error (which causes the python wrapper
to crash).

Also:
$ semgrep -l c -e 'FOO' tests/OTHER/parsing_errors/foo.c
ran 1 rules on 1 files: 0 findings
1 files could not be analyzed; run with --verbose for details or run with --strict to exit non-zero if any file cannot be analyzed
aryx added a commit that referenced this issue Nov 9, 2020
…C code

Fixes #1925

test plan:
$ /home/pad/semgrep/_build/default/cli/Main.exe -dump_ast foo.c
foo.c:3:0: Lexical error: unrecognised symbol, in token rule:#
Raised at file "parsing/Parse_code.ml", line 144, characters 24-27
Called from file "parsing/Parse_code.ml", line 236, characters 18-48

no more NO_FILE_INFO_YET error (which causes the python wrapper
to crash).

Also:
$ semgrep -l c -e 'FOO' tests/OTHER/parsing_errors/foo.c
ran 1 rules on 1 files: 0 findings
1 files could not be analyzed; run with --verbose for details or run with --strict to exit non-zero if any file cannot be analyzed
@aryx aryx closed this as completed in 0fb6321 Nov 9, 2020
@mschwager
Copy link
Contributor

mschwager commented Nov 9, 2020

@aryx Can we output something more descriptive than NO FILE INFO YET? That would be very helpful for debugging.

EDIT: Looking at a288fcc more closely, it looks like we shouldn't get NO FILE INFO YET anymore?

@aryx
Copy link
Collaborator

aryx commented Nov 10, 2020

Ok but what? I can put "Bug in lexer, you forgot the call to complete_parse_info somewhere"

@mschwager
Copy link
Contributor

Ok but what? I can put "Bug in lexer, you forgot the call to complete_parse_info somewhere"

I think the problem is that we're putting NO FILE INFO YET in the output path key. Then, when the Python code tries to use path, it blows up because this file doesn't exist. E.g.

# this is invalid code
int x = 1
$ python -m semgrep --pattern '$X == $X' --lang c /tmp/test.c 
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
...
FileNotFoundError: [Errno 2] No such file or directory: 'NO FILE INFO YET'

The semgrep-core output is:

ipdb> pp output_json
{'errors': [{'check_id': 'LexicalError',
             'end': {'col': 1, 'line': -1},
             'extra': {'line': 'NO LINE',
                       'message': 'Lexical error: unrecognised symbol, in '
                                  'token rule:#'},
             'path': 'NO FILE INFO YET',
             'start': {'col': 0, 'line': -1}}],
 'matches': [],
 'stats': {'errorfiles': 1, 'okfiles': 0}}

Notice 'path': 'NO FILE INFO YET' - this makes the Python code blow up because it assumes path is an existing file. Shouldn't semgrep-core know that the path is /tmp/test.c here?

@mschwager
Copy link
Contributor

#1999!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lang:c
Development

No branches or pull requests

3 participants