Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when scanning long lines with generic mode #6071

Closed
inkz opened this issue Sep 9, 2022 · 11 comments · Fixed by #6113
Closed

Issues when scanning long lines with generic mode #6071

inkz opened this issue Sep 9, 2022 · 11 comments · Fixed by #6113
Labels
feature:autofix feature:matching lang:aliengrep newer engine for the generic mode lang:generic generic mode issues (spacegrep, aliengrep) priority:medium user:external requested by someone outside of r2c

Comments

@inkz
Copy link
Member

inkz commented Sep 9, 2022

Describe the bug

To Reproduce
https://semgrep.dev/s/7PnQ

@r2c-demo
Copy link
Collaborator

r2c-demo commented Sep 9, 2022

This issue is synced in Linear at https://linear.app/r2c/issue/PA-1870/issues-when-scanning-long-lines-with-generic-mode. Note: this link is for r2c use only and is not accessible publicly.

@inkz inkz added the user:external requested by someone outside of r2c label Sep 9, 2022
@inkz
Copy link
Member Author

inkz commented Sep 9, 2022

@emjin emjin added feature:autofix lang:generic generic mode issues (spacegrep, aliengrep) feature:matching labels Sep 9, 2022
@dmgping
Copy link

dmgping commented Sep 15, 2022

Hi guys, is there an estimated fix timeline for this one?

@mjambon
Copy link
Member

mjambon commented Sep 15, 2022

The issue here is that the target file is just one very long line. Adding a newline character will restore normal functionality in this particular example: https://semgrep.dev/s/2eP5

Unfortunately, the generic mode engine (spacegrep) will not handle files made of only extremely long lines. This is meant to avoid extremely slow parsing and matching, which could happen if it were to process a minified or a binary file. It's not really a question of file size but rather the lack of a tree-like structure determined by indentation.

To users, I would recommend adding a linter pass that requires lines to be under some reasonable limit, e.g. 80 or 120. The current heuristic used to determine if a file is source code requires an average line length of under 150 bytes. 150 is a lot given that it's an average, except in the case of one-line files.

To rule writers, I would recommend being patient and trying to remember that test cases that are just one long line may run into this problem, for the time being.

The changes we could make in the software include:

  1. Communicate to the user of the playground (and semgrep in general) that the target was identified as not human-readable and is being ignored.
  2. Relax the heuristic to allow any small file (say, 500 bytes) to be accepted as input.

Item (2) is relatively easy to do. More investigation is needed to evaluate what needs to be done for (1).

@dmgping
Copy link

dmgping commented Sep 15, 2022

Hi @mjambon
No.2 sounds like what users(namely myself) will need for the tool. While the line in this example is a long line, for a html file I don't thinks it's that unreasonable in terms of length, and certainly isn't the longest I've seen.
Maybe it could be a flag that could be passed to the semgrep command in to set a higher value for the length heuristic? Thus keeping the default lightweight.

@dmgping
Copy link

dmgping commented Sep 22, 2022

Hi @mjambon , do you know when this fix will be merged into the cli and the returntocorp/semgrep-agent:v1 image?

@mjambon
Copy link
Member

mjambon commented Sep 22, 2022

The 500-byte limit will be available in the next semgrep release (0.115.x), next week. This value is not configurable because I was a bit worried of performance (and I was lazy). In retrospect, I think it's a good thing to have a safe default and let users experiment with a value that works for them.

@mjambon
Copy link
Member

mjambon commented Sep 22, 2022

^ #6162 is the follow-up task.

@dmgping
Copy link

dmgping commented Sep 27, 2022

@mjambon It appears the implemented fix hasn't worked for this issue. I upgraded to v0.115 and ran the rule again and seen the same misplacement of the autofix code inserted. The original targeted line was 197 chars long.
Screen Shot 2022-09-27 at 10 18 03 AM
The detection is correct in the terminal output, but the changes made are wrong.
image
On further investigation, it does not seem to be an issue when there is only one finding on a long line, but more so when multiple findings occur on the same long line.

@dmgping
Copy link

dmgping commented Oct 17, 2022

Hi @mjambon, can you review the above? Thanks

@mjambon mjambon added the lang:aliengrep newer engine for the generic mode label May 9, 2023
@dmgping
Copy link

dmgping commented Sep 13, 2023

Following up on this @mjambon, issue still exists today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:autofix feature:matching lang:aliengrep newer engine for the generic mode lang:generic generic mode issues (spacegrep, aliengrep) priority:medium user:external requested by someone outside of r2c
Development

Successfully merging a pull request may close this issue.

5 participants