Issues when scanning long lines with `generic` mode #6071

inkz · 2022-09-09T04:52:07Z

Describe the bug

when you run a generic rule, this one specifically https://semgrep.dev/s/7PnQ on a one line snippet of code - it is not working
when you add extra line to a snippet, like here: https://semgrep.dev/s/E6zN - it works
same for the CLI
also if you make first line even longer https://semgrep.dev/playground/s/7PQ2 - it is not working again (with 1 extra line)
but if I add one more extra line https://semgrep.dev/s/Le26 - it works again
it seems like this behavior also affects the autofix feature since it works incorrectly on such lines (but can be not related ✌️)
running https://semgrep.dev/s/7PnQ in the CLI will incorrectly fix last finding

To Reproduce
https://semgrep.dev/s/7PnQ

The text was updated successfully, but these errors were encountered:

r2c-demo · 2022-09-09T04:52:22Z

This issue is synced in Linear at https://linear.app/r2c/issue/PA-1870/issues-when-scanning-long-lines-with-generic-mode. Note: this link is for r2c use only and is not accessible publicly.

inkz · 2022-09-09T04:54:16Z

requested by user: https://r2c-community.slack.com/archives/C018NJRRCJ0/p1662577736189749

dmgping · 2022-09-15T00:21:06Z

Hi guys, is there an estimated fix timeline for this one?

mjambon · 2022-09-15T20:23:01Z

The issue here is that the target file is just one very long line. Adding a newline character will restore normal functionality in this particular example: https://semgrep.dev/s/2eP5

Unfortunately, the generic mode engine (spacegrep) will not handle files made of only extremely long lines. This is meant to avoid extremely slow parsing and matching, which could happen if it were to process a minified or a binary file. It's not really a question of file size but rather the lack of a tree-like structure determined by indentation.

To users, I would recommend adding a linter pass that requires lines to be under some reasonable limit, e.g. 80 or 120. The current heuristic used to determine if a file is source code requires an average line length of under 150 bytes. 150 is a lot given that it's an average, except in the case of one-line files.

To rule writers, I would recommend being patient and trying to remember that test cases that are just one long line may run into this problem, for the time being.

The changes we could make in the software include:

Communicate to the user of the playground (and semgrep in general) that the target was identified as not human-readable and is being ignored.
Relax the heuristic to allow any small file (say, 500 bytes) to be accepted as input.

Item (2) is relatively easy to do. More investigation is needed to evaluate what needs to be done for (1).

dmgping · 2022-09-15T22:07:39Z

Hi @mjambon
No.2 sounds like what users(namely myself) will need for the tool. While the line in this example is a long line, for a html file I don't thinks it's that unreasonable in terms of length, and certainly isn't the longest I've seen.
Maybe it could be a flag that could be passed to the semgrep command in to set a higher value for the length heuristic? Thus keeping the default lightweight.

dmgping · 2022-09-22T19:50:27Z

Hi @mjambon , do you know when this fix will be merged into the cli and the returntocorp/semgrep-agent:v1 image?

mjambon · 2022-09-22T20:09:04Z

The 500-byte limit will be available in the next semgrep release (0.115.x), next week. This value is not configurable because I was a bit worried of performance (and I was lazy). In retrospect, I think it's a good thing to have a safe default and let users experiment with a value that works for them.

mjambon · 2022-09-22T20:21:36Z

^ #6162 is the follow-up task.

dmgping · 2022-09-27T17:23:06Z

@mjambon It appears the implemented fix hasn't worked for this issue. I upgraded to v0.115 and ran the rule again and seen the same misplacement of the autofix code inserted. The original targeted line was 197 chars long.

The detection is correct in the terminal output, but the changes made are wrong.

On further investigation, it does not seem to be an issue when there is only one finding on a long line, but more so when multiple findings occur on the same long line.

dmgping · 2022-10-17T18:01:22Z

Hi @mjambon, can you review the above? Thanks

dmgping · 2023-09-13T19:07:14Z

Following up on this @mjambon, issue still exists today.

inkz added the user:external requested by someone outside of r2c label Sep 9, 2022

emjin added feature:autofix lang:generic generic mode issues (spacegrep, aliengrep) feature:matching labels Sep 9, 2022

mjambon added the priority:medium label Sep 15, 2022

mjambon mentioned this issue Sep 16, 2022

generic mode: Allow short text targets without 2D layout #6113

Merged

5 tasks

mjambon closed this as completed in #6113 Sep 20, 2022

mjambon mentioned this issue Sep 22, 2022

Make generic mode's 500-byte limit configurable #6162

Open

mjambon added the lang:aliengrep newer engine for the generic mode label May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when scanning long lines with `generic` mode #6071

Issues when scanning long lines with `generic` mode #6071

inkz commented Sep 9, 2022

r2c-demo commented Sep 9, 2022

inkz commented Sep 9, 2022

dmgping commented Sep 15, 2022

mjambon commented Sep 15, 2022

dmgping commented Sep 15, 2022

dmgping commented Sep 22, 2022

mjambon commented Sep 22, 2022 •

edited

mjambon commented Sep 22, 2022

dmgping commented Sep 27, 2022 •

edited

dmgping commented Oct 17, 2022

dmgping commented Sep 13, 2023

Issues when scanning long lines with generic mode #6071

Issues when scanning long lines with generic mode #6071

Comments

inkz commented Sep 9, 2022

r2c-demo commented Sep 9, 2022

inkz commented Sep 9, 2022

dmgping commented Sep 15, 2022

mjambon commented Sep 15, 2022

dmgping commented Sep 15, 2022

dmgping commented Sep 22, 2022

mjambon commented Sep 22, 2022 • edited

mjambon commented Sep 22, 2022

dmgping commented Sep 27, 2022 • edited

dmgping commented Oct 17, 2022

dmgping commented Sep 13, 2023

Issues when scanning long lines with `generic` mode #6071

Issues when scanning long lines with `generic` mode #6071

mjambon commented Sep 22, 2022 •

edited

dmgping commented Sep 27, 2022 •

edited