Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify strings made of repeated characters as having low entropy #4833

Merged
merged 8 commits into from Mar 21, 2022

Conversation

mjambon
Copy link
Member

@mjambon mjambon commented Mar 18, 2022

This should work but I want to add tests to make sure it does. The issue is that the semgrep matcher returns quoted strings e.g. "xxxxxxxxxxxx" and the entropy analysis is supposed to be smart and eliminate the quotes. This is necessary for the string to be recognized as a repeated character.

Edit: added tests and fixed implementation accordingly.

PR checklist:

  • Tests
  • Documentation is up-to-date
  • Changelog is up-to-date
  • Change has no security implications (otherwise, ping security team)

@mjambon mjambon requested review from kurt-r2c and a team March 18, 2022 03:44
@mjambon mjambon marked this pull request as draft March 18, 2022 03:45
@mjambon mjambon marked this pull request as ready for review March 18, 2022 07:12
@mjambon
Copy link
Member Author

mjambon commented Mar 19, 2022

@kurt-r2c if you find a link to the original issue about entropy analysis, we should add it to CHANGES.md, otherwise I think it's no big deal. This is what I wrote:

- Entropy analysis: strings made of repeated characters such as
  `'xxxxxxxxxxxxxx'` are no longer reported has having high entropy (#4833)

@mjambon
Copy link
Member Author

mjambon commented Mar 19, 2022

mmm, a test fails with generic mode when binding a metavariable to something that looks like an int, which ends up being represented as an Int kind of value:

(Metavariable.E
   { e = (L (Int ((Some 123456789012), ()))); e_id = 0; e_range = None })

Trying to figure out how this works.

@mjambon
Copy link
Member Author

mjambon commented Mar 19, 2022

The new behavior is to unquote the captured string value if possible. If it's not a string, use the original source code fragment.

@github-actions
Copy link
Contributor

🔥 Potential speedup in benchmark semgrep.bench.coinbase.std: -23.6% (-3.137 s)

14 benchmarks, 3.8% faster on average.

Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details').

@mjambon
Copy link
Member Author

mjambon commented Mar 19, 2022

Getting what looks like a network error (twice in a row). The URL being reported (https://semgrep.dev/c/auto) works for me from home. I'm going to re-run the job again.

https://github.com/returntocorp/semgrep/runs/5610188578?check_suite_focus=true

--- stdout from semgrep process ---
{"errors": [{"code": 2, "level": "error", "message": "Failed to download config from https://semgrep.dev/c/auto: ('Connection aborted.', RemoteDisconnected('Remote end closed\nconnection without response'))", "type": "SemgrepError"}, {"code": 7, "level": "error", "message": "invalid configuration file found (1 configs were invalid)", "type": "SemgrepError"}], "paths": {"_comment": "<add --verbose for a list of skipped paths>", "scanned": []}, "results": []}

--- end semgrep stdout ---

@kurt-r2c
Copy link
Contributor

@mjambon we were tracking this internally via Linear: RULES-446. There's no associated GitHub issue. The statement as-is should be fine.

@mjambon mjambon merged commit bfe6b94 into develop Mar 21, 2022
@mjambon mjambon deleted the mj-xxxx-not-secret branch March 21, 2022 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants