Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semgrep consume all of the CPU, Possible Regex DoS inside the engine. #827

Closed
ajinabraham opened this issue May 23, 2020 · 6 comments · Fixed by #852
Closed

Semgrep consume all of the CPU, Possible Regex DoS inside the engine. #827

ajinabraham opened this issue May 23, 2020 · 6 comments · Fixed by #852
Assignees
Labels
bug Something isn't working lang:javascript pattern:ellipsis performance user:external requested by someone outside of r2c

Comments

@ajinabraham
Copy link
Contributor

Describe the bug

Certain files and rules cause semgrep to consume all CPU and hangs the process.

To Reproduce
$ semgrep -f foo.yaml ~/Downloads/juice-shop-master/frontend/src/assets/private/three.js

File: https://github.com/bkimminich/juice-shop/blob/master/frontend/src/assets/private/three.js
Rule: foo.yaml

rules:
  - id: cpu_kill
    patterns:
      - pattern-either:
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.parseString(..., <... $REQ.$QUERY ...>, ...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.foo(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.bar(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.again(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.again(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.blaa(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.fo(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.ell(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.nothing(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.again(...);
          - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.fance(...);
    message: CPU gone!
    languages:
      - javascript
    severity: ERROR

This issue is present with rules that have ellipsis ... used to match eveything in between, which makes me think that there is Regex dos happening inside semgrep core.

 - pattern: |
              var $PARSER = new libxmljs.SaxParser();
              ...
              $PARSER.fance(...);

Expected behavior
semgrep should have checks against this or timeout on the file if the analysis is taking more than X time.

Screenshots
Screenshot 2020-05-23 at 5 48 52 PM

Environment
semgrep 0.8.0 installed from PyPi

$ semgrep --version
0.8.0
@ajinabraham
Copy link
Contributor Author

Forcefully killing throws

-c intercepted, will do some cleaning before exiting
[Parmap]: error at index j=3 in (10,14), chunksize=5 of a total of 5 got exception Common.UnixExit(-1) on core 2
an internal error occured while invoking the semgrep engine: unknown exception: Common.UnixExit(-1).
An error occurred while invoking the semgrep engine; please help us fix this by filing an an issue at https://semgrep.dev

@dlukeomalley dlukeomalley added bug Something isn't working pattern:ellipsis performance user:external requested by someone outside of r2c labels May 24, 2020
@dlukeomalley
Copy link
Member

dlukeomalley commented May 24, 2020

@ajinabraham Thank you for filing this. Being sensitive to your project, on what timeline would you like to see a fix? We don't want to block you or your v4 rollout. Depending on how pressing this is, is it acceptable for us to investigate this next week?

@ajinabraham
Copy link
Contributor Author

This is not blocking me as of now. This was reported by one of the early adopters of nodejsscan cli. For now he can scan the source by ignoring the problematic files.ajinabraham/njsscan#1

@aryx
Copy link
Collaborator

aryx commented May 26, 2020

Yes it's probably because of the '...' which right now can be slow on huge files because of the
deep statement matching that is done implicitely (it basically tries to match on any sequence of statements where those sequences are the flatten list of all statements in a file, including in nested functions. We know this could be slow, I'm not sure though what to do for those. Maybe limit the depth of those sequences.

@ajinabraham
Copy link
Contributor Author

Yeah it would be ideal to support a configurable depth parameter in semgrep API with a reasonable default.

aryx added a commit that referenced this issue May 28, 2020
This should close #827 and #664

The code to handle foo(); ...; bar(); was very naive
and was doing lots of useless work. This fixes that.

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

+ /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py
---------------------
profiling result
---------------------
Main total                               :      1.670 sec          1 count
Parse_python.parse                       :      0.918 sec          1 count
Parse_python.tokens                      :      0.525 sec          2 count
Semgrep.check                            :      0.458 sec          1 count
Parser_python.main                       :      0.277 sec          1 count
Semgrep.match_sts_sts                    :      0.186 sec      41627 count

(was 85sec before)

+ /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js
---------------------
profiling result
---------------------
Main total                               :      2.151 sec          1 count
Parse_js.parse                           :      1.236 sec          1 count
Parse_js.tokens                          :      0.398 sec          2 count
Semgrep.check                            :      0.389 sec          1 count
Semgrep.match_sts_sts                    :      0.239 sec      16824 count
Parser_js.module_item                    :      0.192 sec        609 count

(was a lot more before)
@aryx
Copy link
Collaborator

aryx commented May 28, 2020

no need for -depth parameter yet. The code I wrote was just terribly slow and was doing lots of useless work. With the small fix things seems more tractable.

@aryx aryx self-assigned this May 28, 2020
@aryx aryx closed this as completed in #852 May 29, 2020
aryx added a commit that referenced this issue May 29, 2020
* Optimize deep statement matching

This should close #827 and #664

The code to handle foo(); ...; bar(); was very naive
and was doing lots of useless work. This fixes that.

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

+ /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py
---------------------
profiling result
---------------------
Main total                               :      1.670 sec          1 count
Parse_python.parse                       :      0.918 sec          1 count
Parse_python.tokens                      :      0.525 sec          2 count
Semgrep.check                            :      0.458 sec          1 count
Parser_python.main                       :      0.277 sec          1 count
Semgrep.match_sts_sts                    :      0.186 sec      41627 count

(was 85sec before)

+ /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js
---------------------
profiling result
---------------------
Main total                               :      2.151 sec          1 count
Parse_js.parse                           :      1.236 sec          1 count
Parse_js.tokens                          :      0.398 sec          2 count
Semgrep.check                            :      0.389 sec          1 count
Semgrep.match_sts_sts                    :      0.239 sec      16824 count
Parser_js.module_item                    :      0.192 sec        609 count

(was a lot more before)

* Adding -debug -profile to semgrep-core and SEMGREP_CORE_DEBUG/PROFILE env var

Those options are useful to debug or profile semgrep-core.
Using the environment variable allows us to pass options to semgrep-core
without having to modify semgrep-python.

Test plan:
pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_DEBUG=1
pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_PROFILE=1
pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js
Debug mode On
Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpqfdc1lug -j 8 ../semgrep-core/tests/PERF/three.js
Profile mode On
disabling -j when in profiling mode
PARSING: ../semgrep-core/tests/PERF/three.js
saving rules file for debugging in: /tmp/semgrep_core_rule-4e8afb.yaml
---------------------
profiling result
---------------------
Main total                               :      1.625 sec          1 count
Parse_js.parse                           :      0.724 sec          1 count
Semgrep.check                            :      0.568 sec          1 count
Semgrep.match_sts_sts                    :      0.333 sec     185064 count

* Profiling information for each rule

This allows to see which rules take the most time. Note that
when called from semgrep-python, the rule id are not very readable,
but the generated file is saved in /tmp/ so you can find back
what the rule it corresponds to.

Test plan:
export SEMGREP_CORE_PROFILE=1
export SEMGREP_CORE_DEBUG=1
pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js
Debug mode On
Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpy5pzp3p_ -j 8 ../semgrep-core/tests/PERF/three.js
Profile mode On
disabling -j when in profiling mode
PARSING: ../semgrep-core/tests/PERF/three.js
saving rules file for debugging in: /tmp/semgrep_core_rule-97ae74.yaml
---------------------
profiling result
---------------------
Main total                               :      1.975 sec          1 count
Parse_js.parse                           :      0.828 sec          1 count
Semgrep.check                            :      0.791 sec          1 count
Semgrep.match_sts_sts                    :      0.559 sec     185064 count
Parse_js.tokens                          :      0.335 sec         12 count
Parser_js.module_item                    :      0.083 sec        609 count
Normalize_ast.normalize                  :      0.058 sec          1 count
Common.=~                                :      0.043 sec      51044 count
Common.full_charpos_to_pos_large         :      0.042 sec         12 count
rule:0..0.10                             :      0.035 sec      16824 count
rule:0..0.9                              :      0.031 sec      16824 count
rule:0..0.8                              :      0.030 sec      16824 count
rule:0..0.7                              :      0.029 sec      16824 count
rule:0..0.6                              :      0.029 sec      16824 count
rule:0..0.5                              :      0.029 sec      16824 count
rule:0..0.4                              :      0.029 sec      16824 count
rule:0..0.0                              :      0.029 sec      16824 count
rule:0..0.2                              :      0.029 sec      16824 count
rule:0..0.1                              :      0.029 sec      16824 count
rule:0..0.3                              :      0.029 sec      16824 count
file_type_of_file                        :      0.000 sec          2 count
Semgrep.apply_equivalences               :      0.000 sec         11 count
Common.sort_by_xxx                       :      0.000 sec         11 count
Unix.stat                                :      0.000 sec         12 count

* * docs/development.md: improve doc

Co-authored-by: brendon <brendon.go@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lang:javascript pattern:ellipsis performance user:external requested by someone outside of r2c
Development

Successfully merging a pull request may close this issue.

3 participants