Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize deep statement matching #852

Merged
merged 5 commits into from May 29, 2020
Merged

Optimize deep statement matching #852

merged 5 commits into from May 29, 2020

Conversation

aryx
Copy link
Collaborator

@aryx aryx commented May 28, 2020

This should close #827 and #664

The code to handle foo(); ...; bar(); was very naive
and was doing lots of useless work. This fixes that.

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

  • /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py

profiling result

Main total : 1.670 sec 1 count
Parse_python.parse : 0.918 sec 1 count
Parse_python.tokens : 0.525 sec 2 count
Semgrep.check : 0.458 sec 1 count
Parser_python.main : 0.277 sec 1 count
Semgrep.match_sts_sts : 0.186 sec 41627 count

(was 85sec before)

  • /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js

profiling result

Main total : 2.151 sec 1 count
Parse_js.parse : 1.236 sec 1 count
Parse_js.tokens : 0.398 sec 2 count
Semgrep.check : 0.389 sec 1 count
Semgrep.match_sts_sts : 0.239 sec 16824 count
Parser_js.module_item : 0.192 sec 609 count

(was a lot more before)

aryx added 2 commits May 28, 2020 13:54
This should close #827 and #664

The code to handle foo(); ...; bar(); was very naive
and was doing lots of useless work. This fixes that.

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

+ /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py
---------------------
profiling result
---------------------
Main total                               :      1.670 sec          1 count
Parse_python.parse                       :      0.918 sec          1 count
Parse_python.tokens                      :      0.525 sec          2 count
Semgrep.check                            :      0.458 sec          1 count
Parser_python.main                       :      0.277 sec          1 count
Semgrep.match_sts_sts                    :      0.186 sec      41627 count

(was 85sec before)

+ /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js
---------------------
profiling result
---------------------
Main total                               :      2.151 sec          1 count
Parse_js.parse                           :      1.236 sec          1 count
Parse_js.tokens                          :      0.398 sec          2 count
Semgrep.check                            :      0.389 sec          1 count
Semgrep.match_sts_sts                    :      0.239 sec      16824 count
Parser_js.module_item                    :      0.192 sec        609 count

(was a lot more before)
… env var

Those options are useful to debug or profile semgrep-core.
Using the environment variable allows us to pass options to semgrep-core
without having to modify semgrep-python.

Test plan:
pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_DEBUG=1
pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_PROFILE=1
pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js
Debug mode On
Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpqfdc1lug -j 8 ../semgrep-core/tests/PERF/three.js
Profile mode On
disabling -j when in profiling mode
PARSING: ../semgrep-core/tests/PERF/three.js
saving rules file for debugging in: /tmp/semgrep_core_rule-4e8afb.yaml
---------------------
profiling result
---------------------
Main total                               :      1.625 sec          1 count
Parse_js.parse                           :      0.724 sec          1 count
Semgrep.check                            :      0.568 sec          1 count
Semgrep.match_sts_sts                    :      0.333 sec     185064 count
@aryx
Copy link
Collaborator Author

aryx commented May 28, 2020

No idea what those tox-tests github actions are.

pr2 "Debug mode On";
pr2 (spf "Executed as: %s" (Sys.argv|>Array.to_list|> String.concat " "));
end;
if !profile then begin
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcoh maybe this was the issue. Maybe you were running the ocaml programs with profiling information but
because of -j the job was actually done in another process ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that occurred to me after I read that multi threading in OCaml is actually multiprocessing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, OCaml has concurrent threads (Xavier Leroy the author of OCaml actually added the first POSIX C thead library for Linux a long time ago, and he did it because he wanted threads in OCaml), but it does not have yet multi-core threads. There is work ongoing to suppor that.
Note that neither Python/PHP/Ruby/... have multi-core threads either.

aryx added 2 commits May 28, 2020 15:23
This allows to see which rules take the most time. Note that
when called from semgrep-python, the rule id are not very readable,
but the generated file is saved in /tmp/ so you can find back
what the rule it corresponds to.

Test plan:
export SEMGREP_CORE_PROFILE=1
export SEMGREP_CORE_DEBUG=1
pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js
Debug mode On
Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpy5pzp3p_ -j 8 ../semgrep-core/tests/PERF/three.js
Profile mode On
disabling -j when in profiling mode
PARSING: ../semgrep-core/tests/PERF/three.js
saving rules file for debugging in: /tmp/semgrep_core_rule-97ae74.yaml
---------------------
profiling result
---------------------
Main total                               :      1.975 sec          1 count
Parse_js.parse                           :      0.828 sec          1 count
Semgrep.check                            :      0.791 sec          1 count
Semgrep.match_sts_sts                    :      0.559 sec     185064 count
Parse_js.tokens                          :      0.335 sec         12 count
Parser_js.module_item                    :      0.083 sec        609 count
Normalize_ast.normalize                  :      0.058 sec          1 count
Common.=~                                :      0.043 sec      51044 count
Common.full_charpos_to_pos_large         :      0.042 sec         12 count
rule:0..0.10                             :      0.035 sec      16824 count
rule:0..0.9                              :      0.031 sec      16824 count
rule:0..0.8                              :      0.030 sec      16824 count
rule:0..0.7                              :      0.029 sec      16824 count
rule:0..0.6                              :      0.029 sec      16824 count
rule:0..0.5                              :      0.029 sec      16824 count
rule:0..0.4                              :      0.029 sec      16824 count
rule:0..0.0                              :      0.029 sec      16824 count
rule:0..0.2                              :      0.029 sec      16824 count
rule:0..0.1                              :      0.029 sec      16824 count
rule:0..0.3                              :      0.029 sec      16824 count
file_type_of_file                        :      0.000 sec          2 count
Semgrep.apply_equivalences               :      0.000 sec         11 count
Common.sort_by_xxx                       :      0.000 sec         11 count
Unix.stat                                :      0.000 sec         12 count
@mschwager
Copy link
Contributor

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

This is amazing!

It looks like a lot of perf tests were added, but none confirming existing behavior. There are a lot of larger changes here, are we confident there's no behavioral regressions without additional functionality tests?

* factorize, but I prefer to control and limit the number of places
* where we call m_stmts_deep. Once we call m_list__m_stmt, we
* are in a simpler world where the list of stmts will not grow.
*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for explaining the context and the intent

Common.profile_code "Sgrep_generic.check" (
fun () -> check2 ~hook rules equivs file lang
)
let check ~hook a b c d e =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I prefer having the labeled arguments here 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it's just that those Common.profile_code are just hacks because there's no super easy way to profile code. In theory I should just run ocamlprof and get nice stats, but I like the focused profile that allows Common.profile_code. Then I want to mimimize the amount of modifications I have to do to the program to support this non-functional property (profiling), so I do that. A better way probably would be to use the recent OCaml attribute to do that, have something like [@@ profile] let check a b c d = ... Maybe @mjambon knows a good ppx rewriter that support that.

let env = Matching_generic.empty_environment () in
GG.m_expr pattern e env
(*e: function [[Semgrep_generic.match_e_e]] *)
let match_e_e ruleid a b =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about naming the wrapper match_e_e_profiled (et c. for other profiled calls)?

I know I was rather confused by the ...2 naming scheme when I first met this code base.

Comment on lines +1285 to +1298
| [], [] ->
return ()
(*s: [[Generic_vs_generic.m_list__m_stmt()]] empty list vs list case *)
(* less-is-ok:
* it's ok to have statements after in the concrete code as long as we
* matched all the statements in the pattern (there is an implicit
* '...' at the end, in addition to implicit '...' at the beginning
* handled by kstmts calling the pattern for each subsequences).
* TODO: sgrep_generic though then display the whole sequence as a match
* instead of just the relevant part.
*)
| [], _::_ ->
return ()
(*e: [[Generic_vs_generic.m_list__m_stmt()]] empty list vs list case *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the motivation to separate these two cases for documentation?

(vs. | [], _ ->)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just more precise. there is already a case above for [], [], so [], _ below would be more general that it needs to be.

(* let's first try the without going deep *)
(
(* can match nothing *)
(m_list__m_stmt xsa (xb::xsb)) >||>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does one get documentation on >||>?

>>= seems common enough that it's nicer than Monad.bind, but I'm struggling to grok >||> and >!>. Maybe use the Googleable version instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in Matching_generic.ml, which is 'open'ed at the beginning of the file.
Neither >>= nor >||> are predefined OCaml operators. I've defined those operators
for the purpose of the matching process.

@nbrahms
Copy link
Contributor

nbrahms commented May 28, 2020

Just style thoughts from me, feel free to ignore them.

@brendongo brendongo added this to the 0.9.0 milestone May 28, 2020
@brendongo
Copy link
Member

Pulled in changes from develop to get tests passing

Copy link
Member

@brendongo brendongo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want 81000 lines to be added?

@aryx aryx merged commit c1ca429 into develop May 29, 2020
@aryx aryx deleted the optimize_deep_stmt branch May 29, 2020 09:42
@aryx
Copy link
Collaborator Author

aryx commented May 29, 2020

Yes, I think it's ok to have those 81000 lines; it's convenient to test sometimes performance locally without having to use another repo. Also it boosts my github stats!

IagoAbal added a commit that referenced this pull request Aug 10, 2023
TODO: Test performance

Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
TODO
IagoAbal added a commit that referenced this pull request Aug 10, 2023
TODO: Test performance

Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
TODO
IagoAbal added a commit that referenced this pull request Aug 10, 2023
TODO: Test performance

Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
TODO
IagoAbal added a commit that referenced this pull request Aug 14, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 14, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 15, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 15, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 15, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 16, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 17, 2023
Follows: c1ca429 ("Optimize deep statement matching (#852)")

test plan:
make test # new tests
IagoAbal added a commit that referenced this pull request Aug 17, 2023
Pattern `... bar()` would behave strangely because it would only try to
match deeply if it did not match anything non-deeply. So depending on
how the target code looked, `... bar()` would match or not match inside
e.g. an `if` statement. This was confusing and not a great property for
Semgrep to have.

The regression was introduced in PR #852 as part of a set of
optimizations done back in 0.9.0 (!), but at present reverting this one
thing does not seem to have any negative perf impact.

Follows: c1ca429 ("Optimize deep statement matching (#852)")

Closes semgrep/semgrep-rules#660
Closes PA-2992

test plan:
make test # new tests
Also, compared against develop running p/default on 32 repos from
stress-test-monorepo, and no meaningful slowdown or increase in memory
usage was observed.
cretoxyrhina pushed a commit to cretoxyrhina/semgrep that referenced this pull request Oct 17, 2023
Pattern `... bar()` would behave strangely because it would only try to
match deeply if it did not match anything non-deeply. So depending on
how the target code looked, `... bar()` would match or not match inside
e.g. an `if` statement. This was confusing and not a great property for
Semgrep to have.

The regression was introduced in PR semgrep#852 as part of a set of
optimizations done back in 0.9.0 (!), but at present reverting this one
thing does not seem to have any negative perf impact.

Follows: c1ca429 ("Optimize deep statement matching (semgrep#852)")

Closes semgrep/semgrep-rules#660
Closes PA-2992

test plan:
make test # new tests
Also, compared against develop running p/default on 32 repos from
stress-test-monorepo, and no meaningful slowdown or increase in memory
usage was observed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Semgrep consume all of the CPU, Possible Regex DoS inside the engine.
6 participants