Optimize deep statement matching #852

aryx · 2020-05-28T11:54:29Z

This should close #827 and #664

The code to handle foo(); ...; bar(); was very naive
and was doing lots of useless work. This fixes that.

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

/home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py

profiling result

Main total : 1.670 sec 1 count
Parse_python.parse : 0.918 sec 1 count
Parse_python.tokens : 0.525 sec 2 count
Semgrep.check : 0.458 sec 1 count
Parser_python.main : 0.277 sec 1 count
Semgrep.match_sts_sts : 0.186 sec 41627 count

(was 85sec before)

/home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js

profiling result

Main total : 2.151 sec 1 count
Parse_js.parse : 1.236 sec 1 count
Parse_js.tokens : 0.398 sec 2 count
Semgrep.check : 0.389 sec 1 count
Semgrep.match_sts_sts : 0.239 sec 16824 count
Parser_js.module_item : 0.192 sec 609 count

(was a lot more before)

This should close #827 and #664 The code to handle foo(); ...; bar(); was very naive and was doing lots of useless work. This fixes that. Test plan: time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js 3.2s (was 3min before) + /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py --------------------- profiling result --------------------- Main total : 1.670 sec 1 count Parse_python.parse : 0.918 sec 1 count Parse_python.tokens : 0.525 sec 2 count Semgrep.check : 0.458 sec 1 count Parser_python.main : 0.277 sec 1 count Semgrep.match_sts_sts : 0.186 sec 41627 count (was 85sec before) + /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js --------------------- profiling result --------------------- Main total : 2.151 sec 1 count Parse_js.parse : 1.236 sec 1 count Parse_js.tokens : 0.398 sec 2 count Semgrep.check : 0.389 sec 1 count Semgrep.match_sts_sts : 0.239 sec 16824 count Parser_js.module_item : 0.192 sec 609 count (was a lot more before)

… env var Those options are useful to debug or profile semgrep-core. Using the environment variable allows us to pass options to semgrep-core without having to modify semgrep-python. Test plan: pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_DEBUG=1 pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_PROFILE=1 pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js Debug mode On Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpqfdc1lug -j 8 ../semgrep-core/tests/PERF/three.js Profile mode On disabling -j when in profiling mode PARSING: ../semgrep-core/tests/PERF/three.js saving rules file for debugging in: /tmp/semgrep_core_rule-4e8afb.yaml --------------------- profiling result --------------------- Main total : 1.625 sec 1 count Parse_js.parse : 0.724 sec 1 count Semgrep.check : 0.568 sec 1 count Semgrep.match_sts_sts : 0.333 sec 185064 count

aryx · 2020-05-28T13:09:08Z

No idea what those tox-tests github actions are.

aryx · 2020-05-28T13:10:22Z

semgrep-core/bin/Main.ml

+    pr2 "Debug mode On";
+    pr2 (spf "Executed as: %s" (Sys.argv|>Array.to_list|> String.concat " "));
+  end;
+  if !profile then begin


@rcoh maybe this was the issue. Maybe you were running the ocaml programs with profiling information but
because of -j the job was actually done in another process ...

Yeah that occurred to me after I read that multi threading in OCaml is actually multiprocessing

Well, OCaml has concurrent threads (Xavier Leroy the author of OCaml actually added the first POSIX C thead library for Linux a long time ago, and he did it because he wanted threads in OCaml), but it does not have yet multi-core threads. There is work ongoing to suppor that.
Note that neither Python/PHP/Ruby/... have multi-core threads either.

This allows to see which rules take the most time. Note that when called from semgrep-python, the rule id are not very readable, but the generated file is saved in /tmp/ so you can find back what the rule it corresponds to. Test plan: export SEMGREP_CORE_PROFILE=1 export SEMGREP_CORE_DEBUG=1 pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js Debug mode On Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpy5pzp3p_ -j 8 ../semgrep-core/tests/PERF/three.js Profile mode On disabling -j when in profiling mode PARSING: ../semgrep-core/tests/PERF/three.js saving rules file for debugging in: /tmp/semgrep_core_rule-97ae74.yaml --------------------- profiling result --------------------- Main total : 1.975 sec 1 count Parse_js.parse : 0.828 sec 1 count Semgrep.check : 0.791 sec 1 count Semgrep.match_sts_sts : 0.559 sec 185064 count Parse_js.tokens : 0.335 sec 12 count Parser_js.module_item : 0.083 sec 609 count Normalize_ast.normalize : 0.058 sec 1 count Common.=~ : 0.043 sec 51044 count Common.full_charpos_to_pos_large : 0.042 sec 12 count rule:0..0.10 : 0.035 sec 16824 count rule:0..0.9 : 0.031 sec 16824 count rule:0..0.8 : 0.030 sec 16824 count rule:0..0.7 : 0.029 sec 16824 count rule:0..0.6 : 0.029 sec 16824 count rule:0..0.5 : 0.029 sec 16824 count rule:0..0.4 : 0.029 sec 16824 count rule:0..0.0 : 0.029 sec 16824 count rule:0..0.2 : 0.029 sec 16824 count rule:0..0.1 : 0.029 sec 16824 count rule:0..0.3 : 0.029 sec 16824 count file_type_of_file : 0.000 sec 2 count Semgrep.apply_equivalences : 0.000 sec 11 count Common.sort_by_xxx : 0.000 sec 11 count Unix.stat : 0.000 sec 12 count

mschwager · 2020-05-28T15:18:05Z

Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s

(was 3min before)

This is amazing!

It looks like a lot of perf tests were added, but none confirming existing behavior. There are a lot of larger changes here, are we confident there's no behavioral regressions without additional functionality tests?

mjambon · 2020-05-28T18:36:25Z

semgrep-core/matching/Generic_vs_generic.ml

+   * factorize, but I prefer to control and limit the number of places
+   * where we call m_stmts_deep. Once we call m_list__m_stmt, we
+   * are in a simpler world where the list of stmts will not grow.
+   *)


+1 for explaining the context and the intent

nbrahms · 2020-05-28T15:56:40Z

semgrep-core/matching/Semgrep_generic.ml

-  Common.profile_code "Sgrep_generic.check" (
-    fun () -> check2 ~hook rules equivs file lang
-  )
+let check ~hook a b c d e =


FWIW, I prefer having the labeled arguments here 🤷

True, it's just that those Common.profile_code are just hacks because there's no super easy way to profile code. In theory I should just run ocamlprof and get nice stats, but I like the focused profile that allows Common.profile_code. Then I want to mimimize the amount of modifications I have to do to the program to support this non-functional property (profiling), so I do that. A better way probably would be to use the recent OCaml attribute to do that, have something like [@@ profile] let check a b c d = ... Maybe @mjambon knows a good ppx rewriter that support that.

nbrahms · 2020-05-28T15:58:27Z

semgrep-core/matching/Semgrep_generic.ml

  let env = Matching_generic.empty_environment () in
  GG.m_expr pattern e env
 (*e: function [[Semgrep_generic.match_e_e]] *)
+let match_e_e ruleid a b =


What about naming the wrapper match_e_e_profiled (et c. for other profiled calls)?

I know I was rather confused by the ...2 naming scheme when I first met this code base.

nbrahms · 2020-05-28T16:41:35Z

semgrep-core/matching/Generic_vs_generic.ml

+  | [], [] ->
+      return ()
+  (*s: [[Generic_vs_generic.m_list__m_stmt()]] empty list vs list case *)
+  (* less-is-ok:
+   * it's ok to have statements after in the concrete code as long as we
+   * matched all the statements in the pattern (there is an implicit
+   * '...' at the end, in addition to implicit '...' at the beginning
+   * handled by kstmts calling the pattern for each subsequences).
+   * TODO: sgrep_generic though then display the whole sequence as a match
+   * instead of just the relevant part.
+   *)
+  | [], _::_ ->
+      return ()
+  (*e: [[Generic_vs_generic.m_list__m_stmt()]] empty list vs list case *)


Is the motivation to separate these two cases for documentation?

(vs. | [], _ ->)

It's just more precise. there is already a case above for [], [], so [], _ below would be more general that it needs to be.

nbrahms · 2020-05-28T18:47:54Z

semgrep-core/matching/Generic_vs_generic.ml

+    (* let's first try the without going deep *)
+     (
+      (* can match nothing *)
+      (m_list__m_stmt xsa (xb::xsb)) >||>


how does one get documentation on >||>?

>>= seems common enough that it's nicer than Monad.bind, but I'm struggling to grok >||> and >!>. Maybe use the Googleable version instead?

FWIW:

http://symbolhound.com/?q=%3E%7C%7C%3E+ocaml

http://symbolhound.com/?q=%3E%21%3E+ocaml

both return no results.

It's defined in Matching_generic.ml, which is 'open'ed at the beginning of the file.
Neither >>= nor >||> are predefined OCaml operators. I've defined those operators
for the purpose of the matching process.

nbrahms · 2020-05-28T18:49:57Z

Just style thoughts from me, feel free to ignore them.

brendongo · 2020-05-28T20:21:45Z

Pulled in changes from develop to get tests passing

brendongo

Do we really want 81000 lines to be added?

aryx · 2020-05-29T09:43:21Z

Yes, I think it's ok to have those 81000 lines; it's convenient to test sometimes performance locally without having to use another repo. Also it boosts my github stats!

TODO: Test performance Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: TODO

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

Pattern `... bar()` would behave strangely because it would only try to match deeply if it did not match anything non-deeply. So depending on how the target code looked, `... bar()` would match or not match inside e.g. an `if` statement. This was confusing and not a great property for Semgrep to have. The regression was introduced in PR #852 as part of a set of optimizations done back in 0.9.0 (!), but at present reverting this one thing does not seem to have any negative perf impact. Follows: c1ca429 ("Optimize deep statement matching (#852)") Closes semgrep/semgrep-rules#660 Closes PA-2992 test plan: make test # new tests Also, compared against develop running p/default on 32 repos from stress-test-monorepo, and no meaningful slowdown or increase in memory usage was observed.

Pattern `... bar()` would behave strangely because it would only try to match deeply if it did not match anything non-deeply. So depending on how the target code looked, `... bar()` would match or not match inside e.g. an `if` statement. This was confusing and not a great property for Semgrep to have. The regression was introduced in PR semgrep#852 as part of a set of optimizations done back in 0.9.0 (!), but at present reverting this one thing does not seem to have any negative perf impact. Follows: c1ca429 ("Optimize deep statement matching (semgrep#852)") Closes semgrep/semgrep-rules#660 Closes PA-2992 test plan: make test # new tests Also, compared against develop running p/default on 32 repos from stress-test-monorepo, and no meaningful slowdown or increase in memory usage was observed.

aryx added 2 commits May 28, 2020 13:54

aryx requested review from nbrahms, brendongo, DrewDennison, ievans, mjambon and rcoh May 28, 2020 13:07

aryx commented May 28, 2020

View reviewed changes

aryx added 2 commits May 28, 2020 15:23

* docs/development.md: improve doc

3d28f8f

mjambon reviewed May 28, 2020

View reviewed changes

rcoh approved these changes May 28, 2020

View reviewed changes

nbrahms reviewed May 28, 2020

View reviewed changes

brendongo added this to the 0.9.0 milestone May 28, 2020

Merge branch 'develop' into optimize_deep_stmt

868cc94

brendongo requested changes May 29, 2020

View reviewed changes

aryx merged commit c1ca429 into develop May 29, 2020

aryx deleted the optimize_deep_stmt branch May 29, 2020 09:42

IagoAbal added a commit that referenced this pull request Aug 10, 2023

matching: ellipsis: Always try going deep

703d2a4

TODO: Test performance Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: TODO

IagoAbal mentioned this pull request Aug 10, 2023

matching: ellipsis: Always try going deep #8440

Merged

5 tasks

IagoAbal added a commit that referenced this pull request Aug 10, 2023

matching: ellipsis: Always try going deep

d194e4b

TODO: Test performance Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: TODO

IagoAbal added a commit that referenced this pull request Aug 10, 2023

matching: ellipsis: Always try going deep

cd13296

TODO: Test performance Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: TODO

IagoAbal added a commit that referenced this pull request Aug 14, 2023

matching: ellipsis: Always try going deep

336feb1

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

IagoAbal added a commit that referenced this pull request Aug 14, 2023

matching: ellipsis: Always try going deep

ab7136e

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

IagoAbal added a commit that referenced this pull request Aug 15, 2023

matching: ellipsis: Always try going deep

09ad246

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

IagoAbal added a commit that referenced this pull request Aug 15, 2023

matching: ellipsis: Always try going deep

652dbe0

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

IagoAbal added a commit that referenced this pull request Aug 15, 2023

matching: ellipsis: Always try going deep

1f2b5c9

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

IagoAbal added a commit that referenced this pull request Aug 16, 2023

matching: ellipsis: Always try going deep

161f33d

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

IagoAbal added a commit that referenced this pull request Aug 17, 2023

matching: ellipsis: Always try going deep

a00a6d8

Follows: c1ca429 ("Optimize deep statement matching (#852)") test plan: make test # new tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize deep statement matching #852

Optimize deep statement matching #852

aryx commented May 28, 2020

aryx commented May 28, 2020

aryx May 28, 2020

rcoh May 28, 2020

aryx May 28, 2020

mschwager commented May 28, 2020

mjambon May 28, 2020

nbrahms May 28, 2020

aryx May 29, 2020

nbrahms May 28, 2020

nbrahms May 28, 2020

aryx May 29, 2020

nbrahms May 28, 2020

nbrahms May 28, 2020

rcoh May 28, 2020

aryx May 29, 2020

nbrahms commented May 28, 2020

brendongo commented May 28, 2020

brendongo left a comment

aryx commented May 29, 2020

Optimize deep statement matching #852

Optimize deep statement matching #852

Conversation

aryx commented May 28, 2020

profiling result

profiling result

aryx commented May 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mschwager commented May 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbrahms commented May 28, 2020

brendongo commented May 28, 2020

brendongo left a comment

Choose a reason for hiding this comment

aryx commented May 29, 2020