New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize deep statement matching #852
Conversation
This should close #827 and #664 The code to handle foo(); ...; bar(); was very naive and was doing lots of useless work. This fixes that. Test plan: time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js 3.2s (was 3min before) + /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang py -f tests/PERF/ellipsis-python.sgrep tests/PERF/my_first_calculator.py --------------------- profiling result --------------------- Main total : 1.670 sec 1 count Parse_python.parse : 0.918 sec 1 count Parse_python.tokens : 0.525 sec 2 count Semgrep.check : 0.458 sec 1 count Parser_python.main : 0.277 sec 1 count Semgrep.match_sts_sts : 0.186 sec 41627 count (was 85sec before) + /home/pad/github/semgrep/semgrep-core/_build/default/bin/Main.exe -profile -lang js -f tests/PERF/ellipsis-js.sgrep tests/PERF/three.js --------------------- profiling result --------------------- Main total : 2.151 sec 1 count Parse_js.parse : 1.236 sec 1 count Parse_js.tokens : 0.398 sec 2 count Semgrep.check : 0.389 sec 1 count Semgrep.match_sts_sts : 0.239 sec 16824 count Parser_js.module_item : 0.192 sec 609 count (was a lot more before)
… env var Those options are useful to debug or profile semgrep-core. Using the environment variable allows us to pass options to semgrep-core without having to modify semgrep-python. Test plan: pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_DEBUG=1 pad@yrax:~/github/semgrep/semgrep$ export SEMGREP_CORE_PROFILE=1 pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js Debug mode On Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpqfdc1lug -j 8 ../semgrep-core/tests/PERF/three.js Profile mode On disabling -j when in profiling mode PARSING: ../semgrep-core/tests/PERF/three.js saving rules file for debugging in: /tmp/semgrep_core_rule-4e8afb.yaml --------------------- profiling result --------------------- Main total : 1.625 sec 1 count Parse_js.parse : 0.724 sec 1 count Semgrep.check : 0.568 sec 1 count Semgrep.match_sts_sts : 0.333 sec 185064 count
No idea what those tox-tests github actions are. |
pr2 "Debug mode On"; | ||
pr2 (spf "Executed as: %s" (Sys.argv|>Array.to_list|> String.concat " ")); | ||
end; | ||
if !profile then begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcoh maybe this was the issue. Maybe you were running the ocaml programs with profiling information but
because of -j the job was actually done in another process ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that occurred to me after I read that multi threading in OCaml is actually multiprocessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, OCaml has concurrent threads (Xavier Leroy the author of OCaml actually added the first POSIX C thead library for Linux a long time ago, and he did it because he wanted threads in OCaml), but it does not have yet multi-core threads. There is work ongoing to suppor that.
Note that neither Python/PHP/Ruby/... have multi-core threads either.
This allows to see which rules take the most time. Note that when called from semgrep-python, the rule id are not very readable, but the generated file is saved in /tmp/ so you can find back what the rule it corresponds to. Test plan: export SEMGREP_CORE_PROFILE=1 export SEMGREP_CORE_DEBUG=1 pad@yrax:~/github/semgrep/semgrep$ pipenv run semgrep -f ../semgrep-core/tests/PERF/ajin.yaml ../semgrep-core/tests/PERF/three.js Debug mode On Executed as: semgrep-core -lang javascript -rules_file /tmp/tmpy5pzp3p_ -j 8 ../semgrep-core/tests/PERF/three.js Profile mode On disabling -j when in profiling mode PARSING: ../semgrep-core/tests/PERF/three.js saving rules file for debugging in: /tmp/semgrep_core_rule-97ae74.yaml --------------------- profiling result --------------------- Main total : 1.975 sec 1 count Parse_js.parse : 0.828 sec 1 count Semgrep.check : 0.791 sec 1 count Semgrep.match_sts_sts : 0.559 sec 185064 count Parse_js.tokens : 0.335 sec 12 count Parser_js.module_item : 0.083 sec 609 count Normalize_ast.normalize : 0.058 sec 1 count Common.=~ : 0.043 sec 51044 count Common.full_charpos_to_pos_large : 0.042 sec 12 count rule:0..0.10 : 0.035 sec 16824 count rule:0..0.9 : 0.031 sec 16824 count rule:0..0.8 : 0.030 sec 16824 count rule:0..0.7 : 0.029 sec 16824 count rule:0..0.6 : 0.029 sec 16824 count rule:0..0.5 : 0.029 sec 16824 count rule:0..0.4 : 0.029 sec 16824 count rule:0..0.0 : 0.029 sec 16824 count rule:0..0.2 : 0.029 sec 16824 count rule:0..0.1 : 0.029 sec 16824 count rule:0..0.3 : 0.029 sec 16824 count file_type_of_file : 0.000 sec 2 count Semgrep.apply_equivalences : 0.000 sec 11 count Common.sort_by_xxx : 0.000 sec 11 count Unix.stat : 0.000 sec 12 count
This is amazing! It looks like a lot of perf tests were added, but none confirming existing behavior. There are a lot of larger changes here, are we confident there's no behavioral regressions without additional functionality tests? |
* factorize, but I prefer to control and limit the number of places | ||
* where we call m_stmts_deep. Once we call m_list__m_stmt, we | ||
* are in a simpler world where the list of stmts will not grow. | ||
*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for explaining the context and the intent
Common.profile_code "Sgrep_generic.check" ( | ||
fun () -> check2 ~hook rules equivs file lang | ||
) | ||
let check ~hook a b c d e = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I prefer having the labeled arguments here 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it's just that those Common.profile_code are just hacks because there's no super easy way to profile code. In theory I should just run ocamlprof and get nice stats, but I like the focused profile that allows Common.profile_code. Then I want to mimimize the amount of modifications I have to do to the program to support this non-functional property (profiling), so I do that. A better way probably would be to use the recent OCaml attribute to do that, have something like [@@ profile] let check a b c d = ... Maybe @mjambon knows a good ppx rewriter that support that.
let env = Matching_generic.empty_environment () in | ||
GG.m_expr pattern e env | ||
(*e: function [[Semgrep_generic.match_e_e]] *) | ||
let match_e_e ruleid a b = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about naming the wrapper match_e_e_profiled
(et c. for other profiled calls)?
I know I was rather confused by the ...2
naming scheme when I first met this code base.
| [], [] -> | ||
return () | ||
(*s: [[Generic_vs_generic.m_list__m_stmt()]] empty list vs list case *) | ||
(* less-is-ok: | ||
* it's ok to have statements after in the concrete code as long as we | ||
* matched all the statements in the pattern (there is an implicit | ||
* '...' at the end, in addition to implicit '...' at the beginning | ||
* handled by kstmts calling the pattern for each subsequences). | ||
* TODO: sgrep_generic though then display the whole sequence as a match | ||
* instead of just the relevant part. | ||
*) | ||
| [], _::_ -> | ||
return () | ||
(*e: [[Generic_vs_generic.m_list__m_stmt()]] empty list vs list case *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the motivation to separate these two cases for documentation?
(vs. | [], _ ->
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just more precise. there is already a case above for [], [], so [], _ below would be more general that it needs to be.
(* let's first try the without going deep *) | ||
( | ||
(* can match nothing *) | ||
(m_list__m_stmt xsa (xb::xsb)) >||> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does one get documentation on >||>
?
>>=
seems common enough that it's nicer than Monad.bind
, but I'm struggling to grok >||>
and >!>
. Maybe use the Googleable version instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW:
http://symbolhound.com/?q=%3E%7C%7C%3E+ocaml
http://symbolhound.com/?q=%3E%21%3E+ocaml
both return no results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined in Matching_generic.ml, which is 'open'ed at the beginning of the file.
Neither >>= nor >||> are predefined OCaml operators. I've defined those operators
for the purpose of the matching process.
Just style thoughts from me, feel free to ignore them. |
Pulled in changes from develop to get tests passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want 81000 lines to be added?
Yes, I think it's ok to have those 81000 lines; it's convenient to test sometimes performance locally without having to use another repo. Also it boosts my github stats! |
Pattern `... bar()` would behave strangely because it would only try to match deeply if it did not match anything non-deeply. So depending on how the target code looked, `... bar()` would match or not match inside e.g. an `if` statement. This was confusing and not a great property for Semgrep to have. The regression was introduced in PR #852 as part of a set of optimizations done back in 0.9.0 (!), but at present reverting this one thing does not seem to have any negative perf impact. Follows: c1ca429 ("Optimize deep statement matching (#852)") Closes semgrep/semgrep-rules#660 Closes PA-2992 test plan: make test # new tests Also, compared against develop running p/default on 32 repos from stress-test-monorepo, and no meaningful slowdown or increase in memory usage was observed.
Pattern `... bar()` would behave strangely because it would only try to match deeply if it did not match anything non-deeply. So depending on how the target code looked, `... bar()` would match or not match inside e.g. an `if` statement. This was confusing and not a great property for Semgrep to have. The regression was introduced in PR semgrep#852 as part of a set of optimizations done back in 0.9.0 (!), but at present reverting this one thing does not seem to have any negative perf impact. Follows: c1ca429 ("Optimize deep statement matching (semgrep#852)") Closes semgrep/semgrep-rules#660 Closes PA-2992 test plan: make test # new tests Also, compared against develop running p/default on 32 repos from stress-test-monorepo, and no meaningful slowdown or increase in memory usage was observed.
This should close #827 and #664
The code to handle foo(); ...; bar(); was very naive
and was doing lots of useless work. This fixes that.
Test plan:
time pipenv run semgrep -f ~/semgrep/tests/PERF/ajin.yaml ~/semgrep/tests/PERF/three.js
3.2s
(was 3min before)
profiling result
Main total : 1.670 sec 1 count
Parse_python.parse : 0.918 sec 1 count
Parse_python.tokens : 0.525 sec 2 count
Semgrep.check : 0.458 sec 1 count
Parser_python.main : 0.277 sec 1 count
Semgrep.match_sts_sts : 0.186 sec 41627 count
(was 85sec before)
profiling result
Main total : 2.151 sec 1 count
Parse_js.parse : 1.236 sec 1 count
Parse_js.tokens : 0.398 sec 2 count
Semgrep.check : 0.389 sec 1 count
Semgrep.match_sts_sts : 0.239 sec 16824 count
Parser_js.module_item : 0.192 sec 609 count
(was a lot more before)