New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ support #1952
Comments
Thank you @hexagonrecursion for the tracking ticket and request! We'll look for more +1s here and start the conversation 👍 |
cc @aryx @mjambon and also see some prior discussion on HN: https://news.ycombinator.com/item?id=24932891 |
+1 |
Do you have example of codebase you would like to analyze? A big problem is that we want to parse the code as-is, but |
|
Yes for mozilla it is not gonna happen ... They're using weird macros everywhere that makes the code unparsable. |
This will help #1952 test plan: pad@yrax:~/work/lang-cpp/Cataclysm-DDA$ yy -lang cpp -test_parse_tree_sitter . + /home/pad/semgrep/_build/default/cli/Main.exe -lang cpp -test_parse_tree_sitter . 11 / 852/OVERLOAD/work/lang-cpp/Cataclysm-DDA/src/activity_actor.cpp: exn = Tree_sitter_run.Tree_sitter_error.Error(_) ... NB total files = 852; NB total lines = 474254; perfect = 662; pbs = 190; timeout = 0; =========> 77% nb good = 174878, nb passed = 0 =========> 0.000000% nb good = 174878, nb bad = 299376 =========> 36.874333% For comparison, the C++/C/cpp parser in pfff is doing: NB total files = 852; NB total lines = 473336; perfect = 110; pbs = 742; timeout = 0; =========> 12% nb good = 160267, nb passed = 187 =========> 0.116680% nb good = 160267, nb bad = 313069 =========> 33.859035% pad@yrax:~/work/lang-cpp/Cataclysm-DDA$
The tree-sitter c++ parser can parse 93% of Cataclysm, which is not bad, but there's lots of parse errors on code I don't really understand:
|
This will help #1952 test plan: pad@yrax:~/work/lang-cpp/Cataclysm-DDA$ yy -lang cpp -test_parse_tree_sitter . + /home/pad/semgrep/_build/default/cli/Main.exe -lang cpp -test_parse_tree_sitter . 11 / 852/OVERLOAD/work/lang-cpp/Cataclysm-DDA/src/activity_actor.cpp: exn = Tree_sitter_run.Tree_sitter_error.Error(_) ... NB total files = 852; NB total lines = 474254; perfect = 662; pbs = 190; timeout = 0; =========> 77% nb good = 174878, nb passed = 0 =========> 0.000000% nb good = 174878, nb bad = 299376 =========> 36.874333% For comparison, the C++/C/cpp parser in pfff is doing: NB total files = 852; NB total lines = 473336; perfect = 110; pbs = 742; timeout = 0; =========> 12% nb good = 160267, nb passed = 187 =========> 0.116680% nb good = 160267, nb bad = 313069 =========> 33.859035% pad@yrax:~/work/lang-cpp/Cataclysm-DDA$
This is c++11 https://en.cppreference.com/w/cpp/language/user_literal |
Could you create an issue on the tree-sitter C++ parser, given that's the one we are using for C++: |
Just listening to Clint Gibler's talk about semgrep at Empire Hacking. I would be interested in using semgrep with the osquery codebase. We don't use macros very extensively there. Would this be feasible? |
I've just tried the tree-sitter-cpp parser on osquery and it just parses 75% of the codebase.
problem on size_t{10}
problem on HIDDEN_FLAG macro. It problable needs quite some work to extend https://github.com/tree-sitter/tree-sitter-cpp to handle all those constructs An alternative is to rely on the clang AST, but this requires the source to be compilable and to instruct semgrep where to find the header files (-I) and possible macros (-D) to correctly parse the code. This requires lots of work and may be quite slow as clang needs to process all the included files to get all the macros and typedefs etc to be able to parse correctly the code. |
I think in the short term we could spend some time to integrate tree-sitter-cpp in semgrep, to be able to parse and match C++ code, but you will get lower coverage than for the other languages as lots of the C++ code (in osquery case 25%) would not be parsed. |
@aryx thank you for the great feedback! I wonder about that first case -- isn't that standard C++ syntax? Trying to remember what the checks were that I wanted to add to osquery that I thought semgrep could be a good fit for. I will post here if I remember those. |
I don't think you can have toplevel function calls in C or C++, and that's what this code looks like, it looks like a function call, Macros ... |
Parsing C++ is hard. tree-sitter/tree-sitter-cpp#74 http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.pdf#page=375 appendix F LibTooling (an C/C++/ObjectiveC parsing library that uses the same parser as the clang compiler) requires a "compilation database" as input. I assume there is a good reason for that i.e. the parse tree of a c++ file depends on the compiler flags and the current working directory. https://medium.com/@mujjingun_23509/full-proof-that-c-grammar-is-undecidable-34e22dd8b664 |
It's worth noting that if you can compile the project, Bear makes it easy to generate the compilation database. You basically just do So I don't think having to have a compilation database would be a hard requirement to satisfy. I think it's probably the way to go. |
How good is Bear now? A few years ago I had a quick look and it could handle only a few projects. Can it now really be used on any project? Is there examples out there of complex projects using complex makefiles where Bear does a good job? Can it be used for the mozilla codebase for example? Does clang return any error using a Bear generated database? |
From my experience, Bear is pretty robust now. I've mostly used Bear/compilation databases with varied vim/nvim tooling, e.g. the ccls C/C++/Obj-C Language Server can optionally use compilation databases and they recommend it for larger projects. I haven't tried it on a gigantic, complex project like the Mozilla codebase. But then again, my ambitions with semgrep and C++ are lesser. :) It would be sweet if it was robust enough to be applicable to Mozilla's codebase, but the largest pain point for me is when someone throws a moderate amount of C++-isms in a basically-C project, making semgrep completely unable to handle it.
I've personally never encountered this. |
This will help semgrep/semgrep#1952 test plan: make
This is the start of C++ support in Semgrep. This will help #1952 test plan: make
This is the start of C++ support in Semgrep. This will help #1952 test plan: make
Closing this issue since we added support for C++ last year (I forgot to close the issue). |
I searched the issues and did not find a request for C++ support. Now there is one.
The text was updated successfully, but these errors were encountered: