Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ support #1952

Closed
hexagonrecursion opened this issue Nov 3, 2020 · 19 comments
Closed

C++ support #1952

hexagonrecursion opened this issue Nov 3, 2020 · 19 comments

Comments

@hexagonrecursion
Copy link

I searched the issues and did not find a request for C++ support. Now there is one.

@dlukeomalley
Copy link
Member

Thank you @hexagonrecursion for the tracking ticket and request! We'll look for more +1s here and start the conversation 👍

@ievans
Copy link
Member

ievans commented Nov 3, 2020

cc @aryx @mjambon and also see some prior discussion on HN: https://news.ycombinator.com/item?id=24932891

@ubimars
Copy link

ubimars commented Nov 9, 2020

+1

@aryx
Copy link
Collaborator

aryx commented Nov 9, 2020

Do you have example of codebase you would like to analyze? A big problem is that we want to parse the code as-is, but
if the C++ code uses heavily cpp macros using weird syntax, we will not be able to parse the code.

@hexagonrecursion
Copy link
Author

@aryx

Do you have example of codebase you would like to analyze? A big problem is that we want to parse the code as-is, but
if the C++ code uses heavily cpp macros using weird syntax, we will not be able to parse the code.

  1. Cataclysm - Dark Days Ahead. A turn-based survival game set in a post-apocalyptic world.
  2. https://hg.mozilla.org/mozilla-central/

@aryx
Copy link
Collaborator

aryx commented Nov 10, 2020

Yes for mozilla it is not gonna happen ... They're using weird macros everywhere that makes the code unparsable.
It's not C++, it's C++_WITH_OUR_OWN_MINI_LANGUAGE_FOR_RESOURCE_MANAGEMENT_AND_OTHER_STUFF.

aryx added a commit that referenced this issue Nov 10, 2020
This will help #1952

test plan:
pad@yrax:~/work/lang-cpp/Cataclysm-DDA$ yy -lang cpp -test_parse_tree_sitter .
+ /home/pad/semgrep/_build/default/cli/Main.exe -lang cpp -test_parse_tree_sitter .
11 / 852/OVERLOAD/work/lang-cpp/Cataclysm-DDA/src/activity_actor.cpp: exn = Tree_sitter_run.Tree_sitter_error.Error(_)
...

NB total files = 852; NB total lines = 474254; perfect = 662; pbs = 190; timeout = 0; =========> 77%
nb good = 174878,  nb passed = 0 =========> 0.000000%
nb good = 174878,  nb bad = 299376 =========> 36.874333%

For comparison, the C++/C/cpp parser in pfff is doing:
NB total files = 852; NB total lines = 473336; perfect = 110; pbs = 742; timeout = 0; =========> 12%
nb good = 160267,  nb passed = 187 =========> 0.116680%
nb good = 160267,  nb bad = 313069 =========> 33.859035%
pad@yrax:~/work/lang-cpp/Cataclysm-DDA$
@aryx
Copy link
Collaborator

aryx commented Nov 10, 2020

The tree-sitter c++ parser can parse 93% of Cataclysm, which is not bad, but there's lots of parse errors on code I don't really understand:

Unrecognized construct
Error: File Cataclysm-DDA/tests/vehicle_part_test.cpp, line 23, characters 51-52:
#include "vpart_range.h"

static time_point midnight = calendar::turn_zero + 0_hours;
                                                   ^       
static time_point midday = calendar::turn_zero + 12_hours;

aryx added a commit that referenced this issue Nov 10, 2020
This will help #1952

test plan:
pad@yrax:~/work/lang-cpp/Cataclysm-DDA$ yy -lang cpp -test_parse_tree_sitter .
+ /home/pad/semgrep/_build/default/cli/Main.exe -lang cpp -test_parse_tree_sitter .
11 / 852/OVERLOAD/work/lang-cpp/Cataclysm-DDA/src/activity_actor.cpp: exn = Tree_sitter_run.Tree_sitter_error.Error(_)
...

NB total files = 852; NB total lines = 474254; perfect = 662; pbs = 190; timeout = 0; =========> 77%
nb good = 174878,  nb passed = 0 =========> 0.000000%
nb good = 174878,  nb bad = 299376 =========> 36.874333%

For comparison, the C++/C/cpp parser in pfff is doing:
NB total files = 852; NB total lines = 473336; perfect = 110; pbs = 742; timeout = 0; =========> 12%
nb good = 160267,  nb passed = 187 =========> 0.116680%
nb good = 160267,  nb bad = 313069 =========> 33.859035%
pad@yrax:~/work/lang-cpp/Cataclysm-DDA$
@aryx
Copy link
Collaborator

aryx commented Nov 13, 2020

Could you create an issue on the tree-sitter C++ parser, given that's the one we are using for C++:
https://github.com/tree-sitter/tree-sitter-cpp

@zwass
Copy link

zwass commented Dec 9, 2020

Just listening to Clint Gibler's talk about semgrep at Empire Hacking. I would be interested in using semgrep with the osquery codebase. We don't use macros very extensively there. Would this be feasible?

@aryx
Copy link
Collaborator

aryx commented Dec 9, 2020

I've just tried the tree-sitter-cpp parser on osquery and it just parses 75% of the codebase.
Here are a few parsing errors reported by the current version of tree-sitter-cpp:

File utils/json/tests/json.cpp, line 202, characters 22-28:

  auto doc2 = JSON::newObject();
  doc2.add("new_key", size_t{10});
  doc2.addCopy("new_key1", "new_value");

problem on size_t{10}

File logger/logger.cpp, line 79, character 0 to line 84, character 39:
 * Within the daemon, logs are drained every 3 seconds.
 */
HIDDEN_FLAG(bool,
            logger_status_sync,
            false,
            "Always send status logs synchronously");

DECLARE_bool(enable_numeric_monitoring);

problem on HIDDEN_FLAG macro.

It problable needs quite some work to extend https://github.com/tree-sitter/tree-sitter-cpp to handle all those constructs
to get a good parsing percentage on osquery.

An alternative is to rely on the clang AST, but this requires the source to be compilable and to instruct semgrep where to find the header files (-I) and possible macros (-D) to correctly parse the code. This requires lots of work and may be quite slow as clang needs to process all the included files to get all the macros and typedefs etc to be able to parse correctly the code.

@aryx
Copy link
Collaborator

aryx commented Dec 9, 2020

I think in the short term we could spend some time to integrate tree-sitter-cpp in semgrep, to be able to parse and match C++ code, but you will get lower coverage than for the other languages as lots of the C++ code (in osquery case 25%) would not be parsed.

@zwass
Copy link

zwass commented Dec 10, 2020

@aryx thank you for the great feedback! I wonder about that first case -- isn't that standard C++ syntax? Trying to remember what the checks were that I wanted to add to osquery that I thought semgrep could be a good fit for. I will post here if I remember those.

@aryx
Copy link
Collaborator

aryx commented Dec 10, 2020

I don't think you can have toplevel function calls in C or C++, and that's what this code looks like, it looks like a function call,
when really it's a macro that expand in a declaration (that you can have at the toplevel in C or c++).

Macros ...

@hexagonrecursion
Copy link
Author

Parsing C++ is hard.

tree-sitter/tree-sitter-cpp#74
tldr: in order to support its primary use case (updating the parse tree in real time as you type) tree-sitter-cpp makes certain sacrifices and accepts certain inaccuracies that may make it a less than ideal backend for semgrep.

http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.pdf#page=375 appendix F
A fairly comprehensive list of C++ ambiguities and other parsing difficulties

LibTooling (an C/C++/ObjectiveC parsing library that uses the same parser as the clang compiler) requires a "compilation database" as input. I assume there is a good reason for that i.e. the parse tree of a c++ file depends on the compiler flags and the current working directory.

https://medium.com/@mujjingun_23509/full-proof-that-c-grammar-is-undecidable-34e22dd8b664
Full Proof that C++ Grammar is Undecidable

@dkasak
Copy link

dkasak commented May 5, 2021

It's worth noting that if you can compile the project, Bear makes it easy to generate the compilation database. You basically just do bear -- cmd_line_to_start_compilation, e.g. bear -- make.

So I don't think having to have a compilation database would be a hard requirement to satisfy. I think it's probably the way to go.

@aryx
Copy link
Collaborator

aryx commented May 5, 2021

How good is Bear now? A few years ago I had a quick look and it could handle only a few projects. Can it now really be used on any project? Is there examples out there of complex projects using complex makefiles where Bear does a good job? Can it be used for the mozilla codebase for example? Does clang return any error using a Bear generated database?

@dkasak
Copy link

dkasak commented May 5, 2021

From my experience, Bear is pretty robust now. I've mostly used Bear/compilation databases with varied vim/nvim tooling, e.g. the ccls C/C++/Obj-C Language Server can optionally use compilation databases and they recommend it for larger projects.

I haven't tried it on a gigantic, complex project like the Mozilla codebase. But then again, my ambitions with semgrep and C++ are lesser. :) It would be sweet if it was robust enough to be applicable to Mozilla's codebase, but the largest pain point for me is when someone throws a moderate amount of C++-isms in a basically-C project, making semgrep completely unable to handle it.

Does clang return any error using a Bear generated database?

I've personally never encountered this.

aryx added a commit to semgrep/pfff that referenced this issue Aug 3, 2021
aryx added a commit that referenced this issue Aug 3, 2021
This is the start of C++ support in Semgrep.
This will help #1952

test plan:
make
aryx added a commit that referenced this issue Aug 3, 2021
This is the start of C++ support in Semgrep.
This will help #1952

test plan:
make
@aryx
Copy link
Collaborator

aryx commented Jun 1, 2022

Closing this issue since we added support for C++ last year (I forgot to close the issue).

@aryx aryx closed this as completed Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants