C++ support #1952

hexagonrecursion · 2020-11-03T10:51:19Z

I searched the issues and did not find a request for C++ support. Now there is one.

dlukeomalley · 2020-11-03T16:59:33Z

Thank you @hexagonrecursion for the tracking ticket and request! We'll look for more +1s here and start the conversation 👍

ievans · 2020-11-03T17:00:14Z

cc @aryx @mjambon and also see some prior discussion on HN: https://news.ycombinator.com/item?id=24932891

ubimars · 2020-11-09T12:25:28Z

+1

aryx · 2020-11-09T13:54:45Z

Do you have example of codebase you would like to analyze? A big problem is that we want to parse the code as-is, but
if the C++ code uses heavily cpp macros using weird syntax, we will not be able to parse the code.

hexagonrecursion · 2020-11-10T06:55:18Z

@aryx

Do you have example of codebase you would like to analyze? A big problem is that we want to parse the code as-is, but
if the C++ code uses heavily cpp macros using weird syntax, we will not be able to parse the code.

aryx · 2020-11-10T07:46:39Z

Yes for mozilla it is not gonna happen ... They're using weird macros everywhere that makes the code unparsable.
It's not C++, it's C++_WITH_OUR_OWN_MINI_LANGUAGE_FOR_RESOURCE_MANAGEMENT_AND_OTHER_STUFF.

This will help #1952 test plan: pad@yrax:~/work/lang-cpp/Cataclysm-DDA$ yy -lang cpp -test_parse_tree_sitter . + /home/pad/semgrep/_build/default/cli/Main.exe -lang cpp -test_parse_tree_sitter . 11 / 852/OVERLOAD/work/lang-cpp/Cataclysm-DDA/src/activity_actor.cpp: exn = Tree_sitter_run.Tree_sitter_error.Error(_) ... NB total files = 852; NB total lines = 474254; perfect = 662; pbs = 190; timeout = 0; =========> 77% nb good = 174878, nb passed = 0 =========> 0.000000% nb good = 174878, nb bad = 299376 =========> 36.874333% For comparison, the C++/C/cpp parser in pfff is doing: NB total files = 852; NB total lines = 473336; perfect = 110; pbs = 742; timeout = 0; =========> 12% nb good = 160267, nb passed = 187 =========> 0.116680% nb good = 160267, nb bad = 313069 =========> 33.859035% pad@yrax:~/work/lang-cpp/Cataclysm-DDA$

aryx · 2020-11-10T09:57:54Z

The tree-sitter c++ parser can parse 93% of Cataclysm, which is not bad, but there's lots of parse errors on code I don't really understand:

Unrecognized construct
Error: File Cataclysm-DDA/tests/vehicle_part_test.cpp, line 23, characters 51-52:
#include "vpart_range.h"

static time_point midnight = calendar::turn_zero + 0_hours;
                                                   ^       
static time_point midday = calendar::turn_zero + 12_hours;

This will help #1952 test plan: pad@yrax:~/work/lang-cpp/Cataclysm-DDA$ yy -lang cpp -test_parse_tree_sitter . + /home/pad/semgrep/_build/default/cli/Main.exe -lang cpp -test_parse_tree_sitter . 11 / 852/OVERLOAD/work/lang-cpp/Cataclysm-DDA/src/activity_actor.cpp: exn = Tree_sitter_run.Tree_sitter_error.Error(_) ... NB total files = 852; NB total lines = 474254; perfect = 662; pbs = 190; timeout = 0; =========> 77% nb good = 174878, nb passed = 0 =========> 0.000000% nb good = 174878, nb bad = 299376 =========> 36.874333% For comparison, the C++/C/cpp parser in pfff is doing: NB total files = 852; NB total lines = 473336; perfect = 110; pbs = 742; timeout = 0; =========> 12% nb good = 160267, nb passed = 187 =========> 0.116680% nb good = 160267, nb bad = 313069 =========> 33.859035% pad@yrax:~/work/lang-cpp/Cataclysm-DDA$

hexagonrecursion · 2020-11-12T15:43:22Z

This is c++11 https://en.cppreference.com/w/cpp/language/user_literal
Headers where the literals are defined:
https://github.com/CleverRaven/Cataclysm-DDA/blob/master/src/units.h#L635-L839
https://github.com/CleverRaven/Cataclysm-DDA/blob/master/src/calendar.h#L339-L367

aryx · 2020-11-13T07:39:45Z

Could you create an issue on the tree-sitter C++ parser, given that's the one we are using for C++:
https://github.com/tree-sitter/tree-sitter-cpp

zwass · 2020-12-09T00:29:30Z

Just listening to Clint Gibler's talk about semgrep at Empire Hacking. I would be interested in using semgrep with the osquery codebase. We don't use macros very extensively there. Would this be feasible?

aryx · 2020-12-09T09:03:08Z

I've just tried the tree-sitter-cpp parser on osquery and it just parses 75% of the codebase.
Here are a few parsing errors reported by the current version of tree-sitter-cpp:

File utils/json/tests/json.cpp, line 202, characters 22-28:

  auto doc2 = JSON::newObject();
  doc2.add("new_key", size_t{10});
  doc2.addCopy("new_key1", "new_value");

problem on size_t{10}

File logger/logger.cpp, line 79, character 0 to line 84, character 39:
 * Within the daemon, logs are drained every 3 seconds.
 */
HIDDEN_FLAG(bool,
            logger_status_sync,
            false,
            "Always send status logs synchronously");

DECLARE_bool(enable_numeric_monitoring);

problem on HIDDEN_FLAG macro.

It problable needs quite some work to extend https://github.com/tree-sitter/tree-sitter-cpp to handle all those constructs
to get a good parsing percentage on osquery.

An alternative is to rely on the clang AST, but this requires the source to be compilable and to instruct semgrep where to find the header files (-I) and possible macros (-D) to correctly parse the code. This requires lots of work and may be quite slow as clang needs to process all the included files to get all the macros and typedefs etc to be able to parse correctly the code.

aryx · 2020-12-09T09:04:32Z

I think in the short term we could spend some time to integrate tree-sitter-cpp in semgrep, to be able to parse and match C++ code, but you will get lower coverage than for the other languages as lots of the C++ code (in osquery case 25%) would not be parsed.

zwass · 2020-12-10T19:40:48Z

@aryx thank you for the great feedback! I wonder about that first case -- isn't that standard C++ syntax? Trying to remember what the checks were that I wanted to add to osquery that I thought semgrep could be a good fit for. I will post here if I remember those.

aryx · 2020-12-10T20:32:30Z

I don't think you can have toplevel function calls in C or C++, and that's what this code looks like, it looks like a function call,
when really it's a macro that expand in a declaration (that you can have at the toplevel in C or c++).

Macros ...

hexagonrecursion · 2020-12-26T14:35:17Z

Parsing C++ is hard.

tree-sitter/tree-sitter-cpp#74
tldr: in order to support its primary use case (updating the parse tree in real time as you type) tree-sitter-cpp makes certain sacrifices and accepts certain inaccuracies that may make it a less than ideal backend for semgrep.

http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.pdf#page=375 appendix F
A fairly comprehensive list of C++ ambiguities and other parsing difficulties

LibTooling (an C/C++/ObjectiveC parsing library that uses the same parser as the clang compiler) requires a "compilation database" as input. I assume there is a good reason for that i.e. the parse tree of a c++ file depends on the compiler flags and the current working directory.

https://medium.com/@mujjingun_23509/full-proof-that-c-grammar-is-undecidable-34e22dd8b664
Full Proof that C++ Grammar is Undecidable

dkasak · 2021-05-05T07:35:37Z

It's worth noting that if you can compile the project, Bear makes it easy to generate the compilation database. You basically just do bear -- cmd_line_to_start_compilation, e.g. bear -- make.

So I don't think having to have a compilation database would be a hard requirement to satisfy. I think it's probably the way to go.

aryx · 2021-05-05T07:42:43Z

How good is Bear now? A few years ago I had a quick look and it could handle only a few projects. Can it now really be used on any project? Is there examples out there of complex projects using complex makefiles where Bear does a good job? Can it be used for the mozilla codebase for example? Does clang return any error using a Bear generated database?

dkasak · 2021-05-05T08:07:31Z

From my experience, Bear is pretty robust now. I've mostly used Bear/compilation databases with varied vim/nvim tooling, e.g. the ccls C/C++/Obj-C Language Server can optionally use compilation databases and they recommend it for larger projects.

I haven't tried it on a gigantic, complex project like the Mozilla codebase. But then again, my ambitions with semgrep and C++ are lesser. :) It would be sweet if it was robust enough to be applicable to Mozilla's codebase, but the largest pain point for me is when someone throws a moderate amount of C++-isms in a basically-C project, making semgrep completely unable to handle it.

Does clang return any error using a Bear generated database?

I've personally never encountered this.

This will help semgrep/semgrep#1952 test plan: make

This is the start of C++ support in Semgrep. This will help #1952 test plan: make

aryx · 2022-06-01T10:30:43Z

Closing this issue since we added support for C++ last year (I forgot to close the issue).

ievans added the new-language label Nov 3, 2020

aryx mentioned this issue Nov 10, 2020

Test tree-sitter c++ parser #1996

Merged

aryx mentioned this issue Nov 10, 2020

A few projects to test the tree-sitter C and C++ parser semgrep/ocaml-tree-sitter-semgrep#125

Merged

aryx added a commit to semgrep/pfff that referenced this issue Aug 3, 2021

[C++] adapt parser interface to Parse_info.parse_result

ebffd58

This will help semgrep/semgrep#1952 test plan: make

aryx added a commit that referenced this issue Aug 3, 2021

[C++] add tree-sitter boilerplate file

09994ad

This is the start of C++ support in Semgrep. This will help #1952 test plan: make

aryx mentioned this issue Aug 3, 2021

[C++] add tree-sitter boilerplate file #3661

Merged

2 tasks

aryx added a commit that referenced this issue Aug 3, 2021

[C++] add tree-sitter boilerplate file (#3661)

ebb919a

This is the start of C++ support in Semgrep. This will help #1952 test plan: make

aryx closed this as completed Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ support #1952

C++ support #1952

hexagonrecursion commented Nov 3, 2020

dlukeomalley commented Nov 3, 2020

ievans commented Nov 3, 2020

ubimars commented Nov 9, 2020

aryx commented Nov 9, 2020

hexagonrecursion commented Nov 10, 2020

aryx commented Nov 10, 2020

aryx commented Nov 10, 2020

hexagonrecursion commented Nov 12, 2020

aryx commented Nov 13, 2020

zwass commented Dec 9, 2020

aryx commented Dec 9, 2020

aryx commented Dec 9, 2020

zwass commented Dec 10, 2020

aryx commented Dec 10, 2020

hexagonrecursion commented Dec 26, 2020

dkasak commented May 5, 2021 •

edited

aryx commented May 5, 2021

dkasak commented May 5, 2021 •

edited

aryx commented Jun 1, 2022

C++ support #1952

C++ support #1952

Comments

hexagonrecursion commented Nov 3, 2020

dlukeomalley commented Nov 3, 2020

ievans commented Nov 3, 2020

ubimars commented Nov 9, 2020

aryx commented Nov 9, 2020

hexagonrecursion commented Nov 10, 2020

aryx commented Nov 10, 2020

aryx commented Nov 10, 2020

hexagonrecursion commented Nov 12, 2020

aryx commented Nov 13, 2020

zwass commented Dec 9, 2020

aryx commented Dec 9, 2020

aryx commented Dec 9, 2020

zwass commented Dec 10, 2020

aryx commented Dec 10, 2020

hexagonrecursion commented Dec 26, 2020

dkasak commented May 5, 2021 • edited

aryx commented May 5, 2021

dkasak commented May 5, 2021 • edited

aryx commented Jun 1, 2022

dkasak commented May 5, 2021 •

edited

dkasak commented May 5, 2021 •

edited