-
-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace TCC Header Parser by Tree-Sitter based one #275
Comments
We plan to employ tree-sitter-cpp for this, libclang is too heavy. Some customness might be required, something like |
I create a small POC with libclang, and I identify 3 major advantages against the tree-sitter.
output:
target i386
target x86_64
Perhaps the type definition system could depend on the architecture of the loaded binary? source code: #include "clang-c/Index.h"
#include <err.h>
#include <stdio.h>
#include <string.h>
#define PROJECT_NAME "test_libclang"
enum CXChildVisitResult display_struct_decl(
CXCursor cursor,
CXCursor parent,
__attribute((unused)) CXClientData client_data) {
CXString field_name = clang_getCursorSpelling(cursor);
CXType struct_type = clang_getCursorType(cursor);
CXType parent_type = clang_getCursorType(parent);
CXString kind_name = clang_getTypeSpelling(struct_type);
const char *kind_name_c = clang_getCString(kind_name);
if (strlen(kind_name_c) == 0) {
return CXChildVisit_Continue;
}
int field_size = clang_Type_getSizeOf(struct_type);
const char *field_name_c = clang_getCString(field_name);
int field_offset = clang_Type_getOffsetOf(parent_type, field_name_c) / 8;
printf(
"\t%s %s; # size = %d byte, offset = %d byte\n",
clang_getCString(kind_name),
field_name_c,
field_size,
field_offset);
clang_disposeString(field_name);
clang_disposeString(kind_name);
return CXChildVisit_Continue;
}
void display_struct(CXCursor struct_cursor) {
CXString struct_name = clang_getCursorSpelling(struct_cursor);
printf("struct %s {\n", clang_getCString(struct_name));
clang_visitChildren(struct_cursor, display_struct_decl, NULL);
printf("}\n");
clang_disposeString(struct_name);
}
enum CXChildVisitResult display_cursor_info(
CXCursor cursor,
__attribute((unused)) CXCursor parent,
__attribute((unused)) CXClientData client_data) {
enum CXCursorKind cursor_type = clang_getCursorKind(cursor);
if (cursor_type == CXCursor_StructDecl) {
display_struct(cursor);
}
return CXChildVisit_Continue;
}
void display_info(CXTranslationUnit unit) {
CXCursor root = clang_getTranslationUnitCursor(unit);
clang_visitChildren(root, display_cursor_info, NULL);
}
void parse_header(
const char *filename,
const char *const *arguments,
int argc) {
CXIndex index = clang_createIndex(0, 0);
CXTranslationUnit unit = clang_parseTranslationUnit(
index,
filename,
arguments,
argc,
NULL,
0,
CXTranslationUnit_None);
if (unit == NULL) {
errx(1, "Invalid header file.");
}
display_info(unit);
clang_disposeTranslationUnit(unit);
clang_disposeIndex(index);
}
int main(int argc, char **argv) {
if (argc < 2) {
printf("%s takes 1 arguments.\n", argv[0]);
return 1;
}
parse_header(argv[1], argv + 2, argc - 2);
return 0;
} |
It should depend on the architecture of the binary and OS, indeed. It } else if (bt == VT_DOUBLE || bt == VT_INT64) {
if (!strncmp (tcc_state->arch, "x86", 3) && tcc_state->bits == 32) {
if (!strncmp (tcc_state->os, "windows", 7)) {
*a = 8;
} else {
*a = 4;
} Regarding the using of libclang as a dependency of such a core feature of Rizin - I am strongly against is. From my experience libclang has no stable API and it's a huge pain to maintain anything built upon it, migrating from version to version. Even worse if you need to support multiple versions of it or some differences on how different distributions packaged it. Using Clang just for a simple data types parsing is an overkill. It will make our CI cry as well. I think using tree-sitter-cpp is a way to go. If their preprocessor support is insufficient we just could use old But your libclang-based PoC might be useful for generating tests for the new C data types parser. |
Does Rizin really need to be able to parse the c syntax (and header file)? If C parsing is not the priority, we can use a simpler solution. We could use another language to represent the type system. example: (struct (name "test") (field (u8) "a") (field (u64) (offset 4) "b"))
(enum (name "hello") (field "a") (field "b" 1)) offset is optional, by default the structure must packed each field. struct test {
u8 a;
u64 b;
};
enum hello {
a,
b = 1,
} S-exp are easy to parse (we could use tree-sitter) and to serialize (to save configuration). And add a command to load arch/os type dependent.
x86_64.type (typedef (u8) (char))
(typedef (u64) (long))
... What do you think about it? |
For the purpose of parsing real life headers using clang it might be better have it as separate helper executable or script instead of linking libclang directly with into rizin. Linking to libclang has multiple potential problems:
|
Another note about using libclang for parsing real life headers unless you are on Linux parsing Linux library headers is a challenging problem. Standard library and system library headers include a lot of compiler specific junk that often can't be easily parsed by other compiler parsers. And the standard library headers can't be simply replaced with standard library used by Clang because that could potentially change the meaning in main headers you are parsing, especially if you care about exact byte layout. So a robust solution which can work with Windows and macOS headers will likely require combination careful flag tuning/maching with target platform, replacements for features that can't be dealt with, some header replacements. It is not impossible but it takes some effort and maintenance to keep it working. Latest MSVC versions are somewhat easier to handle, but they still occasionally break things as demonstrated by Cutter's current need to use older MSVC version so that shiboken (which uses libclang) can deal with MSVC headers. Even on macOS with XCode which also uses Clang things aren't perfect. It isn't quite the same as open source clang, they have their own private fork which is neither newer than older than open source Clang. It is also not the same LLVM as in their public swift repos. They have their own changes which may take months until they get open sourced or reimplemented. The opposite is also possible where latest open source clang have incompatible changes which aren't included in the latest XCode version. At the same time if you manage clang to get through with parsing, having differences in layout of few structures is probably not too bad for reverse engineering purposes. That's still 10x more correct type definitions than if you had to manually define them. Not having to understand content of function bodies (almost because C++ can be complicated) also makes things a bit easier. |
Yes. There is no way to avoid this.
If we use tree-sitter we can parse both C and C++ definitions too ;) |
I've updated the description with a concrete plan. |
Current State
There are currently two parsers for different parts of C syntax in rizin:
Parsing of C headers for struct definitions, enums, etc. using tcc into base types for the database:
rizin/librz/include/rz_parse.h
Lines 73 to 77 in a3a339b
Parsing C Type expressions like
char *[42]
using a custom mpc-based parser into ctypes to be used on top of base types in variables, etc:rizin/librz/include/rz_parse.h
Lines 79 to 115 in a3a339b
The tcc one for base types has several limitations such as usage of global state and it also writes out directly the sdb records, which is super unflexible and hard to inspect. The mpc one for ctypes works ok, but mpc also has some aspects that make it annoying to use and extend (see the amount of strcmp in https://github.com/rizinorg/rizin/blob/a3a339b11b1e241ee95201d7b666ff291c8e2100/librz/parse/ctype.c).
Solution
tree-sitter-cpp should be used to replace tcc. The parser for base types should be written in such a way that it emits
RzAnalysisBaseType
structs and not raw sdb. Thus, the results will be able to be used more flexible or just stored into the database through the api.Because a complete C(++) parser and thus presumably also tree-sitter-cpp must also parse cast expressions such as
(char *[42])<something>
, the parser for thechar *[42]
in this example could also be re-used directly forRzParseCTypeType
, replacing mpc there.Conflicts/Dependencies
This has a little bit of a conflict with #371 if the parser is done first because the new parser will generate
RzAnalysisBaseType
objects, but #371 will modify the structure of it to useRzParseCTypeType
instead of strings, forcing the newly written parser to be adapted too.If however the parser is done second and #371 first, then perhaps the old mpc parser will have to be integrated in many more places, making refactoring to the tree sitter parser a bit harder.
I think the best solution to solve this will be to implement the parser first, and in the parser creating a
RzParseCTypeType
whenever such a type is encountered, but only in the very last step when constructing theRzAnalysisBaseType
, converting that to a string. As such, this issue depends on #370.While writing the new parser, one will probably encounter the case of inline definitions of struct, enums and functions. For functions, this will depend on #373. For other inline types, I think with the current ctypes/base types separation it's something we cannot support and should postpone to later when all the refactoring is done and we might consider merging these two types of types.
Original issue contents:
The text was updated successfully, but these errors were encountered: