-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock/memory leak/crash when parsing a specific file #207
Comments
Same pattern for this file as well (deadlocks as is, but not if we convert tabs into spaces): |
Thanks for the report! Could you try to remove parts of that file to get a minimal reproduction for this bug? |
Hi @maxbrunsfeld , thank you for your answer. Based on the second example, the simpler (and pointless) code I found to bug was this: bytecode = b'def main():\n\t\t\t\t\t\t\t\t\t\t\t\tif True:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 2:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 3:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n'
parser.parse(bytecode) |
Confirming I also see similar failures on 0.20.8 while fuzz testing on |
I have located the issue, which was introduced by tree-sitter version 0.20.7 (0.20.6 is good). py-tree-sitter version 0.20.1, which uses tree-sitter version 0.20.7, triggers this issue. In contrast, py-tree-sitter version 0.20.0, which uses tree-sitter version 0.20.2, does not have this issue. |
This interested me, so I decided to bisect it as a start: Log:
|
Some more investigating, the bug wasn't actually in tree-sitter's code, but the scanner.
The non explicit cast caused any indent that is UINT8_T max (255), to be truncated to -1, and then when that's expanded to a uint16_t in push_back, we get a wildly different result in the vector - so we gained an extra char. Then, the loop in ts_parser__lex compares the size of the serialized buffer length and the length returned when comparing if the external scanner state changes, and as a result they will never be equal because on serialize after this bug occurs - we write one less than is read, leading to a back-and-forth off-by-one fight which can be observed by adding a debug print in ts_external_scanner_state_eq: bool ts_external_scanner_state_eq(const ExternalScannerState *a, const char *buffer, unsigned length) {
printf("a->length: %d, length: %d\n", a->length, length);
return
a->length == length &&
memcmp(ts_external_scanner_state_data(a), buffer, length) == 0;
} demo: As a result - #221 would actually fix this as I cast everything properly @maxbrunsfeld I'd appreciate a review of both #220 and #221 :) |
Wow, nice investigation @amaanq. Thanks so much for your work on this - I’ll review in the morning. |
Hello,
While I've parsed countless files so far without issue, when parsing this very file tree-sitter deadlocks and starts consuming more and more memory (100s of GB) until there is no more left, and it crashes with core dumped.
Note that if I replace all the tabs
\t
by 4 spaces, then the parser however works just fine as usual.versions:
tree-sitter==0.20.1
tree-sitter-python
at tagc6cfa75 4f25ce7
(master as of Feb 28th)repro:
The text was updated successfully, but these errors were encountered: