Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock/memory leak/crash when parsing a specific file #207

Closed
jrapin opened this issue Mar 7, 2023 · 8 comments · Fixed by #221
Closed

Deadlock/memory leak/crash when parsing a specific file #207

jrapin opened this issue Mar 7, 2023 · 8 comments · Fixed by #221

Comments

@jrapin
Copy link

jrapin commented Mar 7, 2023

Hello,
While I've parsed countless files so far without issue, when parsing this very file tree-sitter deadlocks and starts consuming more and more memory (100s of GB) until there is no more left, and it crashes with core dumped.
Note that if I replace all the tabs \t by 4 spaces, then the parser however works just fine as usual.

versions:

  • tree-sitter==0.20.1
  • lib compiled from tree-sitter-python at tag c6cfa75 4f25ce7 (master as of Feb 28th)

repro:

import requests
from pathlib import Path
import tree_sitter as ts

# download code to parse
url = "https://raw.githubusercontent.com/Tomi23R/sureBet-bot/294dda827eab8440fe06b97f698beabf6b580946/crawler/tenis_codere.py"
res = requests.get(url)
code = res.content.decode("utf8")

# instantiate tree-sitter
lib_path = Path().resolve().absolute() / "tree-sitter" / "python.so"
assert lib_path.exists(), "Update the lib_path to your install"
language = ts.Language(lib_path, "python")
parser = ts.Parser()
parser.set_language(language)

# this does not deadlock:
parser.parse(code.replace("\t", "    ").encode("utf8"))

# this deadlocks:
parser.parse(code.encode("utf8"))
@jrapin
Copy link
Author

jrapin commented Mar 7, 2023

Same pattern for this file as well (deadlocks as is, but not if we convert tabs into spaces):
https://github.com/dLauraM/PROYECTO-01-MARTINEZ-DIANALAURA/blob/3381270416d92240ba9e5f6ffa246e70d26bb3c1/main.py

@maxbrunsfeld
Copy link
Contributor

Thanks for the report! Could you try to remove parts of that file to get a minimal reproduction for this bug?

@jrapin
Copy link
Author

jrapin commented Mar 8, 2023

Hi @maxbrunsfeld , thank you for your answer. Based on the second example, the simpler (and pointless) code I found to bug was this:

bytecode = b'def main():\n\t\t\t\t\t\t\t\t\t\t\t\tif True:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 2:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 3:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n'
parser.parse(bytecode)

@pqn
Copy link

pqn commented Apr 4, 2023

Confirming I also see similar failures on 0.20.8 while fuzz testing on tree-sitter-python commit 6282715 (base64 encoding: uwoJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCXsrYkRBAEhiQEEAfEBj).

@ChillMagic
Copy link

ChillMagic commented May 8, 2023

I have located the issue, which was introduced by tree-sitter version 0.20.7 (0.20.6 is good). py-tree-sitter version 0.20.1, which uses tree-sitter version 0.20.7, triggers this issue. In contrast, py-tree-sitter version 0.20.0, which uses tree-sitter version 0.20.2, does not have this issue.

@amaanq
Copy link
Member

amaanq commented Jun 16, 2023

This interested me, so I decided to bisect it as a start:

Log:

git bisect start
# status: waiting for both good and bad commits
# bad: [d0029a15273e526925a764033e9b7f18f96a7ce5] Avoid unused value warning from array_pop
git bisect bad d0029a15273e526925a764033e9b7f18f96a7ce5
# status: waiting for good commit(s), bad commit known
# good: [062421dece3315bd6f228ad6d468cba083d0a2d5] 0.20.1
git bisect good 062421dece3315bd6f228ad6d468cba083d0a2d5
# good: [062421dece3315bd6f228ad6d468cba083d0a2d5] 0.20.1
git bisect good 062421dece3315bd6f228ad6d468cba083d0a2d5
# bad: [1848d0bc3632ef128d70736695a3fb969f6e0c69] Add tree included ranges getter
git bisect bad 1848d0bc3632ef128d70736695a3fb969f6e0c69
# good: [764c8c88ca2fb100f40fdbfbae61ffd32585fc67] last tweaks
git bisect good 764c8c88ca2fb100f40fdbfbae61ffd32585fc67
# good: [c0e1991f6bf439a52fb8a5a1176db341c56a7d63] :art: ts_parser__lex
git bisect good c0e1991f6bf439a52fb8a5a1176db341c56a7d63
# bad: [20d44ed13eb2785683a018bf9b58e40e381a3c3c] Merge pull request #1785 from petrisch/patch-1
git bisect bad 20d44ed13eb2785683a018bf9b58e40e381a3c3c
# bad: [3739b7794e381582c8f4a37b2ccec756c3504984] Merge pull request #1819 from slackner/uint16-production-id
git bisect bad 3739b7794e381582c8f4a37b2ccec756c3504984
# bad: [01df16ca9faa1635909ebea242deac200624d9ee] lib: 0.20.8
git bisect bad 01df16ca9faa1635909ebea242deac200624d9ee
# bad: [04381dcea31c9a6ccc4c4eeaf306e867f5fcc893] Add more python error recovery tests
git bisect bad 04381dcea31c9a6ccc4c4eeaf306e867f5fcc893
# bad: [5aa2f4dc8c00dacf72d6b46a75b2020c3d4d99cb] Log when ignoring an empty external token after an error
git bisect bad 5aa2f4dc8c00dacf72d6b46a75b2020c3d4d99cb
# bad: [d223a81b5064587af3a5f61f52d519670ba8995f] Allow empty external tokens during err recovery if they change the scanner's state
git bisect bad d223a81b5064587af3a5f61f52d519670ba8995f
# first bad commit: [d223a81b5064587af3a5f61f52d519670ba8995f] Allow empty external tokens during err recovery if they change the scanner's state

d223a81b5064587af3a5f61f52d519670ba8995f is the first bad commit
commit d223a81b5064587af3a5f61f52d519670ba8995f
Author: Max Brunsfeld <maxbrunsfeld@gmail.com>
Date:   Fri Jun 24 15:58:13 2022 -0700

    Allow empty external tokens during err recovery if they change the scanner's state

 lib/src/parser.c                             | 64 ++++++++++++++++---------
 lib/src/subtree.c                            | 48 ++++++++++++-------
 lib/src/subtree.h                            | 12 ++++-
 test/fixtures/error_corpus/python_errors.txt | 71 +++++++++++++++++++++++-----
 4 files changed, 143 insertions(+), 52 deletions(-)

@amaanq
Copy link
Member

amaanq commented Jun 22, 2023

Some more investigating, the bug wasn't actually in tree-sitter's code, but the scanner.

diff --git a/src/scanner.cc b/src/scanner.cc
index e9ba26b..970a0e7 100644
--- a/src/scanner.cc
+++ b/src/scanner.cc
@@ -139,7 +139,7 @@ struct Scanner {
       i += delimiter_count;
 
       for (; i < length; i++) {
-        indent_length_stack.push_back(buffer[i]);
+        indent_length_stack.push_back((unsigned char)buffer[i]);
       }
     }
   }

The non explicit cast caused any indent that is UINT8_T max (255), to be truncated to -1, and then when that's expanded to a uint16_t in push_back, we get a wildly different result in the vector - so we gained an extra char. Then, the loop in ts_parser__lex compares the size of the serialized buffer length and the length returned when comparing if the external scanner state changes, and as a result they will never be equal because on serialize after this bug occurs - we write one less than is read, leading to a back-and-forth off-by-one fight which can be observed by adding a debug print in ts_external_scanner_state_eq:

bool ts_external_scanner_state_eq(const ExternalScannerState *a, const char *buffer, unsigned length) {
  printf("a->length: %d, length: %d\n", a->length, length);
  return
    a->length == length &&
    memcmp(ts_external_scanner_state_data(a), buffer, length) == 0;
}

demo:

image

As a result - #221 would actually fix this as I cast everything properly

@maxbrunsfeld I'd appreciate a review of both #220 and #221 :)

@maxbrunsfeld
Copy link
Contributor

Wow, nice investigation @amaanq. Thanks so much for your work on this - I’ll review in the morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants