Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge memory usage when used through python bindings #199

Closed
arthurzam opened this issue Aug 24, 2023 · 27 comments
Closed

Huge memory usage when used through python bindings #199

arthurzam opened this issue Aug 24, 2023 · 27 comments

Comments

@arthurzam
Copy link

When trying to use new tree-sitter-bash (0.20.0, 0.20.1, 0.20.2) through the Python bindings, a huge memory usage is detected. Please note that by huge memory usage I mean really huge and really fast, in 6-8 seconds gets to 16GB and then I need to kill it before filling my RAM on PC. Tested on extra big system and also got to 490 GB RAM before I killed it. I think this is some memory leak or allocation loop, but can't be sure, so I'm afraid to call it as such.

Also, 0.19.0 was working ok. But I'm really thankful for the work the new maintainer has done since. We at Gentoo really want to upgrade so we can use the new improved grammar 😄

Reproducer

Caution: I recommend checking you have ready infra for killing the python process running for that, or a good OOM killer configuration. This really grows fast and can freeze your system. For me I'm testing inside a chroot container, so I have a terminal ready to kill the whole chroot before my system freezes.

from tree_sitter import Language, Parser
from ctypes.util import find_library

syslib = find_library("tree-sitter-bash")
lang = Language(syslib, "bash")
parser = Parser()
parser.set_language(lang)
# the next lines does all the problems
tree = parser.parse(b"""
  echo command
""")

At first we were thinking we were passing a very complicated bash scripts (we do have such), but the above example is literally just a simple echo.

Another maintainer at Gentoo noticed it also happens with emacs bindings, so I don't think this is only python bindings issue, maybe the general bindings interface? This is also reproduced across multiple Gentoo systems, even different arches (amd64, arm64, sparc, ppc64, ppc64le).

System configuration

This is done on a Gentoo system. Here are the commands to compile the resulting so file (maybe we are compiling incorrectly?):

$ make -j4 parser.o scanner.o 
x86_64-pc-linux-gnu-gcc -O2 -pipe -frecord-gcc-switches -fPIC   -c -o parser.o parser.c
x86_64-pc-linux-gnu-gcc -O2 -pipe -frecord-gcc-switches -fPIC   -c -o scanner.o scanner.c
$ x86_64-pc-linux-gnu-gcc -O2 -pipe -frecord-gcc-switches -fPIC -Wl,-O1 -Wl,--as-needed -shared parser.o scanner.o -Wl,--soname=libtree-sitter-bash.so.14 -o /var/tmp/portage/dev-libs/tree-sitter-bash-0.20.2/work/libtree-sitter-bash.so.14

Here is the link tree for the so:

$ lddtree /usr/lib64/libtree-sitter-bash.so.14
libtree-sitter-bash.so.14 => /usr/lib64/libtree-sitter-bash.so.14 (interpreter => none)
    libc.so.6 => /lib64/libc.so.6
        ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2

The system has:

  • glibc-2.37
  • gcc-12.2.1
  • tree-sitter-0.20.8
  • python bindings of tree-sitter-0.20.1

Any more info, testing, or anything you want I'll be happy to provide.

@MatthewGentoo
Copy link

I attached a debugger to Emacs to see what it was doing, and it looks like we end up stuck in this loop:

while (ts_stack_is_active(self->stack, version)) {
  LOG(
    "process version:%d, version_count:%u, state:%d, row:%u, col:%u",
    version,
    ts_stack_version_count(self->stack),
    ts_stack_state(self->stack, version),
    ts_stack_position(self->stack, version).extent.row,
    ts_stack_position(self->stack, version).extent.column
  );

  if (!ts_parser__advance(self, version, allow_node_reuse)) return NULL;
  LOG_STACK();

  position = ts_stack_position(self->stack, version).bytes;
  if (position > last_position || (version > 0 && position == last_position)) {
    last_position = position;
    break;
  }
}

(https://github.com/tree-sitter/tree-sitter/blob/0c49d6745b3fc4822ab02e0018770cd6383a779c/lib/src/parser.c#L1923-L1941)

It also looks like an empty file can trigger the issue.

Unsure what makes the Emacs and Python bindings to the ts-cli test runner, as all of the tests are able to run just fine.

@amaanq
Copy link
Member

amaanq commented Aug 24, 2023

It's possible it's a bug w/ the python bindings, fuzzing and a corpus input of 6000+ files has the cli puttering along just fine

@amaanq
Copy link
Member

amaanq commented Aug 24, 2023

If you can (and I'd really appreciate it), try bisecting the repo between 0.19.0 and when you first detect this bug. Note that you might have to regenerate the parser between some commits.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 24, 2023

This may be related tree-sitter/tree-sitter#2517, details are in edit notes there.

My respect for Gentoo users to be so detailed in reports!

@arthurzam
Copy link
Author

Thank you for the prompt reply, I'll try bisecting and regenning.

@arthurzam
Copy link
Author

OK, so now I'm very confused. I wanted to start the bisecting, but of course before that I needed to prepare the verification steps, which I was planning it to be:

  1. tree-sitter generate

  2. compile the so locally:

    cd src/
    x86_64-pc-linux-gnu-gcc -O2 -pipe -frecord-gcc-switches -fPIC   -c -o parser.o parser.c
    x86_64-pc-linux-gnu-gcc -O2 -pipe -frecord-gcc-switches -fPIC   -c -o scanner.o scanner.c
    x86_64-pc-linux-gnu-gcc -O2 -pipe -frecord-gcc-switches -fPIC -Wl,-O1 -Wl,--as-needed -shared parser.o scanner.o -Wl,--soname=libtree-sitter-bash.so.14 -o libtree-sitter-bash.so.14
  3. run the python code and see the memory usage.

And to my surprise, the current master code works? Going back to the tagged versions of 0.20.* also works?

After rechecking, I think I found something weird. The tree-sitter generate command generates quite different C files. Even when I'm on the commits called chore: generate (from my understanding is when you sync the parsers into the generated C code, sorry if I misunderstood), it still generates different files. Every time I generate the new parser and scanner, the new files work as intended (and I also see the fixes done since 0.19.0, so this isn't a case of using old version by mistake).

When I'm on commit a7be575, I get those 2 files (the diff between my files and the once in repo is huge, not something cosmetic like timestamps, unless I'm missing something):

Clearly something is very different in the generator used? The versions on my system are:

  • tree-sitter-0.20.8
  • tree-sitter-cli-0.20.8
  • nodejs-20.5.1

Could we maybe investigate what happens here? While, yes, we could take the easy route in Gentoo: run tree-sitter generate on each build, or pre-gen them manually before releasing and distribute "our" files, but I think this is quite a wrong solution. Can we please compare the difference between the generators on mine and yours, so the version distributed in the release tarballs works also for us?

@amaanq
Copy link
Member

amaanq commented Aug 26, 2023

I generate on ts master w/ the Rust CLI crate, and I think ahlinc does too. Can you bisect tree-sitter master? Node 0.20.8 is a few months old now - I know I'm asking a lot but I would really appreciate it.

@arthurzam
Copy link
Author

I generate on ts master w/ the Rust CLI crate, and I think ahlinc does too. Can you bisect tree-sitter master? Node 0.20.8 is a few months old now - I know I'm asking a lot but I would really appreciate it.

Mind linking me the relevant git repos, ideally on exactly which commit you are on master? Just so I can reproduce as well as possible :)

@amaanq
Copy link
Member

amaanq commented Aug 26, 2023

https://github.com/tree-sitter/tree-sitter/

I constantly change commits ( generating on master right ;) ), anything within the last day or two should work

right now - tree-sitter 0.20.8 (6d41d99990cfc8527f424603cdfce1c9af2d78ef)

Build it by running cargo install --path cli (you can add --debug for faster build times, but slower generate times...)

@arthurzam
Copy link
Author

@amaanq Yes, you are very right, and indeed with master tree-sitter-cli I get nearly same file as yours (in the parser.c it moved one enum declaration to just another place, but this doesn't affect anything at all).

I guess that the new master CLI uses a little bit different functions/codebase, which affects how bindings access the library (I mean the the master CLI generates code for the master tree-sitter lib), so it doesn't work well for anyone using not-master tree-sitter-lib.

When I installed live libtree-sitter on my system, the new tree-sitter-bash works.

Now I want to ask you both what is the best way forward?

@amaanq
Copy link
Member

amaanq commented Aug 26, 2023

Well that's even more interesting, a mismatch of versions between the installed tree-sitter library and generated parsers caused this?

I think it's because of this: tree-sitter/tree-sitter#2316

If you feel like going above and beyond, bisecting an installed version before and after that PR w/ my generated-from-master parser.c file would be awesome. (The results of this might help us make a decision more easily)

I think we need to bump a new ver...

@ahlinc

@arthurzam
Copy link
Author

Well that's even more interesting, a mismatch of versions between the installed tree-sitter library and generated parsers caused this?

I think it's because of this: tree-sitter/tree-sitter#2316

If you feel like going above and beyond, bisecting an installed version before and after that PR w/ my generated-from-master parser.c file would be awesome. (The results of this might help us make a decision more easily)

Well, why not? Did a full git bisect between 0.20.8 tag and HEAD, and tree-sitter/tree-sitter@4a00725 is the commit doing all the difference. On parent commit it memory leaks with current tree-sitter-bash generated files, on on linked commit it doesn't leak.

Meaning starting from tree-sitter/tree-sitter@4a00725 the memory leak stops and the libtree-sitter matches the current generated files.


I've also tried the other way - find the first commit generating "leaking" files with released libtree-sitter-0.20.8, and also got exactly the same commit tree-sitter/tree-sitter@4a00725 - meaning this is indeed the changing commit, starting from the generated files are incompatible with release libtree-sitter.


Why did this commit break? Cause the generator stops adding eof = lexer->eof(lexer); to the file, expecting the parser.h to do it, but this isn't present in installed lib/include/tree_sitter/parser.h.

I guess this commit can be reverted for now to not break backward compatibility - if you want.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

I guess this commit can be reverted for now to not break backward compatibility - if you want.

Nice catch, I think we can revert that commit for now and postpone it to couple with some other more important changes to be worth to bump a language version.

Also I think it's important rule that grammar maintainers should always regenerate parsers only with the latest released version of Tree-sitter CLI and don't use Tree-sitter CLI built from the master branch. Or better the parser regeneration should be an automatic step that would happen on CI. Because it's an easy mistake that may accidentally happened.
Edit: above assumption isn't actual, see further conversation.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

but this isn't present in installed lib/include/tree_sitter/parser.h

@arthurzam could you explain this sentence a bit more? Where is it installed and on what moments is it used?

@arthurzam
Copy link
Author

arthurzam commented Aug 26, 2023

@arthurzam could you explain this sentence a bit more? Where is it installed and on what moments is it used?

libtree-sitter is installed on main system, so the header file is also installed globally. For example on gentoo:

$ qlist dev-libs/tree-sitter
/usr/lib64/libtree-sitter.so.0.0
/usr/lib64/libtree-sitter.so.0
/usr/lib64/libtree-sitter.so
/usr/lib64/pkgconfig/tree-sitter.pc
/usr/include/tree_sitter/api.h
/usr/include/tree_sitter/parser.h
/usr/lib64/libtree-sitter-bash.so.13
/usr/lib64/libtree-sitter-bash.so

So when the tree-sitter-bash's C files are compiled on the system, it #include <tree_sitter/parser.h> so it gets current released header file, while the tree-sitter-cli generated for "newer" master header.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

So when the tree-sitter-bash's C files are compiled on the system, it #include <tree_sitter/parser.h> so it gets current released header file, while the tree-sitter-cli generated for "newer" master header.

Thank you for the explanation, so for now this issue with the bash grammar happened because I habitually used a CLI built from the Tree-sitter master to regenerate parser files in this bash repo.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

I assume that Neovim nightly wasn't affected by this issue because they don't trust to generated files and do parsers regeneration during the build process. Or they just didn't do an update.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

@arthurzam a couple of additional questions:

  • Why do you install the /usr/include/tree_sitter/parser.h header to the system?
    This header file is an accompanied header file that should exists in every grammars repository together with the generated src/parser.c file.

  • What header file do you use when building a grammar like bash?
    That /usr/include/tree_sitter/parser.h system wide installed or provided in a grammar repo like src/tree_sitter/parser.h?

@arthurzam
Copy link
Author

* Why do you install the `/usr/include/tree_sitter/parser.h` header to the system?
  This header file is an accompanied header file that should exists in every grammars repository together with the generated `src/parser.c` file.

This is installed by the makefile of tree-sitter: link

* What header file do you use when building a grammar like bash?
  That `/usr/include/tree_sitter/parser.h` system wide installed or provided in a grammar repo like `src/tree_sitter/parser.h`?

Since the include directive is #include <tree_sitter/parser.h>, with the < gcc takes it from system level. If you want it to start with local paths, it should use ". Full docs

Thinking more on it, if indeed it had " in the include it would take the file generated in repo, so it might be a better way for the future.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

This is installed by the makefile of tree-sitter: link

Oookay 😄, that will be fixed for sure.

Since the include directive is #include <tree_sitter/parser.h>, with the < gcc takes it from system level. If you want it to start with local paths, it should use ". Full docs

That's an another issue in Tree-sitter, will be fixed too.

@thesamesam
Copy link

thesamesam commented Aug 26, 2023

Unless I'm misunderstanding here, in this instance, we (in Gentoo) weren't regenerating the parser anyway, so the header on our system in particular doesn't negate the need for the PR you filed (even if it's fair to ask why it's there).

i.e. the issue is the tree-sitter-bash release had one built in a bad env, not that we just happen to have the header around, even if that's an independent problem.

EDIT: sorry, I follow now, we still have to build the library each time, duh

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

i.e. the issue is the tree-sitter-bash release had one built in a bad env, not that we just happen to have the header around, even if that's an independent problem.

You shouldn't have a system wide /usr/include/tree_sitter/parser.h header.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 26, 2023

@thesamesam Could you please patch Gentoo's emerge file for that until the next Tree-sitter release would happen?

thesamesam added a commit to thesamesam/gentoo that referenced this issue Aug 26, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Force use of the correct one in the dist tarball at src/tree_sitter/parser.h
instead. Drop the dependency on dev-libs/tree-sitter given we were only depending
on it for this header.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
thesamesam added a commit to thesamesam/gentoo that referenced this issue Aug 26, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Backport that change to avoid mismatches.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
thesamesam added a commit to thesamesam/gentoo that referenced this issue Aug 26, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Force use of the correct one in the dist tarball at src/tree_sitter/parser.h
instead. Drop the dependency on dev-libs/tree-sitter given we were only depending
on it for this header.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
thesamesam added a commit to thesamesam/gentoo that referenced this issue Aug 26, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Backport that change to avoid mismatches.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
thesamesam added a commit to thesamesam/gentoo that referenced this issue Aug 27, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Force use of the correct one in the dist tarball at src/tree_sitter/parser.h
instead. Drop the dependency on dev-libs/tree-sitter given we were only depending
on it for this header.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
thesamesam added a commit to thesamesam/gentoo that referenced this issue Aug 27, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Backport that change to avoid mismatches.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
@arthurzam
Copy link
Author

arthurzam commented Aug 27, 2023

@ahlinc @amaanq Thank you for helping us debug it, and for fixing all problems we found. I believe we can continue from here on Gentoo side on packaging tree-sitter-bash. Ideally you tag a new version with updated includes (" instead of <, already on master) and our life will be excellent 😄

Could you please patch Gentoo's emerge file for that until the next Tree-sitter release would happen?

Work on it has been done, we are now at the formalities part until we merge it to downstream.


Now that I can check for regressions in parsing, and we have a huge repository of bash code https://github.com/gentoo/gentoo, I'm starting to find regressions in the parsing, so if you aren't against I'll file bugs for them.

I believe we can close this bug. Thank you for working with us on this bug, and for working on tree-sitter-bash.

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Aug 27, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Force use of the correct one in the dist tarball at src/tree_sitter/parser.h
instead. Drop the dependency on dev-libs/tree-sitter given we were only depending
on it for this header.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Aug 27, 2023
There shouldn't be a system-wide copy of parser.h and upstream plan on
dropping this from dev-libs/tree-sitter as it can cause issues if there's
a mismatch between the version used for the bundled generated parser
vs the header used to build the library.

Backport that change to avoid mismatches.

Bug: tree-sitter/tree-sitter-bash#199
Bug: https://bugs.gentoo.org/912716
Signed-off-by: Sam James <sam@gentoo.org>
@ahlinc
Copy link
Contributor

ahlinc commented Aug 27, 2023

@thesamesam @arthurzam @MatthewGentoo

I also want to say thank you for Gentoo team for passion and for many details in replies that helped to find real issues and fix them for both projects! 🎉

P.S: We in Tree-sitter team would be happy to know any details about use cases of Tree-sitter in Gentoo project.

@arthurzam
Copy link
Author

P.S: We in Tree-sitter team would be happy to know any details about use cases of Tree-sitter in Gentoo project.

In Gentoo, we define the packages building instructions as ebuild files, which are bash files, inheriting "libraries" called eclass files. So we have a QA tool called pkgcheck whose purpose is to check for errors, mistakes and such.

The tool has 2 types of validation strategy: sourcing the files (many wrappers around just calling it through bash and collecting env vars) and parsing the files (using tree-sitter-bash). As a very minimal example (to not spook you up from all the python code collected in this tool over the years) you can see this PR.

@arthurzam
Copy link
Author

Thank you all for all the fixes and improvements. With the release of 0.20.4, I think this bug is closed.

Many thanks for the great maintainers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants