Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up parse table representation, use 16 bits for production_id #943

Merged
merged 2 commits into from
Feb 26, 2021

Conversation

maxbrunsfeld
Copy link
Contributor

Fixes #511

Background

Currently, because of the way that Tree-sitter's parser ABI incrementally evolved, the production_id field on parse actions (which is used to identify the sequence of fields and aliases that applies to a given node's children) is represented as a uint8_t. So far, this has not been a problem for any of the grammars that we maintain, but other users have hit the limit, and when it happens, it's very confusing.

Change

This PR changes production_id to a nice uint16_t, which should always be more than large enough.

Unfortunately, because of some shortcomings of the old struct layout, this was not feasible to do without making a breaking change to the parser ABI. So, for the first time in a long time, I have bumped tree-sitter's TREE_SITTER_MIN_COMPATIBLE_LANGUAGE_VERSION constant. This means that new versions of the library will refuse to load parsers that were generated with previous versions of the CLI. Specifically ts_parser_set_language will return false. Users will need to regenerate their parsers.

Normally, this is not a problem: I think that for most use cases, users can easily upgrade the library and their parsers together. There are a few exceptions that I know of:

  1. Atom - At some point, when Atom adopts this version of Tree-sitter, this change will affect Atom users who have installed packages using third-party Tree-sitter parsers. The packages will needto be updated to use a regenerated version of the parsers. /cc @darangi - I don't think Atom plans on upgrading Tree-sitter anytime soon, so this is probably not urgent for you I think?

  2. Neovim - I think that some end-users of Neovim have installed their own third-party parsers. When Neovim upgrades to the latest version of Tree-sitter, Neovim will refuse to load these older parsers. The end users will need to install new versions of the parsers. /cc @bfredl @vigoux @theHamsta @tjdevries

  3. Emacs-Tree-Sitter - /cc @ubolonton I'm not sure if users can install their own parsers. If so, then this is something to be aware of.

Details

Since an ABI change was already needed, I took this opportunity to fix some dumb aspects of the current ABI. There are no more bitfields (they were used unnecessarily before). Also, I've added one more field to the TSLanguage struct - production_id_count, which makes it so that all the arrays on TSLanguage can have their length known at runtime.

Also remove the use of bitfields from the parse table format.
In all cases, bitfields were not necessary to achieve the
current binary sizes. Avoiding them makes the binaries more
portable.

There was no way to make this change backward-compatible,
so we have finally dropped support for parsers generated
with an earlier version of Tree-sitter.

At some point, when Atom adopts this version of Tree-sitter,
this change will affect Atom users who have installed packages
using third-party Tree-sitter parsers. The packages will need
to be updated to use a regenerated version of the parsers.
I think this is the last additional field that's needed so
that every array member of TSLanguage has a length that
can be calculated at runtime.
@ubolonton
Copy link
Contributor

Most Emacs users of tree-sitter download pre-compiled parsers distributed by the Emacs package tree-sitter-langs, so this shouldn't be too much of an issue.

maxbrunsfeld added a commit that referenced this pull request Mar 1, 2021
Due to the breaking ABI change in #943, this is our chance
to reorder the fields in a more logical way.
qjoner

This comment was marked as spam.

@dundargoc dundargoc added this to the 1.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tree-sitter generates invalid code for grammars with large numbers of fields and/or aliases
4 participants