New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of running a query in a small range of a large file #2085
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…siting hidden nodes
…odes it descends into
maxbrunsfeld
force-pushed
the
faster-query-in-range
branch
from
February 16, 2023 19:59
57508ea
to
837899e
Compare
This change gives noticeable performance improvements for highlighting problematic files in Neovim. Are you by any chance planning on a release in the next few weeks so this could reach more users? |
jamessan
added a commit
to jamessan/tree-sitter
that referenced
this pull request
Mar 15, 2023
tree-sitter#2085 added the ts_query_is_pattern_non_local API and its usage in tree-sitter-cli, so bump version accordingly.
maxbrunsfeld
added a commit
that referenced
this pull request
Jun 27, 2023
amaanq
pushed a commit
that referenced
this pull request
Jul 10, 2023
amaanq
pushed a commit
that referenced
this pull request
Jul 10, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
querying in a range - Tree queries can be executed on a restricted range of a source file. This is useful for e.g. efficiently syntax-highlighting a screen's worth of code, regardless of how large the underlying file (and syntax tree) are. To impose a range restriction, you use the
ts_query_cursor_set_point_range
orts_query_cursor_set_byte_range
functions.The precise semantics of these range restrictions is that we return all matches that intersect the given range. In other words, the matched nodes don't all have to be contained within the given range; some of the nodes in a match may be entirely outside of the range.
non-rooted patterns - Most of the time, the patterns in a tree query have a definite root node. For example, with the pattern
(call_expression function: (identifier) @foo)
, we know that any match must be contained entirely within acall_expression
node. However, you can also write patterns like("{" @open "}" @close)
, which match two nodes that are siblings, without specifying what their parent is. We call these patterns 'non-rooted'.wide-branching repetition nodes - When parsing large files, there are often certain nodes with very large numbers of children. These nodes generally correspond to rules in the grammar that use the
repeat()
rule function to indicate that they can contain arbitrarily many children. Even though repetition nodes conceptually represent a flat sequence of children, Tree-sitter stores them internally in a balanced binary tree that looks like this.hidden syntax nodes - Many of the nodes in this balanced tree structure are hidden from Tree-sitter's public API. All functions that walk the tree skip over these nodes, so that downstream users are not aware of their existence.
Problem
Previously, the query cursor (the object that executes queries) would use Tree-sitter's public
TreeCursor
APIs for walking the tree. So it would visit each visible node. When executing a query within a specified range, it would avoid descending into any syntax nodes outside of that range. But if a syntax node did intersect the range, the cursor would descend into that node, and it would visit all of the node's children.To see why this is a problem, consider this Rust file:
The syntax tree for this file would look like this. Notice that the
source_file
node has a large number of children.Now suppose we were querying only on lines 4-5. The query cursor would descend into the
source_file
node (because it extends across lines 4-5). Then, the cursor would visit all of thesource_file
node's children (everyfunction_item
node in the file). Of course, it would avoid descending into any of thefunction_item
nodes outside of lines 4-5, but it's still expensive just to walk across all of those nodes. And the bigger the file is, the more work it is.Solution
This PR changes the way that the query cursor walks the syntax tree, so that we no longer rely on the public
TreeCursor
API. Instead, we explicitly visit every node in the tree, including the hidden nodes. And at each step, we decide whether to descend further or not. In the case of large repetition nodes like the one above, we can avoid descending into any subtree outside of the given range (even if that subtree is rooted at a hidden node).Subtleties
Unfortunately, because of the existence of non-rooted patterns, we can't always avoid descending into hidden subtrees outside of the given range. For example, if your query had a pattern like this (which matches any pair of consecutive functions), we would need to look at functions outside of the range in order to find all of the matches intersecting the range.
So in certain cases, we do need to descend into hidden nodes that are entirely outside of the range. The way that I've solved this is to introduce a new step into the up-front analysis of queries.
Now, as part of constructing a query, we look at each non-rooted pattern in the query, and compute (by walking the parse table) the set of repetitions in the grammar that could possibly contain a match for that pattern. Luckily, this analysis is a pretty straightforward extension of some analysis that we were already doing in order to validate that all patterns are possible to match.
Then, at runtime, when deciding whether or not to descend into a syntax node outside of the query's range restriction, we use this information to decide if it's necessary.
Tasks