Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of running a query in a small range of a large file #2085

Merged
merged 9 commits into from Feb 16, 2023

Conversation

maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented Feb 15, 2023

Background

querying in a range - Tree queries can be executed on a restricted range of a source file. This is useful for e.g. efficiently syntax-highlighting a screen's worth of code, regardless of how large the underlying file (and syntax tree) are. To impose a range restriction, you use the ts_query_cursor_set_point_range or ts_query_cursor_set_byte_range functions.

The precise semantics of these range restrictions is that we return all matches that intersect the given range. In other words, the matched nodes don't all have to be contained within the given range; some of the nodes in a match may be entirely outside of the range.

non-rooted patterns - Most of the time, the patterns in a tree query have a definite root node. For example, with the pattern (call_expression function: (identifier) @foo), we know that any match must be contained entirely within a call_expression node. However, you can also write patterns like ("{" @open "}" @close), which match two nodes that are siblings, without specifying what their parent is. We call these patterns 'non-rooted'.

wide-branching repetition nodes - When parsing large files, there are often certain nodes with very large numbers of children. These nodes generally correspond to rules in the grammar that use the repeat() rule function to indicate that they can contain arbitrarily many children. Even though repetition nodes conceptually represent a flat sequence of children, Tree-sitter stores them internally in a balanced binary tree that looks like this.

Screen Shot 2023-02-15 at 3 22 38 PM

hidden syntax nodes - Many of the nodes in this balanced tree structure are hidden from Tree-sitter's public API. All functions that walk the tree skip over these nodes, so that downstream users are not aware of their existence.

Problem

Previously, the query cursor (the object that executes queries) would use Tree-sitter's public TreeCursor APIs for walking the tree. So it would visit each visible node. When executing a query within a specified range, it would avoid descending into any syntax nodes outside of that range. But if a syntax node did intersect the range, the cursor would descend into that node, and it would visit all of the node's children.

To see why this is a problem, consider this Rust file:

fn f1() {}
fn f2() {}
fn f3() {}
fn f4() {}
fn f5() {}
fn f6() {}
fn f7() {}
fn f8() {}
fn f9() {}
fn f10() {}

The syntax tree for this file would look like this. Notice that the source_file node has a large number of children.

(source_file
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block)))

Now suppose we were querying only on lines 4-5. The query cursor would descend into the source_file node (because it extends across lines 4-5). Then, the cursor would visit all of the source_file node's children (every function_item node in the file). Of course, it would avoid descending into any of the function_item nodes outside of lines 4-5, but it's still expensive just to walk across all of those nodes. And the bigger the file is, the more work it is.

Solution

This PR changes the way that the query cursor walks the syntax tree, so that we no longer rely on the public TreeCursor API. Instead, we explicitly visit every node in the tree, including the hidden nodes. And at each step, we decide whether to descend further or not. In the case of large repetition nodes like the one above, we can avoid descending into any subtree outside of the given range (even if that subtree is rooted at a hidden node).

Subtleties

Unfortunately, because of the existence of non-rooted patterns, we can't always avoid descending into hidden subtrees outside of the given range. For example, if your query had a pattern like this (which matches any pair of consecutive functions), we would need to look at functions outside of the range in order to find all of the matches intersecting the range.

(
  (function_item) @fn-1
  .
  (function_item) @fn-2
)

So in certain cases, we do need to descend into hidden nodes that are entirely outside of the range. The way that I've solved this is to introduce a new step into the up-front analysis of queries.

Now, as part of constructing a query, we look at each non-rooted pattern in the query, and compute (by walking the parse table) the set of repetitions in the grammar that could possibly contain a match for that pattern. Luckily, this analysis is a pretty straightforward extension of some analysis that we were already doing in order to validate that all patterns are possible to match.

Then, at runtime, when deciding whether or not to descend into a syntax node outside of the query's range restriction, we use this information to decide if it's necessary.

Tasks

  • Add an API for checking which non-rooted patterns disable range-based tree-walking optimizations
  • Verify that we haven't slowed down query construction

@clason
Copy link
Contributor

clason commented Feb 25, 2023

This change gives noticeable performance improvements for highlighting problematic files in Neovim. Are you by any chance planning on a release in the next few weeks so this could reach more users?

jamessan added a commit to jamessan/tree-sitter that referenced this pull request Mar 15, 2023
tree-sitter#2085 added the ts_query_is_pattern_non_local API
and its usage in tree-sitter-cli, so bump version accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants