Improve the performance of running a query in a small range of a large file #2085

maxbrunsfeld · 2023-02-15T23:37:14Z

Background

querying in a range - Tree queries can be executed on a restricted range of a source file. This is useful for e.g. efficiently syntax-highlighting a screen's worth of code, regardless of how large the underlying file (and syntax tree) are. To impose a range restriction, you use the ts_query_cursor_set_point_range or ts_query_cursor_set_byte_range functions.

The precise semantics of these range restrictions is that we return all matches that intersect the given range. In other words, the matched nodes don't all have to be contained within the given range; some of the nodes in a match may be entirely outside of the range.

non-rooted patterns - Most of the time, the patterns in a tree query have a definite root node. For example, with the pattern (call_expression function: (identifier) @foo), we know that any match must be contained entirely within a call_expression node. However, you can also write patterns like ("{" @open "}" @close), which match two nodes that are siblings, without specifying what their parent is. We call these patterns 'non-rooted'.

wide-branching repetition nodes - When parsing large files, there are often certain nodes with very large numbers of children. These nodes generally correspond to rules in the grammar that use the repeat() rule function to indicate that they can contain arbitrarily many children. Even though repetition nodes conceptually represent a flat sequence of children, Tree-sitter stores them internally in a balanced binary tree that looks like this.

hidden syntax nodes - Many of the nodes in this balanced tree structure are hidden from Tree-sitter's public API. All functions that walk the tree skip over these nodes, so that downstream users are not aware of their existence.

Problem

Previously, the query cursor (the object that executes queries) would use Tree-sitter's public TreeCursor APIs for walking the tree. So it would visit each visible node. When executing a query within a specified range, it would avoid descending into any syntax nodes outside of that range. But if a syntax node did intersect the range, the cursor would descend into that node, and it would visit all of the node's children.

To see why this is a problem, consider this Rust file:

fn f1() {}
fn f2() {}
fn f3() {}
fn f4() {}
fn f5() {}
fn f6() {}
fn f7() {}
fn f8() {}
fn f9() {}
fn f10() {}

The syntax tree for this file would look like this. Notice that the source_file node has a large number of children.

(source_file
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block))
  (function_item name: (identifier) parameters: (parameters) body: (block)))

Now suppose we were querying only on lines 4-5. The query cursor would descend into the source_file node (because it extends across lines 4-5). Then, the cursor would visit all of the source_file node's children (every function_item node in the file). Of course, it would avoid descending into any of the function_item nodes outside of lines 4-5, but it's still expensive just to walk across all of those nodes. And the bigger the file is, the more work it is.

Solution

This PR changes the way that the query cursor walks the syntax tree, so that we no longer rely on the public TreeCursor API. Instead, we explicitly visit every node in the tree, including the hidden nodes. And at each step, we decide whether to descend further or not. In the case of large repetition nodes like the one above, we can avoid descending into any subtree outside of the given range (even if that subtree is rooted at a hidden node).

Subtleties

Unfortunately, because of the existence of non-rooted patterns, we can't always avoid descending into hidden subtrees outside of the given range. For example, if your query had a pattern like this (which matches any pair of consecutive functions), we would need to look at functions outside of the range in order to find all of the matches intersecting the range.

(
  (function_item) @fn-1
  .
  (function_item) @fn-2
)

So in certain cases, we do need to descend into hidden nodes that are entirely outside of the range. The way that I've solved this is to introduce a new step into the up-front analysis of queries.

Now, as part of constructing a query, we look at each non-rooted pattern in the query, and compute (by walking the parse table) the set of repetitions in the grammar that could possibly contain a match for that pattern. Luckily, this analysis is a pretty straightforward extension of some analysis that we were already doing in order to validate that all patterns are possible to match.

Then, at runtime, when deciding whether or not to descend into a syntax node outside of the query's range restriction, we use this information to decide if it's necessary.

Tasks

Add an API for checking which non-rooted patterns disable range-based tree-walking optimizations
Verify that we haven't slowed down query construction

…erns

…siting hidden nodes

…odes it descends into

clason · 2023-02-25T10:44:54Z

This change gives noticeable performance improvements for highlighting problematic files in Neovim. Are you by any chance planning on a release in the next few weeks so this could reach more users?

tree-sitter#2085 added the ts_query_is_pattern_non_local API and its usage in tree-sitter-cli, so bump version accordingly.

maxbrunsfeld added 8 commits February 14, 2023 14:41

Add --row-range, --quiet, and --time flags to query subcommand

ff2436a

Precompute the set of repetition symbols that can match rootless patt…

32ce1fc

…erns

Group analysis state sets into QueryAnalysis struct

189cf6d

Extract 'internal' versions of tree cursor movement fns that allow vi…

29c9073

…siting hidden nodes

Restructure query_cursor_advance to explicitly control which hidden n…

fa869cf

…odes it descends into

Tweak query tests

bd63fb2

Fix bug in maintenance of query cursor's tree depth

40703f1

Add API for checking if a pattern in a query is non-local

837899e

maxbrunsfeld force-pushed the faster-query-in-range branch from 57508ea to 837899e Compare February 16, 2023 19:59

Add unit test for querying within a range of a long top-level repetition

8dcf851

maxbrunsfeld merged commit c51896d into master Feb 16, 2023

maxbrunsfeld deleted the faster-query-in-range branch February 16, 2023 20:26

maxbrunsfeld mentioned this pull request Feb 16, 2023

Fix syntax-related performance problems on gigantic files zed-industries/zed#2182

Merged

the-mikedavis mentioned this pull request Mar 7, 2023

Pin tree-sitter at git master helix-editor/helix#6218

Merged

jamessan mentioned this pull request Mar 15, 2023

cli: Bump tree-sitter dependency to 0.20.10 #1895

Merged

jamessan added a commit to jamessan/tree-sitter that referenced this pull request Mar 15, 2023

cli: Bump tree-sitter dependency to 0.20.10

23faf59

tree-sitter#2085 added the ts_query_is_pattern_non_local API and its usage in tree-sitter-cli, so bump version accordingly.

maxbrunsfeld added a commit that referenced this pull request Jun 27, 2023

Fix false positive query match bug, introduced in #2085

e48773a

amaanq pushed a commit that referenced this pull request Jul 10, 2023

Fix false positive query match bug, introduced in #2085

1a52e75

amaanq pushed a commit that referenced this pull request Jul 10, 2023

Fix false positive query match bug, introduced in #2085

356f682

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of running a query in a small range of a large file #2085

Improve the performance of running a query in a small range of a large file #2085

maxbrunsfeld commented Feb 15, 2023 •

edited

clason commented Feb 25, 2023

Improve the performance of running a query in a small range of a large file #2085

Improve the performance of running a query in a small range of a large file #2085

Conversation

maxbrunsfeld commented Feb 15, 2023 • edited

Background

Problem

Solution

Subtleties

Tasks

clason commented Feb 25, 2023

maxbrunsfeld commented Feb 15, 2023 •

edited