Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Introduce the 'Tree query' - an API for pattern-matching on syntax trees #444
This pull request adds a new data type to the Tree-sitter C library:
Many code analysis tasks involve searching for patterns in syntax trees. Some of these analysis tasks are very common, and it'd be nice to avoid implementing them multiple times. Examples of some common tasks you might want to perform with a Tree-sitter syntax tree:
The Tree-sitter C library is used from several languages, so in order for these analyses to be reusable, they have to be specified in a way that doesn't depend on any particular high-level language's runtime.
But these solutions have some major drawbacks:
The Query Language
Instead of using CSS, compiled into a DFA ahead-of-time, the new
To select all of the methods, together with the name of the class:
(class_declaration name: (identifier) @the-class-name body: (class_body (method_definition name: (property_identifier) @the-method-name)))
To select variables to which functions or arrow functions are assigned:
(assignment_expression (identifier) @function-name (function)) (assignment_expression (identifier) @function-name (arrow_function))
To select all null checks:
(binary_expression left: (*) @null-checked-object operator: "==" right: (null))
The annotations that start with
With the exception of captures, the query language is identical to the format in which Tree-sitter's unit tests are written: S-expressions.
Static Verification of Queries
When a query is instantiated, all of the node names and field names are transformed into integer ids, and an error is raised if any of the names are not actually defined in the grammar.
Because the queries are so easy to parse, it would be easy to add an even more thorough check that uses the
Trying it Out
You can write and execute queries interactively in the web UI, both on the docs site and via the
This binary search implementation differs from Rust's `slice::binary_search_by` method in how they deal with ties. In Rust's implementation: > If there are multiple matches, then any one of the matches > could be returned. This implementation needs to return the index of the *first* match.
* Rename TSQueryContext -> TSQueryCursor * Remove the permanent association between the cursor and its query. The cursor can now be used again for a different query.
@maxbrunsfeld Very nice, this seems it will simplify the implementation of tree-sitter highlighting in neovim a lot (which will be a priority again after the imminent 0.4 release).
I have one question though. The property sheet cursor in
This PR looks like it implements a separate cursor type with much more limited cursor movement (only forward-advancement, though still skipping irrelevant sub-trees). Would it possible to support updating this new NFA on top of the unrestricted TSTreeCursor as with the property sheets? It would simplify the API if we only supported one fully-featured cursor type. Or is the new query semantics (perhaps conditioning on future siblings and such) making it impossible/hard to support random access cursor?
@turbo It is work in progress https://github.com/neovim/neovim/pull/10124/files#diff-5ee91216d7ccd53f980bbf60ffcc29a6 . Eventually we might separate out the parts that are not specific to to neovim as a separate library. Though we make some simplifying assumptions (like all trees accessible to lua are treated as immutable), that might not be appropriate for completely general bindings.
@maxbrunsfeld This is great. I'm maintaining an internal parsing toolkit similar to tree-sitter and was just about to start implementing a query language for static analysis purposes (security, code duplication etc). So now I can aim for compatibility with this query format, too. Though in my case, the query would be run non-realtime against a previously serialized, immutable tree-sitter syntax tree.
@bfredl Thanks! for the most part, that fits my use case. I'll try to play around with that soon. Our whole system is a mix of Rust and Lua, so I should be able to make it work either way.
Yeah, queries can capture nodes based on patterns that match against their later siblings, so the process of identifying captures isn't as 'local' as it was with the property sheet system.
The result is that
For Lua, I might expose the
In that design,
@maxbrunsfeld Thanks, that makes sense. Our plan is to first implement fine-grained lua API calls and then profile that solution. If it is too slow there would be multiple ways forward, including to do baseline highlighting (everything that can be expressed as queries only, which now is most things I guess) in C only, and use lua API for specific additional highlighting only.
@ubolonton You mentioned that you're working on syntax highlighting in Emacs. My goal with this feature is to replace the current Property-sheet system for syntax highlighting. I think that because this feature is totally implemented in the C library, it will make it easier to share more of the implementation between different applications.
@bfredl @ubolonton I would like to get the design of this right for my use case at GitHub as well as both of your use cases, since NeoVim and Emacs are important ecosystems. Some notes about how this system differs from property sheets:
However, there are still some issues to figure out:
Also pinging @bryphe since you have started work on syntax-highlighting with Tree-sitter in Onivim 2. I saw that you are supporting Atom's format for specifying highlighting. This makes total sense, but I want to give you a heads up that I am pursuing this different system for syntax highlighting that's built into Tree-sitter itself. Since you're not yet saddled with backward-compatibility concerns like Atom is, you may want to target this one instead, even though it is still being fleshed out. Sorry for any confusion around this.
I think this is the correct baseline solution. We wouldn't want to create highlighting rules from scratch for all languages that tree-sitter supports. Though I think we (neovim) should allow both actually. As queries can be parsed at runtime, some users will want to add their own matches/captures on top of the baseline rules, which can use vim highlight group names directly. As canonical tree-sitter capture names use
I will need to experiment a bit with this in practice, but I think we can handle this ourselves, as long as the matches are ordered by their start position. I e if we see a match on line 197, we know that line 196 is "done", and we can flush that line to the terminal/UI. But if a match on line 197 has a capture on line 200 it should be no problem, we can just stash it in some temporary storage until the renderer has reached line 200.
Yeah, this is a good point. I'm going to need to add or tweak APIs to make this possible. I will do that before merging.
This is absolutely fantastic, thank you for building awesome stuff. This has been my largest external obstacle in terms of using the tree sitter practically. Later this year I'd love to work on contributing some tools to make this easy for theme makers to take advantage of. I'd also be interested in getting people together to agree on a mostly-universal naming convention for capture names.
The only piece of concern I have is for the capture names though, like