Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the 'Tree query' - an API for pattern-matching on syntax trees #444

Merged
merged 34 commits into from Sep 19, 2019

Conversation

@maxbrunsfeld
Copy link
Member

maxbrunsfeld commented Sep 12, 2019

Overview

This pull request adds a new data type to the Tree-sitter C library: TSQuery. A query represents one or more patterns of nodes in a syntax tree. You can instantiate a query from a series of S-expressions (similar to those used in Tree-sitter's unit testing system). You can then execute the query on a syntax tree, which lets you efficiently iterate over all of the occurrences of the patterns in the tree. This works in C, Rust, and JavaScript (via Wasm).

Background

Many code analysis tasks involve searching for patterns in syntax trees. Some of these analysis tasks are very common, and it'd be nice to avoid implementing them multiple times. Examples of some common tasks you might want to perform with a Tree-sitter syntax tree:

  • Computing syntax highlighting
  • Computing code-folding regions
  • Finding nested documents to parse separately (JavaScript within HTML, Ruby within ERB, etc)

The Tree-sitter C library is used from several languages, so in order for these analyses to be reusable, they have to be specified in a way that doesn't depend on any particular high-level language's runtime.

Prior Solutions

  • In #204, I added a system called property sheets, that let you use CSS to select syntax nodes and assigning them properties. It was based on a compile-time step that converted the .css files into finite state machines encoded as JSON.
  • In #283, I added a Rust implementation of syntax highlighting based on these property sheets.

But these solutions have some major drawbacks:

  1. It's awkward having to compile the CSS files into JSON files ahead-of-time. It means that the JSON files need to be checked into the git repository (or somewhere) and applications can't easily use the feature for other purposes. Also, The generated JSON files are larger than the source CSS, which is bad for front-end use of Tree-sitter.

  2. We could switch to processing the CSS at runtime, but the CLI currently relies on a Rust library for parsing the CSS, and a bunch of Rust code for transforming it into a DFA. It would be hard to consolidate this into the core C library, and it would bloat the library.

  3. Most immediately-pressing - CSS has limited expressive power. It's awkward to select nodes based on details of their siblings. This turns out to be important for certain syntax highlighting tasks. For example, in this JavaScript code:

    var foo = function(a) { /* ... */ }

    The variable foo is normally highlighted as a function. The current tree-sitter-highlight crate can't do this, because properties of nodes can't depend on their siblings like that.

The Query Language

Instead of using CSS, compiled into a DFA ahead-of-time, the new TSQuery API uses S-expressions, compiled into an NFA at runtime. Here are some examples of what the query language currently looks like.

To select all of the methods, together with the name of the class:

(class_declaration
  name: (identifier) @the-class-name
  body: (class_body
    (method_definition
      name: (property_identifier) @the-method-name)))

To select variables to which functions or arrow functions are assigned:

(assignment_expression
  (identifier) @function-name
  (function))

(assignment_expression
  (identifier) @function-name
  (arrow_function))

To select all null checks:

(binary_expression
  left: (*) @null-checked-object
  operator: "=="
  right: (null))

The annotations that start with @, like @the-class-name are called captures. These identify which nodes should be returned from the query, and what name should be associated with them when they are returned.

With the exception of captures, the query language is identical to the format in which Tree-sitter's unit tests are written: S-expressions.

Static Verification of Queries

When a query is instantiated, all of the node names and field names are transformed into integer ids, and an error is raised if any of the names are not actually defined in the grammar.

Because the queries are so easy to parse, it would be easy to add an even more thorough check that uses the node-types.json to check that all of the parent-child relationships are valid, and could actually occur. I'll implement this in a follow-up PR. My plan is that reusable queries could be stored in a top-level queries folder in the grammar repositories, and the tree-sitter test command could be augmented to check the validity of the queries using the existing node type information.

Trying it Out

You can write and execute queries interactively in the web UI, both on the docs site and via the tree-sitter web-ui command in your own grammar repos.

playground-queries

maxbrunsfeld added 17 commits Sep 9, 2019
This binary search implementation differs from Rust's
`slice::binary_search_by` method in how they deal with ties.

In Rust's implementation:

> If there are multiple matches, then any one of the matches
> could be returned.

This implementation needs to return the index of the *first* match.
* Rename TSQueryContext -> TSQueryCursor
* Remove the permanent association between the cursor and its query. The 
cursor can now be used again for a different query.
@bfredl

This comment has been minimized.

Copy link
Contributor

bfredl commented Sep 13, 2019

@maxbrunsfeld Very nice, this seems it will simplify the implementation of tree-sitter highlighting in neovim a lot (which will be a priority again after the imminent 0.4 release).

I have one question though. The property sheet cursor in lib.rs, which I partially reimplemented in our lua bindings, implemented the DFA execution on top of the random-access cursor, so it supported the full set of movement operations of a "raw" TSTreeCursor, and updated the DFA state to it. I e we only needed to wrap one cursor API, which optionally could have a property sheet assigned to it.

This PR looks like it implements a separate cursor type with much more limited cursor movement (only forward-advancement, though still skipping irrelevant sub-trees). Would it possible to support updating this new NFA on top of the unrestricted TSTreeCursor as with the property sheets? It would simplify the API if we only supported one fully-featured cursor type. Or is the new query semantics (perhaps conditioning on future siblings and such) making it impossible/hard to support random access cursor?

@turbo

This comment has been minimized.

Copy link

turbo commented Sep 13, 2019

@bfredl You have Lua bindings for tree-sitter? Are they available somewhere?

@bfredl

This comment has been minimized.

Copy link
Contributor

bfredl commented Sep 13, 2019

@turbo It is work in progress https://github.com/neovim/neovim/pull/10124/files#diff-5ee91216d7ccd53f980bbf60ffcc29a6 . Eventually we might separate out the parts that are not specific to to neovim as a separate library. Though we make some simplifying assumptions (like all trees accessible to lua are treated as immutable), that might not be appropriate for completely general bindings.

@turbo

This comment has been minimized.

Copy link

turbo commented Sep 13, 2019

@maxbrunsfeld This is great. I'm maintaining an internal parsing toolkit similar to tree-sitter and was just about to start implementing a query language for static analysis purposes (security, code duplication etc). So now I can aim for compatibility with this query format, too. Though in my case, the query would be run non-realtime against a previously serialized, immutable tree-sitter syntax tree.

@bfredl Thanks! for the most part, that fits my use case. I'll try to play around with that soon. Our whole system is a mix of Rust and Lua, so I should be able to make it work either way.

@maxbrunsfeld

This comment has been minimized.

Copy link
Member Author

maxbrunsfeld commented Sep 13, 2019

Or is the new query semantics (perhaps conditioning on future siblings and such) making it impossible/hard to support random access cursor?

Yeah, queries can capture nodes based on patterns that match against their later siblings, so the process of identifying captures isn't as 'local' as it was with the property sheet system.

The result is that TSQueryCursor has a more limited API than TSTreeCursor, but I actually think it's an easier API to use. It's more like a classic iterator conceptually: a simple sequence of values that you can lazily consume. In Rust, we expose it using the Iterator trait.

For Lua, I might expose the TSQueryCursor as a simple iterator.

start = {row = 30, column = 0}
end = {row = 55, column = 0}

for i, match in query.exec(tree, start, end) do
  print(match.pattern_index)
  for capture_name, node in ipairs(match.captures) do
    print(capture_name, node)
  end
end

In that design, TSQueryCursor itself isn't even exposed directly to Lua. The downside of that is there is a small cost to instantiating and deleting a TSQueryCursor. In the JavaScript WASM bindings, since there's only one thread, I just maintain a global TSQueryCursor instance, and reuse it any time Query.exec is called. I'm not sure if that would be good for NeoVim.

Another binding idea - In JavaScript/WASM, FFI is more costly than allocating arrays, so I don't even expose the iterator interface. Instead, exec just returns an array of matches. I'm not sure if the same applies to Lua though.

@maxbrunsfeld

This comment has been minimized.

Copy link
Member Author

maxbrunsfeld commented Sep 13, 2019

in my case, the query would be run non-realtime against a previously serialized, immutable tree-sitter syntax tree.

@turbo That's interesting. Is that project open source by chance? I'd be curious to check it out.

@bfredl

This comment has been minimized.

Copy link
Contributor

bfredl commented Sep 13, 2019

@maxbrunsfeld Thanks, that makes sense. Our plan is to first implement fine-grained lua API calls and then profile that solution. If it is too slow there would be multiple ways forward, including to do baseline highlighting (everything that can be expressed as queries only, which now is most things I guess) in C only, and use lua API for specific additional highlighting only.

@maxbrunsfeld

This comment has been minimized.

Copy link
Member Author

maxbrunsfeld commented Sep 13, 2019

@ubolonton You mentioned that you're working on syntax highlighting in Emacs. My goal with this feature is to replace the current Property-sheet system for syntax highlighting. I think that because this feature is totally implemented in the C library, it will make it easier to share more of the implementation between different applications.

@bfredl @ubolonton I would like to get the design of this right for my use case at GitHub as well as both of your use cases, since NeoVim and Emacs are important ecosystems. Some notes about how this system differs from property sheets:

  1. There is no fixed enum containing all of the valid highlight names. The idea is that the @capture names can be arbitrary strings that you can use as highlighting identifiers.
  2. Obviously, this is part of the core C library, not a separate Rust library
  3. This already works with existing trees, instead of operating on the source code.
  4. You can already use this to search only the visible lines.
  5. The queries can be automatically checked for validity against the grammar.

However, there are still some issues to figure out:

  1. Different applications have different naming conventions for highlights. I see two ways of handling this.

    • My preference is that we would still share the queries, but each dependent app would apply a mapping from Tree-sitter's highlight names into the application's native highlighting names. In this design, there would be a queries/highlightquery file checked into each Tree-sitter grammar repo. These files would all use consistent capture names like @function.builtin, @variable.parameter, @number.float, etc. Emacs and Vim could use these queries directly, but would have to maintain mappings for these highlight names.
    • Alternatively, Vim and Emacs could just use their own highlight queries that use the applications' native highlight names directly. This is at least easier now than with Property Sheets, because queries are compiled from source at runtime, so you wouldn't have to store the generated JSON file.
  2. EDIT - this is no longer true. See below. The query API isn't highly tailored toward syntax highlighting. Currently, it lets you iterate over the matches, where each match can have multiple captures. Matches can overlap, so individual captures are not guaranteed to appear in order. For example, in this JavaScript query:

    (object
      (pair
        key: (identifier) @method.def
        value: (function)))
    
    ":" @delimiter.pair

    With this code:

    x = {foo: function() {}}

    The delimiter.pair capture for : would be returned before the method.def capture for foo, even though foo appears first in the code.

    For syntax highlighting purposes, we'd want to iterate over the individual captures (instead of the matches), in the order that the captures appear. So a little bit of post-processing logic is required. Right now, I'm not sure if this should be implemented as part of the C TSQuery API, or if it should be left to applications to implement this part.

Also pinging @bryphe since you have started work on syntax-highlighting with Tree-sitter in Onivim 2. I saw that you are supporting Atom's format for specifying highlighting. This makes total sense, but I want to give you a heads up that I am pursuing this different system for syntax highlighting that's built into Tree-sitter itself. Since you're not yet saddled with backward-compatibility concerns like Atom is, you may want to target this one instead, even though it is still being fleshed out. Sorry for any confusion around this.

@bfredl

This comment has been minimized.

Copy link
Contributor

bfredl commented Sep 13, 2019

My preference is that we would still share the queries, but each dependent app would apply a mapping from Tree-sitter's highlight names into the application's native highlighting names. In this design, there would be a queries/highlightquery file checked into each Tree-sitter grammar repo. These files would all use consistent capture names like @function.builtin, @variable.parameter, @number.float, etc. Emacs and Vim could use these queries directly, but would have to maintain mappings for these highlight names.

I think this is the correct baseline solution. We wouldn't want to create highlighting rules from scratch for all languages that tree-sitter supports. Though I think we (neovim) should allow both actually. As queries can be parsed at runtime, some users will want to add their own matches/captures on top of the baseline rules, which can use vim highlight group names directly. As canonical tree-sitter capture names use lower.case and vim highlight groups are UpperCase, the intention of any rule should be unambiguous.

For syntax highlighting purposes, we'd want to iterate over the individual captures (instead of the matches), in the order that the captures appear. So a little bit of post-processing logic is required. Right now, I'm not sure if this should be implemented as part of the C TSQuery API, or if it should be left to applications to implement this part.

I will need to experiment a bit with this in practice, but I think we can handle this ourselves, as long as the matches are ordered by their start position. I e if we see a match on line 197, we know that line 196 is "done", and we can flush that line to the terminal/UI. But if a match on line 197 has a capture on line 200 it should be no problem, we can just stash it in some temporary storage until the renderer has reached line 200.

@maxbrunsfeld

This comment has been minimized.

Copy link
Member Author

maxbrunsfeld commented Sep 13, 2019

I will need to experiment a bit with this in practice, but I think we can handle this ourselves, as long as the matches are ordered by their start position.

Yeah, this is a good point. I'm going to need to add or tweak APIs to make this possible. I will do that before merging.

For syntax highlighting, we want to iterate over all of the captures in 
order, and don't care about grouping the captures by pattern.
@Razzeee

This comment has been minimized.

Copy link
Contributor

Razzeee commented Sep 18, 2019

Sounds correct this issue might also be involved in theming.
microsoft/vscode#77133
Both are planned for this iteration.

When iterating over captures, this prevents reasonable queries from 
forcing the tree cursor to buffer matches unnecessarily.
@maxbrunsfeld maxbrunsfeld merged commit 07afce0 into master Sep 19, 2019
3 checks passed
3 checks passed
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@maxbrunsfeld maxbrunsfeld deleted the tree-queries branch Sep 19, 2019
@jeff-hykin

This comment has been minimized.

Copy link

jeff-hykin commented Sep 20, 2019

  • You can also compare the text contents of two captured nodes. This is pretty interesting - it lets you query for more sophisticated patterns. For example, this pattern would match assignments that update a variable based on its own previous value, like x = x + 1:
    ((assignment_expression
       left: (identifier) @id1
       right: (binary_expression
         left: (identifier) @id2))
     (eq? @id1 @id2))

This is absolutely fantastic, thank you for building awesome stuff. This has been my largest external obstacle in terms of using the tree sitter practically. Later this year I'd love to work on contributing some tools to make this easy for theme makers to take advantage of. I'd also be interested in getting people together to agree on a mostly-universal naming convention for capture names.

The only piece of concern I have is for the capture names though, like @function.builtin. TextMate has the terrible legacy of making users to pick between multiple incorrect answers resulting in everyone having both bad and uniquely-bad answers. Sometimes tokens are both parameters and constants, sometimes they're equally a function, class, variable, and object all at the same time. So long as there's a way to tag/label a token with multiple capture names, I'll be 100% on board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.