Application of the generated parser as a compiler's parser #831

resolritter · 2020-12-04T15:57:02Z

resolritter
Dec 4, 2020

Sorry, this isn't an issue. Rather, it's a question, but I did not know where else to ask this.

I've been prototyping my language's grammar with tree-sitter. This is not only because tree-sitter grammars are easy to write, but it is pretty useful to figure out if there's ambiguity in the syntax through conflicts.

tree-sitter emphasizes "incremental parsing". As I'm not experienced with parsers/lexers in general, I'd like to know your thoughts in using the generated parser as the actual parser in the compiler front-end for this language I've been working on. I've noticed other language projects generally rely on hand-written parsers which, in my naivete, would be more precise in reporting errors at specific places in the code... Right? I'm sure there are some other trade-offs, but I'm not knowledgeable enough about this.

Particularly, this language I've mentioned is supposed to be transpiled to JavaScript. Definitely nothing like e.g. Rust which has complicated lifetime semantics scattered throughout the code - in my case it'd be fine to simply abort the compilation on the first syntax error, because I'll not be doing semantic analysis in the code itself.

In general, how do you feel about trying to use the generated parser as a compiler component in an actual project? Are there some limitations you envision by taking that approach? Or, taking a step back, is the approach proper for doing this kind of work or would I be missing out on some valuable context/information that I'd have if I were to build the AST with a hand-written parser?

Answered by maxbrunsfeld

Dec 4, 2020

I think the biggest downside to using a Tree-sitter parser in a compiler front-end is that, while we've done a lot of work on Tree-sitter's error recovery, we haven't yet built out functionality for error messages. So it isn't trivial to find out the exact token/position where the error initiated, and get a list of expected tokens, and things like that.

Also, the error recovery currently isn't customizable in domain-specific ways (e.g. as soon as the word "function" appears, assume that the user meant to write an entire function definition).

Down the road, I would love to invest in both of these things, but because there's so much other stuff we're working on, it may be a while before thi…

View full answer

maxbrunsfeld · 2020-12-04T17:27:17Z

maxbrunsfeld
Dec 4, 2020
Maintainer

I converted this to a discussion thread since we're trying to move toward using Discussions for these types of conversations. The advantage is that they never need to be closed, and they're not mixed in with the more traditional "issues" that each capture an actionable item.

1 reply

resolritter Dec 4, 2020
Author

Thanks for that. I did know this feature existed, so the thought of opening a discussion didn't cross my mind.

maxbrunsfeld · 2020-12-04T17:28:31Z

maxbrunsfeld
Dec 4, 2020
Maintainer

I think the biggest downside to using a Tree-sitter parser in a compiler front-end is that, while we've done a lot of work on Tree-sitter's error recovery, we haven't yet built out functionality for error messages. So it isn't trivial to find out the exact token/position where the error initiated, and get a list of expected tokens, and things like that.

Also, the error recovery currently isn't customizable in domain-specific ways (e.g. as soon as the word "function" appears, assume that the user meant to write an entire function definition).

Down the road, I would love to invest in both of these things, but because there's so much other stuff we're working on, it may be a while before this happens.

1 reply

resolritter Dec 5, 2020
Author

Opened #833

stephe-ada-guru · 2020-12-04T23:08:33Z

stephe-ada-guru
Dec 4, 2020

Max Brunsfeld <notifications@github.com> writes:

I converted this to a discussion thread since we're trying to move toward using Discussions for these types of conversations. The advantage is that they never need to be closed, and they're not mixed in with the more traditional "issues" that each capture an actionable item.

I don't see how to create a new "Discussion" item. On the tree-sitter github home page https://github.com/ubolonton/emacs-tree-sitter, the word "discussion" does not occur, but there is a tab for Issues. Perhaps the header of the Issues page (and the README on the home page?) could mention how to create a Discussion instead?

…

-- -- Stephe

1 reply

maxbrunsfeld Dec 4, 2020
Maintainer

It's one of the tabs, alongside "Code", "Issues" and "Pull Requests".

You linked to another repo, not the Tree-sitter repo. Not all repos have the Discussions feature enabled, I believe.

reverendpaco · 2020-12-13T15:53:32Z

reverendpaco
Dec 13, 2020

Even though this question has been answered, I wanted to ask a more specific question since I came here with the
same thought process 'Can I make tree-sitter my parser generator for a language implementation?'

My question is, is there an API within tree-sitter that actually returns the content of the node (not just the ranges)?

It looks like the answer is 'no', as I did spend some time in the code, and also looked at the neovim lua integration at

https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/ts_utils.lua

which provides a utility to fetch a content from the originating text given the ranges.

2 replies

resolritter Dec 15, 2020
Author

While I'm not sure if it's covered by the API, the library user should be able to extract the range from the source itself in whatever language he's in, since he provides the source for tree-sitter in the first place. For instance, the Rust bindings provide a utf8_text method which does this (it's not coming from the tree-sitter C API).

tree-sitter/lib/binding_rust/lib.rs

Lines 1014 to 1016 in 7aca288

    
           pub fn utf8_text<'a>(&self, source: &'a [u8]) -> Result<&'a str, str::Utf8Error> { 
        
               str::from_utf8(&source[self.start_byte()..self.end_byte()]) 
        
           }

You can see its application here

tree-sitter/cli/src/query.rs

Line 56 in d6a3e4c

capture.node.utf8_text(&source_code).unwrap_or("")

i.e. since node.start_byte() and node.end_byte() should already be provided in the bindings for your language, it's easy to slice the text in the language itself. Otherwise, it'd require that the tree-sitter C API stored a copy of the source code at all times together with the AST representation for the sake of this feature, which might not necessarily be the case.

marcel0ll Dec 16, 2020

#849

I asked the same thing in the above issue and the answer was no. You have to store the original source and get the text from it using: ts_node_start_byte(TSNode) and ts_node_end_byte(TSNode).

I am trying to use tree-sitter to test an idea of incrementally generating code. I thought of trying to use it as a linter or a js dialect compiler to js.

NicholasLYang · 2022-07-28T03:41:45Z

NicholasLYang
Jul 28, 2022

I'm curious, @maxbrunsfeld, do you still believe that this is the case? Or is tree-sitter now a good candidate for a compiler front end?

5 replies

ljleb Nov 15, 2022

I'm curious about this as well!

Jachdich May 3, 2023

For what it's worth I'm interested too, I thought using tree-sitter would be an excellent way to get syntax highlighting and a bootstrap compiler (Since I plan to rewrite my language's compiler in itself, the bootstrap compiler does not have to be perfect)

NicholasLYang May 3, 2023

Some updates, I wrote the initial parser for my programming language, Vicuna, with tree-sitter. While I'm still a huge fan of tree-sitter, it was rather frustrating since I couldn't easily convert the tree-sitter parse tree to a more idiomatic abstract syntax tree. I ended up having to traverse the parse tree and build up my AST, which basically felt like maintaining a second, redundant parser. Furthermore, I'd agree with the assessment that the error message story is not quite there. Also, having C dependencies made it rather difficult to build for WebAssembly, since Rust's wasm32-unknown-unknown backend has broken C interop. I ended up switching to chumsky, a rather nice parser combinator library.

Again, I still think tree-sitter is excellent, but it's not really designed to be a front-end for a compiler. You could try rust-sitter, a promising, albeit very new library that seems to solve the AST generation problem.

sgraf812 May 22, 2023

Here's what an example interpreter using tree-sitter for an ML-lookalike lambda calculus + numbers looks like: https://github.com/sgraf812/tree-sitter-lambda/blob/35fe05520e806548dedb48e7f97118847b531b26/src/main.rs#L39

As always, it's fun when you finally make it work, but it was a rather unpleasant experience.

VonTum Apr 20, 2024

Having gone through the effort of switching to tree-sitter as the SUS compiler's frontend, I can say I'm very content with the change.

The arguments given in this thread turned out way less important than they originally seemed.

we haven't yet built out functionality for error messages. So it isn't trivial to find out the exact token/position where the error initiated, and get a list of expected tokens, and things like that.
This is totally fair, things like expected token lists would be useful to have at an error location. But there are two ways I mitigated this:

Instead of trying to provide the minimal edit that the user should make to make their program compile, I would give the user a list of valid constructs in this scope. So for instance if I get an ERROR inside my block scope, I tell them: "Here are the types of statements that are valid here"
Another way to give better errors is to make fewer errors parse errors. Take this example:
a + 3 = func(x)
Instead of making my grammar as tight as possible, like $._statement: $ => seq($._assignable_expr, '=', $._expr) and going "Invalid syntax on '+'", I define my grammar as loosely as possible ($._statement: $ => seq($._expr, '=', $._expr)), and give proper errors in the interpretation stage.

I ended up having to traverse the parse tree and build up my AST, which basically felt like maintaining a second, redundant parser.
Well, any internal representation conversion will feel like a full-scale parser. But you have to keep in mind that tree-sitter has done the most arduous work already. All grammar conflicts are already out of the tree.

To build a better parser though, I recommend adding some abstraction over TreeCursor. I created a wrapper to allow descending down the hierarchy in a functional style here: https://github.com/pc2/sus-compiler/blob/5314928aaf9aa95ff4328be95bc4aed4f09d11b5/src/parser.rs#L81-L322

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application of the generated parser as a compiler's parser #831

{{title}}

Replies: 5 comments 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Application of the generated parser as a compiler's parser #831

Replies: 5 comments · 10 replies

maxbrunsfeld Dec 4, 2020 Maintainer

resolritter Dec 4, 2020 Author

maxbrunsfeld Dec 4, 2020 Maintainer

resolritter Dec 5, 2020 Author

maxbrunsfeld Dec 4, 2020 Maintainer

resolritter Dec 15, 2020 Author

Replies: 5 comments 10 replies

maxbrunsfeld
Dec 4, 2020
Maintainer

resolritter Dec 4, 2020
Author

maxbrunsfeld
Dec 4, 2020
Maintainer

resolritter Dec 5, 2020
Author

maxbrunsfeld Dec 4, 2020
Maintainer

resolritter Dec 15, 2020
Author