Introduce tree-sitter-language crate for grammar crates to depend on #3069

maxbrunsfeld · 2024-02-23T22:14:44Z

This new crate tree-sitter-language just provides a LanguageFn type that grammar crates like tree-sitter-json can create instances of. Formerly, those grammar crates depended on tree-sitter itself, which was bad, because they had to depend on a specific version of the library, even though they don't use the library.

This is a breaking change. Grammar repos will need to regenerate their rust bindings.

dcreager · 2024-02-23T22:45:39Z

Oh this is interesting, I think I like this quite a lot!

dcreager

If we're willing to take on a backwards-incompatible change in the tree-sitter crate, I think there might be a way to accomplish the same goal (language grammar crates don't need to depend on tree-sitter) without introducing the new tree-sitter-language crate.

Instead, you would update the signature of Parser::set_language to:

type LanguageFn = unsafe extern "C" fn() -> *const ();

impl Parser {
  pub fn set_language(&mut self, language: LanguageFn) -> Result<(), LanguageError> {
    /* ... */
  }
}

So the same type as you've put into ts-lang, but without a struct wrapper around it, and just put it into tree-sitter crate directly. (It's the named struct wrapper that requires the new crate in your original formulation.) And then set_language takes in that factory function directly (and calls it to get the language pointer to then pass on to ts_parser_set_language).

The call would end up looking like

parser.set_language(tree_sitter_python)?;

And the language grammar crates would be updated to change the return type of their language constructor from tree_sitter::Language to *const ().

What do you think?

dcreager · 2024-02-26T19:28:12Z

We might also want to keep a version of Parser::set_language that takes in the Language directly, as it currently does, which I think would mean that downstream consumers would not be forced to update tree-sitter and all of their language grammars in lock-step.

maxbrunsfeld · 2024-02-26T20:51:36Z

@dcreager The drawback I see with that approach is that I think Parser::set_language should be marked as unsafe then, because you can pass any function to it with the signature unsafe fn extern "C" fn() -> *const ().

I admit that this extra crate is kinda weird, and I don't totally love it. And the backward-incompatible change is going to introduce some busy work for my use case at Zed as well. But I would like to properly use the unsafe attribute on any Rust APIs that are technically unsafe. Thoughts?

dcreager · 2024-02-26T21:03:16Z

The drawback I see with that approach is that I think Parser::set_language should be marked as unsafe then, because you can pass any function to it with the signature unsafe fn extern "C" fn() -> *const ().

Oof, that's a great point, I hadn't thought of that. Though I think that means the LanguageFn::from method in your current draft would have to be marked unsafe too. I don't see a way to do the dependency inversion without having to carry some unsafe call across the crate boundary.

maxbrunsfeld · 2024-02-26T21:10:47Z

Yeah, in the latest commit, I have LanguageFn::from_raw as a const unsafe function that is called in the generate code in the language crates.

dcreager · 2024-02-26T21:36:24Z

cli/src/generate/templates/lib.rs

-    unsafe { tree_sitter_PARSER_NAME() }
-}
+/// The tree-sitter [`LanguageFn`] for this grammar.
+pub const LANGUAGE: LanguageFn = unsafe { LanguageFn::from_raw(tree_sitter_PARSER_NAME) };


Yeah, in the latest commit, I have LanguageFn::from_raw as a const unsafe function that is called in the generate code in the language crates.

Nice! I'm convinced. With the unsafe call now living in the grammar crates, I think this is the least gross solution. 😂

dcreager · 2024-02-26T21:42:09Z

cli/src/generate/templates/lib.rs

 //! let mut parser = tree_sitter::Parser::new();
-//! parser.set_language(&tree_sitter_PARSER_NAME::language()).expect("Error loading CAMEL_PARSER_NAME grammar");
+//! parser.set_language(&tree_sitter_PARSER_NAME::LANGUAGE.into()).unwrap();


What do you think about set_language taking in LanguageFn then? That would simplify this part to

parser.set_language(tree_sitter_PARSER_NAME::LANGUAGE).unwrap();

It's just aesthetic really but I like how it reduces the noise a bit.

Yeah, it's true the into() thing is a bit noisy.

My thing with set_language is that now we have a second code path for creating Language objects, which doesn't go through a LanguageFn: loading languages from WASM files via WasmStore::load_language. So I wouldn't want to change the type to take a LanguageFn.

Although, we could (I guess) make it take an impl Into<Language>, so that you could pass either a Language or a LanguageFn.

My thing with set_language is that now we have a second code path for creating Language objects, which doesn't go through a LanguageFn: loading languages from WASM files via WasmStore::load_language. So I wouldn't want to change the type to take a LanguageFn.

Oh that's nice! I hadn't been following the wasm stuff closely. So that's how you're doing dynamic linking if though it's not very well standardized on wasm yet?

Although, we could (I guess) make it take an impl Into<Language>, so that you could pass either a Language or a LanguageFn.

That would definitely be a nice way to provide backwards-compatibility. The new tree-sitter crate with this change would still work with old language grammar releases, since you'd just have the Language directly to pass in. You can update your grammar dependencies at your leisure and it would be a one-liner for each to change how you're passing the language into your parser.

Also, if you went with TryInto instead I think that could even subsume the error handling from WasmStore::load_language. So you'd end up creating a wrapper along the lines of

pub struct WasmLanguage<'a> { store: &'a mut WasmStore, name: &'a str, bytes: &'a [u8], } impl<'a> TryFrom<WasmLanguage<'a>> for Language { type Error = WasmError; fn try_from(wasm_lang: WasmLanguage<'a>) -> Result<Result, WasmError> { // the body of `WasmStore::load_language` } } impl WasmStore { pub fn language(&mut self, name: &str, bytes: &[u8]) -> WasmLanguage { WasmLanguage { store: self, name, bytes } } } // let store: WasmStore = ...; // parser.set_language(store.language("python", python_bytes))?;

I think I'm wrong about my suggestion that Into or TryInto would provide transparent backwards compatibility, since set_language currently takes a reference to a language. Given that, I retract my suggestions — I think the dependency inversion on its own is a huge win, and worth getting in ASAP. We can iterate on the set_language ergonomics later if it turns out to be an actual pain point.

dcreager

[Apologies if my comments are coming across as nit-picking — I really like the dependency inversion in this change and want to make sure we make it as ergonomic to use as possible]

dcreager · 2024-02-27T15:48:57Z

cli/src/generate/templates/lib.rs

 //! let mut parser = tree_sitter::Parser::new();
-//! parser.set_language(&tree_sitter_PARSER_NAME::language()).expect("Error loading CAMEL_PARSER_NAME grammar");
+//! parser.set_language(&tree_sitter_PARSER_NAME::LANGUAGE.into()).unwrap();


My thing with set_language is that now we have a second code path for creating Language objects, which doesn't go through a LanguageFn: loading languages from WASM files via WasmStore::load_language. So I wouldn't want to change the type to take a LanguageFn.

Oh that's nice! I hadn't been following the wasm stuff closely. So that's how you're doing dynamic linking if though it's not very well standardized on wasm yet?

Although, we could (I guess) make it take an impl Into<Language>, so that you could pass either a Language or a LanguageFn.

That would definitely be a nice way to provide backwards-compatibility. The new tree-sitter crate with this change would still work with old language grammar releases, since you'd just have the Language directly to pass in. You can update your grammar dependencies at your leisure and it would be a one-liner for each to change how you're passing the language into your parser.

Also, if you went with TryInto instead I think that could even subsume the error handling from WasmStore::load_language. So you'd end up creating a wrapper along the lines of

pub struct WasmLanguage<'a> { store: &'a mut WasmStore, name: &'a str, bytes: &'a [u8], } impl<'a> TryFrom<WasmLanguage<'a>> for Language { type Error = WasmError; fn try_from(wasm_lang: WasmLanguage<'a>) -> Result<Result, WasmError> { // the body of `WasmStore::load_language` } } impl WasmStore { pub fn language(&mut self, name: &str, bytes: &[u8]) -> WasmLanguage { WasmLanguage { store: self, name, bytes } } } // let store: WasmStore = ...; // parser.set_language(store.language("python", python_bytes))?;

dcreager · 2024-03-06T15:55:09Z

cli/src/generate/templates/lib.rs

 //! let mut parser = tree_sitter::Parser::new();
-//! parser.set_language(&tree_sitter_PARSER_NAME::language()).expect("Error loading CAMEL_PARSER_NAME grammar");
+//! parser.set_language(&tree_sitter_PARSER_NAME::LANGUAGE.into()).unwrap();


I think I'm wrong about my suggestion that Into or TryInto would provide transparent backwards compatibility, since set_language currently takes a reference to a language. Given that, I retract my suggestions — I think the dependency inversion on its own is a huge win, and worth getting in ASAP. We can iterate on the set_language ergonomics later if it turns out to be an actual pain point.

hendrikvanantwerpen · 2024-03-06T16:59:36Z

This is looking very cool! Let me try to check my understanding of what the dependency structure will look like with these changes in:

Grammar source depends on a tree-sitter version (cli, for generation) compatible with the source to generate the grammar artifact. This tree-sitter version determines the ABI (ignoring the fact that it might support multiple for now).
Grammar artifact's Rust bindings depend on a tree-sitter-language version. This is only a wrapped pointer, so the idea is, I guess, that this will be very stable.
Users depend on a tree-sitter version (lib, for parsing) and the grammar artifact. The only requirement is that they agree on the tree-sitter-language version.

Results:

The tree-sitter cli version used to generate the grammar artifact does not determine the tree-sitter lib version used to parse.
The compatibility between the grammar and the lib is not visible in the versions anymore, because the ABI is only dynamic.
Grammars can start using semantic versioning based on parse tree structure (and downstream query compatibility) regardless of the cli or lib tree-sitter versions involved.

This would improve things a lot!

I wonder if it would make sense to try to support multiple ABI versions of the same grammar easily. If I understand things correctly, the cli supports generating artifacts for older ABI versions, and the lib also supports reading multiple ABI versions. If we made the ABI part of the grammar artifact crate name, it would be easy to provide a grammar for multiple ABI targets, and they could all be checked into the repo, or published to crates.io. Would that make sense to do?

amaanq · 2024-05-25T00:44:56Z

@maxbrunsfeld I think this is ready - I've rebased it for you and fixed conflicts, as well as updated the generating files code to update Cargo.toml and/or lib.rs if they need to be updated

maxbrunsfeld · 2024-05-25T01:09:45Z

Awesome, thanks @amaanq . I guess let’s go for it.

…end on Co-authored-by: Conrad <conrad@zed.dev> Co-authored-by: Marshall <marshall@zed.dev> Co-authored-by: Amaan Qureshi <amaanq12@gmail.com>

amaanq · 2024-05-25T01:54:11Z

🥳

maxbrunsfeld marked this pull request as draft February 23, 2024 22:22

ObserverOfTime mentioned this pull request Feb 24, 2024

feat: add bindings for C, Go, Python, and Swift #2438

Merged

maxbrunsfeld force-pushed the language-crate branch from 1da97d0 to 01f7edb Compare February 26, 2024 17:20

maxbrunsfeld marked this pull request as ready for review February 26, 2024 17:53

maxbrunsfeld force-pushed the language-crate branch from 5fecdbf to fbe78be Compare February 26, 2024 18:24

dcreager reviewed Feb 26, 2024

View reviewed changes

dcreager reviewed Feb 27, 2024

View reviewed changes

WillLillis mentioned this pull request Feb 27, 2024

Versioning Conflict for Grammars' Rust Bindings #3095

Closed

dcreager mentioned this pull request Mar 5, 2024

Dependency version cleanup github/stack-graphs#411

Merged

dcreager approved these changes Mar 6, 2024

View reviewed changes

amaanq force-pushed the master branch from c206aad to 0a5a564 Compare March 10, 2024 21:15

amaanq force-pushed the master branch from 16be3ee to d569d0e Compare March 17, 2024 23:01

ObserverOfTime mentioned this pull request Apr 1, 2024

tree-sitter-cli@0.22.2 breaks tree-sitter-css tests #3238

Closed

amaanq mentioned this pull request Apr 8, 2024

Upgrade to tree-sitter 0.21.0 tree-sitter/tree-sitter-go#138

Closed

ObserverOfTime added this to the 0.23 milestone Apr 11, 2024

ObserverOfTime mentioned this pull request Apr 12, 2024

Compile-time kind and field ids that are generated by binding generation. #3276

Open

amaanq force-pushed the language-crate branch from cf263bb to 20991b8 Compare May 24, 2024 21:23

feat!: introduce tree-sitter-language crate for grammar crates to dep…

09a8eb2

…end on Co-authored-by: Conrad <conrad@zed.dev> Co-authored-by: Marshall <marshall@zed.dev> Co-authored-by: Amaan Qureshi <amaanq12@gmail.com>

amaanq force-pushed the language-crate branch from afba330 to 09a8eb2 Compare May 25, 2024 01:53

amaanq merged commit 38137c7 into master May 25, 2024
12 checks passed

amaanq deleted the language-crate branch May 25, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce tree-sitter-language crate for grammar crates to depend on #3069

Introduce tree-sitter-language crate for grammar crates to depend on #3069

maxbrunsfeld commented Feb 23, 2024 •

edited

dcreager commented Feb 23, 2024

dcreager left a comment

dcreager commented Feb 26, 2024

maxbrunsfeld commented Feb 26, 2024

dcreager commented Feb 26, 2024

maxbrunsfeld commented Feb 26, 2024

dcreager Feb 26, 2024

dcreager Feb 26, 2024

maxbrunsfeld Feb 26, 2024 •

edited

dcreager Feb 27, 2024

dcreager Mar 6, 2024

dcreager left a comment

dcreager Feb 27, 2024

dcreager Mar 6, 2024

hendrikvanantwerpen commented Mar 6, 2024

amaanq commented May 25, 2024

maxbrunsfeld commented May 25, 2024

amaanq commented May 25, 2024

Introduce tree-sitter-language crate for grammar crates to depend on #3069

Introduce tree-sitter-language crate for grammar crates to depend on #3069

Conversation

maxbrunsfeld commented Feb 23, 2024 • edited

dcreager commented Feb 23, 2024

dcreager left a comment

Choose a reason for hiding this comment

dcreager commented Feb 26, 2024

maxbrunsfeld commented Feb 26, 2024

dcreager commented Feb 26, 2024

maxbrunsfeld commented Feb 26, 2024

dcreager Feb 26, 2024

Choose a reason for hiding this comment

dcreager Feb 26, 2024

Choose a reason for hiding this comment

maxbrunsfeld Feb 26, 2024 • edited

Choose a reason for hiding this comment

dcreager Feb 27, 2024

Choose a reason for hiding this comment

dcreager Mar 6, 2024

Choose a reason for hiding this comment

dcreager left a comment

Choose a reason for hiding this comment

dcreager Feb 27, 2024

Choose a reason for hiding this comment

dcreager Mar 6, 2024

Choose a reason for hiding this comment

hendrikvanantwerpen commented Mar 6, 2024

amaanq commented May 25, 2024

maxbrunsfeld commented May 25, 2024

amaanq commented May 25, 2024

maxbrunsfeld commented Feb 23, 2024 •

edited

maxbrunsfeld Feb 26, 2024 •

edited