Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce tree-sitter-language crate for grammar crates to depend on #3069

Merged
merged 1 commit into from
May 25, 2024

Conversation

maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented Feb 23, 2024

This new crate tree-sitter-language just provides a LanguageFn type that grammar crates like tree-sitter-json can create instances of. Formerly, those grammar crates depended on tree-sitter itself, which was bad, because they had to depend on a specific version of the library, even though they don't use the library.

This is a breaking change. Grammar repos will need to regenerate their rust bindings.

@maxbrunsfeld maxbrunsfeld marked this pull request as draft February 23, 2024 22:22
@dcreager
Copy link
Contributor

Oh this is interesting, I think I like this quite a lot!

Copy link
Contributor

@dcreager dcreager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're willing to take on a backwards-incompatible change in the tree-sitter crate, I think there might be a way to accomplish the same goal (language grammar crates don't need to depend on tree-sitter) without introducing the new tree-sitter-language crate.

Instead, you would update the signature of Parser::set_language to:

type LanguageFn = unsafe extern "C" fn() -> *const ();

impl Parser {
  pub fn set_language(&mut self, language: LanguageFn) -> Result<(), LanguageError> {
    /* ... */
  }
}

So the same type as you've put into ts-lang, but without a struct wrapper around it, and just put it into tree-sitter crate directly. (It's the named struct wrapper that requires the new crate in your original formulation.) And then set_language takes in that factory function directly (and calls it to get the language pointer to then pass on to ts_parser_set_language).

The call would end up looking like

parser.set_language(tree_sitter_python)?;

And the language grammar crates would be updated to change the return type of their language constructor from tree_sitter::Language to *const ().

What do you think?

@dcreager
Copy link
Contributor

We might also want to keep a version of Parser::set_language that takes in the Language directly, as it currently does, which I think would mean that downstream consumers would not be forced to update tree-sitter and all of their language grammars in lock-step.

@maxbrunsfeld
Copy link
Contributor Author

@dcreager The drawback I see with that approach is that I think Parser::set_language should be marked as unsafe then, because you can pass any function to it with the signature unsafe fn extern "C" fn() -> *const ().

I admit that this extra crate is kinda weird, and I don't totally love it. And the backward-incompatible change is going to introduce some busy work for my use case at Zed as well. But I would like to properly use the unsafe attribute on any Rust APIs that are technically unsafe. Thoughts?

@dcreager
Copy link
Contributor

The drawback I see with that approach is that I think Parser::set_language should be marked as unsafe then, because you can pass any function to it with the signature unsafe fn extern "C" fn() -> *const ().

Oof, that's a great point, I hadn't thought of that. Though I think that means the LanguageFn::from method in your current draft would have to be marked unsafe too. I don't see a way to do the dependency inversion without having to carry some unsafe call across the crate boundary.

@maxbrunsfeld
Copy link
Contributor Author

Yeah, in the latest commit, I have LanguageFn::from_raw as a const unsafe function that is called in the generate code in the language crates.

unsafe { tree_sitter_PARSER_NAME() }
}
/// The tree-sitter [`LanguageFn`] for this grammar.
pub const LANGUAGE: LanguageFn = unsafe { LanguageFn::from_raw(tree_sitter_PARSER_NAME) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in the latest commit, I have LanguageFn::from_raw as a const unsafe function that is called in the generate code in the language crates.

Nice! I'm convinced. With the unsafe call now living in the grammar crates, I think this is the least gross solution. 😂

//! let mut parser = tree_sitter::Parser::new();
//! parser.set_language(&tree_sitter_PARSER_NAME::language()).expect("Error loading CAMEL_PARSER_NAME grammar");
//! parser.set_language(&tree_sitter_PARSER_NAME::LANGUAGE.into()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about set_language taking in LanguageFn then? That would simplify this part to

parser.set_language(tree_sitter_PARSER_NAME::LANGUAGE).unwrap();

It's just aesthetic really but I like how it reduces the noise a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's true the into() thing is a bit noisy.

My thing with set_language is that now we have a second code path for creating Language objects, which doesn't go through a LanguageFn: loading languages from WASM files via WasmStore::load_language. So I wouldn't want to change the type to take a LanguageFn.

Although, we could (I guess) make it take an impl Into<Language>, so that you could pass either a Language or a LanguageFn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thing with set_language is that now we have a second code path for creating Language objects, which doesn't go through a LanguageFn: loading languages from WASM files via WasmStore::load_language. So I wouldn't want to change the type to take a LanguageFn.

Oh that's nice! I hadn't been following the wasm stuff closely. So that's how you're doing dynamic linking if though it's not very well standardized on wasm yet?

Although, we could (I guess) make it take an impl Into<Language>, so that you could pass either a Language or a LanguageFn.

That would definitely be a nice way to provide backwards-compatibility. The new tree-sitter crate with this change would still work with old language grammar releases, since you'd just have the Language directly to pass in. You can update your grammar dependencies at your leisure and it would be a one-liner for each to change how you're passing the language into your parser.

Also, if you went with TryInto instead I think that could even subsume the error handling from WasmStore::load_language. So you'd end up creating a wrapper along the lines of

pub struct WasmLanguage<'a> {
  store: &'a mut WasmStore,
  name: &'a str,
  bytes: &'a [u8],
}

impl<'a> TryFrom<WasmLanguage<'a>> for Language {
  type Error = WasmError;
  fn try_from(wasm_lang: WasmLanguage<'a>) -> Result<Result, WasmError> {
    // the body of `WasmStore::load_language`
  }
}

impl WasmStore {
  pub fn language(&mut self, name: &str, bytes: &[u8]) -> WasmLanguage {
    WasmLanguage { store: self, name, bytes }
  }
}

// let store: WasmStore = ...;
// parser.set_language(store.language("python", python_bytes))?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm wrong about my suggestion that Into or TryInto would provide transparent backwards compatibility, since set_language currently takes a reference to a language. Given that, I retract my suggestions — I think the dependency inversion on its own is a huge win, and worth getting in ASAP. We can iterate on the set_language ergonomics later if it turns out to be an actual pain point.

Copy link
Contributor

@dcreager dcreager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Apologies if my comments are coming across as nit-picking — I really like the dependency inversion in this change and want to make sure we make it as ergonomic to use as possible]

//! let mut parser = tree_sitter::Parser::new();
//! parser.set_language(&tree_sitter_PARSER_NAME::language()).expect("Error loading CAMEL_PARSER_NAME grammar");
//! parser.set_language(&tree_sitter_PARSER_NAME::LANGUAGE.into()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thing with set_language is that now we have a second code path for creating Language objects, which doesn't go through a LanguageFn: loading languages from WASM files via WasmStore::load_language. So I wouldn't want to change the type to take a LanguageFn.

Oh that's nice! I hadn't been following the wasm stuff closely. So that's how you're doing dynamic linking if though it's not very well standardized on wasm yet?

Although, we could (I guess) make it take an impl Into<Language>, so that you could pass either a Language or a LanguageFn.

That would definitely be a nice way to provide backwards-compatibility. The new tree-sitter crate with this change would still work with old language grammar releases, since you'd just have the Language directly to pass in. You can update your grammar dependencies at your leisure and it would be a one-liner for each to change how you're passing the language into your parser.

Also, if you went with TryInto instead I think that could even subsume the error handling from WasmStore::load_language. So you'd end up creating a wrapper along the lines of

pub struct WasmLanguage<'a> {
  store: &'a mut WasmStore,
  name: &'a str,
  bytes: &'a [u8],
}

impl<'a> TryFrom<WasmLanguage<'a>> for Language {
  type Error = WasmError;
  fn try_from(wasm_lang: WasmLanguage<'a>) -> Result<Result, WasmError> {
    // the body of `WasmStore::load_language`
  }
}

impl WasmStore {
  pub fn language(&mut self, name: &str, bytes: &[u8]) -> WasmLanguage {
    WasmLanguage { store: self, name, bytes }
  }
}

// let store: WasmStore = ...;
// parser.set_language(store.language("python", python_bytes))?;

//! let mut parser = tree_sitter::Parser::new();
//! parser.set_language(&tree_sitter_PARSER_NAME::language()).expect("Error loading CAMEL_PARSER_NAME grammar");
//! parser.set_language(&tree_sitter_PARSER_NAME::LANGUAGE.into()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm wrong about my suggestion that Into or TryInto would provide transparent backwards compatibility, since set_language currently takes a reference to a language. Given that, I retract my suggestions — I think the dependency inversion on its own is a huge win, and worth getting in ASAP. We can iterate on the set_language ergonomics later if it turns out to be an actual pain point.

@hendrikvanantwerpen
Copy link
Contributor

This is looking very cool! Let me try to check my understanding of what the dependency structure will look like with these changes in:

  • Grammar source depends on a tree-sitter version (cli, for generation) compatible with the source to generate the grammar artifact. This tree-sitter version determines the ABI (ignoring the fact that it might support multiple for now).
  • Grammar artifact's Rust bindings depend on a tree-sitter-language version. This is only a wrapped pointer, so the idea is, I guess, that this will be very stable.
  • Users depend on a tree-sitter version (lib, for parsing) and the grammar artifact. The only requirement is that they agree on the tree-sitter-language version.

Results:

  • The tree-sitter cli version used to generate the grammar artifact does not determine the tree-sitter lib version used to parse.
  • The compatibility between the grammar and the lib is not visible in the versions anymore, because the ABI is only dynamic.
  • Grammars can start using semantic versioning based on parse tree structure (and downstream query compatibility) regardless of the cli or lib tree-sitter versions involved.

This would improve things a lot!

I wonder if it would make sense to try to support multiple ABI versions of the same grammar easily. If I understand things correctly, the cli supports generating artifacts for older ABI versions, and the lib also supports reading multiple ABI versions. If we made the ABI part of the grammar artifact crate name, it would be easy to provide a grammar for multiple ABI targets, and they could all be checked into the repo, or published to crates.io. Would that make sense to do?

@amaanq
Copy link
Member

amaanq commented May 25, 2024

@maxbrunsfeld I think this is ready - I've rebased it for you and fixed conflicts, as well as updated the generating files code to update Cargo.toml and/or lib.rs if they need to be updated

@maxbrunsfeld
Copy link
Contributor Author

Awesome, thanks @amaanq . I guess let’s go for it.

…end on

Co-authored-by: Conrad <conrad@zed.dev>
Co-authored-by: Marshall <marshall@zed.dev>
Co-authored-by: Amaan Qureshi <amaanq12@gmail.com>
@amaanq amaanq merged commit 38137c7 into master May 25, 2024
12 checks passed
@amaanq amaanq deleted the language-crate branch May 25, 2024 01:54
@amaanq
Copy link
Member

amaanq commented May 25, 2024

🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants