Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a consistent set of standard library function to wasm parsers across environments #2897

Merged
merged 8 commits into from
Feb 5, 2024

Conversation

maxbrunsfeld
Copy link
Contributor

Background

Tree-sitter parsers can be compiled to WebAssembly. Those parsers can then be used in the web browser, via web-tree-sitter, the WASM build of the Tree-sitter library. Now, they can also be used by the native Tree-sitter library, if the new wasm feature is enabled.

From what I've found so far, when building WASM modules as dynamic libraries with Clang/Emscripten, those dynamic libraries cannot include functions from the C/C++ standard library that they depend on. Instead, the "main module" needs to explicitly include any standard library functions that we plan to use.

So far, we have manually included a somewhat arbitrary list of functions from the C standard library, as well as a few from libc++ (new, delete, plus some std::string ones), but grammar authors have not had any visibility into what functions are available, or even the fact that there is a limited set of functions. Also, the list of stdlib symbols has been specified in a number of duplicated places, so there wasn't one clear place to change the list of symbols (in case we want to add any, or if we bump Emscripten and the C++ ABI changes).

Change

This change consolidates all of the logic for deciding which standard library functions to include into one file, stdlib-symbols.txt. That file is used for:

  • The web-tree-sitter library
  • The native Tree-sitter library, when the WASM feature is enabled
  • A Rust library function, wasm_stdlib_symbols
  • A new step in the build-wasm CLI command, which checks that the WASM artifact doesn't depend on any unavailable symbols.

Now, when you run tree-sitter build-wasm in a grammar repo, it will give you an informative error message if you depend on a function that's unavailable.

Next Steps

As I work on language extensions in zed, I'll be building a list of community parsers that we want to use via WASM. I may end up adding some new stdlib functions to the available set. Now, there is a single place where they need to be added.

@amaanq
Copy link
Member

amaanq commented Feb 2, 2024

Hey Max, I've just fixed all the CI errors stemming from the javascript parser updates and from no longer installing parser.h (now moved to lib/src) - can you rebase on top of that?

Also, this is a good time to maybe ask (and discuss more in the chat perhaps) about providing a set of ts_lib functions for consumers that wish to leverage libc-like functions in a scanner

For example, a commonly used one is iswspace - we could instead provide ts_iswspace that is effectively the same code, but divorces itself from the libc dependency - and apply this to a bunch of common functions used

Another one is reusing the allocators that one might provide when using the library. If I, for example, want to use my own custom allocator, and I'm using the Python grammar, I'm forced to use libc's malloc since the Python grammar uses that for delimiter tracking, and there's no way to directly use the same allocator I've provided. So the definitions in parser.h could add the allocator functions to allow scanners to make use of the same allocator everywhere (and thus potentially removing some more libc dependencies)

It will prevent us from needing to bundle in stdlib functions into wasm as well, which I think is a plus point

@maxbrunsfeld
Copy link
Contributor Author

Another one is reusing the allocators that one might provide when using the library. If I, for example, want to use my own custom allocator, and I'm using the Python grammar, I'm forced to use libc's malloc since the Python grammar uses that for delimiter tracking, and there's no way to directly use the same allocator I've provided.

I think it's a good idea to encourage scanners to use ts_current_malloc and friends. Maybe parser.h should define ts_malloc etc, so that external scanners could easily use the same malloc? We could even have parser.h override malloc itself, to encourage it more strongly.

For example, a commonly used one is iswspace - we could instead provide ts_iswspace that is effectively the same code, but divorces itself from the libc dependency - and apply this to a bunch of common functions used

I don't understand the benefit of this, vs just using iwspace directly. Either way, we'll have a fixed list of library functions available, but why not match the normal libc interface, rather than inventing our own?

@amaanq
Copy link
Member

amaanq commented Feb 5, 2024

I think it's a good idea to encourage scanners to use ts_current_malloc and friends. Maybe parser.h should define ts_malloc etc, so that external scanners could easily use the same malloc? We could even have parser.h override malloc itself, to encourage it more strongly.

Yeah I agree, and overriding malloc sounds like an ever better idea so that it's less work for downstream to potentially migrate. We should also override calloc, realloc, and free as well. I'll look into getting this done soon!

I don't understand the benefit of this, vs just using iwspace directly. Either way, we'll have a fixed list of library functions available, but why not match the normal libc interface, rather than inventing our own?

I was thinking it'd improve the accessibility of using libtree-sitter from any OS, even obscure ones where libc isn't implemented (or only partially), but now I also see the downside and extra maintenance cost of using our own implementation, and it could be worse in cases where the OS's libc uses special optimizations/intrinsics for performance. But yeah after thinking about it, I think we can disregard that idea, sorry about that.

@maxbrunsfeld
Copy link
Contributor Author

No worries! I see your point about not being dependent on libc. It'll continue to evolve. I think that this PR gets us closer to having clear expectations about what symbols external scanners can use.

In light of the concerns you raise about the custom allocator, I'm thinking about removing the libc++ symbols from the list (making it impossible to use C++ at all in external scanners, if you care about WASM), but I'm not 100% sure at this point, so I'll leave them for now.

@maxbrunsfeld maxbrunsfeld merged commit 3fe69d6 into master Feb 5, 2024
16 checks passed
@maxbrunsfeld maxbrunsfeld deleted the wasm-stdlib branch February 5, 2024 20:24
@amaanq
Copy link
Member

amaanq commented Feb 5, 2024

I think it'd be great to remove C++ from external scanner usage - it is much more difficult to get that working cross-platform on systems such as windows, and we really don't need the overhead/extra requirements that come with it, especially since pure-C projects like neovim were able to add tree-sitter-python and bash upstream in the build process instead of via the plugin that downloads and compiles the parsers because of the C++ to C rewrites I did 😁 I think we should document this that it is really preferred to just use C.

On that note, I'll rewrite the Ruby grammar's scanner soon, last one I've yet to tackle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants