-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a consistent set of standard library function to wasm parsers across environments #2897
Conversation
Hey Max, I've just fixed all the CI errors stemming from the javascript parser updates and from no longer installing parser.h (now moved to lib/src) - can you rebase on top of that? Also, this is a good time to maybe ask (and discuss more in the chat perhaps) about providing a set of ts_lib functions for consumers that wish to leverage libc-like functions in a scanner For example, a commonly used one is iswspace - we could instead provide ts_iswspace that is effectively the same code, but divorces itself from the libc dependency - and apply this to a bunch of common functions used Another one is reusing the allocators that one might provide when using the library. If I, for example, want to use my own custom allocator, and I'm using the Python grammar, I'm forced to use libc's malloc since the Python grammar uses that for delimiter tracking, and there's no way to directly use the same allocator I've provided. So the definitions in parser.h could add the allocator functions to allow scanners to make use of the same allocator everywhere (and thus potentially removing some more libc dependencies) It will prevent us from needing to bundle in stdlib functions into wasm as well, which I think is a plus point |
fe21c0a
to
dcc9e71
Compare
dcc9e71
to
e054de4
Compare
I think it's a good idea to encourage scanners to use
I don't understand the benefit of this, vs just using |
Yeah I agree, and overriding malloc sounds like an ever better idea so that it's less work for downstream to potentially migrate. We should also override calloc, realloc, and free as well. I'll look into getting this done soon!
I was thinking it'd improve the accessibility of using libtree-sitter from any OS, even obscure ones where libc isn't implemented (or only partially), but now I also see the downside and extra maintenance cost of using our own implementation, and it could be worse in cases where the OS's libc uses special optimizations/intrinsics for performance. But yeah after thinking about it, I think we can disregard that idea, sorry about that. |
No worries! I see your point about not being dependent on libc. It'll continue to evolve. I think that this PR gets us closer to having clear expectations about what symbols external scanners can use. In light of the concerns you raise about the custom allocator, I'm thinking about removing the |
I think it'd be great to remove C++ from external scanner usage - it is much more difficult to get that working cross-platform on systems such as windows, and we really don't need the overhead/extra requirements that come with it, especially since pure-C projects like neovim were able to add tree-sitter-python and bash upstream in the build process instead of via the plugin that downloads and compiles the parsers because of the C++ to C rewrites I did 😁 I think we should document this that it is really preferred to just use C. On that note, I'll rewrite the Ruby grammar's scanner soon, last one I've yet to tackle |
Background
Tree-sitter parsers can be compiled to WebAssembly. Those parsers can then be used in the web browser, via
web-tree-sitter
, the WASM build of the Tree-sitter library. Now, they can also be used by the native Tree-sitter library, if the new wasm feature is enabled.From what I've found so far, when building WASM modules as dynamic libraries with Clang/Emscripten, those dynamic libraries cannot include functions from the C/C++ standard library that they depend on. Instead, the "main module" needs to explicitly include any standard library functions that we plan to use.
So far, we have manually included a somewhat arbitrary list of functions from the C standard library, as well as a few from libc++ (
new
,delete
, plus somestd::string
ones), but grammar authors have not had any visibility into what functions are available, or even the fact that there is a limited set of functions. Also, the list of stdlib symbols has been specified in a number of duplicated places, so there wasn't one clear place to change the list of symbols (in case we want to add any, or if we bump Emscripten and the C++ ABI changes).Change
This change consolidates all of the logic for deciding which standard library functions to include into one file,
stdlib-symbols.txt
. That file is used for:web-tree-sitter
librarywasm_stdlib_symbols
build-wasm
CLI command, which checks that the WASM artifact doesn't depend on any unavailable symbols.Now, when you run
tree-sitter build-wasm
in a grammar repo, it will give you an informative error message if you depend on a function that's unavailable.Next Steps
As I work on language extensions in zed, I'll be building a list of community parsers that we want to use via WASM. I may end up adding some new stdlib functions to the available set. Now, there is a single place where they need to be added.