New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CodePointTrie Builder to ICU4X #1837
Comments
Hmm, one option here, which sounds a little crazy but isn't totally crazy, is to compile the ICU4C CPT builder to a WASM file, check that WASM file into tree, and then evaluate it in Rust using Wasmer. This would produce a cross-platform Rust CPT builder that isn't super fast but is perfectly fine for build tools. It's also a fun thought experiment if we can get it to work. Thoughts? |
Might be okay, but I kinda think that if we're going to have a crate with all these types I'd prefer if it were decently self-contained |
What we could do is to copy the relevant C++ files into tree, and have a script that depends on wasienv or emcc that builds them into WASM files, which also live in tree, but you don't need wasienv or emcc to build ICU4X. |
Okay, so to unblock #1835, what I need in the short term is some solution for generating CPTs from datagen. That's the use case. I have considered a number of approaches. Options 3, 4, and 8 talk about WASM; this is based on the principle that we can check WASM into the repo and then run it using the Rust-based Wasmer runtime.
Thoughts? |
Porting the CodePointTrie builder (ICU MutableCodePointTrie, e.g., Java version) to Rust is doable, but yes, it's a fair bit of code, and it's tricky enough for bugs to sneak in. So in addition to porting that chunk of code, you would also want to port the test code (e.g., Java version). I think you already have a port of parts of that test code, but using canned tries (output from the C test); when you add the builder, you would want to build the tries on the fly. You could keep the canned versions and compare them with your builder. Note that the C++ builder has a small set of dependencies on other ICU code, so it could be built into a very small library. You can follow the dependencies from the Going through WASM seems very roundabout. If you were going to add parts of ICU4C into the ICU4X repo, then just copy a small subset of ICU4C files into the ICU4X repo and FFI into that. FFI into rust_icu sounds reasonable. ICU could create a new command-line tool just for generating CodePointTrie's, and you could invoke that. We would need to define the input and output formats. Also note that part of the MutableCodePointTrie is that it functions like a CodePointTrie. Aside from defining values for ranges and building a trie, you can also query values and enumerate ranges (which is why it's called "mutable code point trie"). Strictly speaking, you don't need that in a pure builder, but it makes for an efficient data structure, like a mutable HashMap but optimized for code point --> uint32_t. I use this a lot in ICU data builders, updating values and iterating on data until it's done and can be poured into its runtime form. |
This requires having a C++ compiler available at build time. But, maybe that's an acceptable dependency. Most systems have some C+ compiler available. I believe this is a much lower bar than I added this as Option 7. |
Yes, however systems also often differ on the precise setup of the C++ compiler and linker, and the libraries available, and other things. My experience with C deps in Rust is that they work, but they add more complexity to setting up the crate. I'd rather go with the WASM option over C++ What if we export the CPTs from ICU, and provide an optional datagen step that uses wasm or C++ to generate them? |
The problem is that the contents of the CPTs are generated in Rust code currently. We would need to port that Rust code to C++, which is Option 6. |
Ah I see. |
I'm open to a wasm dep; can we perhaps make that a separate datagen phase so most datagen users don't need to pull in that (heavy) dependency tree? I.e. datagen just copies checked-in files, there's a separate cargo-make task and crate that generates them. |
Datagen clients should not need to run any cargo-make tasks, as that would require checking out the repository. Also if the checked-in data file doesn't match this will be a more complex workflow than just including the wasm builder and making the build 30s longer. |
I added an option 8. It's a variant of option 3 that may be easier to implement: check in the WASM file to the repo, and have a script that re-generates the WASM file from a local checkout of ICU4C, using a fully custom build script, rather than relying on the heavy ICU4C build system. |
Oh, I mean, it would be something they can run without cargo-make too, i just mean that from our side it's not needed or built in |
Anything involving option 1 means that we need to have all possible CPTs checked into our repo. My hesitation is that if clients download datagen and run it with some weird configuration that we don't have pre-computed, then datagen fails for them, and they have to download the repo and do this big complicated workaround. To avoid this scenario, I heavily lean toward a solution that generates the CPTs on-demand when people run datagen. |
Actually, I think my concern is addressed by recommending contributors to call Basically I don't want this on the primary path for local development, ideally. |
I don't see how to avoid having it on the path for |
Right, my suggestion is to avoid having it in the path for whatever icu4x contributors will typically run to regenerate testdata, which need not be |
Before Markus's comment, I was thinking of responding with the simplest idea, but that ended up covered by the end of his comment -- being able to run ICU4C (via a tool) in order to construct customized data in the form of a CodePointTrie. It allows us to avoid porting the builder code (or actively keeping up with optimization improvements therein). I thought that would be sufficient because datagen is offline (just like our icuexportdata tool) and not frequent, so the time impact would be of lesser importance. I also thought because ICU can be built on multiple platforms, the tooling wouldn't be a problem. But it sounds like it is a concern. It sounds like Options 7 and 8 among others are desirable to avoid truly building all of ICU, even with the extra effort (and complexity) required to achieve that, and it sounds like early prototypes show that Option 8 is viable. That's fine to me, too. Option 8 has complexity of its own (we're trading build complexity in one place for another), so documentation should also cover the manual pre-requisite steps required to make the WASM file. |
I do not believe that there is a significant difference in the overall effort required for any of the options, except for porting the builder to Rust.
Exactly true. Although I managed to get it working, an additional downside I've found with Option 8 (which @markusicu also predicted) is that the steps for re-building the WASM file are a bit nontrivial, although good documentation should mitigate this. Once the WASM file is built, the wasmer crate makes it really easy to run it. |
FYI, I have Option 8 running in #1864. I ended up crafting a Makefile (checked in to the ICU4X tree) that compiles a small part of ICU4C into wasm32-wasi *.o files and then links what it needs into a corresponding *.wasm file. This solves our problem without having to copy any source files into ICU4X (there would have been 20-30) and/or maintain patches, as would have likely been required in Option 7. It's not a perfect solution:
However, I think it's the best solution overall, given that this code is not intended to be run at runtime, and both of the above problems have solutions that can be done if necessary. |
Although we currently have option 8, I'll add a couple more options:
|
Another option, rather useful for codebases that are already carrying ICU4C around:
The way I envision this working is that icu_cpt_builder is refactored a bit so that the In datagen, we can then allow people to provide a path to a binary in the CLI args, and ask them to build icu4x/components/collections/codepointtrie_builder/src/wasm.rs Lines 60 to 66 in b74aafd
|
Currently the only way to generate CodePointTries in ICU4X is to leverage ICU4C's UMutableCPTrie. However, we would like to build our own CodePointTries in Segmenter (#1835), and possibly elsewhere. We should try to decouple ICU4X and ICU4C in this way.
The text was updated successfully, but these errors were encountered: