Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CodePointTrie Builder to ICU4X #1837

Closed
sffc opened this issue May 2, 2022 · 22 comments · Fixed by #1864
Closed

Add CodePointTrie Builder to ICU4X #1837

sffc opened this issue May 2, 2022 · 22 comments · Fixed by #1864
Assignees
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@sffc
Copy link
Member

sffc commented May 2, 2022

Currently the only way to generate CodePointTries in ICU4X is to leverage ICU4C's UMutableCPTrie. However, we would like to build our own CodePointTries in Segmenter (#1835), and possibly elsewhere. We should try to decouple ICU4X and ICU4C in this way.

@sffc sffc added C-unicode Component: Props, sets, tries T-techdebt Type: ICU4X code health and tech debt S-medium Size: Less than a week (larger bug fix or enhancement) labels May 2, 2022
@sffc
Copy link
Member Author

sffc commented May 4, 2022

Hmm, one option here, which sounds a little crazy but isn't totally crazy, is to compile the ICU4C CPT builder to a WASM file, check that WASM file into tree, and then evaluate it in Rust using Wasmer.

This would produce a cross-platform Rust CPT builder that isn't super fast but is perfectly fine for build tools. It's also a fun thought experiment if we can get it to work.

Thoughts?

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label May 4, 2022
@Manishearth
Copy link
Member

Might be okay, but I kinda think that if we're going to have a crate with all these types I'd prefer if it were decently self-contained

@sffc
Copy link
Member Author

sffc commented May 5, 2022

What we could do is to copy the relevant C++ files into tree, and have a script that depends on wasienv or emcc that builds them into WASM files, which also live in tree, but you don't need wasienv or emcc to build ICU4X.

@sffc
Copy link
Member Author

sffc commented May 5, 2022

Okay, so to unblock #1835, what I need in the short term is some solution for generating CPTs from datagen. That's the use case.

I have considered a number of approaches. Options 3, 4, and 8 talk about WASM; this is based on the principle that we can check WASM into the repo and then run it using the Rust-based Wasmer runtime.

  1. Check in pre-built CPTs to the repo
    • Pro: Easy
    • Con: Does not scale well to custom data or CLDR updates
  2. Port the CPT builder to ICU4X (this issue)
    • Pro: Good end result with no external dependencies
    • Con: Fair bit of work added to the critical path of ICU4X 1.0; risk of the CPT builders drifting out of sync
  3. Compile full ICU4C into WASM and then link it into a small script runner that we also compile into WASM
    • Pro: Keeps a single source of truth for the CPT builder
    • Con: Building proper ICU4C to WASM is nontrivial
  4. Copy the CPT C++ builder code into the ICU4X repo and compile that into WASM (instead of full ICU4C)
    • Pro: Fully self-contained in the ICU4X project
    • Con: Likely requires removing ICU4C-specific boilerplate from the C++ file; could drift out of sync
  5. Add an FFI for the CPT builder to rust_icu and add a dependency on rust_icu to datagen; see Support UCPTrie and UMutableCPTrie google/rust_icu#247
    • Pro: Fairly clean, official solution
    • Con: Requires a copy of ICU4C for running datagen, which we've explicitly avoided
  6. Move the Segmenter datagen code into ICU4C's icuexportdata tool and do all the work over there
    • Pro: Full access to ICU4C tooling when building the segmentation data
    • Con: Moving Rust code into C++ sounds like a bad idea; customizing the segmenter data requires re-running the ICU4C toolchain
  7. Copy the CPT C++ builder code into the ICU4X repo, compile it with a C++ compiler from the environment, and invoke it via FFI
    • Pro: The code is fully self-contained
    • Con: Same cons as option 4; requires the presence of a C++ compiler in the environment when running datagen (we would likely leverage a solution such as cc-rs)
  8. Write a script that builds the subset of ICU4C that we need for running the CPT builder into a WASM file, circumventing much of the complexity of the ICU4C build system
    • Pro: No need to copy ICU4C source files into our repo; no dependency on the ICU4C build system
    • Con: WASM dependency

Thoughts?

@markusicu
Copy link
Member

Porting the CodePointTrie builder (ICU MutableCodePointTrie, e.g., Java version) to Rust is doable, but yes, it's a fair bit of code, and it's tricky enough for bugs to sneak in. So in addition to porting that chunk of code, you would also want to port the test code (e.g., Java version). I think you already have a port of parts of that test code, but using canned tries (output from the C test); when you add the builder, you would want to build the tries on the fly. You could keep the canned versions and compare them with your builder.

Note that the C++ builder has a small set of dependencies on other ICU code, so it could be built into a very small library. You can follow the dependencies from the common:umutablecptrie build target (in the special build system for Unicode updates).

Going through WASM seems very roundabout. If you were going to add parts of ICU4C into the ICU4X repo, then just copy a small subset of ICU4C files into the ICU4X repo and FFI into that.

FFI into rust_icu sounds reasonable.

ICU could create a new command-line tool just for generating CodePointTrie's, and you could invoke that. We would need to define the input and output formats.


Also note that part of the MutableCodePointTrie is that it functions like a CodePointTrie. Aside from defining values for ranges and building a trie, you can also query values and enumerate ranges (which is why it's called "mutable code point trie"). Strictly speaking, you don't need that in a pure builder, but it makes for an efficient data structure, like a mutable HashMap but optimized for code point --> uint32_t. I use this a lot in ICU data builders, updating values and iterating on data until it's done and can be poured into its runtime form.

@sffc
Copy link
Member Author

sffc commented May 5, 2022

Going through WASM seems very roundabout. If you were going to add parts of ICU4C into the ICU4X repo, then just copy a small subset of ICU4C files into the ICU4X repo and FFI into that.

This requires having a C++ compiler available at build time. But, maybe that's an acceptable dependency. Most systems have some C+ compiler available. I believe this is a much lower bar than rust_icu and the ICU-based CLI tool, which require a fully working, up-to-date local build of ICU4C.

I added this as Option 7.

@Manishearth
Copy link
Member

Most systems have some C+ compiler available.

Yes, however systems also often differ on the precise setup of the C++ compiler and linker, and the libraries available, and other things.

My experience with C deps in Rust is that they work, but they add more complexity to setting up the crate.

I'd rather go with the WASM option over C++


What if we export the CPTs from ICU, and provide an optional datagen step that uses wasm or C++ to generate them?

@sffc
Copy link
Member Author

sffc commented May 5, 2022

What if we export the CPTs from ICU, and provide an optional datagen step that uses wasm or C++ to generate them?

The problem is that the contents of the CPTs are generated in Rust code currently. We would need to port that Rust code to C++, which is Option 6.

@Manishearth
Copy link
Member

Ah I see.

@Manishearth
Copy link
Member

I'm open to a wasm dep; can we perhaps make that a separate datagen phase so most datagen users don't need to pull in that (heavy) dependency tree? I.e. datagen just copies checked-in files, there's a separate cargo-make task and crate that generates them.

@robertbastian
Copy link
Member

Datagen clients should not need to run any cargo-make tasks, as that would require checking out the repository. Also if the checked-in data file doesn't match this will be a more complex workflow than just including the wasm builder and making the build 30s longer.

@sffc
Copy link
Member Author

sffc commented May 5, 2022

I added an option 8. It's a variant of option 3 that may be easier to implement: check in the WASM file to the repo, and have a script that re-generates the WASM file from a local checkout of ICU4C, using a fully custom build script, rather than relying on the heavy ICU4C build system.

@Manishearth
Copy link
Member

Datagen clients should not need to run any cargo-make tasks, as that would require checking out the repository. Also if the checked-in data file doesn't match this will be a more complex workflow than just including the wasm builder and making the build 30s longer.

Oh, I mean, it would be something they can run without cargo-make too, i just mean that from our side it's not needed or built in cargo make testdata by default

@sffc
Copy link
Member Author

sffc commented May 5, 2022

Anything involving option 1 means that we need to have all possible CPTs checked into our repo. My hesitation is that if clients download datagen and run it with some weird configuration that we don't have pre-computed, then datagen fails for them, and they have to download the repo and do this big complicated workaround. To avoid this scenario, I heavily lean toward a solution that generates the CPTs on-demand when people run datagen.

@Manishearth
Copy link
Member

Actually, I think my concern is addressed by recommending contributors to call cargo make testdata-cldr that only generates cldr-based testdata (or something) and avoids this.

Basically I don't want this on the primary path for local development, ideally.

@sffc
Copy link
Member Author

sffc commented May 5, 2022

I don't see how to avoid having it on the path for cargo make testdata, but if you're just building and running components locally and you don't need datagen, you should be fine.

@Manishearth
Copy link
Member

I don't see how to avoid having it on the path for cargo make testdata, but if you're just building and running components locally and you don't need datagen, you should be fine.

Right, my suggestion is to avoid having it in the path for whatever icu4x contributors will typically run to regenerate testdata, which need not be cargo make testdata, it can be a separate command that runs a subset.

@echeran
Copy link
Contributor

echeran commented May 6, 2022

Before Markus's comment, I was thinking of responding with the simplest idea, but that ended up covered by the end of his comment -- being able to run ICU4C (via a tool) in order to construct customized data in the form of a CodePointTrie. It allows us to avoid porting the builder code (or actively keeping up with optimization improvements therein). I thought that would be sufficient because datagen is offline (just like our icuexportdata tool) and not frequent, so the time impact would be of lesser importance. I also thought because ICU can be built on multiple platforms, the tooling wouldn't be a problem. But it sounds like it is a concern.

It sounds like Options 7 and 8 among others are desirable to avoid truly building all of ICU, even with the extra effort (and complexity) required to achieve that, and it sounds like early prototypes show that Option 8 is viable. That's fine to me, too. Option 8 has complexity of its own (we're trading build complexity in one place for another), so documentation should also cover the manual pre-requisite steps required to make the WASM file.

@sffc
Copy link
Member Author

sffc commented May 6, 2022

even with the extra effort (and complexity) required to achieve that

I do not believe that there is a significant difference in the overall effort required for any of the options, except for porting the builder to Rust.

it sounds like early prototypes show that Option 8 is viable. That's fine to me, too. Option 8 has complexity of its own (we're trading build complexity in one place for another), so documentation should also cover the manual pre-requisite steps required to make the WASM file.

Exactly true. Although I managed to get it working, an additional downside I've found with Option 8 (which @markusicu also predicted) is that the steps for re-building the WASM file are a bit nontrivial, although good documentation should mitigate this. Once the WASM file is built, the wasmer crate makes it really easy to run it.

@sffc sffc added this to the ICU4X 0.7 Sprint A milestone May 9, 2022
@sffc sffc self-assigned this May 9, 2022
@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label May 9, 2022
@sffc
Copy link
Member Author

sffc commented May 9, 2022

FYI, I have Option 8 running in #1864. I ended up crafting a Makefile (checked in to the ICU4X tree) that compiles a small part of ICU4C into wasm32-wasi *.o files and then links what it needs into a corresponding *.wasm file. This solves our problem without having to copy any source files into ICU4X (there would have been 20-30) and/or maintain patches, as would have likely been required in Option 7.

It's not a perfect solution:

  1. Relies on ICU4C not making an incompatible change in the future (see ICU-22016)
    • Potential future solution: Add a CI test
  2. Requires a bit of startup time in order for the WASM engine to parse and prepare the code for execution
    • Potential future solution: Enable filesystem caching of the compiled code in a build.rs script

However, I think it's the best solution overall, given that this code is not intended to be run at runtime, and both of the above problems have solutions that can be done if necessary.

@sffc
Copy link
Member Author

sffc commented Dec 21, 2022

Although we currently have option 8, I'll add a couple more options:

  1. Move the C++ builder code to a standalone library that both ICU4C and ICU4X depend on
    • Pro: No code duplication
    • Con: Unclear where this shared standalone library would live
  2. Port the C++ builder to Rust and then replace the C++ one with the Rust one in ICU4C
    • Pro: Great long-term solution from ICU4X's point of view
    • Con: Extra complexity in the ICU4C build

@Manishearth
Copy link
Member

Manishearth commented Dec 21, 2022

Another option, rather useful for codebases that are already carrying ICU4C around:

  1. Use some cargo features to give the CPTbuilder code the ability to switch between using wasmer and a natively compiled binary.
    • Pro: More flexibility for clients that care about third party dependencies.
    • Con: More code to carry around. Might be painful to test in CI

The way I envision this working is that icu_cpt_builder is refactored a bit so that the build() code has an option to insert a closure that takes a values iterator (and the trie type) and produces a toml string. perhaps add a build_with() option and hide build()and the wasmer dep behind a Cargo feature.

In datagen, we can then allow people to provide a path to a binary in the CLI args, and ask them to build list_to_ucptrie.cpp themselves. Datagen can be built without the wasm option (and will then panic if this binary is not provided), and when provided it will use build_with() instead of build(), and pass in a small closure that basically does the equivalent of the following code but with a binary.

writeln!(wasi_stdin, "{}", values.len()).expect("valid pipe");
for value in values {
let num: u32 = (*value).into();
writeln!(wasi_stdin, "{}", num).expect("valid pipe");
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants