Port BytesTrie to ICU4X #131

sffc · 2020-06-17T01:55:09Z

BytesTrie is a data structure in ICU4C that maps from byte strings to integer values. This data structure is fundamental to various pieces of functionality, including case mapping and language matching. We should port this data structure to ICU4X.

zbraniecki · 2020-10-24T02:12:21Z

Prior art:

Any of those may be a good starting point?

sffc · 2021-02-04T18:39:33Z

@kpozin What are your plans on this issue?

zbraniecki · 2021-02-15T22:46:08Z

Would SetTrie be a good entry point for this - https://github.com/KaiserKarel/set-trie ?

kpozin · 2021-02-16T23:28:58Z

Would SetTrie be a good entry point for this - https://github.com/KaiserKarel/set-trie ?

Unfortunately, no, if we want to maintain backward compatibility.

The main complication is that we're trying to achieve perfect compatibility with an existing data format that does not have a formal specification. The C++/Java code is the spec. The existing code relies heavily on data and method inheritance, reuses a single set of mutable builder classes for several discrete building stages, and even mutates built Tries while traversing them. As you can imagine, these design decisions make the design exceptionally Rust-unfriendly. In writing a port, I've started the whole thing over several times, with varying degrees of Rust-iness and Java-ness in my design.

Moreover, all the logic is closely coupled together in such a way that unit testing individual methods or build stages is impractical (there do not appear to be any unit tests in the legacy code); one basically has to port the entire conglomeration, both Builder and Trie, and then port the integration tests one by one to discover what's broken. This is the stage I'm at now -- I have a full port of most of the production code, am porting tests, and learning the numerous ways in my port is utterly broken.

kpozin · 2021-02-16T23:32:51Z

I'm aware of the sunk costs (I've sunk a fair portion of them :), but as I proposed at the beginning of this project, we would be much better off finding a standard, well-specified trie format and implementing that, instead of continuing down this road.

markusicu · 2021-02-23T22:00:34Z

@sffc :

This data structure is fundamental to various pieces of functionality, including case mapping and language matching. We should port this data structure to ICU4X.

Not quite. In ICU, case mapping, normalization, collation, and character properties use “code point tries” which are heavily optimized for lookup by code point, as well as by walking through UTF-16 and UTF-8 code units. Between ca. 2001 and 2018 I came up with three versions of this. I moved some ICU code to the third version, and turned that on its own into public API. I would love to convert code that still uses older versions to the latest one, given time.

Design doc: http://site.icu-project.org/design/struct/utrie

By contrast, BytesTrie and UCharsTrie are designed for lookups from arbitrarily long sequences of bytes (8-bit units) vs. UChars (16-bit units). These are more-classic retrieval trees. You could use them for lookup from single code points if you turn those into sequences of 8-bit or 16-bit units, but these “string tries” are not especially optimized for that. We use them for BreakIterator dictionaries, in collation for contraction matching (after the first code point), and for matching Unicode character property (and value) names.

One of the design points for the “string tries” is to get a byte-serialized or 16-bit-word-serialized sequence that requires no or only trivial byte swapping for little-endian vs. big-endian machines, for use as read-only, memory-mapped data.

If you “abused” a “string trie” for pure code point lookups, I would expect them to be a lot slower than code point tries, but they may be more compact.

Design docs:

(There is another version of tries in the ICU character conversion code, customized in various ways to deal with special issues there. Both code point tries and string tries, since some charsets map multiple Unicode code points to single charset codes.)

@kpozin :

The existing code relies heavily on data and method inheritance, reuses a single set of mutable builder classes for several discrete building stages

Yes. In principle, you could write a builder from scratch, as long as you build the data structure that the runtime understands.

By the way, I would port the Java builder, not the C++ builder. Late in development, I made a significant improvement to the Java builder that I never got around to porting back to C++. (I have an old ICU ticket for that that I keep meaning to get to.)

and even mutates built Tries while traversing them.

That is either a misunderstanding or an overstatement. The built structure is immutable. But of course while you walk the structure you need a mutable iterator, and that's what the Trie classes are. Each iterator object is tiny. It just keeps a pointer into the big array and a little bit of extra state.

Moreover, all the logic is closely coupled together in such a way that unit testing individual methods or build stages is impractical (there do not appear to be any unit tests in the legacy code); one basically has to port the entire conglomeration, both Builder and Trie, and then port the integration tests one by one to discover what's broken.

Sorry. And thanks for your bug report! I intend to fix that for ICU 69; I need to think about how to unit-test the internal function that has the bogus code. I would be happy to entertain PRs for better unit testing!

markusicu · 2021-02-23T22:28:48Z

The runtime code for each of these data structures is fairly small. It might work well for ICU4X to have offline builder code calling ICU4J or ICU4C that writes Rust code with initializer lists. We already have some C++ code that writes C and Java initializers, especially for code point tries.

sffc · 2021-02-24T03:54:37Z

Summary of conclusions form a meeting with me, @kpozin, @markusicu, @echeran, @zbraniecki, @dminor:

CodePointTrie (aka UCPTrie) is the right tool for selecting enumerated properties (one code point mapping to one of several enumerated values)
BytesTrie is useful for likely subtags, etc., but probably the wrong tool for enumerated properties
Both CodePointTrie and BytesTrie have a binary representation, which I refer to as an "ArrayBuffer"
Enumerated property selection is what's needed for segmentation
We could reinvent the wheel in ICU4X by building our own data structures, but I'm convinced that we should build on the shoulders of giants and use the same data structures as ICU4C/ICU4J
It's not a good use of our time to write the BytesTrie/CodePointTrie builder code in ICU4X, which is what @kpozin was attempting to do above

Next steps:

Continue with Add PPUCD enumerated property parsing #448, which may be useful for regular expression performance.
Manually extract the ArrayBuffers from the ICU4C/ICU4J tests as a one-time job, and use them as input to ICU4X unit tests.
Add a tool to ICU4C that creates JSON artifacts containing ArrayBuffers for enumerated property selection; the artifacts should be versioned and shipped alongside an ICU release. Assignee: undetermined

echeran · 2021-02-24T19:58:02Z

Here are the full meeting notes for yesterday's meeting.

sffc · 2021-02-25T19:26:09Z

CodePointTrie is split off into #508.

sffc · 2021-04-19T22:21:28Z

Note: Design of language matching is in #173

sffc · 2021-05-13T18:03:51Z

See unicode-org/icu#1660 for how we're exporting CodePointTrie data from ICU4C into toml files.

sffc · 2021-07-21T19:36:15Z

Blocks #842

sffc · 2021-08-24T16:05:51Z

@fabura is going to take a look at this.

sffc · 2021-08-24T16:09:56Z

Take a look at:

CodePointTrie, which is the closest thing we have in ICU4X to what you would be building: https://github.com/unicode-org/icu4x/blob/main/experimental/codepointtrie/src/codepointtrie.rs
BytesTrie documentation in ICU4C: https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1BytesTrie.html
BytesTrie impls:
- Code data structure: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/bytestrie.cpp
- Builder: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/bytestriebuilder.cpp
- Iterator: https://github.com/unicode-org/icu/blob/main/icu4c/source/common/bytestrieiterator.cpp

Consider writing code under experimental/bytestrie. Consider copying boilerplate (Cargo.toml, LICENSE, src/lib.rs) from codepointtrie.

makotokato · 2021-08-31T03:09:29Z

I have tiny version of ICU4C/ICU4J compatible bytetrie/uchartrie (https://github.com/makotokato/dictionary_segmenter/blob/main/src/bytes_trie.rs and https://github.com/makotokato/dictionary_segmenter/blob/main/src/uchars_trie.rs) since I need this to use dictionary segmenter of ICU4C/ICU4J.

dminor · 2022-01-19T01:24:19Z

Makoto will need this for the segmenter. We should find an owner. It should be straightforward based upon the Char16Trie implementation and Makoto's implementation mentioned above.

sffc · 2022-01-19T01:25:51Z

Makoto will need this for the segmenter.

Could you clarify? My understanding is that BytesTrie and CharsTrie are largely interchangeable. If we build our own data, then we can build it as either BytesTrie or CharsTrie; either should work.

dminor · 2022-01-19T01:35:00Z

He was concerned about data size if we used CharsTrie... if that's not a valid concern, then that is good to know.

markusicu · 2022-01-20T22:58:30Z

I co-developed BytesTrie and CharsTrie at the same time. In principle, they work almost the same.

BytesTrie works with input sequences of bytes (u8), and is itself stored as a sequence of bytes.
CharsTrie works with input sequences of 16-bit words (u16), and is itself stored as a sequence of 16-bit words.
They both output the same 4-way result enum, and an optional 32-bit value.

BytesTrie is more natural when the input is already a sequence of bytes, or is naturally encoded as such. In ICU, I use it for Unicode property aliases and for Thai/Khmer/Burmese dictionaries; the latter are small scripts and use a custom mapping from Unicode code points to single bytes.

CharsTrie is more natural when the input units are larger. In ICU, I use it for CJK dictionaries and collation contraction lookup, encoding the input as UTF-16 where that's not already the case.

The storage of these "string tries" uses variable-width encodings of result values and internal offsets. The bigger the values and offsets, the more storage units need to be stored and decoded. My hunch is that for small amounts of data a BytesTrie should be smaller, and for large amounts of data a CharsTrie should be faster.

Also, using a BytesTrie for large input units means you need to convert your inputs to multiple bytes (more bytes than the number of 16-bit words you would need), which adds some encoding cost and requires a larger number of lookups.

I have not done detailed benchmarking of one vs. the other for a variety of types and volumes of data.

sffc · 2022-01-20T23:27:56Z

My preference is to start with CharsTrie, since we already have it implemented, and since I understand it to be the simpler of the two. We should use BytesTrie only if we've tried using CharsTrie and found it to be too large or slow for the use case.

sffc · 2022-11-10T18:34:42Z

See #1155 for an early attempt at this.

sffc added T-core Type: Required functionality C-meta Component: Relating to ICU4X as a whole labels Jun 17, 2020

sffc mentioned this issue Jun 17, 2020

What are the key low-level data structures we need to support ECMA-402? #17

Closed

sffc added this to the 2020 Q3 milestone Jun 17, 2020

sffc assigned kpozin Jun 17, 2020

sffc added C-unicode Component: Props, sets, tries and removed C-meta Component: Relating to ICU4X as a whole labels Jun 22, 2020

sffc modified the milestones: 2020 Q3, 2020 Q4 Oct 22, 2020

kpozin modified the milestones: 2020 Q4, 2021-Q1-m1 Jan 7, 2021

sffc modified the milestones: 2021-Q1-m1, 2021-Q1-m2 Feb 4, 2021

zbraniecki added the discuss Discuss at a future ICU4X-SC meeting label Feb 18, 2021

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 4, 2021

sffc modified the milestones: 2021-Q1-m2, 2021-Q2-m1 Mar 12, 2021

sffc modified the milestones: 2021-Q2-m1, ICU4X 0.3 May 7, 2021

sffc modified the milestones: ICU4X 0.3, 2021 Q2-m3 May 13, 2021

sffc added the good first issue Good for newcomers label Jul 24, 2021

sffc removed this from the 2021 Q3-m1 milestone Aug 12, 2021

sffc added backlog help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) labels Aug 12, 2021

sffc assigned sffc and unassigned kpozin Aug 24, 2021

sffc removed the help wanted Issue needs an assignee label Aug 24, 2021

dminor mentioned this issue Oct 21, 2021

Port UCharsTrie to ICU4X #1202

Closed

dminor added the discuss-priority Discuss at the next ICU4X meeting label Jan 19, 2022

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 3, 2022

sffc mentioned this issue Nov 10, 2022

Implementation of icu's bytestrie structure to Rust #1155

Closed

sffc added this to the Backlog milestone Dec 22, 2022

sffc removed the backlog label Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port BytesTrie to ICU4X #131

Port BytesTrie to ICU4X #131

sffc commented Jun 17, 2020

zbraniecki commented Oct 24, 2020

sffc commented Feb 4, 2021

zbraniecki commented Feb 15, 2021

kpozin commented Feb 16, 2021

kpozin commented Feb 16, 2021

markusicu commented Feb 23, 2021

markusicu commented Feb 23, 2021

sffc commented Feb 24, 2021 •

edited

Loading

echeran commented Feb 24, 2021

sffc commented Feb 25, 2021

sffc commented Apr 19, 2021

sffc commented May 13, 2021

sffc commented Jul 21, 2021

sffc commented Aug 24, 2021

sffc commented Aug 24, 2021 •

edited

Loading

makotokato commented Aug 31, 2021

dminor commented Jan 19, 2022

sffc commented Jan 19, 2022

dminor commented Jan 19, 2022

markusicu commented Jan 20, 2022

sffc commented Jan 20, 2022

sffc commented Nov 10, 2022

Port BytesTrie to ICU4X #131

Port BytesTrie to ICU4X #131

Comments

sffc commented Jun 17, 2020

zbraniecki commented Oct 24, 2020

sffc commented Feb 4, 2021

zbraniecki commented Feb 15, 2021

kpozin commented Feb 16, 2021

kpozin commented Feb 16, 2021

markusicu commented Feb 23, 2021

markusicu commented Feb 23, 2021

sffc commented Feb 24, 2021 • edited Loading

echeran commented Feb 24, 2021

sffc commented Feb 25, 2021

sffc commented Apr 19, 2021

sffc commented May 13, 2021

sffc commented Jul 21, 2021

sffc commented Aug 24, 2021

sffc commented Aug 24, 2021 • edited Loading

makotokato commented Aug 31, 2021

dminor commented Jan 19, 2022

sffc commented Jan 19, 2022

dminor commented Jan 19, 2022

markusicu commented Jan 20, 2022

sffc commented Jan 20, 2022

sffc commented Nov 10, 2022

sffc commented Feb 24, 2021 •

edited

Loading

sffc commented Aug 24, 2021 •

edited

Loading