Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port BytesTrie to ICU4X #131

Open
sffc opened this issue Jun 17, 2020 · 22 comments
Open

Port BytesTrie to ICU4X #131

sffc opened this issue Jun 17, 2020 · 22 comments
Assignees
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Jun 17, 2020

BytesTrie is a data structure in ICU4C that maps from byte strings to integer values. This data structure is fundamental to various pieces of functionality, including case mapping and language matching. We should port this data structure to ICU4X.

@sffc sffc added T-core Type: Required functionality C-meta Component: Relating to ICU4X as a whole labels Jun 17, 2020
@sffc sffc added this to the 2020 Q3 milestone Jun 17, 2020
@sffc sffc added C-unicode Component: Props, sets, tries and removed C-meta Component: Relating to ICU4X as a whole labels Jun 22, 2020
@sffc sffc modified the milestones: 2020 Q3, 2020 Q4 Oct 22, 2020
@zbraniecki
Copy link
Member

@kpozin kpozin modified the milestones: 2020 Q4, 2021-Q1-m1 Jan 7, 2021
@sffc
Copy link
Member Author

sffc commented Feb 4, 2021

@kpozin What are your plans on this issue?

@sffc sffc modified the milestones: 2021-Q1-m1, 2021-Q1-m2 Feb 4, 2021
@zbraniecki
Copy link
Member

Would SetTrie be a good entry point for this - https://github.com/KaiserKarel/set-trie ?

@kpozin
Copy link
Contributor

kpozin commented Feb 16, 2021

Would SetTrie be a good entry point for this - https://github.com/KaiserKarel/set-trie ?

Unfortunately, no, if we want to maintain backward compatibility.

The main complication is that we're trying to achieve perfect compatibility with an existing data format that does not have a formal specification. The C++/Java code is the spec. The existing code relies heavily on data and method inheritance, reuses a single set of mutable builder classes for several discrete building stages, and even mutates built Tries while traversing them. As you can imagine, these design decisions make the design exceptionally Rust-unfriendly. In writing a port, I've started the whole thing over several times, with varying degrees of Rust-iness and Java-ness in my design.

Moreover, all the logic is closely coupled together in such a way that unit testing individual methods or build stages is impractical (there do not appear to be any unit tests in the legacy code); one basically has to port the entire conglomeration, both Builder and Trie, and then port the integration tests one by one to discover what's broken. This is the stage I'm at now -- I have a full port of most of the production code, am porting tests, and learning the numerous ways in my port is utterly broken.

@kpozin
Copy link
Contributor

kpozin commented Feb 16, 2021

I'm aware of the sunk costs (I've sunk a fair portion of them :), but as I proposed at the beginning of this project, we would be much better off finding a standard, well-specified trie format and implementing that, instead of continuing down this road.

@zbraniecki zbraniecki added the discuss Discuss at a future ICU4X-SC meeting label Feb 18, 2021
@markusicu
Copy link
Member

@sffc :

This data structure is fundamental to various pieces of functionality, including case mapping and language matching. We should port this data structure to ICU4X.

Not quite. In ICU, case mapping, normalization, collation, and character properties use “code point tries” which are heavily optimized for lookup by code point, as well as by walking through UTF-16 and UTF-8 code units. Between ca. 2001 and 2018 I came up with three versions of this. I moved some ICU code to the third version, and turned that on its own into public API. I would love to convert code that still uses older versions to the latest one, given time.

Design doc: http://site.icu-project.org/design/struct/utrie

By contrast, BytesTrie and UCharsTrie are designed for lookups from arbitrarily long sequences of bytes (8-bit units) vs. UChars (16-bit units). These are more-classic retrieval trees. You could use them for lookup from single code points if you turn those into sequences of 8-bit or 16-bit units, but these “string tries” are not especially optimized for that. We use them for BreakIterator dictionaries, in collation for contraction matching (after the first code point), and for matching Unicode character property (and value) names.

One of the design points for the “string tries” is to get a byte-serialized or 16-bit-word-serialized sequence that requires no or only trivial byte swapping for little-endian vs. big-endian machines, for use as read-only, memory-mapped data.

If you “abused” a “string trie” for pure code point lookups, I would expect them to be a lot slower than code point tries, but they may be more compact.

Design docs:

(There is another version of tries in the ICU character conversion code, customized in various ways to deal with special issues there. Both code point tries and string tries, since some charsets map multiple Unicode code points to single charset codes.)

@kpozin :

The existing code relies heavily on data and method inheritance, reuses a single set of mutable builder classes for several discrete building stages

Yes. In principle, you could write a builder from scratch, as long as you build the data structure that the runtime understands.

By the way, I would port the Java builder, not the C++ builder. Late in development, I made a significant improvement to the Java builder that I never got around to porting back to C++. (I have an old ICU ticket for that that I keep meaning to get to.)

and even mutates built Tries while traversing them.

That is either a misunderstanding or an overstatement. The built structure is immutable. But of course while you walk the structure you need a mutable iterator, and that's what the Trie classes are. Each iterator object is tiny. It just keeps a pointer into the big array and a little bit of extra state.

Moreover, all the logic is closely coupled together in such a way that unit testing individual methods or build stages is impractical (there do not appear to be any unit tests in the legacy code); one basically has to port the entire conglomeration, both Builder and Trie, and then port the integration tests one by one to discover what's broken.

Sorry. And thanks for your bug report! I intend to fix that for ICU 69; I need to think about how to unit-test the internal function that has the bogus code. I would be happy to entertain PRs for better unit testing!

@markusicu
Copy link
Member

The runtime code for each of these data structures is fairly small. It might work well for ICU4X to have offline builder code calling ICU4J or ICU4C that writes Rust code with initializer lists. We already have some C++ code that writes C and Java initializers, especially for code point tries.

@sffc
Copy link
Member Author

sffc commented Feb 24, 2021

Summary of conclusions form a meeting with me, @kpozin, @markusicu, @echeran, @zbraniecki, @dminor:

  • CodePointTrie (aka UCPTrie) is the right tool for selecting enumerated properties (one code point mapping to one of several enumerated values)
  • BytesTrie is useful for likely subtags, etc., but probably the wrong tool for enumerated properties
  • Both CodePointTrie and BytesTrie have a binary representation, which I refer to as an "ArrayBuffer"
  • Enumerated property selection is what's needed for segmentation
  • We could reinvent the wheel in ICU4X by building our own data structures, but I'm convinced that we should build on the shoulders of giants and use the same data structures as ICU4C/ICU4J
  • It's not a good use of our time to write the BytesTrie/CodePointTrie builder code in ICU4X, which is what @kpozin was attempting to do above

Next steps:

  • Continue with Add PPUCD enumerated property parsing #448, which may be useful for regular expression performance.
  • Manually extract the ArrayBuffers from the ICU4C/ICU4J tests as a one-time job, and use them as input to ICU4X unit tests.
  • Add a tool to ICU4C that creates JSON artifacts containing ArrayBuffers for enumerated property selection; the artifacts should be versioned and shipped alongside an ICU release. Assignee: undetermined

@echeran
Copy link
Contributor

echeran commented Feb 24, 2021

Here are the full meeting notes for yesterday's meeting.

@sffc
Copy link
Member Author

sffc commented Feb 25, 2021

CodePointTrie is split off into #508.

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 4, 2021
@sffc sffc modified the milestones: 2021-Q1-m2, 2021-Q2-m1 Mar 12, 2021
@sffc
Copy link
Member Author

sffc commented Apr 19, 2021

Note: Design of language matching is in #173

@sffc sffc modified the milestones: 2021-Q2-m1, ICU4X 0.3 May 7, 2021
@sffc
Copy link
Member Author

sffc commented May 13, 2021

See unicode-org/icu#1660 for how we're exporting CodePointTrie data from ICU4C into toml files.

@sffc sffc modified the milestones: ICU4X 0.3, 2021 Q2-m3 May 13, 2021
@sffc
Copy link
Member Author

sffc commented Jul 21, 2021

Blocks #842

@sffc sffc added the good first issue Good for newcomers label Jul 24, 2021
@sffc sffc removed this from the 2021 Q3-m1 milestone Aug 12, 2021
@sffc sffc added backlog help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) labels Aug 12, 2021
@sffc sffc assigned sffc and unassigned kpozin Aug 24, 2021
@sffc sffc removed the help wanted Issue needs an assignee label Aug 24, 2021
@sffc
Copy link
Member Author

sffc commented Aug 24, 2021

@fabura is going to take a look at this.

@sffc
Copy link
Member Author

sffc commented Aug 24, 2021

Take a look at:

Consider writing code under experimental/bytestrie. Consider copying boilerplate (Cargo.toml, LICENSE, src/lib.rs) from codepointtrie.

@makotokato
Copy link
Member

I have tiny version of ICU4C/ICU4J compatible bytetrie/uchartrie (https://github.com/makotokato/dictionary_segmenter/blob/main/src/bytes_trie.rs and https://github.com/makotokato/dictionary_segmenter/blob/main/src/uchars_trie.rs) since I need this to use dictionary segmenter of ICU4C/ICU4J.

@dminor dminor added the discuss-priority Discuss at the next ICU4X meeting label Jan 19, 2022
@dminor
Copy link
Contributor

dminor commented Jan 19, 2022

Makoto will need this for the segmenter. We should find an owner. It should be straightforward based upon the Char16Trie implementation and Makoto's implementation mentioned above.

@sffc
Copy link
Member Author

sffc commented Jan 19, 2022

Makoto will need this for the segmenter.

Could you clarify? My understanding is that BytesTrie and CharsTrie are largely interchangeable. If we build our own data, then we can build it as either BytesTrie or CharsTrie; either should work.

@dminor
Copy link
Contributor

dminor commented Jan 19, 2022

He was concerned about data size if we used CharsTrie... if that's not a valid concern, then that is good to know.

@markusicu
Copy link
Member

I co-developed BytesTrie and CharsTrie at the same time. In principle, they work almost the same.

  • BytesTrie works with input sequences of bytes (u8), and is itself stored as a sequence of bytes.
  • CharsTrie works with input sequences of 16-bit words (u16), and is itself stored as a sequence of 16-bit words.
    They both output the same 4-way result enum, and an optional 32-bit value.

BytesTrie is more natural when the input is already a sequence of bytes, or is naturally encoded as such. In ICU, I use it for Unicode property aliases and for Thai/Khmer/Burmese dictionaries; the latter are small scripts and use a custom mapping from Unicode code points to single bytes.

CharsTrie is more natural when the input units are larger. In ICU, I use it for CJK dictionaries and collation contraction lookup, encoding the input as UTF-16 where that's not already the case.

The storage of these "string tries" uses variable-width encodings of result values and internal offsets. The bigger the values and offsets, the more storage units need to be stored and decoded. My hunch is that for small amounts of data a BytesTrie should be smaller, and for large amounts of data a CharsTrie should be faster.

Also, using a BytesTrie for large input units means you need to convert your inputs to multiple bytes (more bytes than the number of 16-bit words you would need), which adds some encoding cost and requires a larger number of lookups.

I have not done detailed benchmarking of one vs. the other for a variety of types and volumes of data.

@sffc
Copy link
Member Author

sffc commented Jan 20, 2022

My preference is to start with CharsTrie, since we already have it implemented, and since I understand it to be the simpler of the two. We should use BytesTrie only if we've tried using CharsTrie and found it to be too large or slow for the use case.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Feb 3, 2022
@sffc
Copy link
Member Author

sffc commented Nov 10, 2022

See #1155 for an early attempt at this.

@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

7 participants