Shrink Unicode tables (even more) #70486

Mark-Simulacrum · 2020-03-27T23:02:52Z

This shrinks the Unicode tables further, building upon the wins in #68232 (the previous
counts differ due to an interim Unicode version update, see #69929.

The new data structure is slower by around 3x, on the benchmark of looking up every
Unicode scalar value in each data set sequentially in every data set included. Note that
for ASCII, the exposed functions on char optimize with direct branches, so ASCII will
retain the same performance regardless of internal optimizations (or the reverse). Also,
note that the size reduction due to the skip list (from where the performance losses come)
is around 40%, and, as a result, I believe the performance loss is acceptable, as the
routines are still quite fast. Anywhere where this is hot, should probably be using a
custom data structure anyway (e.g., a raw bitset) or something optimized for frequently
seen values, etc.

This PR updates both the bitset data structure, and introduces a new data structure
similar to a skip list. For more details, see the main.rs of the table generator, which
describes both. The commits mostly work individually and document size wins.

As before, this is tested on all valid chars to have the same results as nightly (and the
canonical Unicode data sets), happily, no bugs were found.

Set	Previous	New	% of old	Codepoints	Ranges
Alphabetic	3055	1599	52%	132875	695
Case Ignorable	2136	949	44%	2413	410
Cased	934	359	38%	4286	141
Cc	43	9	20%	65	2
Grapheme Extend	1774	813	46%	1979	344
Lowercase	985	867	88%	2344	652
N	1266	419	33%	1781	133
Uppercase	934	777	83%	1911	643
White_Space	140	37	26%	25	10
----------------	----------	-------	------------	------------	--------
Total	11267	5829	51%	-	-

If the unicode-downloads folder already exists, we likely just fetched the data, so don't make any further network requests. Unicode versions are released rarely enough that this doesn't matter much in practice.

Try chunk sizes between 1 and 64, selecting the one which minimizes the number of bytes used. 16, the previous constant, turned out to be a rather good choice, with 5/9 of the datasets still using it. Alphabetic : 3036 bytes (- 19 bytes) Case_Ignorable : 2136 bytes Cased : 934 bytes Cc : 32 bytes (- 11 bytes) Grapheme_Extend: 1774 bytes Lowercase : 985 bytes N : 1225 bytes (- 41 bytes) Uppercase : 934 bytes White_Space : 97 bytes (- 43 bytes) Total table sizes: 11153 bytes (-114 bytes)

Currently the test file takes a while to compile -- 30 seconds or so -- but since it's not going to be committed, and is just for local testing, that seems fine.

This avoids wasting a small amount of space for some of the data sets. The chunk resizing is caused by but not directly related to changes in this commit. Alphabetic : 3036 bytes Case_Ignorable : 2133 bytes (- 3 bytes) Cased : 934 bytes Cc : 32 bytes Grapheme_Extend: 1760 bytes (-14 bytes) Lowercase : 985 bytes N : 1220 bytes (- 5 bytes) Uppercase : 934 bytes White_Space : 97 bytes Total table sizes: 11131 bytes (-22 bytes)

Previously, all words in the (deduplicated) bitset would be stored raw -- a full 64 bits (8 bytes). Now, those words that are equivalent to others through a specific mapping are stored separately and "mapped" to the original when loading; this shrinks the table sizes significantly, as each mapped word is stored in 2 bytes (a 4x decrease from the previous). The new encoding is also potentially non-optimal: the "mapped" byte is frequently repeated, as in practice many mapped words use the same base word. Currently we only support two forms of mapping: rotation and inversion. Note that these are both guaranteed to map transitively if at all, and supporting mappings for which this is not true may require a more interesting algorithm for choosing the optimal pairing. Updated sizes: Alphabetic : 2622 bytes (- 414 bytes) Case_Ignorable : 1803 bytes (- 330 bytes) Cased : 808 bytes (- 126 bytes) Cc : 32 bytes Grapheme_Extend: 1508 bytes (- 252 bytes) Lowercase : 901 bytes (- 84 bytes) N : 1064 bytes (- 156 bytes) Uppercase : 838 bytes (- 96 bytes) White_Space : 91 bytes (- 6 bytes) Total table sizes: 9667 bytes (-1,464 bytes)

This saves less bytes - by far - and is likely not the best operator to choose. But for now, it works -- a better choice may arise later. Alphabetic : 2538 bytes (- 84 bytes) Case_Ignorable : 1773 bytes (- 30 bytes) Cased : 790 bytes (- 18 bytes) Cc : 26 bytes (- 6 bytes) Grapheme_Extend: 1490 bytes (- 18 bytes) Lowercase : 865 bytes (- 36 bytes) N : 1040 bytes (- 24 bytes) Uppercase : 778 bytes (- 60 bytes) White_Space : 85 bytes (- 6 bytes) Total table sizes: 9385 bytes (-282 bytes)

This ensures that what we test is what we get for final results as well.

This optimizes slightly better. Alphabetic : 2536 bytes Case_Ignorable : 1771 bytes Cased : 788 bytes Cc : 24 bytes Grapheme_Extend: 1488 bytes Lowercase : 863 bytes N : 1038 bytes Uppercase : 776 bytes White_Space : 83 bytes Total table sizes: 9367 bytes (-18 bytes; 2 bytes per set)

We find that it is common for large ranges of chars to be false -- and that means that it is plausibly common for us to ask about a word that is entirely empty. Therefore, we should make sure that we do not need to rotate bits or otherwise perform some operation to map to the zero word; canonicalize it first if possible.

LLVM seems to at least sometimes optimize better when the length comes directly from the `len()` of the array vs. an equivalent integer. Also, this allows easier copy/pasting of the function into compiler explorer for experimentation.

This arranges for the sparser sets (everything except lower and uppercase) to be encoded in a significantly smaller context. However, it is also a performance trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction is deemed to be sufficiently important to merit this performance loss, particularly as it is unlikely that this code is hot anywhere (and if it is, paying the memory cost for a bitset that directly represents the data seems worthwhile). Alphabetic : 1599 bytes (- 937 bytes) Case_Ignorable : 949 bytes (- 822 bytes) Cased : 359 bytes (- 429 bytes) Cc : 9 bytes (- 15 bytes) Grapheme_Extend: 813 bytes (- 675 bytes) Lowercase : 863 bytes N : 419 bytes (- 619 bytes) Uppercase : 776 bytes White_Space : 37 bytes (- 46 bytes) Total table sizes: 5824 bytes (-3543 bytes)

In practice, for the two data sets that still use the bitset encoding (uppercase and lowercase) this is not a significant win, so just drop it entirely. It costs us about 5 bytes, and the complexity is nontrivial.

rust-highfive · 2020-03-27T23:02:56Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

Mark-Simulacrum · 2020-03-27T23:08:19Z

r? @dtolnay perhaps?

cc @jamesmunns once more for embedded

Also cc @thomcc from BurntSushi/ucd-generate#30

@joshtriplett also brought up in #68232 (comment) that we may just want to jettison the grapheme_extend table (and the only user, which is the debug printing for chars/strs). I am personally ambivalent -- certainly, for strings which are non-ASCII, it can be a big win to avoid the extensive \u{...} chars in printouts. But I don't know how to get data on how common it is. Certainly we intentionally made the choice to use Grapheme_Extend; see discussion on #49283 which I believe is the most recent extensive discussion relating to this.

dtolnay

Thanks, this is terrific. I agree that these lookups are fast enough that optimizing for size is the important thing.

Is there a way to measure the impact of the more complicated lookup logic on the binary size? In this case it doesn't look like the difference would be more than ~100 bytes but it's worth taking into account next to the raw table sizes. For example if we shrink grapheme_extend by 100 bytes but grow the search code it requires by 200 bytes that would not be smart.

dtolnay · 2020-03-28T01:24:26Z

@bors r+

bors · 2020-03-28T01:24:28Z

📌 Commit ad679a7 has been approved by dtolnay

joshtriplett · 2020-03-28T03:52:04Z

This looks good to me as well.

@Mark-Simulacrum Yes, I'd still like to see that change made, to save the extra 813 bytes (plus a bit for the function using it).

Mark-Simulacrum · 2020-03-28T15:50:53Z

It looks like the lookup logic increased for pretty much everything -- roughly doubling in size. It's still pretty small, but for the sets which had a small delta (100-200 bytes or so) most of that was eaten up by the lookup logic; I think for most we shrunk by enough that it doesn't matter too much. In any case, for the most important one -- grapheme extend -- the new set is still 56% of the old one.

With that in mind, I think changing things here is probably not too worth it, but I'll record it again in the future if I decide to look into this some more. nm --print-size --radix=d ./nightly | rustfilt | grep unicode | sed 's@core::unicode::unicode_data::@@' is what I used to get the statistics for this.

Set	Old	New	Delta
Alphabetic	151	341	190
White_Space	146	275	129
Case Ignorable	151	325	174
Grapheme Extend	151	325	174
N	141	341	200
Cc	80	18	-62
Cased	146	325	178
Lowercase	146	261	115
Uppercase	146	261	115

@ghost

Rollup of 5 pull requests Successful merges: - rust-lang#70418 (Add long error explanation for E0703) - rust-lang#70448 (Create output dir in rustdoc markdown render) - rust-lang#70486 (Shrink Unicode tables (even more)) - rust-lang#70493 (Fix rustdoc.css CSS tab-size property) - rust-lang#70495 (Replace last mention of IRC with Discord) Failed merges: r? @ghost

HKalbasi · 2023-07-01T13:02:27Z

src/libcore/unicode/unicode_data.rs

+    //
+    // This means that we can avoid bounds checking for the accesses below, too.
+    let last_idx =
+        match short_offset_runs.binary_search_by_key(&(needle << 11), |header| header << 11) {


Can someone explain why these << 11s are needed?

Mark-Simulacrum added 14 commits March 20, 2020 12:11

Avoid re-fetching Unicode data

903f67d

If the unicode-downloads folder already exists, we likely just fetched the data, so don't make any further network requests. Unicode versions are released rarely enough that this doesn't matter much in practice.

Generate tests for Unicode property data

580a634

Currently the test file takes a while to compile -- 30 seconds or so -- but since it's not going to be committed, and is just for local testing, that seems fine.

Deduplicate test and primary range_search definitions

5f71d98

This ensures that what we test is what we get for final results as well.

Avoid relying on const parameters to function

af243d4

LLVM seems to at least sometimes optimize better when the length comes directly from the `len()` of the array vs. an equivalent integer. Also, this allows easier copy/pasting of the function into compiler explorer for experimentation.

Add richer printing

33b9e6f

Remove separate encoding for a single nonzero-mapping byte

b6bc906

In practice, for the two data sets that still use the bitset encoding (uppercase and lowercase) this is not a significant win, so just drop it entirely. It costs us about 5 bytes, and the complexity is nontrivial.

Update the documentation comment

ad679a7

rust-highfive assigned nikomatsakis Mar 27, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 27, 2020

rust-highfive assigned dtolnay and unassigned nikomatsakis Mar 27, 2020

dtolnay reviewed Mar 28, 2020

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 28, 2020

Dylan-DPC-zz mentioned this pull request Mar 28, 2020

Rollup of 5 pull requests #70499

Merged

bors merged commit 7f1e626 into rust-lang:master Mar 28, 2020

Mark-Simulacrum deleted the unicode-shrink branch February 13, 2022 18:19

HKalbasi reviewed Jul 1, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shrink Unicode tables (even more) #70486

Shrink Unicode tables (even more) #70486

Uh oh!

Mark-Simulacrum commented Mar 27, 2020 •

edited

Loading

Uh oh!

rust-highfive commented Mar 27, 2020

Uh oh!

Mark-Simulacrum commented Mar 27, 2020

Uh oh!

dtolnay left a comment

Uh oh!

dtolnay commented Mar 28, 2020

Uh oh!

bors commented Mar 28, 2020

Uh oh!

joshtriplett commented Mar 28, 2020

Uh oh!

Mark-Simulacrum commented Mar 28, 2020

Uh oh!

HKalbasi Jul 1, 2023

Uh oh!

Uh oh!

Shrink Unicode tables (even more) #70486

Shrink Unicode tables (even more) #70486

Uh oh!

Conversation

Mark-Simulacrum commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Mar 27, 2020

Uh oh!

Mark-Simulacrum commented Mar 27, 2020

Uh oh!

dtolnay left a comment

Choose a reason for hiding this comment

Uh oh!

dtolnay commented Mar 28, 2020

Uh oh!

bors commented Mar 28, 2020

Uh oh!

joshtriplett commented Mar 28, 2020

Uh oh!

Mark-Simulacrum commented Mar 28, 2020

Uh oh!

HKalbasi Jul 1, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mark-Simulacrum commented Mar 27, 2020 •

edited

Loading