Perform normalization performance evaluation between Rust and ICU #93

zbraniecki · 2020-05-15T02:17:49Z

The proposed experiment will allow us to get a basic idea on performance of normalization libraries between ICU and the Rust normalization crate.

I will use the https://github.com/zbraniecki/intl-measurements/ harness building two example apps that will use microtiming intervals to measure time used by both apps.
On top of that I'll plug the Rust test into Criterion for more detailed measurements.

If the results will warrant further testing, it may be then useful to write basic FFI for ICU4C to plug it into Criterion.

zbraniecki · 2020-05-15T03:14:26Z

This is related to #40 and #66

zbraniecki · 2020-05-15T03:17:19Z

The current procedure plan is:

Build sample set
- Select a Wikipedia page of a substantial length
- Languages: Vietnamise, Korean, English, French, German, Russian, Spanish
For each document
- Convert to NFC
- Convert to NFD
- Convert pre-converted to NFD to NFC
- Convert pre-converted to NFC to NFD

zbraniecki · 2020-05-15T03:18:47Z

Question marks represent things I'm unsure of:

How to "synthesize a document to NFC"?
Is there any other step or is the result text the input for our test?
What to do with the Korean text?
What operation should I perform from ICU and what from Rust normalization crate?

@hsivonen , @Manishearth , @sffc , @nciric - feedback and help?

Manishearth · 2020-05-15T04:15:53Z

I think we should just get text examples, and:

benchmark conversion to NFC
benchmark conversion to NFD
benchmark conversion of NFC form to NFD
benchmark conversion of NFD form to NFC

for everything. For everything but Vietnamese the text is probably already in nfc form.

The Rust crate has extension nfc() and nfd() methods which are character iterators: https://docs.rs/unicode-normalization/0.1.12/unicode_normalization/trait.UnicodeNormalization.html

macchiati · 2020-05-15T04:45:59Z

Let me elaborate what I said before: for perf testing we should use representative data. That means ideally a representative sample for each of the top world languages, then weight the resulting durations by the percentage of text in that language (approximately, *https://w3techs.com/technologies/history_overview/content_language/ms/y <https://w3techs.com/technologies/history_overview/content_language/ms/y>*). Anyway, a test with Vietnamese text, in the format used on the web, would be interesting, but we have to keep in mind that the majority of text will already be in NFC, so the speed with text that is already normalized will be crucial. Caveats of course since there are some diminishing returns on the effort to collect that data. And other domains may vary, eg texting. So you should also test English, and a few other languages on the top of that list, as proxies for the whole list.

On Thu, May 14, 2020, 20:19 Zibi Braniecki ***@***.***> wrote: Question marks represent things I'm unsure of: - How to "synthesize a document to NFC"? Not sure what that means either. - Is there any other step or is the result text the input for our test? - What to do with the Korean text? Less than 1%, so other languages would be better. - What operation should I perform from ICU and what from Rust normalization crate? I'd recommend

Convert to NFC Test if NFC

…

@hsivonen <https://github.com/hsivonen> , @Manishearth <https://github.com/Manishearth> , @sffc <https://github.com/sffc> , @nciric <https://github.com/nciric> - feedback and help? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMEXYW5WZND7IX3QXS3RRSYCLANCNFSM4NBGCPLQ> .

zbraniecki · 2020-05-15T06:10:42Z

Thank you all!

Updated the plan to list languages and 4 benchmarks. Lmk if that sounds good.

hsivonen · 2020-05-15T06:47:53Z

Select a Wikipedia page of a substantial length

FWIW, for encoding_rs, I chose the page for Mars, the planet (for the reasons documented). (Get plain-text versions from the encoding_bench repo, links to Wikipedia revisions, license (not suited for checking into the ICU4X repo; also note the Czech version has GFDL portions).)

For each document

For Vietnamese, please also generate a starting point that corresponds to the keyboard layout, which is neither NFC nor NFD. To get this form, first normalize to NFC and then apply the detone crate with orthographic set to true.

How to "synthesize a document to NFC"?

I suggest using the unic_normal crate. testdet has a bit of code that generates as-if-from-keyboard Vietnamese using unic_normal and detone.

hsivonen · 2020-05-15T07:01:18Z

Is there any other step or is the result text the input for our test?

I think the inputs are:

Original data normalized to NFC (bench normalization to NFC, which should be no-op, and normalization to NFD)
Original data normalized to NFD (bench normalization to NFD, which should be no-op, and normalization to NFC)
Additionally for Vienamese only: Original data normalized to NFC followed by orthographic decomposition (bench normalization to NFC and NFD)

What operation should I perform from ICU and what from Rust normalization crate?

For unic_normal you'd do input.chars().nfc().collect::<String>(), which has fundamentally different allocation behavior from ICU4C. You could also bench input.chars().nfc().count() to estimate the allocation and memory write cost. For ICU4C, the relevant function seems to be unorm2_normalize, which writes to a caller-allocated buffer.

hsivonen · 2020-05-15T07:04:08Z

Sadly, there doesn't appear to be enough API surface to test the two engines in a commeasurable way in terms of allocation behavior.

hsivonen · 2020-05-15T07:16:47Z

On the Rust side, if you set the capacity of the String ahead of time and then extend it from an iterator, the allocation behavior matches ICU4C, and then UTF-8 vs. UTF-16 remains as the difference.

hsivonen · 2020-05-15T07:37:34Z

Using std::char::decode_utf16 to get the iterator over char in the Rust case would address the UTF-8 vs. UTF-16 difference on the input side.

hsivonen · 2020-05-15T07:39:26Z

On the output side, AFAICT, you'd need a bit of custom code around encode_utf16 to write an iterator over char into a UTF-16 buffer.

macchiati · 2020-05-15T14:23:35Z

I don't think normalizing from NFD to NFC is a high use case. I'd recommend (where "wild" is representative of what's on the web): - wild => NFC - wild => NFD - isNFC(wild) - isNFD(wild) I'm a bit leery of simulating what we *think* Vietnamese might be on the web at this point in time. However, vi is only 1.4% of the total weight, so as long as we weight it appropriately. Mark

…

On Fri, May 15, 2020 at 12:01 AM Henri Sivonen ***@***.***> wrote: Is there any other step or is the result text the input for our test? I think the inputs are: - Original data normalized to NFC (bench normalization to NFC, which should be no-op, and normalization to NFD) - Original data normalized to NFD (bench normalization to NFD, which should be no-op, and normalization to NFC) - Additionally for Vienamese only: Original data normalized to NFC followed by orthographic decomposition (bench normalization to NFC and NFD) What operation should I perform from ICU and what from Rust normalization crate? For unic_normal you'd do input.chars().nfc().collect::<String>(), which has fundamentally different allocation behavior from ICU4C. You could also bench input.chars().nfc().count() to estimate the allocation and memory write cost. For ICU4C, the relevant function seems to be unorm2_normalize <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html#a0a596802db767da410b4b04cb75cbc53>, which writes to a caller-allocated buffer. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCNMCUT2UDIBZAC7BLRRTSEZANCNFSM4NBGCPLQ> .

macchiati · 2020-05-15T15:05:42Z

Btw, nice page talking about why to pick Wikipedia Mars...

…

On Thu, May 14, 2020, 23:48 Henri Sivonen ***@***.***> wrote: Select a Wikipedia page of a substantial length FWIW, for encoding_rs, I chose the page for Mars, the planet (for the reasons documented <https://github.com/hsivonen/encoding_bench#selection-of-test-data>). (Get plain-text versions from the encoding_bench repo <https://github.com/hsivonen/encoding_bench/tree/master/src/wikipedia>, links to Wikipedia revisions <https://github.com/hsivonen/encoding_bench/blob/master/src/wikipedia/sources.txt>, license <https://github.com/hsivonen/encoding_bench/blob/master/LICENSE-CC-BY-SA> (not suited for checking into the ICU4X repo; also note the Czech version has GFDL portions).) For each document For Vietnamese, please also generate a starting point that corresponds to the keyboard layout, which is neither NFC nor NFD. To get this form, first normalize to NFC and then apply the detone <https://crates.io/crates/detone> crate with orthographic set to true. How to "synthesize a document to NFC"? I suggest using the unic_normal crate. testdet has a bit of code <https://github.com/hsivonen/testdet/blob/3f2629c9814290603e995be3184e79b4c349b5a7/src/main.rs#L701-L703> that generates as-if-from-keyboard Vietnamese using unic_normal and detone . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGPD5SKJOHMAFCAUTTRRTQSNANCNFSM4NBGCPLQ> .

zbraniecki · 2020-05-19T21:49:23Z

With help from @hsivonen and @Manishearth I completed the initial test harness for normalization crate.

Explainer

Due to differences between UTF-8 vs UTF-16 handling and its impact on the nature of the operation, both C and Rust code are measured within Rust crate, both in an example binary and a criterion benchmark. For the sake of sanity, I also verified on a very small subset that a C++ app will give comparable numbers for ICU.

The criterion benchmark is more stable, but a single run of the binary in --release mode gives comparable results, so while below I'm presenting numbers from criterion, the cargo run --release on that crate should give comparable results.

For the performance review I used Henri's encoding testsuite which uses wikipedia articles for given set of locales.

As for locales, I used the following set: "ar", "de", "en", "ko", "ru", "vi".

For the sake of completeness I compared UTF-16 operations on NFC and NFD and for vi Henri added an orthographic scenario for both.

Notice: I did not validate the output of any of the methods. I exclusively compared the performance of the calls based on the given input.

Results

Locale	Operation	Source	Encoding	C++	Rust	Diff
ar	NFC	NFC	UTF-16	27.224 us	365.91 us	(x13.0)
ar	NFC	NFC	UTF-8		321.13 us
ar	NFD	NFC	UTF-16	42.021 us	234.21 us	(x5.5)
ar	NFD	NFC	UTF-8		302.74 us
ar	NFC	NFD	UTF-16	46.405 us	391.28 us	(x8.4)
ar	NFC	NFD	UTF-8		470.15 us
ar	NFD	NFD	UTF-16	42.489 us	249.78 us	(x5.9)
ar	NFD	NFD	UTF-8		318.81 us
de	NFC	NFC	UTF-16	76.972 us	1.7980 ms	(x23.2)
de	NFC	NFC	UTF-8		1.7830 ms
de	NFD	NFC	UTF-16	94.661 us	1.1210 ms	(x11.9)
de	NFD	NFC	UTF-8		1.1002 ms
de	NFC	NFD	UTF-16	132.27 us	1.8075 ms	(x13.6)
de	NFC	NFD	UTF-8		1.8072 ms
de	NFD	NFD	UTF-16	96.353 us	1.1102 ms	(x11.5)
de	NFD	NFD	UTF-8		1.0919 ms
en	NFC	NFC	UTF-16	65.217 us	1.5867 ms	(x24.4)
en	NFC	NFC	UTF-8		1.5546 ms
en	NFD	NFC	UTF-16	64.187 us	873.03 us	(x13.6)
en	NFD	NFC	UTF-8		884.41 us
en	NFC	NFD	UTF-16	62.624 us	1.5472 ms	(x24.7)
en	NFC	NFD	UTF-8		1.4711 ms
en	NFD	NFD	UTF-16	67.357 us	856.90 us	(x12.7)
en	NFD	NFD	UTF-8		896.31 us
ko	NFC	NFC	UTF-16	20.629 us	368.37 us	(x17.8)
ko	NFC	NFC	UTF-8		399.63 us
ko	NFD	NFC	UTF-16	106.02 us	247.49 us	(x2.3)
ko	NFD	NFC	UTF-8		358.79 us
ko	NFC	NFD	UTF-16	92.915 us	465.07 us	(x5.0)
ko	NFC	NFD	UTF-8		513.15 us
ko	NFD	NFD	UTF-16	58.509 us	303.24 us	(x5.1)
ko	NFD	NFD	UTF-8		387.68 us
ru	NFC	NFC	UTF-16	138.41 us	1.8304 ms	(x13.2)
ru	NFC	NFC	UTF-8		2.2583 ms
ru	NFD	NFC	UTF-16	198.54 us	1.2005 ms	(x6.0)
ru	NFD	NFC	UTF-8		1.5329 ms
ru	NFC	NFD	UTF-16	183.29 us	1.8699 ms	(x10.1)
ru	NFC	NFD	UTF-8		2.3132 ms
ru	NFD	NFD	UTF-16	189.12 us	1.2479 ms	(x6.6)
ru	NFD	NFD	UTF-8		1.5129 ms
vi	NFC	NFC	UTF-16	97.061 us	2.0190 ms	(x20.8)
vi	NFC	NFC	UTF-8		2.2570 ms
vi	NFD	NFC	UTF-16	364.88 us	1.3493 ms	(x3.6)
vi	NFD	NFC	UTF-8		1.4905 ms
vi	NFC	NFD	UTF-16	1.0147 ms	2.2458 ms	(x2.2)
vi	NFC	NFD	UTF-8		2.2110 ms
vi	NFD	NFD	UTF-16	370.88 us	1.2600 ms	(x3.4)
vi	NFD	NFD	UTF-8		1.4762 ms

Locale	Operation	Source	Encoding	C++	Rust	Diff
vi	NFC	orthographic	UTF-16	911.77 us	2.2167 ms	(x2.4)
vi	NFC	orthographic	UTF-8		2.1625 ms
vi	NFD	orthographic	UTF-16	403.16 us	1.3254 ms	(x3.3)
vi	NFD	orthographic	UTF-8		1.5354 ms

To reproduce, clone the intl-measurements repo, cd ./unic/normalize and either cargo run --release or cargo bench.

Summary

In all collected results ICU is 2-25 times faster than unicode-normalization
The biggest differences can be observed in NFC->NFC operations
The UTF-8 vs UTF-16 doesn't yield any significant differences

I encourage everyone to evaluate the source code to validate the code used for benchmarks.

zbraniecki · 2020-05-19T21:51:52Z

@Manishearth @hsivonen @sffc - can I get your thumbs up on the results?

I'd like to close this issue and get back to #40 once we confirm validity of the results.

We may also want to report the results to https://github.com/unicode-rs/unicode-normalization authors for evaluation (@Manishearth - can you facilitate this comms?)

Manishearth · 2020-05-20T15:32:06Z

Hmm, that's surprising, worth checking out what ICU does to get it such a perf win. unicode-normalization has not been vetted much for perf either.

Either way, seems like a from-scratch impl following the design of ICU might be better.

Manishearth · 2020-05-20T15:36:04Z

Oh, I see why, the tables are generated as giant match blocks instead of cached binary search tables which most of the other unicode crates use. I'd kind of assumed this was the default for all unicode-rs crates.

zbraniecki · 2020-05-28T18:29:09Z

This is now completed.

zbraniecki · 2020-05-28T18:36:52Z

Summary of the decision: We see a major performance difference in favor of ICU4C. Manish believes that the issue he filed would close the gap but said that between rewriting and fixing unicode-normalize, but isn't very opinionated on which we should do.

So, whoever will take this on can decide which route to take, and the harness is available to retest against a new target.

nciric · 2020-05-28T23:10:03Z

I think we need to discuss data loading issue with normalization. Would refitting existing crate with data provider interface be harder than writing one from scratch - as Manish said, normalization code is not very complex.

zbraniecki self-assigned this May 15, 2020

sffc added T-task Type: Tracking thread for a non-code task C-unicode Component: Props, sets, tries labels May 15, 2020

Manishearth mentioned this issue May 20, 2020

10x slower performance over ICU unicode-rs/unicode-normalization#57

Open

zbraniecki closed this as completed May 28, 2020

nciric mentioned this issue May 28, 2020

Import unicode-normalization or re-write from scratch? #40

Closed

sffc added this to the 2020 Q2 milestone Jun 17, 2020

welps mentioned this issue Jun 26, 2020

Fix dictionary index case-sensitivity inconsistencies baskerville/plato#121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform normalization performance evaluation between Rust and ICU #93

Perform normalization performance evaluation between Rust and ICU #93

zbraniecki commented May 15, 2020

zbraniecki commented May 15, 2020

zbraniecki commented May 15, 2020 •

edited

Loading

zbraniecki commented May 15, 2020

Manishearth commented May 15, 2020

macchiati commented May 15, 2020 via email

zbraniecki commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

macchiati commented May 15, 2020 via email

macchiati commented May 15, 2020 via email

zbraniecki commented May 19, 2020 •

edited

Loading

zbraniecki commented May 19, 2020

Manishearth commented May 20, 2020

Manishearth commented May 20, 2020 •

edited

Loading

zbraniecki commented May 28, 2020

zbraniecki commented May 28, 2020

nciric commented May 28, 2020

Perform normalization performance evaluation between Rust and ICU #93

Perform normalization performance evaluation between Rust and ICU #93

Comments

zbraniecki commented May 15, 2020

zbraniecki commented May 15, 2020

zbraniecki commented May 15, 2020 • edited Loading

zbraniecki commented May 15, 2020

Manishearth commented May 15, 2020

macchiati commented May 15, 2020 via email

zbraniecki commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

hsivonen commented May 15, 2020

macchiati commented May 15, 2020 via email

macchiati commented May 15, 2020 via email

zbraniecki commented May 19, 2020 • edited Loading

Explainer

Results

Summary

zbraniecki commented May 19, 2020

Manishearth commented May 20, 2020

Manishearth commented May 20, 2020 • edited Loading

zbraniecki commented May 28, 2020

zbraniecki commented May 28, 2020

nciric commented May 28, 2020

zbraniecki commented May 15, 2020 •

edited

Loading

zbraniecki commented May 19, 2020 •

edited

Loading

Manishearth commented May 20, 2020 •

edited

Loading