Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform normalization performance evaluation between Rust and ICU #93

Closed
zbraniecki opened this issue May 15, 2020 · 21 comments
Closed
Assignees
Labels
C-unicode Component: Props, sets, tries T-task Type: Tracking thread for a non-code task
Milestone

Comments

@zbraniecki
Copy link
Member

The proposed experiment will allow us to get a basic idea on performance of normalization libraries between ICU and the Rust normalization crate.

I will use the https://github.com/zbraniecki/intl-measurements/ harness building two example apps that will use microtiming intervals to measure time used by both apps.
On top of that I'll plug the Rust test into Criterion for more detailed measurements.

If the results will warrant further testing, it may be then useful to write basic FFI for ICU4C to plug it into Criterion.

@zbraniecki zbraniecki self-assigned this May 15, 2020
@zbraniecki
Copy link
Member Author

This is related to #40 and #66

@zbraniecki
Copy link
Member Author

zbraniecki commented May 15, 2020

The current procedure plan is:

  • Build sample set
    • Select a Wikipedia page of a substantial length
    • Languages: Vietnamise, Korean, English, French, German, Russian, Spanish
  • For each document
    • Convert to NFC
    • Convert to NFD
    • Convert pre-converted to NFD to NFC
    • Convert pre-converted to NFC to NFD

@zbraniecki
Copy link
Member Author

Question marks represent things I'm unsure of:

  • How to "synthesize a document to NFC"?
  • Is there any other step or is the result text the input for our test?
  • What to do with the Korean text?
  • What operation should I perform from ICU and what from Rust normalization crate?

@hsivonen , @Manishearth , @sffc , @nciric - feedback and help?

@Manishearth
Copy link
Member

I think we should just get text examples, and:

  • benchmark conversion to NFC
  • benchmark conversion to NFD
  • benchmark conversion of NFC form to NFD
  • benchmark conversion of NFD form to NFC

for everything. For everything but Vietnamese the text is probably already in nfc form.

The Rust crate has extension nfc() and nfd() methods which are character iterators: https://docs.rs/unicode-normalization/0.1.12/unicode_normalization/trait.UnicodeNormalization.html

@macchiati
Copy link
Member

macchiati commented May 15, 2020 via email

@zbraniecki
Copy link
Member Author

Thank you all!

Updated the plan to list languages and 4 benchmarks. Lmk if that sounds good.

@hsivonen
Copy link
Member

Select a Wikipedia page of a substantial length

FWIW, for encoding_rs, I chose the page for Mars, the planet (for the reasons documented). (Get plain-text versions from the encoding_bench repo, links to Wikipedia revisions, license (not suited for checking into the ICU4X repo; also note the Czech version has GFDL portions).)

For each document

For Vietnamese, please also generate a starting point that corresponds to the keyboard layout, which is neither NFC nor NFD. To get this form, first normalize to NFC and then apply the detone crate with orthographic set to true.

How to "synthesize a document to NFC"?

I suggest using the unic_normal crate. testdet has a bit of code that generates as-if-from-keyboard Vietnamese using unic_normal and detone.

@hsivonen
Copy link
Member

Is there any other step or is the result text the input for our test?

I think the inputs are:

  • Original data normalized to NFC (bench normalization to NFC, which should be no-op, and normalization to NFD)
  • Original data normalized to NFD (bench normalization to NFD, which should be no-op, and normalization to NFC)
  • Additionally for Vienamese only: Original data normalized to NFC followed by orthographic decomposition (bench normalization to NFC and NFD)

What operation should I perform from ICU and what from Rust normalization crate?

For unic_normal you'd do input.chars().nfc().collect::<String>(), which has fundamentally different allocation behavior from ICU4C. You could also bench input.chars().nfc().count() to estimate the allocation and memory write cost. For ICU4C, the relevant function seems to be unorm2_normalize, which writes to a caller-allocated buffer.

@hsivonen
Copy link
Member

Sadly, there doesn't appear to be enough API surface to test the two engines in a commeasurable way in terms of allocation behavior.

@hsivonen
Copy link
Member

On the Rust side, if you set the capacity of the String ahead of time and then extend it from an iterator, the allocation behavior matches ICU4C, and then UTF-8 vs. UTF-16 remains as the difference.

@hsivonen
Copy link
Member

Using std::char::decode_utf16 to get the iterator over char in the Rust case would address the UTF-8 vs. UTF-16 difference on the input side.

@hsivonen
Copy link
Member

On the output side, AFAICT, you'd need a bit of custom code around encode_utf16 to write an iterator over char into a UTF-16 buffer.

@macchiati
Copy link
Member

macchiati commented May 15, 2020 via email

@macchiati
Copy link
Member

macchiati commented May 15, 2020 via email

@sffc sffc added T-task Type: Tracking thread for a non-code task C-unicode Component: Props, sets, tries labels May 15, 2020
@zbraniecki
Copy link
Member Author

zbraniecki commented May 19, 2020

With help from @hsivonen and @Manishearth I completed the initial test harness for normalization crate.

Explainer

Due to differences between UTF-8 vs UTF-16 handling and its impact on the nature of the operation, both C and Rust code are measured within Rust crate, both in an example binary and a criterion benchmark. For the sake of sanity, I also verified on a very small subset that a C++ app will give comparable numbers for ICU.

The criterion benchmark is more stable, but a single run of the binary in --release mode gives comparable results, so while below I'm presenting numbers from criterion, the cargo run --release on that crate should give comparable results.

For the performance review I used Henri's encoding testsuite which uses wikipedia articles for given set of locales.

As for locales, I used the following set: "ar", "de", "en", "ko", "ru", "vi".

For the sake of completeness I compared UTF-16 operations on NFC and NFD and for vi Henri added an orthographic scenario for both.

Notice: I did not validate the output of any of the methods. I exclusively compared the performance of the calls based on the given input.

Results

Locale Operation Source Encoding C++ Rust Diff
ar NFC NFC UTF-16 27.224 us 365.91 us (x13.0)
ar NFC NFC UTF-8 321.13 us
ar NFD NFC UTF-16 42.021 us 234.21 us (x5.5)
ar NFD NFC UTF-8 302.74 us
ar NFC NFD UTF-16 46.405 us 391.28 us (x8.4)
ar NFC NFD UTF-8 470.15 us
ar NFD NFD UTF-16 42.489 us 249.78 us (x5.9)
ar NFD NFD UTF-8 318.81 us
de NFC NFC UTF-16 76.972 us 1.7980 ms (x23.2)
de NFC NFC UTF-8 1.7830 ms
de NFD NFC UTF-16 94.661 us 1.1210 ms (x11.9)
de NFD NFC UTF-8 1.1002 ms
de NFC NFD UTF-16 132.27 us 1.8075 ms (x13.6)
de NFC NFD UTF-8 1.8072 ms
de NFD NFD UTF-16 96.353 us 1.1102 ms (x11.5)
de NFD NFD UTF-8 1.0919 ms
en NFC NFC UTF-16 65.217 us 1.5867 ms (x24.4)
en NFC NFC UTF-8 1.5546 ms
en NFD NFC UTF-16 64.187 us 873.03 us (x13.6)
en NFD NFC UTF-8 884.41 us
en NFC NFD UTF-16 62.624 us 1.5472 ms (x24.7)
en NFC NFD UTF-8 1.4711 ms
en NFD NFD UTF-16 67.357 us 856.90 us (x12.7)
en NFD NFD UTF-8 896.31 us
ko NFC NFC UTF-16 20.629 us 368.37 us (x17.8)
ko NFC NFC UTF-8 399.63 us
ko NFD NFC UTF-16 106.02 us 247.49 us (x2.3)
ko NFD NFC UTF-8 358.79 us
ko NFC NFD UTF-16 92.915 us 465.07 us (x5.0)
ko NFC NFD UTF-8 513.15 us
ko NFD NFD UTF-16 58.509 us 303.24 us (x5.1)
ko NFD NFD UTF-8 387.68 us
ru NFC NFC UTF-16 138.41 us 1.8304 ms (x13.2)
ru NFC NFC UTF-8 2.2583 ms
ru NFD NFC UTF-16 198.54 us 1.2005 ms (x6.0)
ru NFD NFC UTF-8 1.5329 ms
ru NFC NFD UTF-16 183.29 us 1.8699 ms (x10.1)
ru NFC NFD UTF-8 2.3132 ms
ru NFD NFD UTF-16 189.12 us 1.2479 ms (x6.6)
ru NFD NFD UTF-8 1.5129 ms
vi NFC NFC UTF-16 97.061 us 2.0190 ms (x20.8)
vi NFC NFC UTF-8 2.2570 ms
vi NFD NFC UTF-16 364.88 us 1.3493 ms (x3.6)
vi NFD NFC UTF-8 1.4905 ms
vi NFC NFD UTF-16 1.0147 ms 2.2458 ms (x2.2)
vi NFC NFD UTF-8 2.2110 ms
vi NFD NFD UTF-16 370.88 us 1.2600 ms (x3.4)
vi NFD NFD UTF-8 1.4762 ms
Locale Operation Source Encoding C++ Rust Diff
vi NFC orthographic UTF-16 911.77 us 2.2167 ms (x2.4)
vi NFC orthographic UTF-8 2.1625 ms
vi NFD orthographic UTF-16 403.16 us 1.3254 ms (x3.3)
vi NFD orthographic UTF-8 1.5354 ms

To reproduce, clone the intl-measurements repo, cd ./unic/normalize and either cargo run --release or cargo bench.

Summary

  • In all collected results ICU is 2-25 times faster than unicode-normalization
  • The biggest differences can be observed in NFC->NFC operations
  • The UTF-8 vs UTF-16 doesn't yield any significant differences

I encourage everyone to evaluate the source code to validate the code used for benchmarks.

@zbraniecki
Copy link
Member Author

@Manishearth @hsivonen @sffc - can I get your thumbs up on the results?

I'd like to close this issue and get back to #40 once we confirm validity of the results.

We may also want to report the results to https://github.com/unicode-rs/unicode-normalization authors for evaluation (@Manishearth - can you facilitate this comms?)

@Manishearth
Copy link
Member

Hmm, that's surprising, worth checking out what ICU does to get it such a perf win. unicode-normalization has not been vetted much for perf either.

Either way, seems like a from-scratch impl following the design of ICU might be better.

@Manishearth
Copy link
Member

Manishearth commented May 20, 2020

Oh, I see why, the tables are generated as giant match blocks instead of cached binary search tables which most of the other unicode crates use. I'd kind of assumed this was the default for all unicode-rs crates.

@zbraniecki
Copy link
Member Author

This is now completed.

@zbraniecki
Copy link
Member Author

Summary of the decision: We see a major performance difference in favor of ICU4C. Manish believes that the issue he filed would close the gap but said that between rewriting and fixing unicode-normalize, but isn't very opinionated on which we should do.

So, whoever will take this on can decide which route to take, and the harness is available to retest against a new target.

@nciric
Copy link
Contributor

nciric commented May 28, 2020

I think we need to discuss data loading issue with normalization. Would refitting existing crate with data provider interface be harder than writing one from scratch - as Manish said, normalization code is not very complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries T-task Type: Tracking thread for a non-code task
Projects
None yet
Development

No branches or pull requests

6 participants