-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform normalization performance evaluation between Rust and ICU #93
Comments
The current procedure plan is:
|
Question marks represent things I'm unsure of:
@hsivonen , @Manishearth , @sffc , @nciric - feedback and help? |
I think we should just get text examples, and:
for everything. For everything but Vietnamese the text is probably already in nfc form. The Rust crate has extension |
Let me elaborate what I said before: for perf testing we should use
representative data. That means ideally a representative sample for each of
the top world languages, then weight the resulting durations by the
percentage of text in that language (approximately,
*https://w3techs.com/technologies/history_overview/content_language/ms/y
<https://w3techs.com/technologies/history_overview/content_language/ms/y>*).
Anyway, a test with Vietnamese text, in the format used on the web, would
be interesting, but we have to keep in mind that the majority of text will
already be in NFC, so the speed with text that is already normalized will
be crucial.
Caveats of course since there are some diminishing returns on the effort to
collect that data. And other domains may vary, eg texting.
So you should also test English, and a few other languages on the top of
that list, as proxies for the whole list.
On Thu, May 14, 2020, 20:19 Zibi Braniecki ***@***.***> wrote:
Question marks represent things I'm unsure of:
- How to "synthesize a document to NFC"?
Not sure what that means either.
- Is there any other step or is the result text the input for our test?
- What to do with the Korean text?
Less than 1%, so other languages would be better.
- What operation should I perform from ICU and what from Rust
normalization crate?
I'd recommend
Convert to NFC
Test if NFC
…
@hsivonen <https://github.com/hsivonen> , @Manishearth
<https://github.com/Manishearth> , @sffc <https://github.com/sffc> ,
@nciric <https://github.com/nciric> - feedback and help?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMEXYW5WZND7IX3QXS3RRSYCLANCNFSM4NBGCPLQ>
.
|
Thank you all! Updated the plan to list languages and 4 benchmarks. Lmk if that sounds good. |
FWIW, for encoding_rs, I chose the page for Mars, the planet (for the reasons documented). (Get plain-text versions from the encoding_bench repo, links to Wikipedia revisions, license (not suited for checking into the ICU4X repo; also note the Czech version has GFDL portions).)
For Vietnamese, please also generate a starting point that corresponds to the keyboard layout, which is neither NFC nor NFD. To get this form, first normalize to NFC and then apply the
I suggest using the |
I think the inputs are:
For |
Sadly, there doesn't appear to be enough API surface to test the two engines in a commeasurable way in terms of allocation behavior. |
On the Rust side, if you set the capacity of the |
Using |
On the output side, AFAICT, you'd need a bit of custom code around |
I don't think normalizing from NFD to NFC is a high use case. I'd recommend
(where "wild" is representative of what's on the web):
- wild => NFC
- wild => NFD
- isNFC(wild)
- isNFD(wild)
I'm a bit leery of simulating what we *think* Vietnamese might be on the
web at this point in time. However, vi is only 1.4% of the total weight,
so as long as we weight it appropriately.
Mark
…On Fri, May 15, 2020 at 12:01 AM Henri Sivonen ***@***.***> wrote:
Is there any other step or is the result text the input for our test?
I think the inputs are:
- Original data normalized to NFC (bench normalization to NFC, which
should be no-op, and normalization to NFD)
- Original data normalized to NFD (bench normalization to NFD, which
should be no-op, and normalization to NFC)
- Additionally for Vienamese only: Original data normalized to NFC
followed by orthographic decomposition (bench normalization to NFC and NFD)
What operation should I perform from ICU and what from Rust normalization
crate?
For unic_normal you'd do input.chars().nfc().collect::<String>(), which
has fundamentally different allocation behavior from ICU4C. You could also
bench input.chars().nfc().count() to estimate the allocation and memory
write cost. For ICU4C, the relevant function seems to be unorm2_normalize
<https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html#a0a596802db767da410b4b04cb75cbc53>,
which writes to a caller-allocated buffer.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCNMCUT2UDIBZAC7BLRRTSEZANCNFSM4NBGCPLQ>
.
|
Btw, nice page talking about why to pick Wikipedia Mars...
…On Thu, May 14, 2020, 23:48 Henri Sivonen ***@***.***> wrote:
Select a Wikipedia page of a substantial length
FWIW, for encoding_rs, I chose the page for Mars, the planet (for the
reasons documented
<https://github.com/hsivonen/encoding_bench#selection-of-test-data>).
(Get plain-text versions from the encoding_bench repo
<https://github.com/hsivonen/encoding_bench/tree/master/src/wikipedia>, links
to Wikipedia revisions
<https://github.com/hsivonen/encoding_bench/blob/master/src/wikipedia/sources.txt>,
license
<https://github.com/hsivonen/encoding_bench/blob/master/LICENSE-CC-BY-SA>
(not suited for checking into the ICU4X repo; also note the Czech version
has GFDL portions).)
For each document
For Vietnamese, please also generate a starting point that corresponds to
the keyboard layout, which is neither NFC nor NFD. To get this form, first
normalize to NFC and then apply the detone
<https://crates.io/crates/detone> crate with orthographic set to true.
How to "synthesize a document to NFC"?
I suggest using the unic_normal crate. testdet has a bit of code
<https://github.com/hsivonen/testdet/blob/3f2629c9814290603e995be3184e79b4c349b5a7/src/main.rs#L701-L703>
that generates as-if-from-keyboard Vietnamese using unic_normal and detone
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#93 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGPD5SKJOHMAFCAUTTRRTQSNANCNFSM4NBGCPLQ>
.
|
With help from @hsivonen and @Manishearth I completed the initial test harness for normalization crate. ExplainerDue to differences between UTF-8 vs UTF-16 handling and its impact on the nature of the operation, both C and Rust code are measured within Rust crate, both in an example binary and a criterion benchmark. For the sake of sanity, I also verified on a very small subset that a C++ app will give comparable numbers for ICU. The criterion benchmark is more stable, but a single run of the binary in For the performance review I used Henri's encoding testsuite which uses wikipedia articles for given set of locales. As for locales, I used the following set: For the sake of completeness I compared UTF-16 operations on NFC and NFD and for Notice: I did not validate the output of any of the methods. I exclusively compared the performance of the calls based on the given input. Results
To reproduce, clone the intl-measurements repo, Summary
I encourage everyone to evaluate the source code to validate the code used for benchmarks. |
@Manishearth @hsivonen @sffc - can I get your thumbs up on the results? I'd like to close this issue and get back to #40 once we confirm validity of the results. We may also want to report the results to https://github.com/unicode-rs/unicode-normalization authors for evaluation (@Manishearth - can you facilitate this comms?) |
Hmm, that's surprising, worth checking out what ICU does to get it such a perf win. unicode-normalization has not been vetted much for perf either. Either way, seems like a from-scratch impl following the design of ICU might be better. |
Oh, I see why, the tables are generated as giant match blocks instead of cached binary search tables which most of the other unicode crates use. I'd kind of assumed this was the default for all unicode-rs crates. |
This is now completed. |
Summary of the decision: We see a major performance difference in favor of ICU4C. Manish believes that the issue he filed would close the gap but said that between rewriting and fixing So, whoever will take this on can decide which route to take, and the harness is available to retest against a new target. |
I think we need to discuss data loading issue with normalization. Would refitting existing crate with data provider interface be harder than writing one from scratch - as Manish said, normalization code is not very complex. |
The proposed experiment will allow us to get a basic idea on performance of normalization libraries between ICU and the Rust normalization crate.
I will use the https://github.com/zbraniecki/intl-measurements/ harness building two example apps that will use microtiming intervals to measure time used by both apps.
On top of that I'll plug the Rust test into Criterion for more detailed measurements.
If the results will warrant further testing, it may be then useful to write basic FFI for ICU4C to plug it into Criterion.
The text was updated successfully, but these errors were encountered: