Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for grapheme iteration. #77

Merged
merged 2 commits into from
Feb 12, 2020
Merged

Optimization for grapheme iteration. #77

merged 2 commits into from
Feb 12, 2020

Conversation

cessen
Copy link
Contributor

@cessen cessen commented Feb 9, 2020

Typical text involves a lot of chars one-after-another that
are in a similar scalar value range and in the same category. This
commit takes advantage of that to speed up grapheme iteration by
caching the range and category of the last char, and always checking
against that cache first before doing a binary search on the whole
grapheme category table.

This results in significant speed ups on many texts, in some cases
up to 2x faster.

Typical text involves a lot of chars one-after-another that
are in a similar scalar value range and in the same category.  This
commit takes advantage of that to speed up grapheme iteration by
caching the range and category of the last char, and always checking
against that cache first before doing a binary search on the whole
grapheme category table.

This results in significant speed ups on many texts, in some cases
up to 2x faster.
@cessen
Copy link
Contributor Author

cessen commented Feb 9, 2020

Here are some benchmarks I did on a variety of texts in various languages, most of which are sourced from the wikipedia articles about each respective language (e.g. the Mandarin Chinese text is from the Mandarin Chinese wikipedia page about the Mandarin Chinese language). They were all converted to plain utf8 text (no html markup) and copy-pasted to create roughly 200kb text files.

Before:

test iterate_text_arabic      ... bench:   4,200,019 ns/iter (+/- 100,975) = 43 MB/s
test iterate_text_english     ... bench:   7,563,328 ns/iter (+/- 197,099) = 27 MB/s
test iterate_text_hindi       ... bench:   3,408,050 ns/iter (+/- 78,840) = 56 MB/s
test iterate_text_japanese    ... bench:   2,909,737 ns/iter (+/- 137,862) = 67 MB/s
test iterate_text_korean      ... bench:   3,158,322 ns/iter (+/- 77,622) = 54 MB/s
test iterate_text_mandarin    ... bench:   2,784,563 ns/iter (+/- 74,021) = 67 MB/s
test iterate_text_russian     ... bench:   4,437,562 ns/iter (+/- 136,513) = 42 MB/s
test iterate_text_source_code ... bench:   5,531,391 ns/iter (+/- 161,335) = 25 MB/s
test iterate_text_worst_case  ... bench:   7,252,276 ns/iter (+/- 242,166) = 26 MB/s
test iterate_text_zalgo       ... bench:   3,025,250 ns/iter (+/- 77,794) = 62 MB/s

After:

test iterate_text_arabic      ... bench:   3,150,853 ns/iter (+/- 178,068) = 58 MB/s
test iterate_text_english     ... bench:   4,021,696 ns/iter (+/- 141,119) = 50 MB/s
test iterate_text_hindi       ... bench:   3,462,884 ns/iter (+/- 85,012) = 55 MB/s
test iterate_text_japanese    ... bench:   2,571,715 ns/iter (+/- 103,212) = 76 MB/s
test iterate_text_korean      ... bench:   3,530,435 ns/iter (+/- 139,610) = 48 MB/s
test iterate_text_mandarin    ... bench:   1,913,650 ns/iter (+/- 65,902) = 98 MB/s
test iterate_text_russian     ... bench:   3,140,924 ns/iter (+/- 80,945) = 60 MB/s
test iterate_text_source_code ... bench:   3,170,036 ns/iter (+/- 103,675) = 45 MB/s
test iterate_text_worst_case  ... bench:   8,669,671 ns/iter (+/- 335,782) = 22 MB/s
test iterate_text_zalgo       ... bench:   1,234,701 ns/iter (+/- 25,555) = 153 MB/s

With the exception of Korean and an artificial worst-case text I created, everything either improves in performance or breaks even. And the performance improvements are often substantial.

The worst-case text is comprised of alternating grapheme categories on every char, completely defeating the optimization, and leaving only the overhead. This should give a good indication of the worst-case performance drop that might happen.

Korean is a bit of a unique case, and due to the nature of Korean text is pretty close to a real-world equivalent of the artificial worst-case text. Nevertheless, the drop in performance seems (IMO) reasonable given the improvements for other texts.

The zalgo text case includes artificially long combining character sequences (https://en.wikipedia.org/wiki/Combining_character#Zalgo_text), and shouldn't be taken too seriously.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Could you also check in the benchmarks you wrote? Perhaps make a benches/ folder.

Be sure to exclude that folder from publishing in Cargo.toml.

Also, an additional optimization that can be done is hardcoding the table for ASCII and checking for that first.

@Manishearth Manishearth merged commit b82ed4d into unicode-rs:master Feb 12, 2020
@Manishearth
Copy link
Member

Eh, those things can be followups.

@cessen
Copy link
Contributor Author

cessen commented Feb 13, 2020

@Manishearth
Sure, I'll work on tidying up and committing the benchmarks the next chance I get (possibly this weekend, or maybe the weekend after).

Yeah, I've also been thinking about how to hard-code the ASCII range in some way. I think with some bit-fiddling it could be made extremely fast, because the only pure-ASCII case that needs handling is CRLF (I'm pretty sure?). And it would likely introduce very little overhead for other texts, I'm guessing.

Originally I was planning to do that optimization just in my own application code, bypassing UnicodeSegmentation altogether for those cases, but if you're interested in having it in-tree I'd be more than happy take a crack at it in another PR.

@Manishearth
Copy link
Member

Oh, I meant that we could hardcode the categories so that ASCII doesn't hit the binary search

@cessen
Copy link
Contributor Author

cessen commented Feb 13, 2020

Ah, okay. I'm sure that would make ASCII text faster, though I think it would more be because it's effectively biasing the search to check ASCII first rather than because of the hard coding itself. We could choose to bias other ranges similarly, if we wanted to. So I guess the question is: is the ASCII range special enough to warrant doing that?

Edit: to clarify, I mean "special enough" in the general case. For e.g. my own project, a code editor, it definitely would be. But I can write my own ascii-specific optimizations in that case, as could other applications for their own domains.

@Manishearth
Copy link
Member

@cessen it's special enough because a lot of punctuation/etc is ASCII even if text might not necessarily be. It's going to be a single comparison followed by an indexing operation, very cheap if the alternative is a binary search.

It will complement the cache really well since then the cache can continue to deal with the non-ASCII general range of the text whereas spacing and punctuation doesn't invalidate it. Would need to tweak the bsearch method to not return the cache index markers in the ASCII case though.

@cessen
Copy link
Contributor Author

cessen commented Feb 14, 2020

@Manishearth I think that really depends on the text. The CJK languages, for example, have their own punctuation and white space characters, and the only ascii-range characters they ever use are line endings, which is only occasional in typical texts.

But having said that, I think you're right that many other languages still use ascii punctuation and white space. And then there's things like html, too. I'd be happy to take a crack at this and see how it affects the benchmarks and if it seems worth whatever overhead it might introduce.

@Manishearth
Copy link
Member

Yes, not all languages do this.

This was definitely a win for unicode-xid in unicode-rs/unicode-xid#16 ( which doesn't even go the full way with the ASCII table, just reorders the binary search!), which admittedly works with different sets of characters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants