Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use minimal perfect hashing for lookups #37

Merged
merged 8 commits into from Apr 16, 2019

Conversation

@raphlinus
Copy link
Contributor

commented Apr 10, 2019

This patch moves many lookups from large match statements to a custom approach based on minimal perfect hashing.

Should improve #29 considerably - cargo build --release goes from 6.28s to 2.11s on my machine. In addition, code size is considerably improved (1432576 to 858112 bytes for the benchmark executable). Speed is basically the same.

Also moves generation script to Python 3. Note, the Unicode version is still 9.0.

@raphlinus

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

More detail on the benchmarks. Here's the before:

test bench_is_nfc_ascii                      ... bench:          23 ns/iter (+/- 5)
test bench_is_nfc_normalized                 ... bench:          36 ns/iter (+/- 3)
test bench_is_nfc_not_normalized             ... bench:         452 ns/iter (+/- 163)
test bench_is_nfc_stream_safe_ascii          ... bench:          22 ns/iter (+/- 2)
test bench_is_nfc_stream_safe_normalized     ... bench:          46 ns/iter (+/- 6)
test bench_is_nfc_stream_safe_not_normalized ... bench:         528 ns/iter (+/- 225)
test bench_is_nfd_ascii                      ... bench:          21 ns/iter (+/- 4)
test bench_is_nfd_normalized                 ... bench:          45 ns/iter (+/- 3)
test bench_is_nfd_not_normalized             ... bench:          16 ns/iter (+/- 3)
test bench_is_nfd_stream_safe_ascii          ... bench:          24 ns/iter (+/- 4)
test bench_is_nfd_stream_safe_normalized     ... bench:          55 ns/iter (+/- 5)
test bench_is_nfd_stream_safe_not_normalized ... bench:          17 ns/iter (+/- 3)
test bench_nfc_ascii                         ... bench:         661 ns/iter (+/- 113)
test bench_nfc_long                          ... bench:     234,811 ns/iter (+/- 44,577)
test bench_nfd_ascii                         ... bench:         308 ns/iter (+/- 51)
test bench_nfd_long                          ... bench:     127,452 ns/iter (+/- 11,391)
test bench_nfkc_ascii                        ... bench:         599 ns/iter (+/- 49)
test bench_nfkc_long                         ... bench:     236,973 ns/iter (+/- 19,020)
test bench_nfkd_ascii                        ... bench:         316 ns/iter (+/- 21)
test bench_nfkd_long                         ... bench:     141,850 ns/iter (+/- 22,229)
test bench_streamsafe_adversarial            ... bench:         507 ns/iter (+/- 26)
test bench_streamsafe_ascii                  ... bench:          75 ns/iter (+/- 5)

And here's the after:

test bench_is_nfc_ascii                      ... bench:          22 ns/iter (+/- 1)
test bench_is_nfc_normalized                 ... bench:          35 ns/iter (+/- 4)
test bench_is_nfc_not_normalized             ... bench:         419 ns/iter (+/- 119)
test bench_is_nfc_stream_safe_ascii          ... bench:          26 ns/iter (+/- 7)
test bench_is_nfc_stream_safe_normalized     ... bench:          45 ns/iter (+/- 8)
test bench_is_nfc_stream_safe_not_normalized ... bench:         447 ns/iter (+/- 49)
test bench_is_nfd_ascii                      ... bench:          22 ns/iter (+/- 1)
test bench_is_nfd_normalized                 ... bench:          46 ns/iter (+/- 6)
test bench_is_nfd_not_normalized             ... bench:          16 ns/iter (+/- 5)
test bench_is_nfd_stream_safe_ascii          ... bench:          22 ns/iter (+/- 2)
test bench_is_nfd_stream_safe_normalized     ... bench:          61 ns/iter (+/- 8)
test bench_is_nfd_stream_safe_not_normalized ... bench:          16 ns/iter (+/- 4)
test bench_nfc_ascii                         ... bench:         620 ns/iter (+/- 376)
test bench_nfc_long                          ... bench:     195,177 ns/iter (+/- 21,275)
test bench_nfd_ascii                         ... bench:         392 ns/iter (+/- 42)
test bench_nfd_long                          ... bench:     146,535 ns/iter (+/- 9,473)
test bench_nfkc_ascii                        ... bench:         550 ns/iter (+/- 41)
test bench_nfkc_long                         ... bench:     212,233 ns/iter (+/- 16,049)
test bench_nfkd_ascii                        ... bench:         384 ns/iter (+/- 27)
test bench_nfkd_long                         ... bench:     155,408 ns/iter (+/- 12,506)
test bench_streamsafe_adversarial            ... bench:         458 ns/iter (+/- 24)
test bench_streamsafe_ascii                  ... bench:          77 ns/iter (+/- 6)

More commentary. I also tested the singleton bucket "optimization" as described in Steve Hanov's blog post on minimal perfect hashing, and it was about 50% slower on the long tests. It saves rehashing work, but the effect of the extra branching is worse. Not doing it makes table generation a bit slower, and also less robust (it would not be too difficult to construct an adversarial example that would overflow the salt).

I wouldn't be surprised if there was a better hash function. Using a single multiplication doesn't work, there are too many collisions. I also tried a variant of the Jenkins one-at-a-time hash function, and it was slower. Several other proposals were mentioned on a Twitter thread, but I don't think anything will be faster.

@trishume

This comment has been minimized.

Copy link

commented Apr 10, 2019

Another approach that might lead to even better compile times, is to output the tables in some simple packed binary format and include them with the include_bytes! macro, then just index into the byte arrays to extract what you need. Would avoid generating a 0.5 megabyte Rust file. Not sure how much compile time it would save though for the effort it would take.

@raphlinus

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

@trishume That's well worth considering. One factor against it is that this crate has strictly no unsafe code, and the deserialization from the packed format would at least have checks for the conversion into char. But it's probably a good idea to investigate.

@Manishearth
Copy link
Contributor

left a comment

Slightly would prefer if the generated code and the generated tables lived separately -- so you have the functions generate DECOMPOSITION_KEYS and DECOMPOSITION_SALTS tables, and the actual mph_lookup calls live outside of tables.rs, so that tables.rs is just tables and no actual code.

return (y * n) >> 32

# Compute minimal perfect hash function, d can be either a dict or list of keys.
def minimal_perfect_hash(d, singleton_buckets = False):

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 14, 2019

Contributor

I'd prefer if this function had more comments

@@ -432,13 +436,61 @@ def gen_tests(tests, out):

out.write("];\n")

def my_hash(x, salt, n):

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 14, 2019

Contributor

probably should have a comment saying "guaranteed to be less than n"

for (bucket_size, h) in bsorted:
if bucket_size == 0:
break
elif singleton_buckets and bucket_size == 1:

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 14, 2019

Contributor

Do we use the singleton_buckets case at all?

This comment has been minimized.

Copy link
@raphlinus

raphlinus Apr 14, 2019

Author Contributor

No, I can remove it, especially as it seems to perform worse in benchmarks. The main reason I left it in is that it's more robust; without it there's a much greater probability the hashing will fail.

else:
for salt in range(1, 32768):
rehashes = [my_hash(key, salt, n) for key in buckets[h]]
if all(not claimed[hash] for hash in rehashes):

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 14, 2019

Contributor

Is there a guarantee that we won't have a collision amongst the rehashes? Is it just really unlikely? (I suspect it's the latter but want to confirm)

This comment has been minimized.

Copy link
@raphlinus

raphlinus Apr 14, 2019

Author Contributor

Yes, if it finds a suitable salt that comes with a guarantee the rehash won't have a collision (this is what the claimed bool-array keeps track of). On the other hand, it's possible that no salt can be found that satisfies that, but I believe it to be quite a low probability. There's things that can be done to make it more robust. I'll try to add a comment outlining that in case somebody does run into it with a data update.

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 14, 2019

Contributor

Oh, wait, the set check deals with this, I'd forgotten it was there 😄 . To be clear, I was specifically worried about cases where a single run of rehashes has collisions, which claimed won't catch since we update it later.

(worth leaving a comment saying that)

out.write("pub fn composition_table(c1: char, c2: char) -> Option<char> {\n")
out.write(" match (c1, c2) {\n")
out.write(" if c1 < '\\u{10000}' && c2 < '\\u{10000}' {\n")
out.write(" mph_lookup((c1 as u32) << 16 | (c2 as u32), &[\n")

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 14, 2019

Contributor

Could the code outputting mph_lookup calls be factored out into a function?

raphlinus added 2 commits Apr 16, 2019
Move code out of tables
The code has been moved out of the tables module into perfect_hash, and
there is a bit more explanation in comments.
@raphlinus

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2019

@Manishearth does this address your concerns? It's a bit denser (less cut'n'paste of generated code), but hopefully reasonably clear in organization and with comments.

@Manishearth
Copy link
Contributor

left a comment

LGTM, minor issue

/// Look up the canonical combining class for a codepoint.
///
/// The value returned is as defined in the Unicode Character Database.
pub fn canonical_combining_class(c: char) -> u8 {

This comment has been minimized.

Copy link
@Manishearth

Manishearth Apr 16, 2019

Contributor

These functions should live elsewhere

This comment has been minimized.

Copy link
@raphlinus

raphlinus Apr 16, 2019

Author Contributor

Their own module? That's what I did in 40f9ba6.

@Manishearth

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

Thanks!

@Manishearth Manishearth merged commit 7c23cc9 into unicode-rs:master Apr 16, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.