Use HashMap for entities.rs #309

oberien · 2019-05-09T01:43:09Z

I did some benchmarking on how entities.rs could be implemented:

Before hashbrown integration into std

test tests::binary_search            ... bench:       4,356 ns/iter (+/- 129)
test tests::btreemap                 ... bench:       4,894 ns/iter (+/- 170)
test tests::eytzinger                ... bench:       3,813 ns/iter (+/- 181)
test tests::hashmap_fnv              ... bench:       1,303 ns/iter (+/- 199)
test tests::hashmap_seahash          ... bench:       2,429 ns/iter (+/- 38)
test tests::hashmap_siphash          ... bench:       1,729 ns/iter (+/- 59)
test tests::phf                      ... bench:       1,730 ns/iter (+/- 145)
test tests::via_match                ... bench:     104,778 ns/iter (+/- 2,642)
test tests::x_create_hashmap_fnv     ... bench:      44,698 ns/iter (+/- 1,442)
test tests::x_create_hashmap_siphash ... bench:      64,140 ns/iter (+/- 9,549)

With hashbrown (meaning on the latest nightly which has hashbrown integrated already):

test tests::binary_search            ... bench:       3,810 ns/iter (+/- 247)
test tests::btreemap                 ... bench:       4,433 ns/iter (+/- 177)
test tests::eytzinger                ... bench:       3,538 ns/iter (+/- 132)
test tests::hashmap_fnv              ... bench:       1,089 ns/iter (+/- 28)
test tests::hashmap_seahash          ... bench:       2,575 ns/iter (+/- 52)
test tests::hashmap_siphash          ... bench:       1,657 ns/iter (+/- 34)
test tests::phf                      ... bench:       1,687 ns/iter (+/- 54)
test tests::via_match                ... bench:     104,619 ns/iter (+/- 4,298)
test tests::x_create_hashmap_fnv     ... bench:      28,834 ns/iter (+/- 517)
test tests::x_create_hashmap_siphash ... bench:      44,849 ns/iter (+/- 667)

(Measurements were done by looking up 90 entities and 10 non-entities. Times are over all 100 lookups. Creation times are allocating the map with capacity and extending it from the entity list.)

The fastest one is HashMap with FNV-hasher. Unfortunately we can't create an FNV-based hashmap during compiletime, phf currently only supports SipHash. That said, phf with SipHash is as fast as the plain std hashmap with its SipHash (which was to be expected).

Let's compare the relevant ones face to face:

Alg	creation	lookup	#lookups before creation time is armorized (compared to FNV)
binary search	0	3_800	1000 - 1100
FNV	28_800	1_000	-
PHF	0	1_700	4000 - 4100

For "normal" documents assumed not to contain a lot of html entities (or entity-likes), PHF is always better than binary_search. FNV only starts getting better than PHF once there are 4100 HTML entities (or entity-likes). The idea of FNV would be to have a lazy_static, which performs creation of the HashMap, then it'll only ever be accessed immutably, not requiring any synchronization. Once created, querying it will be faster than anything else.

In the perfect world phf would support faster algorithms than SipHash. Unfortunately currently we need to make a tradeoff. Do we want to have more overhead when having less than 4000 entity-likes, or do we want to be faster when having more than 4000 entity-likes?

My guess is that most documents passed to pulldown-cmark don't contain more than 4000 html entities. As such IMO PHF should be used.

The text was updated successfully, but these errors were encountered:

oberien · 2019-05-09T15:53:53Z

I redid the tests with the overhead of lazy_static. The relevant part is that the uncontested atomic lookup in the branch condition of the always correct predicted branch of the lazy static has pretty much no impact on performance. In fact I ran about 10 benchmarks before I found one where the lazy_static variations weren't faster than their non-lazy_static counterpart.

test tests::hashmap_fnv                 ... bench:       1,110 ns/iter (+/- 37)
test tests::hashmap_fnv_lazy_static     ... bench:       1,121 ns/iter (+/- 20)
test tests::hashmap_seahash             ... bench:       2,565 ns/iter (+/- 66)
test tests::hashmap_seahash_lazy_static ... bench:       2,573 ns/iter (+/- 76)
test tests::hashmap_siphash             ... bench:       1,653 ns/iter (+/- 89)
test tests::hashmap_siphash_lazy_static ... bench:       1,673 ns/iter (+/- 76)

marcusklaas · 2019-06-02T10:20:14Z

I'm not opposed to using PHFs. It certainly is a good solution from a technical point of view.

My concern is that it does introduce an additional dependency, and one with non-negligible compile time cost. Given how rare html entities are and how binary search is already reasonably fast (~38ns/ lookup, if I read your overview correctly), I am not sure it is worth it. Thoughts, @raphlinus?

Martin1887 added the enhancement label May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HashMap for entities.rs #309

Use HashMap for entities.rs #309

oberien commented May 9, 2019 •

edited

oberien commented May 9, 2019

marcusklaas commented Jun 2, 2019

Use HashMap for entities.rs #309

Use HashMap for entities.rs #309

Comments

oberien commented May 9, 2019 • edited

oberien commented May 9, 2019

marcusklaas commented Jun 2, 2019

oberien commented May 9, 2019 •

edited