Add initial ZeroHashMap #2579

pdogr · 2022-09-19T03:41:10Z

#2532
Adds core functionality of ZeroHashMap.
Perfect hashes are computed using CHD algorithm, with aHash being used as the hashing function.

sffc

Praise: This looks like the type of hash map impl that will work well in ICU4X. It looks algorithmically similar to the approach rkyv is taking.

I would like to see benches against ZeroMap before merging, both data size and lookup performance, and code size for bonus points.

utils/zerovec/Cargo.toml

utils/zerovec/src/map/map.rs

utils/zerovec/src/map/hashmap.rs

pdogr · 2022-09-19T06:35:46Z

Praise: This looks like the type of hash map impl that will work well in ICU4X. It looks algorithmically similar to the approach rkyv is taking.

I would like to see benches against ZeroMap before merging, both data size and lookup performance, and code size for bonus points.

Yes I based my implementation on rkyv's hashmap with a few changes. Both use CHD algorithm with the random hash function approach. There is a practical approach mentioned in Appendix A of the paper which uses a simpler hash but computes key hash only once as compared to twice for the random hash function approach.

Changes from rkyv impl

Different hash function
Normally we would have stored the value inline at the slot, but here we store index to original placement of (k, v).
assignments.contains(&index) in rkyv takes O(bucket_chain_len) which might slow hash building a bit (I haven't benchmarked it) depending on the bucket_chain len. So we use generation trick to get contains fast.

rkyv does have one additional optimization for the case when the chain has only one bucket.
Instead of storing the seed in this case, the original index will be stored in the displacement array and the seeds will have the highest bit set to 1. As this original index will have the highest bit 0, these won't collide.

sffc · 2022-09-19T07:28:20Z

Sounds good. When optimizing, please focus on (1) data size and (2) lookup speed. I don't care too much about building speed since it is done offline.

robertbastian

I haven't reviewed the construction algorithm but wanted to leave my comments so far

utils/zerovec/Cargo.toml

robertbastian · 2022-09-19T11:30:08Z

utils/zerovec/src/map/hashmap.rs

+fn compute_hash<K: Hash>(seed: u32, k: K, m: usize) -> u32 {
+    let mut hasher = create_hasher_with_seed(seed.into());
+    k.hash(&mut hasher);
+    (hasher.finish() % m as u64) as u32


Suggested change

(hasher.finish() % m as u64) as u32

(hasher.finish() as usize % m) as u32

This avoids doing a u64 mod instruction on 32-bit architectures.

I have to take this back, this does not give the same results on 32 and 64 bit. Assume the hash is 2^32 + 1, then on 32 bit that gets truncated to 1, which is not congruent to 2^32+1 under arbitrary moduli.

I don't think we need 64 bits here, so let's do the arithmetic in 32 bits and then widen to usize.

utils/zerovec/src/map/hashmap.rs

rename iter -> keys change return type of compute hash change container to FlexZeroVec

utils/zerovec/src/map/hashmap.rs

add #[zerovec::derive(Hash)] which derives byte hash of the ule fix some function signatures

pdogr · 2022-09-20T07:58:33Z

Added some benchmarks reusing the data from the existing ones.
Results

zeromap/lookup/large    time:   [188.63 ns 188.94 ns 189.27 ns]
                        change: [-2.8188% -2.3026% -1.7709%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

zerohashmap/lookup/large
                        time:   [82.513 ns 82.638 ns 82.766 ns]
                        change: [-1.1974% -0.7730% -0.3681%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

zeromap/lookup/large/hashmap
                        time:   [65.359 ns 65.476 ns 65.601 ns]
                        change: [-1.3054% -0.8122% -0.3220%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

A few tradeoffs and observations

Almost half of the time is used in the hash computation, I'll try "Practical approach" given in the above paper to see if it shows improvement.
FlexZeroVec vs ZeroVec for HashIndex entries (size vs lookup time)
Reverse mapping. Ideally we want to store key, value together as (K, V) which would remove the need for reverse mapping and also better locality.
Hash function. These benches are using ahash for now. I will do some experiments with platform independent hashes.

sffc · 2022-09-20T17:18:12Z

@pdogr Good data on the large map. How does the performance compare on the small version of the map?

pdogr · 2022-09-20T17:30:40Z

@pdogr Good data on the large map. How does the performance compare on the small version of the map?

For smaller keys hash computation time really comes into picture

zeromap/lookup/small    time:   [52.080 ns 52.323 ns 52.614 ns]
                        change: [+5.7258% +6.4265% +7.0869%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

zerohashmap/lookup/small
                        time:   [98.832 ns 99.025 ns 99.231 ns]
                        change: [-0.8245% -0.3353% +0.1726%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

zeromap/lookup/small/hashmap
                        time:   [65.108 ns 65.336 ns 65.587 ns]
                        change: [-5.1179% -4.3311% -3.5866%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

pdogr · 2022-09-20T17:53:24Z

FlexZeroVec vs ZeroVec

Using ZeroVec improves the lookup time by ~ 11% (99ns -> 87ns) for small lookups and ~25% (82ns -> 64ns) for large lookups with the existing algorithm.

sffc · 2022-09-20T18:51:45Z

It seems that another way to architect this could be to make this a standalone type that maps from keys to an index, and then keep a proper ZeroMap as the second stage of lookup.

utils/zerovec/src/map/hashmap.rs

utils/zerovec/src/map/map.rs

remove duplicate casts support only one `build_from_exact_iter` function

robertbastian · 2022-09-20T21:20:27Z

utils/zerovec/src/map/hashmap.rs

+    where
+        K: Hash,
+    {
+        let l1 = compute_hash(0, k, self.displacements.len());


This will panic if the map is empty, as compute_hash will do % 0. Add that as a precondition to compute_hash and guard against it here.

utils/zerovec/src/map/hashmap.rs

Change benches to read data from `large_zerohashmap.postcard`

pdogr · 2022-09-23T06:07:27Z

Switched to the practical approach mentioned in that paper with 64 bit (16 bits for g, 24 bits for f0, f1).
Changed the hash function to wyhash.

Benches

zeromap/lookup/small    time:   [50.796 ns 50.913 ns 51.042 ns]
                        change: [+0.1679% +0.7275% +1.3576%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  7 (7.00%) high severe

zeromap/lookup/large    time:   [196.49 ns 197.21 ns 198.13 ns]
                        change: [-7.7555% -6.8637% -5.9877%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

zeromap/lookup/small/hashmap
                        time:   [69.802 ns 70.089 ns 70.439 ns]
                        change: [+2.9376% +3.6548% +4.3437%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  8 (8.00%) high mild

zeromap/lookup/large/hashmap
                        time:   [66.232 ns 66.464 ns 66.749 ns]
                        change: [+0.4782% +1.0997% +1.7342%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe

zerohashmap/lookup/small
                        time:   [45.760 ns 45.844 ns 45.928 ns]
                        change: [-1.8354% -1.3624% -0.9334%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild

zerohashmap/lookup/large
                        time:   [43.285 ns 43.356 ns 43.434 ns]
                        change: [+3.1555% +3.8198% +4.4009%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

utils/zerovec/src/map/hashmap.rs

utils/zerovec/derive/src/make_varule.rs

utils/zerovec/src/hashmap/algorithms.rs

utils/zerovec/src/hashmap/mod.rs

utils/zerovec/src/hashmap/algorithms.rs

robertbastian · 2023-01-12T14:22:02Z

utils/zerovec/src/hashmap/mod.rs

+    /// placeholder.
+    #[cfg_attr(feature = "serde", serde(borrow))]
+    displacements: ZeroVec<'a, (u32, u32)>,
+    keys: K::Container,


for the next pr: this doesn't borrow keys and values. You'll need a ZeroHashMapBorrowed and all the boilerplate like in ZeroMap...

robertbastian · 2023-01-12T14:23:08Z

utils/zerovec/src/hashmap/mod.rs

+/// assert_eq!(hashmap.get(&2), Some("c"));
+/// assert_eq!(hashmap.get(&4), None);
+/// ```
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]


for the next pr: ZeroMap has a special implementation to have readable JSON, you can probably steal that by just pointing it at keys and values.

utils/zerovec/src/lib.rs

utils/zerovec/src/hashmap/mod.rs

Co-authored-by: Robert Bastian <robertbastian@users.noreply.github.com>

move key equality check to index method

utils/zerovec/src/hashmap/mod.rs

robertbastian · 2023-01-13T14:52:10Z

utils/zerovec/src/hashmap/mod.rs

+where
+    K: ZeroMapKV<'a> + ?Sized,
+    V: ZeroMapKV<'a> + ?Sized,
+{


not blocking: The iterator code is duplicated across ZeroMap and ZeroHashMap now. Move the logic to ZeroVecLike to deduplicate.

Will do this in a follow-up.

compute_index in u64 shift back to (usize, u32, u32) from split_hash64

utils/zerovec/src/hashmap/algorithms.rs

robertbastian · 2023-02-09T09:26:29Z

utils/zerovec/src/hashmap/serde.rs

+    where
+        S: serde::Serializer,
+    {
+        (&self.displacements, &self.keys, &self.values).serialize(serializer)


Nit: this isn't great for self-describing formats like JSON, but it's fine for now.

add core functionality of ZeroHashMap based on CHD algorithm

d455634

sffc reviewed Sep 19, 2022

View reviewed changes

utils/zerovec/Cargo.toml Outdated Show resolved Hide resolved

utils/zerovec/src/map/map.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

robertbastian reviewed Sep 19, 2022

View reviewed changes

remove keys from HashIndex,

9c4bba2

rename iter -> keys change return type of compute hash change container to FlexZeroVec

robertbastian reviewed Sep 19, 2022

View reviewed changes

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

pdogr added 3 commits September 19, 2022 13:28

rename variables, use vec! macro

4423552

remove #[macro_use] and directly import macro

3aa891e

add benchmarks for zerohashmap

9d01fd5

add #[zerovec::derive(Hash)] which derives byte hash of the ule fix some function signatures

robertbastian reviewed Sep 20, 2022

View reviewed changes

make functions inline

2ee9633

remove duplicate casts support only one `build_from_exact_iter` function

robertbastian reviewed Sep 20, 2022

View reviewed changes

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

add hashmap feature for ZeroHashMapStatic

92700b4

Manishearth reviewed Sep 21, 2022

View reviewed changes

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

pdogr and others added 6 commits September 21, 2022 17:53

remove unnecessary Iterator impl

5bd7ddb

Merge branch 'unicode-org:main' into hm

027a3ce

Apply reverse permutation in HashIndex bulding using zvl_permute

3a17393

Derive serde for HashIndex, ZeroHashMapStatic

9284bde

Change benches to read data from `large_zerohashmap.postcard`

modify generation algorithm using hashing only once

6ebde0e

replace ahash with wyhash

1d5f211

Merge branch 'unicode-org:main' into hm

305038f

robertbastian reviewed Sep 26, 2022

View reviewed changes

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

utils/zerovec/src/map/hashmap.rs Outdated Show resolved Hide resolved

pdogr added 5 commits January 8, 2023 18:52

Use t1ha hash function

553705b

s/ZeroHashMapStatic/ZeroHashMap

7bf72b0

Add docs, fix zhm lookup bench, add zhm deserialize bench

061a38e

add zhm deserialize benches

b0d7b8b

remove hashindex and refactor code

3d10558

pdogr marked this pull request as ready for review January 12, 2023 10:25

pdogr requested a review from a team as a code owner January 12, 2023 10:25

pdogr added 2 commits January 12, 2023 13:45

move common functions to algorithms module

35b8393

impl FromIterator for zhm

52d9e79

robertbastian reviewed Jan 12, 2023

View reviewed changes

pdogr and others added 5 commits January 13, 2023 01:39

Update utils/zerovec/src/hashmap/mod.rs

1cebd79

Co-authored-by: Robert Bastian <robertbastian@users.noreply.github.com>

remove borrow, pub changes

3e1a1d8

add Hash to make_ule

51e577e

add contains_key, iter_keys, iter_values, iter

900c668

move key equality check to index method

minor benchmark refactor

575a040

robertbastian reviewed Jan 13, 2023

View reviewed changes

sffc mentioned this pull request Jan 16, 2023

Add ZeroTrie, an efficient string-to-int collection #2722

Merged

pdogr added 3 commits January 19, 2023 03:26

pass m as usize

60f4d7f

fix maths

ffe15f6

remove inlining

dbe8882

compute_index in u64 shift back to (usize, u32, u32) from split_hash64

robertbastian reviewed Feb 2, 2023

View reviewed changes

utils/zerovec/src/hashmap/algorithms.rs Outdated Show resolved Hide resolved

custom serde for zerohashmap to bake in length validations

8e6dae9

robertbastian reviewed Feb 9, 2023

View reviewed changes

pdogr added 4 commits February 9, 2023 10:17

revert back to u32 arithmetic

96fd1c2

Merge branch 'main' into hm

a2abdde

fix clippy errors

715f667

add derive Hash to make_ule

7811713

robertbastian approved these changes Feb 10, 2023

View reviewed changes

robertbastian removed the request for review from a team February 10, 2023 15:30

pdogr merged commit 6de6825 into unicode-org:main Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial ZeroHashMap #2579

Add initial ZeroHashMap #2579

pdogr commented Sep 19, 2022 •

edited

sffc left a comment

pdogr commented Sep 19, 2022 •

edited

sffc commented Sep 19, 2022

robertbastian left a comment

robertbastian Sep 19, 2022

pdogr Sep 19, 2022

robertbastian Sep 20, 2022

pdogr commented Sep 20, 2022

sffc commented Sep 20, 2022

pdogr commented Sep 20, 2022

pdogr commented Sep 20, 2022

sffc commented Sep 20, 2022

robertbastian Sep 20, 2022

pdogr commented Sep 23, 2022

robertbastian Jan 12, 2023

robertbastian Jan 12, 2023

robertbastian Jan 13, 2023

pdogr Feb 9, 2023

robertbastian Feb 9, 2023

	(hasher.finish() % m as u64) as u32
	(hasher.finish() as usize % m) as u32

Add initial ZeroHashMap #2579

Add initial ZeroHashMap #2579

Conversation

pdogr commented Sep 19, 2022 • edited

sffc left a comment

Choose a reason for hiding this comment

pdogr commented Sep 19, 2022 • edited

sffc commented Sep 19, 2022

robertbastian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdogr commented Sep 20, 2022

sffc commented Sep 20, 2022

pdogr commented Sep 20, 2022

pdogr commented Sep 20, 2022

sffc commented Sep 20, 2022

Choose a reason for hiding this comment

pdogr commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdogr commented Sep 19, 2022 •

edited

pdogr commented Sep 19, 2022 •

edited