chore: Use `hashbrown` in most workspace members #4389

jbencin · 2024-02-16T23:00:21Z

Description

This PR replaces std::collections::{HashMap, HashSet} with the hashbrown equivalents in most of the workspace members (excluding stackslib and stacks-node). By doing this, we are able to validate blocks around 3% faster:

❯ hyperfine -w 3 -r 20 "stacks-core/target/release/stacks-inspect.next replay-block /home/jbencin/data/next/ range 99990 100000" "stacks-core/target/release/stacks-inspect.hashbrown replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: stacks-core/target/release/stacks-inspect.next replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):     11.102 s ±  0.021 s    [User: 10.591 s, System: 0.460 s]
  Range (min … max):   11.066 s … 11.147 s    20 runs
 
Benchmark 2: stacks-core/target/release/stacks-inspect.hashbrown replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):     10.738 s ±  0.023 s    [User: 10.223 s, System: 0.464 s]
  Range (min … max):   10.709 s … 10.796 s    20 runs
 
Summary
  stacks-core/target/release/stacks-inspect.hashbrown replay-block /home/jbencin/data/next/ range 99990 100000 ran
    1.03 ± 0.00 times faster than stacks-core/target/release/stacks-inspect.next replay-block /home/jbencin/data/next/ range 99990 100000

Thanks @cylewitruk for the suggestion!

Applicable issues

Find Optimizations in the current Clarity VM via benchmarking current VM and Wasm #4316

Additional info (benefits, drawbacks, caveats)

From the Hashbrown README, not sure if this matters in the locations that it is used:

Uses AHash as the default hasher, which is much faster than SipHash. However, AHash does not provide the same level of HashDoS resistance as SipHash, so if that is important to you, you might want to consider using a different hasher
hashbrown is not quite a drop in replacement for the std::collections equivalents. It uses different trait bounds on certain functions, which breaks type inference in some places, and this is why I've skipped making this change in stackslib and stacks-node for now

Checklist

Test coverage for new or modified code paths
Changelog is updated
Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
New clarity functions have corresponding PR in clarity-benchmarking repo
New integration test(s) added to bitcoin-tests.yml

xoloki

LGTM. I'm generally in favor of minimizing std.

codecov · 2024-02-17T02:42:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5a4b388) 77.00% compared to head (610f2da) 48.39%.

Additional details and impacted files

@@             Coverage Diff             @@
##             next    #4389       +/-   ##
===========================================
- Coverage   77.00%   48.39%   -28.61%     
===========================================
  Files         448      448               
  Lines      321363   321363               
===========================================
- Hits       247454   155530    -91924     
- Misses      73909   165833    +91924

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cylewitruk

lgtm 👍

If we got 3% in the peripheral libs (+ clarity) then there's more than likely some nice gains to be had in stackslib, especially in the chainstate index & MARF code (src/chainstate/stacks) when it comes to block/transaction processing times.

cylewitruk · 2024-02-17T11:13:31Z

Uses AHash as the default hasher, which is much faster than SipHash. However, AHash does not provide the same level of HashDoS resistance as SipHash, so if that is important to you, you might want to consider using a different hasher

This shouldn't be an issue as we're not directly using any arbitrary user-provided input as keys for hashmaps or values for hashsets, short of information in contracts such as qualified contract identifier, variable and function names, etc. which have been validated for both content and length.

There is some documentation from AHash regarding DoS resistance for anyone interested:

jcnelson · 2024-02-17T13:53:07Z

This shouldn't be an issue as we're not directly using any arbitrary

user-provided input as keys for hashmaps or values for hashsets, Yes we are. Block IDs, txids, and clarity DB keys (among others) are user-chosen.

…

On Sat, Feb 17, 2024, 6:13 AM Cyle Witruk ***@***.***> wrote: Uses AHash <https://github.com/tkaitchuck/aHash> as the default hasher, which is much faster than SipHash. However, AHash does not provide the same level of HashDoS resistance as SipHash, so if that is important to you, you might want to consider using a different hasher This shouldn't be an issue as we're not directly using any arbitrary user-provided input as keys for hashmaps or values for hashsets, short of information in contracts such as qualified contract identifier, variable and function names, etc. which have been validated for both content and length. There is some documentation from AHash regarding DoS resistance for anyone interested: - https://github.com/tkaitchuck/aHash/blob/master/FAQ.md - https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks — Reply to this email directly, view it on GitHub <#4389 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADQJK7OWX4MIC77DC73O4DYUCGGRAVCNFSM6AAAAABDMYNKL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBZHEZTSNRYGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

cylewitruk · 2024-02-17T23:58:33Z

Yes we are. Block IDs, txids, and clarity DB keys (among others) are user-chosen.

The only place where I can find any actual risk of putting arbitrary "dirty" user input in them is potentially in the HTTP server modules where there are HashMaps and HashSets keyed on String - we may want to keep these using std::collections... Otherwise, ClarityName is a guarded string (validated for both length and content), and Txid, StacksBlockId and related are all fixed-length arrays.

cylewitruk · 2024-02-19T17:04:34Z

For giggles, here's some information compiled by Bing's ChatGPT which is admittedly much smarter than myself 😂

jbencin · 2024-02-19T17:37:10Z

Block IDs, txids, and clarity DB keys

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

Even in the unlikely event that someone does manage find hash collisions, it doesn't "break" the HashMap, it just degrades performance. If we observe this, we just change that HashMap back to the one from std::collections

xoloki · 2024-02-19T17:50:17Z

Block IDs, txids, and clarity DB keys

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

Even in the unlikely event that someone does manage find hash collisions, it doesn't "break" the HashMap, it just degrades performance. If we observe this, we just change that HashMap back to the one from std::collections

We wouldn't even need to switch back to std, we can just use a different hasher:

https://doc.rust-lang.org/nightly/core/hash/struct.BuildHasherDefault.html#examples

cylewitruk · 2024-02-19T18:12:02Z

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

The only place allowing HashMap keys or HashSet values of arbitrary length+content (i.e. String is the HTTP modules. I'm not familiar enough in this area to speak about validations done prior to storage, which is worth checking, but realistically an attack vector via http/headers is astronomically unlikely with even the worst "real" hashing algorithms.

jcnelson · 2024-02-19T18:19:46Z

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

@jbencin @cylewitruk The user can choose the hash pre-image, which in turn lets them influence the structure hash table itself (to detrimental ends). The AHash hash function is not a cryptographic hash function [1], so we have to assume for now that calculating pre-images that hash to the same value is feasible. So, the user can force collisions by being clever about what inputs they use. The project README warns you about this. Unfortunately, it seems that Bing's stochastic parrot doesn't know enough to tell us these nuances.

Speaking of, according to the hashbrown README, hashbrown is the basis for std::collections::HashSet and std::collections::HashMap already [2]. So @jbencin, I'm curious how this speed-up was realized, if this is what was already being used under the hood?

Regarding collision resistance, I'm trying to figure out whether or not this is going to be a problem for us (or rather, already is a problem). The AHash authors speak about it here [3] and here [4].

[1] https://github.com/tkaitchuck/aHash?tab=readme-ov-file#ahash-------

[2] https://github.com/rust-lang/hashbrown?tab=readme-ov-file#hashbrown

[3] https://github.com/tkaitchuck/aHash/blob/master/FAQ.md

[4] https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks

jbencin · 2024-02-19T18:55:58Z

The AHash hash function is not a cryptographic hash function

I don't think a cryptographic hash is necessary for a hash table. If someone finds a single collision, it's not a problem. An attacker would have to find many collisions to noticeably degrade performance. And even if they do, it's only a performance issue. It's not like we're dealing with block data, where there could be serious consequences for even a single collision. Is there any cryptanalysis on AHash suggesting this is feasible or even possible?

I'm curious how this speed-up was realized, if this is what was already being used under the hood?

My guess is that it uses a different feature set when used via the hashbrown crate, like substituting AHash for SipHash. It may also be a newer version than the one in the standard library. Some things fail to build when replaced with the hashbrown equivalents, so it's not identical. Also, it increases the binary size by a few MB when I use it

jbencin · 2024-02-19T19:21:45Z

Also, from the AHash README:

runtime-rng: To obtain a seed for Hashers will obtain randomness from the operating system. (On by default) This is done using the getrandom crate.

compile-time-rng: For OS targets without access to a random number generator, compile-time-rng provides an alternative. If getrandom is unavailable and compile-time-rng is enabled, aHash will generate random numbers at compile time and embed them in the binary.

It looks like AHash is initialized with a random seed, which affects the value of the hashes produced. This would make any attempt at a DoS attack pointless, as hashes would not be consistent across nodes, and would not persist through a restart. It also means there is no way for an attacker to predict the hashing behavior of a node in order to find collisions, as it cannot observe the internal state

jcnelson · 2024-02-19T20:53:39Z

It looks like AHash is initialized with a random seed, which affects the value of the hashes produced. This would make any attempt at a DoS attack pointless, as hashes would not be consistent across nodes, and would not persist through a restart. It also means there is no way for an attacker to predict the hashing behavior of a node in order to find collisions, as it cannot observe the internal state

I doubt both of these claims. First, the attacker can measure (through many samples, if needed) the amount of time the node takes to handle a p2p or HTTP message. If there's a hashing operation on these code paths, then the attacker can infer the state of the hash table based on how quickly or slowly the target node processes the operation. AHash doesn't claim constant-time hash operations, so I very much doubt that this information cannot be learned. Second, armed with this information, the attacker can construct a sequence of queries against the node that lead to the DoS. The random factor raises the difficulty of this attack only by increasing the number of samples that must be acquired.

However, the real issue here is that there's a DoS-able hash table on a network I/O path. What we really want to do is make all such hash tables bound in state, so that the worst case performance for an insert is O(max-size) (which is O(1)).

jbencin · 2024-02-19T21:58:56Z

First, the attacker can measure (through many samples, if needed)

Is collecting enough data for a timing attack really feasible here? Would it require paying transaction fees? Shouldn't we ban clients anyway that make excessive requests anyways? If not, isn't that a much simpler DoS attack vector than any HashMap?

jcnelson · 2024-02-19T22:21:52Z

Is collecting enough data for a timing attack really feasible here? Would it require paying transaction fees? Shouldn't we ban clients anyway that make excessive requests anyways? If not, isn't that a much simpler DoS attack vector than any HashMap?

Depends on how patient the attacker is
Probably not
They'll just wait longer between requests
Public node operators already firewall their nodes to throttle requests

Anyway, a few comments ago I pointed out that hashbrown is already the default hasher for std, so if it leads to network vulnerabilities, they're already present. The fix for them is to bound the amount of state that any network-reachable hash table can hold (and this is true for any container type used by the network; you don't want memory exhaustion to be possible regardless).

Back on topic, regarding merging this PR -- since hashbrown is already present in std, is there a way to tweak its compile-time settings directly, instead of importing hashbrown?

xoloki · 2024-02-19T23:01:03Z

Is collecting enough data for a timing attack really feasible here? Would it require paying transaction fees? Shouldn't we ban clients anyway that make excessive requests anyways? If not, isn't that a much simpler DoS attack vector than any HashMap?

Depends on how patient the attacker is

Probably not

They'll just wait longer between requests

Public node operators already firewall their nodes to throttle requests

Anyway, a few comments ago I pointed out that hashbrown is already the default hasher for std, so if it leads to network vulnerabilities, they're already present. The fix for them is to bound the amount of state that any network-reachable hash table can hold (and this is true for any container type used by the network; you don't want memory exhaustion to be possible regardless).

Back on topic, regarding merging this PR -- since hashbrown is already present in std, is there a way to tweak its compile-time settings directly, instead of importing hashbrown?

hashbrown is not in std. The code for hashbrown hashmap is essentially the same as the code for std hashmap, but the default hasher is different (ahash vs siphash).

The performance gains we see in this PR are due to the different hashers, not the hashmaps.

Anywhere we are worried about hash collision, we can either use std or use a different hasher for hashbrown.

Personally I think the odds of anyone actually managing to DOS a node with hashdos is essentially zero.

jbencin · 2024-02-19T23:11:17Z

The fix for them is to bound the amount of state that any network-reachable hash

Right. If an attacker can insert enough entries into hashtables to cause problems without incurring significant transaction costs, that should be fixed, whether we have collisions or not

is there a way to tweak its compile-time settings directly

I don't think so. If there were, I don't think the hashbrown crate would have 240 million downloads. Looking through the Rust std code, it doesn't look like it's there

Personally I think the odds of anyone actually managing to DOS a node with hashdos is essentially zero

Even if they did manage to make a single node slow and unresponsive, I assume we'd detect it and restart the node. I'm not a DevOps expert, but I'd assume our k8s config has healthchecks for that. It would be a difficult attack to pull off, and I don't see the attacker getting anything out of it

…`./stackslib` and `./testnet/stacks-node`)

cylewitruk · 2024-02-20T15:53:09Z

Personally I think the odds of anyone actually managing to DOS a node with hashdos is essentially zero.

Agree - we're in extreme-hypothetical-land here, and it's very unclear what such an attack would yield for a potential attacker.

I'm pro including this change in stackslib as well for any hashmap keyed by either a guarded_string!, numeric primitive or type alias for fixed-length arrays.

xoloki approved these changes Feb 16, 2024

View reviewed changes

jbencin force-pushed the chore/use-hashbrown branch from 9d74f4e to 93efa44 Compare February 17, 2024 02:27

cylewitruk approved these changes Feb 17, 2024

View reviewed changes

chore: Replace HashMap and HashSet with Hashbrown versions (skip …

610f2da

…`./stackslib` and `./testnet/stacks-node`)

jbencin force-pushed the chore/use-hashbrown branch from 93efa44 to 610f2da Compare February 20, 2024 15:21

jbencin merged commit f8c6760 into stacks-network:next Feb 20, 2024
1 of 2 checks passed

jbencin deleted the chore/use-hashbrown branch February 20, 2024 16:11

cylewitruk mentioned this pull request Feb 28, 2024

[Chore] Add wrapper types for hashbrown's HashMap and HashSet #4441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Use `hashbrown` in most workspace members #4389

chore: Use `hashbrown` in most workspace members #4389

jbencin commented Feb 16, 2024

xoloki left a comment

codecov bot commented Feb 17, 2024 •

edited

cylewitruk left a comment •

edited

cylewitruk commented Feb 17, 2024

jcnelson commented Feb 17, 2024 via email

cylewitruk commented Feb 17, 2024

cylewitruk commented Feb 19, 2024 •

edited

jbencin commented Feb 19, 2024 •

edited by cylewitruk

xoloki commented Feb 19, 2024

cylewitruk commented Feb 19, 2024

jcnelson commented Feb 19, 2024

jbencin commented Feb 19, 2024 •

edited

jbencin commented Feb 19, 2024 •

edited

jcnelson commented Feb 19, 2024

jbencin commented Feb 19, 2024

jcnelson commented Feb 19, 2024

xoloki commented Feb 19, 2024

jbencin commented Feb 19, 2024

cylewitruk commented Feb 20, 2024 •

edited

chore: Use hashbrown in most workspace members #4389

chore: Use hashbrown in most workspace members #4389

Conversation

jbencin commented Feb 16, 2024

Description

Applicable issues

Additional info (benefits, drawbacks, caveats)

Checklist

xoloki left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 17, 2024 • edited

Codecov Report

cylewitruk left a comment • edited

Choose a reason for hiding this comment

cylewitruk commented Feb 17, 2024

jcnelson commented Feb 17, 2024 via email

cylewitruk commented Feb 17, 2024

cylewitruk commented Feb 19, 2024 • edited

jbencin commented Feb 19, 2024 • edited by cylewitruk

xoloki commented Feb 19, 2024

cylewitruk commented Feb 19, 2024

jcnelson commented Feb 19, 2024

jbencin commented Feb 19, 2024 • edited

jbencin commented Feb 19, 2024 • edited

jcnelson commented Feb 19, 2024

jbencin commented Feb 19, 2024

jcnelson commented Feb 19, 2024

xoloki commented Feb 19, 2024

jbencin commented Feb 19, 2024

cylewitruk commented Feb 20, 2024 • edited

chore: Use `hashbrown` in most workspace members #4389

chore: Use `hashbrown` in most workspace members #4389

codecov bot commented Feb 17, 2024 •

edited

cylewitruk left a comment •

edited

cylewitruk commented Feb 19, 2024 •

edited

jbencin commented Feb 19, 2024 •

edited by cylewitruk

jbencin commented Feb 19, 2024 •

edited

jbencin commented Feb 19, 2024 •

edited

cylewitruk commented Feb 20, 2024 •

edited