Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Use hashbrown in most workspace members #4389

Merged
merged 1 commit into from Feb 20, 2024

Conversation

jbencin
Copy link
Contributor

@jbencin jbencin commented Feb 16, 2024

Description

This PR replaces std::collections::{HashMap, HashSet} with the hashbrown equivalents in most of the workspace members (excluding stackslib and stacks-node). By doing this, we are able to validate blocks around 3% faster:

hyperfine -w 3 -r 20 "stacks-core/target/release/stacks-inspect.next replay-block /home/jbencin/data/next/ range 99990 100000" "stacks-core/target/release/stacks-inspect.hashbrown replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: stacks-core/target/release/stacks-inspect.next replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):     11.102 s ±  0.021 s    [User: 10.591 s, System: 0.460 s]
  Range (min … max):   11.066 s … 11.147 s    20 runs
 
Benchmark 2: stacks-core/target/release/stacks-inspect.hashbrown replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):     10.738 s ±  0.023 s    [User: 10.223 s, System: 0.464 s]
  Range (min … max):   10.709 s … 10.796 s    20 runs
 
Summary
  stacks-core/target/release/stacks-inspect.hashbrown replay-block /home/jbencin/data/next/ range 99990 100000 ran
    1.03 ± 0.00 times faster than stacks-core/target/release/stacks-inspect.next replay-block /home/jbencin/data/next/ range 99990 100000

Thanks @cylewitruk for the suggestion!

Applicable issues

Additional info (benefits, drawbacks, caveats)

  • From the Hashbrown README, not sure if this matters in the locations that it is used:

    Uses AHash as the default hasher, which is much faster than SipHash. However, AHash does not provide the same level of HashDoS resistance as SipHash, so if that is important to you, you might want to consider using a different hasher

  • hashbrown is not quite a drop in replacement for the std::collections equivalents. It uses different trait bounds on certain functions, which breaks type inference in some places, and this is why I've skipped making this change in stackslib and stacks-node for now

Checklist

  • Test coverage for new or modified code paths
  • Changelog is updated
  • Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
  • New clarity functions have corresponding PR in clarity-benchmarking repo
  • New integration test(s) added to bitcoin-tests.yml

Copy link
Collaborator

@xoloki xoloki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm generally in favor of minimizing std.

Copy link

codecov bot commented Feb 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5a4b388) 77.00% compared to head (610f2da) 48.39%.

Additional details and impacted files
@@             Coverage Diff             @@
##             next    #4389       +/-   ##
===========================================
- Coverage   77.00%   48.39%   -28.61%     
===========================================
  Files         448      448               
  Lines      321363   321363               
===========================================
- Hits       247454   155530    -91924     
- Misses      73909   165833    +91924     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@cylewitruk cylewitruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

If we got 3% in the peripheral libs (+ clarity) then there's more than likely some nice gains to be had in stackslib, especially in the chainstate index & MARF code (src/chainstate/stacks) when it comes to block/transaction processing times.

@cylewitruk
Copy link
Member

Uses AHash as the default hasher, which is much faster than SipHash. However, AHash does not provide the same level of HashDoS resistance as SipHash, so if that is important to you, you might want to consider using a different hasher

This shouldn't be an issue as we're not directly using any arbitrary user-provided input as keys for hashmaps or values for hashsets, short of information in contracts such as qualified contract identifier, variable and function names, etc. which have been validated for both content and length.

There is some documentation from AHash regarding DoS resistance for anyone interested:

@jcnelson
Copy link
Member

jcnelson commented Feb 17, 2024 via email

@cylewitruk
Copy link
Member

Yes we are. Block IDs, txids, and clarity DB keys (among others) are user-chosen.

The only place where I can find any actual risk of putting arbitrary "dirty" user input in them is potentially in the HTTP server modules where there are HashMaps and HashSets keyed on String - we may want to keep these using std::collections... Otherwise, ClarityName is a guarded string (validated for both length and content), and Txid, StacksBlockId and related are all fixed-length arrays.

@cylewitruk
Copy link
Member

cylewitruk commented Feb 19, 2024

For giggles, here's some information compiled by Bing's ChatGPT which is admittedly much smarter than myself 😂

Screenshot_20240219_180144_Bing.jpg

Screenshot_20240219_180954_Bing.jpg

@jbencin
Copy link
Contributor Author

jbencin commented Feb 19, 2024

Block IDs, txids, and clarity DB keys

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

Even in the unlikely event that someone does manage find hash collisions, it doesn't "break" the HashMap, it just degrades performance. If we observe this, we just change that HashMap back to the one from std::collections

@xoloki
Copy link
Collaborator

xoloki commented Feb 19, 2024

Block IDs, txids, and clarity DB keys

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

Even in the unlikely event that someone does manage find hash collisions, it doesn't "break" the HashMap, it just degrades performance. If we observe this, we just change that HashMap back to the one from std::collections

We wouldn't even need to switch back to std, we can just use a different hasher:

https://doc.rust-lang.org/nightly/core/hash/struct.BuildHasherDefault.html#examples

@cylewitruk
Copy link
Member

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

The only place allowing HashMap keys or HashSet values of arbitrary length+content (i.e. String is the HTTP modules. I'm not familiar enough in this area to speak about validations done prior to storage, which is worth checking, but realistically an attack vector via http/headers is astronomically unlikely with even the worst "real" hashing algorithms.

@jcnelson
Copy link
Member

The user can't use any arbitrary data for these values, right? That would make finding collisions much harder

@jbencin @cylewitruk The user can choose the hash pre-image, which in turn lets them influence the structure hash table itself (to detrimental ends). The AHash hash function is not a cryptographic hash function [1], so we have to assume for now that calculating pre-images that hash to the same value is feasible. So, the user can force collisions by being clever about what inputs they use. The project README warns you about this. Unfortunately, it seems that Bing's stochastic parrot doesn't know enough to tell us these nuances.

Speaking of, according to the hashbrown README, hashbrown is the basis for std::collections::HashSet and std::collections::HashMap already [2]. So @jbencin, I'm curious how this speed-up was realized, if this is what was already being used under the hood?

Regarding collision resistance, I'm trying to figure out whether or not this is going to be a problem for us (or rather, already is a problem). The AHash authors speak about it here [3] and here [4].

[1] https://github.com/tkaitchuck/aHash?tab=readme-ov-file#ahash-------

[2] https://github.com/rust-lang/hashbrown?tab=readme-ov-file#hashbrown

[3] https://github.com/tkaitchuck/aHash/blob/master/FAQ.md

[4] https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks

@jbencin
Copy link
Contributor Author

jbencin commented Feb 19, 2024

The AHash hash function is not a cryptographic hash function

I don't think a cryptographic hash is necessary for a hash table. If someone finds a single collision, it's not a problem. An attacker would have to find many collisions to noticeably degrade performance. And even if they do, it's only a performance issue. It's not like we're dealing with block data, where there could be serious consequences for even a single collision. Is there any cryptanalysis on AHash suggesting this is feasible or even possible?

I'm curious how this speed-up was realized, if this is what was already being used under the hood?

My guess is that it uses a different feature set when used via the hashbrown crate, like substituting AHash for SipHash. It may also be a newer version than the one in the standard library. Some things fail to build when replaced with the hashbrown equivalents, so it's not identical. Also, it increases the binary size by a few MB when I use it

@jbencin
Copy link
Contributor Author

jbencin commented Feb 19, 2024

Also, from the AHash README:

  • runtime-rng: To obtain a seed for Hashers will obtain randomness from the operating system. (On by default) This is done using the getrandom crate.
  • compile-time-rng: For OS targets without access to a random number generator, compile-time-rng provides an alternative. If getrandom is unavailable and compile-time-rng is enabled, aHash will generate random numbers at compile time and embed them in the binary.

It looks like AHash is initialized with a random seed, which affects the value of the hashes produced. This would make any attempt at a DoS attack pointless, as hashes would not be consistent across nodes, and would not persist through a restart. It also means there is no way for an attacker to predict the hashing behavior of a node in order to find collisions, as it cannot observe the internal state

@jcnelson
Copy link
Member

It looks like AHash is initialized with a random seed, which affects the value of the hashes produced. This would make any attempt at a DoS attack pointless, as hashes would not be consistent across nodes, and would not persist through a restart. It also means there is no way for an attacker to predict the hashing behavior of a node in order to find collisions, as it cannot observe the internal state

I doubt both of these claims. First, the attacker can measure (through many samples, if needed) the amount of time the node takes to handle a p2p or HTTP message. If there's a hashing operation on these code paths, then the attacker can infer the state of the hash table based on how quickly or slowly the target node processes the operation. AHash doesn't claim constant-time hash operations, so I very much doubt that this information cannot be learned. Second, armed with this information, the attacker can construct a sequence of queries against the node that lead to the DoS. The random factor raises the difficulty of this attack only by increasing the number of samples that must be acquired.

However, the real issue here is that there's a DoS-able hash table on a network I/O path. What we really want to do is make all such hash tables bound in state, so that the worst case performance for an insert is O(max-size) (which is O(1)).

@jbencin
Copy link
Contributor Author

jbencin commented Feb 19, 2024

First, the attacker can measure (through many samples, if needed)

Is collecting enough data for a timing attack really feasible here? Would it require paying transaction fees? Shouldn't we ban clients anyway that make excessive requests anyways? If not, isn't that a much simpler DoS attack vector than any HashMap?

@jcnelson
Copy link
Member

Is collecting enough data for a timing attack really feasible here? Would it require paying transaction fees? Shouldn't we ban clients anyway that make excessive requests anyways? If not, isn't that a much simpler DoS attack vector than any HashMap?

  1. Depends on how patient the attacker is
  2. Probably not
  3. They'll just wait longer between requests
  4. Public node operators already firewall their nodes to throttle requests

Anyway, a few comments ago I pointed out that hashbrown is already the default hasher for std, so if it leads to network vulnerabilities, they're already present. The fix for them is to bound the amount of state that any network-reachable hash table can hold (and this is true for any container type used by the network; you don't want memory exhaustion to be possible regardless).

Back on topic, regarding merging this PR -- since hashbrown is already present in std, is there a way to tweak its compile-time settings directly, instead of importing hashbrown?

@xoloki
Copy link
Collaborator

xoloki commented Feb 19, 2024

Is collecting enough data for a timing attack really feasible here? Would it require paying transaction fees? Shouldn't we ban clients anyway that make excessive requests anyways? If not, isn't that a much simpler DoS attack vector than any HashMap?

  1. Depends on how patient the attacker is
  2. Probably not
  3. They'll just wait longer between requests
  4. Public node operators already firewall their nodes to throttle requests

Anyway, a few comments ago I pointed out that hashbrown is already the default hasher for std, so if it leads to network vulnerabilities, they're already present. The fix for them is to bound the amount of state that any network-reachable hash table can hold (and this is true for any container type used by the network; you don't want memory exhaustion to be possible regardless).

Back on topic, regarding merging this PR -- since hashbrown is already present in std, is there a way to tweak its compile-time settings directly, instead of importing hashbrown?

hashbrown is not in std. The code for hashbrown hashmap is essentially the same as the code for std hashmap, but the default hasher is different (ahash vs siphash).

The performance gains we see in this PR are due to the different hashers, not the hashmaps.

Anywhere we are worried about hash collision, we can either use std or use a different hasher for hashbrown.

Personally I think the odds of anyone actually managing to DOS a node with hashdos is essentially zero.

@jbencin
Copy link
Contributor Author

jbencin commented Feb 19, 2024

The fix for them is to bound the amount of state that any network-reachable hash

Right. If an attacker can insert enough entries into hashtables to cause problems without incurring significant transaction costs, that should be fixed, whether we have collisions or not

is there a way to tweak its compile-time settings directly

I don't think so. If there were, I don't think the hashbrown crate would have 240 million downloads. Looking through the Rust std code, it doesn't look like it's there

Personally I think the odds of anyone actually managing to DOS a node with hashdos is essentially zero

Even if they did manage to make a single node slow and unresponsive, I assume we'd detect it and restart the node. I'm not a DevOps expert, but I'd assume our k8s config has healthchecks for that. It would be a difficult attack to pull off, and I don't see the attacker getting anything out of it

…`./stackslib` and `./testnet/stacks-node`)
@cylewitruk
Copy link
Member

cylewitruk commented Feb 20, 2024

Personally I think the odds of anyone actually managing to DOS a node with hashdos is essentially zero.

Agree - we're in extreme-hypothetical-land here, and it's very unclear what such an attack would yield for a potential attacker.

I'm pro including this change in stackslib as well for any hashmap keyed by either a guarded_string!, numeric primitive or type alias for fixed-length arrays.

@jbencin jbencin merged commit f8c6760 into stacks-network:next Feb 20, 2024
1 of 2 checks passed
@jbencin jbencin deleted the chore/use-hashbrown branch February 20, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants