Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposure of HashMap iteration order allows for O(n²) blowup. #36481

Open
Veedrac opened this issue Sep 14, 2016 · 101 comments
Open

Exposure of HashMap iteration order allows for O(n²) blowup. #36481

Veedrac opened this issue Sep 14, 2016 · 101 comments
Labels
A-collections Area: `std::collection` C-bug Category: This is a bug. P-medium Medium priority T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Veedrac
Copy link
Contributor

Veedrac commented Sep 14, 2016

Exposing HashMap's iteration order can cause O(n²) blowup even in innocent-looking code without the presence of an attacker. In the presence of an attacker, access to the order of a dictionary allows HashDoS-like attacks with only two requests in common scenarios.

Without an attacker

Consider a user with two possibly-disjoint hash maps

let first_map: HashMap<u64, _> = (0..900000).map(|i| (i, ())).collect();
let second_map: HashMap<u64, _> = (900000..1800000).map(|i| (i, ())).collect();

The user wants to merge the hash maps, and does so naïvely,

let mut merged = first_map;
merged.extend(second_map);

Time for merge when second_map is shuffled: 0.4s

Time for merge when second_map is not shuffled: 40.s (x100 amplification)

This effect is noticeably more pronounced when merging with a round robin strategy.

With an attacker

The threat model here is simple. The attacker is able to send JSON to the server. The server parses the JSON into a HashMap and through whatever means - an error message including the formatted map or explicit listing of the contents of the map - may reveal the order of the map.

The attack on this model requires two requests. The first sends some large JSON to the server

let first_request: HashMap<u64, _> = (0..900000).map(|i| (i, ())).collect();

and recieves an ordering

let returned_data: Vec<_> = first_request.into_iter().collect();

The attacker then sends the first and third quarter of this list in a new JSON object.

let second_request: HashMap<u64, _> =
 returned_data[..225000].iter()
     .chain(&returned_data[450000..675000])
     .cloned().collect();

Total time without second request: 0.1s

Total time with second request: 200s (x2000 amplification)


Solutions, near-solutions and non-solutions

These solutions should not be considered to be ordered, necessarily disjoint, nor an exhaustive list of options.

Fall back to BTreeMap

It should be clear that we cannot treat hash maps as a solved problem just because we use SipHash. In fact, SipHash is entirely insufficient to solve the problem. My first suggestion is thus to stop trying to make the hasher secure, and instead fall back to BTreeMap when nonlinear behaviour is detected.

This guarantees a minimum level of performance regardless of the capabilities of the attacker, and allows usage of a faster hashing algorithm by default. Hash maps should get faster by default as a result. This does not prevent having to consider the issue, since the fallback is costly and must be rare, but this is an easier problem than entirely securing the hash map.

Use a hash map without problematic blowup, or less affected by it

Java solves this problem by using a hash map with chaining and converting large buckets to tree maps. This mitigates the impact of degradation, but does not seem to allow using contiguous hash maps by default.

As far as I am aware, the blowup cannot be resolved by moving to another common form of open addressing, although quadratic probing would be significantly less affected by some of these attacks. Chaining alone also defeats the attacks given here, but still requires a secure hash and fails with attackers with more capabilities.

Use different seeds for each hash map

Pull requests #31356 (closed) and #33318 (merged) first proposed incrementing the thread local seed for each hash map. This was later removed when no threat model was seen, but would prevent both attacks listed here.

This still allows attacks when hash maps are reused.

Randomize iteration order

I am not sure how one would randomize iteration order efficiently. However, it should solve the problem unless hashes are exposed through other means.

Ignore the problem

Given that Rust went so far as to use SipHash, quadratic blowup on code as simple as fst.extend(snd) seems too extreme to ignore.

@sfackler sfackler added I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 14, 2016
@sfackler
Copy link
Member

It seems to me like we should basically just revert #33318 for now and use different seeds per map. Ideally we could find a strategy that avoids key reuse but also doesn't regress hashmap creation performance too much, but as @briansmith noted in #33318 there are risks to using multiple correlated keys as well.

@sfackler
Copy link
Member

cc @rust-lang/libs

@briansmith
Copy link
Contributor

The threat model here is simple. The attacker is able to send JSON to the server. The server parses the JSON into a HashMap and through whatever means - an error message including the formatted map or explicit listing of the contents of the map - may reveal the order of the map.

Perhpas we can resolve this by just saying this is outside the threat model. And/or, the Debug implementation for a HashMap isn't performance-sensitive in most applications, and so it could be redone to always output the values in sorted order by key--i.e. it could make a copy of itself into a BTreeMap and then Debug::fmt the BTreeMap. This seems reasonable since either the HashMap is small or it is probably too gigantic to reasonably expect to format in the first place. (Note that HashMap doesn't implement Display, not should it.)

@sfackler
Copy link
Member

We'd also have to mandate a sort in JSON serialization, which seems somewhat unreasonable.

The fact that simply merging two maps can run into sadness makes it seem to me like we have to do something about this.

@Veedrac
Copy link
Contributor Author

Veedrac commented Sep 14, 2016

You can't always sort a hashmap because the keys aren't required to be Ord. This doesn't matter for the BTreeMap fallback because you can sort by hash code (note: this means it's required to use a cryptographic hash), but sorting by hash is exactly the opposite of what you want to do when you don't have that fallback.

@sfackler
Copy link
Member

A shuffle would do fine though I think.

@eternaleye
Copy link
Contributor

eternaleye commented Sep 15, 2016

@briansmith: Debug is just a trivial example of how the information might leak, though, not the actual issue.

Another example would be if a web service validates the JSON it's given key-by-key, by iterating over (key, value) pairs and delegating to a validation function.

  • If it returns the first error it encounters, the attacker can learn the order.
  • If it errors early, even without a descriptive message, the attacker can learn the order.
  • If it collects the errors and returns them without shuffling (or sorting), the attacker can learn the order.

It's a footgun.

@Veedrac
Copy link
Contributor Author

Veedrac commented Sep 15, 2016

We could also use Python's technique to preserve insertion order. It introduces a little more indirection, but the cost doesn't seem to be excessive.

This also makes hash map order deterministic, which is a nice usability bonus. Note that tombstones aren't required for determinism, but are required to preserve insertion order through deletions. Either way prevents the attacks given.

@arthurprs
Copy link
Contributor

arthurprs commented Sep 15, 2016

Revisiting #33318 sounds like the reasonable way out.

@arielb1
Copy link
Contributor

arielb1 commented Sep 15, 2016

HashMap creation needs to be fast in the common case. I think we need a HashMap that defaults to FNV but rehashes with a random SIP key when it gets too much collisions.

@eternaleye
Copy link
Contributor

eternaleye commented Sep 15, 2016

TBH, I'm also curious how alternate approaches (a BTreeMap whose nodes are stored in an overallocated Vec, for similar allocation amortization as HashMap; a trie or crit-bit tree keyed on hashes with the same allocation style, etc) would perform.

Both would likely avoid the pathological merging behavior, and the amortized allocations may bring them closer to HashMap.

@arielb1
Copy link
Contributor

arielb1 commented Sep 15, 2016

So the problem here is not hard collisions, but rather soft collisions?

@bluss
Copy link
Member

bluss commented Sep 15, 2016

@Veedrac It has some wins too, since the data that needs to be swapped by bucket stealing while inserting is smaller, and iteration is much faster since it's on a mostly compact space. But the indirection costs in lookup may not be acceptable at all. (I made a prototype that demonstrates exactly these differences with a +6% hit to lookup.)

Maybe if one moves the hash field to be next to the index, or even special case smaller than 2**32 maps so that they keep 32-bit hashes and 32-bit indices next to each other?

@Veedrac
Copy link
Contributor Author

Veedrac commented Sep 15, 2016

@bluss The new layout should allow you to lower the occupancy (perhaps cap it at 80%?) without paying much of a hit, since the key/value arrays can safely cap out at 100% - only the index array needs to grow. This should get you some speed back, as well as make pathological insertions a lot rarer.

@Veedrac
Copy link
Contributor Author

Veedrac commented Sep 15, 2016

@arielb1 Exactly; the first half of the table is ending up with significantly over 100% occupancy.

@Veedrac
Copy link
Contributor Author

Veedrac commented Sep 16, 2016

Some extra discussion is available in this Reddit thread, including back of the envelope analysis of the asymptotics and commentary on a similar attack mentioned in the SipHash paper.

@bluss
Copy link
Member

bluss commented Sep 16, 2016

the consistent order map prototype is here https://github.com/bluss/ordermap I actually began that not because of this thread but due to curiosity about the new dict representation in Python 3.6.

Removal breaks the insert order property. Otherwise you need tombstones (ideas/forks for efficient way of doing that are welcome).

Conclusions: iteration can be so much faster than libstd HashMap. Growth too. Lookup is slightly slower (more than the 6% I said above if the map is large and out of cache). It solves the merge pathology.

@sfackler
Copy link
Member

That looks pretty awesome at first glance.

I don't think like we need to guarantee insertion order iteration, but it would be nice to make it iterate more obviously in not-insertion order - maybe alternate adding to the front and back of the iteration list or something like that? Probably not worth doing if it'd get complex.

@bluss
Copy link
Member

bluss commented Sep 16, 2016

However it looks in this issue, the current HashMap impl has lots of aspects in which its performance is great already. Don't think it's that easy to replace it.

@sfackler
Copy link
Member

Oh yeah, absolutely. It's just good to see that there are reasonable routes forward.

@Veedrac
Copy link
Contributor Author

Veedrac commented Sep 16, 2016

@bluss Safe, stable Rust and only 6% slower insertion normally? Amazing work.

sfackler added a commit to sfackler/rust that referenced this issue Sep 18, 2016
This reverts commit eaeef3d.

This is a short-term workaround to rust-lang#36481.
@bluss
Copy link
Member

bluss commented Sep 19, 2016

The particular representation in ordermap maps well to safe rust (using vectors without holes). Making accurate benchmarks is tricky (there are so many different scenarios), I can only ask for help there, and my general summary is in the readme.

@alexcrichton
Copy link
Member

The libs team discussed this during triage yesterday and the conclusion was that while we'd definitely like to fix this it's not absolutely critical we do so immediately. We don't mind taking a bit of time to explore our options like @bluss's crate.

@alexcrichton alexcrichton added P-high High priority and removed I-nominated labels Sep 29, 2016
@alkis
Copy link
Contributor

alkis commented Oct 23, 2017

Here's a general solution to the problem. after every rehash or clone, reseed the hasher. It will regress the performance of rehash but it will solve the O(n²) problem. Luckily in production code (as a whole) lookups dominate insertions by a large margin so the tradeoff is actually sound.

Note: if you end up going this route it might no longer make sense to store the full hashes in the hashtable as they are not going to help when rehashing (they are still going to save comparisons but it is likely that the costs outweigh the benefits). You can instead store the distance from the ideal position in a u8. This will reduce the amount of memory touched until finding the element or terminating the search.

@funny-falcon
Copy link

@alkis

it might no longer make sense to store the full hashes ...
they are still going to save comparisons

Storing of 8bits of hash is enough to save comparisons. Actually, I even stored 7bits with great result (using 8bit for other purpose).
If this "hashik" is always non-zero (ie calculated as hshik = hash % 251 + 1), then it could role of "presence" flag.
(Module by a constant is optimized by compiler into two multiplications).

@arthurprs
Copy link
Contributor

arthurprs commented Oct 24, 2017

That's not general as it requires support from the hasher, most just initialize to the same seed. Edit: can be made general by hashing the table address as well #36481 (comment)

I don't think that's a reasonable tradeoff for a stdlib hasher, growing and cloning the hashmap becomes much much slower. As the slot moving is essentially randomized it'll require a lot of random access and Robin hood adjustments as things go into the new table. The fact that sip isn't the fastest hasher around just makes it even worse.

@alkis
Copy link
Contributor

alkis commented Oct 24, 2017

@funny-falcon if we store 8-bits of hash for comparisons then we can't bail out early from probe chains based on distance from ideal position. It basically voids the Robin Hood. Or perhaps I misunderstood your suggestion.

@arthurprs yes either BuildHasher or Hasher traits need an additional requirement. BuildHasher can support it by requiring Default and Hasher by allowing setting the seed. You are right there is going to be a performance hit - I already mentioned it "it regresses the performance of rehash". I posit there is no win-win solution here. You either take performance hit by shuffling the order on rehash or you have O(n²) on insertion. As already mentioned, in the global scheme of things this does not matter: insertions happen orders of magnitude less often than lookups. Also a lot of the regression can be mitigated by using reserve().

@funny-falcon
Copy link

@alkis , I mean, store both 8bits of hash + 8bits distance. So, 16 bits in total.
Storing just 16bits of hash is less meaningful, cause in big table they all will be equal for same position.
(that is another reason why "hashik" is modulo prime: for same table position there will be 251 different "hashik").

@alkis
Copy link
Contributor

alkis commented Oct 24, 2017

@funny-falcon This makes sense 👍

@arthurprs
Copy link
Contributor

That's not reasonable, it's too easy to trip on that land-mine.

The most reasonable trade-off I've seen thus far is the two array map (OrderMap), that "fixes" the problem by making the degenerate case cheap to a point it can't really be exploited.

@alkis
Copy link
Contributor

alkis commented Oct 24, 2017

@arthurprs there is a way to "reseed" without requiring support from either BuildHasher or Hasher and placing no extra requirements to the user. Instead of hashing the key like this:

let mut state = hash_state.build_hasher();
t.hash(&mut state);
t.finish()

use the pointer to the buckets as additional bytes (this becomes a naturally occurring seed that changes at all the right places):

let mut state = hash_state.build_hasher();
state.write_usize(table.hash_start as usize);
t.hash(&mut state);
t.finish()

EDIT: The disadvantage to this solution is that if hash(usize) + hash(T) is slower than set_seed(seed) + hash(T) we get a suboptimal implementation. If there was an API in BuildHasher to create a Hasher with a specific seed then this could be done more efficiently across all hashers. This can be done by changing BuildHasher::build_hasher to take an additional argument. I am not sure how feasible this is at this point.

@arthurprs
Copy link
Contributor

arthurprs commented Oct 24, 2017

@alkis neat! That makes it general.

@alkis
Copy link
Contributor

alkis commented Oct 24, 2017

Any takers to implement this approach and close this bug once and forall? :-)

@arthurprs
Copy link
Contributor

One has to provide proof that the performance hit isn't huge. I'd not be surprised if this fix yields a ~500% hit on grow/clone and a 20+% hit on lookup when using sip13.

@funny-falcon
Copy link

There is even no need to calculate new hash value each time.
(Sorry, I don't know rust well, so I'll try to copy from above)

// key hash calculated once and stored
let mut state = hash_state.build_hasher();
t.hash(&mut state);
hashvalue = SafeHash.new(state.finish());
// position calculated dependent on storage pointer
// (example for 64bit)
poshash = hashvalue + (table.hash_start as usize);
poshash ^= poshash >> 32
poshash *= 0xdeadcafe71febeef; // or some other "well known" odd number (from Murmur, CityHash, etc)
poshash ^= poshash >> 32;

It will be enough for fixing bad case, and there is no need in recalculating SipHash.

@alkis
Copy link
Contributor

alkis commented Oct 24, 2017

@funny-falcon I do not think this will be very good. The pointer to hash_start has low entropy and the difference between one hash_starrt and another can be very small. I think mixing this way will weaken the composite hashfunction. Let's see how it works without this optimization first and then decide if this tradeoff is worth it. Perhaps some hash function experts can weigh in as well.

@funny-falcon
Copy link

@alkis, just try. It will be just enough to fix bad case. There is no need to be better.

I think mixing this way will weaken the composite hashfunction

No way. hashvalue calculated with SipHash, and poshash is calculated in bijective way (ie it is revertible).

Multiplying by a "random looking" constant and shifting provides really good results. This is a well-known technique (Knuth multiplicative hashing).

Simulation (in C):

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>

uint64_t poshash(uint64_t hv, uint64_t hs) {
	uint64_t ph;
	ph = hv + hs;
	ph ^= ph >> 32;
	ph *= 0xdeadcafe71febeeful;
	ph ^= ph >> 32;
	return ph;
}

int main(int ac, char** av) {
	int i, j;
	uint64_t hv1, hv2, hv3, hashstart;
	uint64_t ph1, ph2, ph3;
	uint64_t masks[3] = {0xffff, 0x7fff, 0x3fff};

	hv1 = 0x123456789abcdef0ul;
	hv2 = 0x98523aed81fcdef0ul;
	hv3 = 0x3fa83924dcbcdef0ul;
	hashstart = 0x00007fe342e50000ul;
	printf("\thv1\thv2\thv3\n");
	for (j=0; j<3; j++) {
		uint64_t mask = masks[j];
		printf("%lx\t%lx\t%lx\t%lx\n",
			mask, hv1 & mask, hv2 & mask, hv3 & mask);
	}
	for (i=0; i<5; i++) {
		uint64_t ph1, ph2, ph3;
		ph1 = poshash(hv1, hashstart);
		ph2 = poshash(hv2, hashstart);
		ph3 = poshash(hv3, hashstart);
		printf("\t----\t----\t----\n");
		for (j=0; j<3; j++) {
			uint64_t mask = masks[j];
			printf("%lx\t%04lx\t%04lx\t%04lx\n",
				mask, ph1 & mask, ph2 & mask, ph3 & mask);
		}
		hashstart += 0x100ul; /* next allocation */
	}
}

results:

mask hv1 hv2 hv3
ffff def0 def0 def0
7fff 5ef0 5ef0 5ef0
3fff 1ef0 1ef0 1ef0
----- ------- ------- ------
ffff 8b86 f7a3 ae2e
7fff 0b86 77a3 2e2e
3fff 0b86 37a3 2e2e
----- ------- ------- ------
ffff fd87 41a6 202d
7fff 7d87 41a6 202d
3fff 3d87 01a6 202d
----- ------- ------- ------
ffff 6f85 d3a5 522c
7fff 6f85 53a5 522c
3fff 2f85 13a5 122c
----- ------- ------- ------
ffff 518c 1ddf c42a
7fff 518c 1ddf 442a
3fff 118c 1ddf 042a
----- ------- ------- ------
ffff c38d afde 7629
7fff 438d 2fde 7629
3fff 038d 2fde 3629

@funny-falcon
Copy link

funny-falcon commented Oct 24, 2017

https://en.wikipedia.org/wiki/Universal_hashing#Constructions

It is possible to use more sofisticated "multiply-add-shift":

ph = hashvalue + (table.hash_start as usize);
ph ^= ph >> 32;
ph *= 0xacd5ad43274593b9;
ph += 0x6956ab76ed268a3d;
ph ^= ph >> 32;

But really there is no need for the usecase. Just "multiply-shift" is more than enough.

@steveklabnik
Copy link
Member

Triage: #56241 (comment) and #56241 (comment)

yorickpeterse pushed a commit to inko-lang/inko that referenced this issue Jun 16, 2019
The internals of HashMap, and in particular the hashing logic, were
broken. Internally the VM reused Rust's DefaultHasher type for hashing
values. This type is mutable. When storing a HashMap in a constant,
concurrent access to this HashMap could result in wrong hashes being
produced, as all threads use the same DefaultHasher.

To solve this, Inko takes a similar approach as Rust: we provide a
RandomState type, which can be used to create a DefaultHasher. A
DefaultHasher now takes two keys as arguments, used for seeding the
hasher. The RandomState type generates two keys randomly, similar to
Rust. The hash seeds are generated by taking a thread-local randomly
generated number, then incrementing it (wrapping around on overflow).
This ensures that it is very unlikely for two different HashMaps to use
the same seeds, making certain hash attacks [1] more difficult.

Random number generation is provided by the std::random module. This
module provides methods for randomly generating integers, floats, and
bytes. Integers and floats can also be generated in a given range, for
example:

    import std::random

    random.integer_between(min: 0, max: 10)

[1]: rust-lang/rust#36481 and
     https://internals.rust-lang.org/t/help-harden-hashmap-in-libstd/4138/18
@jonas-schievink jonas-schievink added the A-collections Area: `std::collection` label Aug 19, 2019
@lily-commure
Copy link
Contributor

lily-commure commented Sep 25, 2019

Now that #58623 is merged, the HashMap implementation uses quadratic probing instead of linear or Robin-Hood. Is this bug still open?

@hniksic
Copy link
Contributor

hniksic commented Oct 22, 2022

This issue stlil exists and is affecting developers who try out e.g. rustc_hash::FxHashMap for efficiency. Here is a simple way to reproduce it with just the stdlib:

use std::{time::Instant, collections::hash_map::DefaultHasher, hash::BuildHasherDefault};
use rand::Rng;

//type TestHashSet<K> = std::collections::HashSet<K>;
type TestHashSet<K> = std::collections::HashSet<K, BuildHasherDefault<DefaultHasher>>;

fn main() {
    let mut rnd = rand::thread_rng();
    let nums: Vec<_> = (0..10_000_000).map(|_| rnd.gen::<u64>()).collect();

    let t0 = Instant::now();
    let h: TestHashSet<u64> = nums.into_iter().collect();
    let t1 = Instant::now();
    println!("create: {}", (t1 - t0).as_secs_f64());

    let mut h2: TestHashSet<u64> = Default::default();
    for &n in &h {
        h2.insert(n);
    }
    let t2 = Instant::now();
    println!("rebuild: {}", (t2 - t1).as_secs_f64());

    println!("{}", h2.len());
}

In this example, using a stable hash makes the rebuilding ~5x slower than the randomized one. The behavior is quadratic, in the sense that doubling the set size will approximately quadruple the time it takes to rebuild it.

@workingjubilee
Copy link
Member

So using RandomState solves this and it's about the quality of the hash that is used, now? Erm... this sounds like we should close this, then? It's not news that a hash map's quality depends on the map's hash, and we allow people to replace the randomizing hash with a "non-hash" (e.g. simply taking the value of a small-enough integer).

Specifically, a common reason that people override the RandomState hash is precisely to obtain a (comparatively!) deterministic iteration order yet have the HashMap's relative performance, so further "fixes" would likely be more disruptive than helpful.

@ssokolow
Copy link

ssokolow commented Feb 4, 2023

As phrased, the initial bug report appears to be about defaults.

Exposing HashMap's iteration order can cause O(n²) blowup even in innocent-looking code without the presence of an attacker.

By "using RandomState", do you mean that's now the default or do you mean it's a workaround that people still need to be aware they need?

@workingjubilee
Copy link
Member

workingjubilee commented Feb 5, 2023

At the risk of being severely pedantic and repeating things everyone already knows:

HashMap is HashMap<K, V, S> where K is the key, V is the value, and S is the parameter for hashing (which somewhat awkwardly is defined as trait BuildHasher). A default std::collections::HashMap<K, V>, including the one created via HashMap::new(), will thus use RandomState for hashing. Confusingly, BuildHasherDefault also implements BuildHasher but is not actually used by default by HashMap, which is why the current example uses it as the alternative. So between landing hashbrown, RandomState being the default, and other QoL improvements, you have to try, deliberately, to walk off the golden path in order to encounter this issue, now.

I am also not noticing the blowup claimed in the current report: it seems to have linear growth, just with a variance that sometimes can make n=n*2 seem to yield t=n*4, but n=n*16 doesn't yield t=n*256, but it might yield t=n*32 (and then on the next run may yield t=n*16 or even t=n*8... variance seems to be a factor of about 4). But I haven't run an exhaustive test over different parameter sizes on an otherwise quiet machine to be absolutely sure, honestly, I just eyeballed a few runs and asked because I wanted to sanity check my initial reaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: `std::collection` C-bug Category: This is a bug. P-medium Medium priority T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests