About the security claims in README #17

rikusalminen · 2016-09-04T12:02:51Z

I'm not convinced that the claims about resistance to "hash flooding" algorithmic complexity DDoS attacks in the SECURITY chapter of README.md are correct. I'll address some of the claims.

Such an attack avoidance cannot not be the problem of the hash function, but the hash table collision resolution scheme.

The hash table collision resolution scheme does affect the performance but does not address the O(n) issues with hash collisions. A good resolution scheme can only help when (hash % tablesize) is equal between two hashes, but the hashes are not equal.

If two keys in the hash table give bit-to-bit identical hashes, the hash table implementation has to use strcmp() or similar to compare the entire key byte-by-byte. This will result in O(n^2) work with regards to the size of the input keys. This is bad when hashing HTTP request headers or other untrusted data.

This issue is made worse in the very simple hash functions because it's trivial to create hash collisions by appending a few bytes to the end of a key and strcmp() can't early exit before scanning the whole input.

You can attack every single hash function, even the best and most secure if you detect the seed

Indeed, the motivation for SipHash is to make recovering the key from timing information (or other side channels) difficult. It's the only hash function that I could find with any credible, peer-reviewed cryptanalysis on key recover.

Too bad that SipHash is pretty slow (2000 MB/s on my Skylake, the SSE2, SSSE3 and AVX2 implementations were slower) due to poor instruction level parallelism caused by explicit dependencies between instructions (in particular when implemented with SIMD). There are hash functions in smhasher that perform 5x to 15x faster than that.

If anyone is aware of any better performing hash functions that have been designed and analyzed with key recovery difficulty in mind, please do share.

Please discuss.

Bulat-Ziganshin · 2016-09-04T12:16:03Z

Yo can use any MAC function, in particular UMAC and VMAC are very fast on large inputs. Also look at HighwayHash. They claim MAC properties, but dont published any analysis yet.

rurban · 2016-09-04T20:45:04Z

The hash table collision resolution scheme does affect the performance but does not address the O(n) issues with hash collisions. A good resolution scheme can only help when (hash % tablesize) is equal between two hashes, but the hashes are not equal.

No. In the simplest case a non-linear search, such as logarithmic, avoids O(n), but even normal linear open addressing schemes, such as Robin Hood were recently proved to be constant in the worst case. The hash function or the table size has not much to do with that (besides that it should not be degenerated).

This issue is made worse in the very simple hash functions because it's trivial to create hash collisions by appending a few bytes to the end of a key and strcmp() can't early exit before scanning the whole input.

That's why the tests are here to find such degenerated hash functions. I.e. perl5 used such a function until last year.

Indeed, the motivation for SipHash is to make recovering the key from timing information (or other side channels) difficult. It's the only hash function that I could find with any credible, peer-reviewed cryptanalysis on key recover.

If the attacker wants to do a DDOS hash table there are much easier ways to perform that than finding the seed and generating collisions. Crypto analysis on a 32bit hash function for seed recovery is not needed because you can trivially brute force it or solve it with a simple solver. If Siphash or AES or SHA3 truncated to 32bit or any other. This is pure smoke & mirrors.

There are some proven secure hash tables schemes which cannot degenerate to O(n) in the worst case, but they are only UHASH, Robin-Hood and some slow ones, like Cuckoo or ordered tables.
Plus there are some anti-DDOS schemes, such as using random seeds, randomized iterators or adding sleep with many collisions (=attacks). But here we are only talking about hash function security, and those silly and harmful claims need to be debunked.

Bulat-Ziganshin · 2016-09-04T20:51:59Z

If the attacker wants to do a DDOS hash table there are much easier ways to perform that than finding the seed and generating collisions. Crypto analysis on a 32bit hash function for seed recovery is not needed because you can trivially brute force it or solve it with a simple solver.

Can you demonstrate that with SipHash? I.E. generation of collisions without knowing the seed faster than one success per 2^32 attempts

rurban · 2016-09-04T21:11:18Z

No, but people demonstrates that with sha1 and sha2 via z3 or smtlib, not even using faster and easier methods. So symbolizing and solving siphash is considered trivial compared to SHA2 or md5, with a bit of data and known code. I brute forced 32bit hashes in 4-5 seconds.

Bulat-Ziganshin · 2016-09-04T21:15:01Z

Can you please point us to the text demonstrating how to produce sha2-32 collisions faster than with brute force? I'm sure that you misundertood something since there are no known attacks at sha2 yet.

rurban · 2016-09-05T09:31:41Z

With some state known, i.e. the collisions or the traversal order it's easy. E.g. https://jheusser.github.io/2013/02/03/satcoin.html for SHA-256 with several solvers, and also see the links in the References section. Esp. with cbmc. Formerly people encoded it into smtlib or z3, which was quite some work.

You don't attack the hash function per se, you attack the hash table state to get the seed and then produce the collisions brute force. With a small table size you get a lot of easy to find collisions. Only ~10 rightmost bits needed.

rurban · 2016-09-05T09:39:30Z

If two keys in the hash table give bit-to-bit identical hashes, the hash table implementation has to use strcmp() or similar to compare the entire key byte-by-byte. This will result in O(n^2) work with regards to the size of the input keys. This is bad when hashing HTTP request headers or other untrusted data.

You can never trust the 32bit hash, you always have to compare the length and full keys with strcmp or better memcmp. Trusting a 32bit hash or anything <256 would be foolish.

And yes, this is bad. That's why fast HTTP parsers use better tricks than hash lookups. They usually look at the first 32 or 64bit word and cmp that in one instruction.
HTTP headers can be handled by perfect hashes much better.

rikusalminen · 2016-09-05T17:06:23Z

Thanks for the comments.

You can never trust the 32bit hash, you always have to compare the length and full keys with strcmp or better memcmp.

This is indeed the case and why hash tables with non-random keys or flaky characteristics should not be used on untrusted data.

And it's the reason why Perl, Python, Ruby etc changed from DJB33*, FNV1 and other easy to attack hashes to SipHash. In your opinion, they could have picked any hash function (e.g. Murmur3) with a random seed?

The attack on hash tables was based on easily being able to generate collisions, causing the http header hash table to hog the CPU memcmp'ing the key data while using very little network bandwidth.

You don't attack the hash function per se, you attack the hash table state to get the seed and then produce the collisions brute force.

Isn't this the reason why difficulty of key recovery is a characteristic of a good, secure hash function?

Crypto analysis on a 32bit hash function for seed recovery is not needed because you can trivially brute force it or solve it with a simple solver.

You imply that generating collisions is trivial, and not dependent on the hash function. Can you link to an attack demonstrating this?

The satcoin example was intriguing, are you aware of this being applied to hash table DDoSing?

But here we are only talking about hash function security, and those silly and harmful claims need to be debunked.

I guess this was why I opened this issue. The readme makes claims that counter research papers on the subject but don't cite any sources or give any background. You seem to be convinced that hash function security is a non-issue but could you back this up so it would be convincing to others too?

You seem to have a wealth of knowledge on the topic, I would appreciate if you'd augment the readme with some references on the topic.

Thanks.

rurban · 2016-09-05T20:44:09Z

This is indeed the case and why hash tables with non-random keys or flaky characteristics should not be used on untrusted data.

No, with a proper hash table you can guarantee constant lookup even for the worst case. But the hash function has almost nothing to do with that. (besides uhash, which uses two random simple mult hash functions, simpler than FNV1)

And it's the reason why Perl, Python, Ruby etc changed from DJB33*, FNV1 and other easy to attack hashes to SipHash. In your opinion, they could have picked any hash function (e.g. Murmur3) with a random seed?

A random seed helps, but was not enough. SipHash does not help at all. It doesn't help security, it only makes it slower. The best is to forget about linear chaining. This is the real culprit.
Murmur is also too big for hash tables.
The argument for linear chaining is faster splits (reallocation, double size), but there are still a lot of unneeded branches in this previously "fast" split code. It's easier to reuse the already calculated hash and just insert afresh.

The attack on hash tables was based on easily being able to generate collisions, causing the http header hash table to hog the CPU memcmp'ing the key data while using very little network bandwidth.

Yes, I know the attacks. There are many similar scenarios. glibc, the linux kernel, most routers, dns resolvers, all use proper hash tables, which are mostly fast. To counter a DDOS just detect it, like collisions & (collisions % 16) and sleep a while. Or limit the attack vector as PHP did (MAX_POST_SIZE). Or randomize the iterator as Perl did to hide the seed (not a good idea, but still). collisions>8 don't appear in the wild, unless you are using linear probing and a higher fill factor. Where the collisions are all in the cache and so it doesn't cost at all.
Or use a secure and fast scheme. As explained.

You don't attack the hash function per se, you attack the hash table state to get the seed and then produce the collisions brute force.

Isn't this the reason why difficulty of key recovery is a characteristic of a good, secure hash function?

A hash function cannot be secure by default. This is not possible with max 32bit, where only 7-10 bits are used to attack it.

Crypto analysis on a 32bit hash function for seed recovery is not needed because you can trivially brute force it or solve it with a simple solver.

You imply that generating collisions is trivial, and not dependent on the hash function. Can you link to an attack demonstrating this?

Nope. I don't have the time to write such a thing, academia or script kiddies will do sooner or later.
I work on the other side, fixing hash tables which are resistant against worst cases, without doing too much harm to the performance.

The satcoin example was intriguing, are you aware of this being applied to hash table DDoSing?

No. Just timing attacks and using solvers to create hash function collisions, currently MD5.
DDOS is just a kids attack tool. We should not really care about those kind of DDOS attacks in the wild, as it is much easier to DDOS someone with simpler methods. It's a more a marketing argument.

But here we are only talking about hash function security, and those silly and harmful claims need to be debunked.

I guess this was why I opened this issue. The readme makes claims that counter research papers on the subject but don't cite any sources or give any background. You seem to be convinced that hash function security is a non-issue but could you back this up so it would be convincing to others too?

It's pure logic. 32bit hash functions cannot be called secure. They can only be broken or not, or fast or not. Siphash has a prominent name behind it, but this still doesn't defy common sense. And it does harm to all those dynamic languages using it.

There are also much more silly theories in hash tables in the wild, such as using primes for the size for better module distribution, esp. with open addressing. As in glibc and in the linux kernel.
2^n exponential size is much better suited, as you can bitmask the hash index. A proper distribution just needs an uneven size (2^n - 1), and the numbers (the size and the hash internal constant) must be coprime. Which is trivial to guarantee.

rurban closed this as completed Sep 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the security claims in README #17

About the security claims in README #17

rikusalminen commented Sep 4, 2016 •

edited

Bulat-Ziganshin commented Sep 4, 2016

rurban commented Sep 4, 2016 •

edited

Bulat-Ziganshin commented Sep 4, 2016 •

edited

rurban commented Sep 4, 2016 •

edited

Bulat-Ziganshin commented Sep 4, 2016

rurban commented Sep 5, 2016 •

edited

rurban commented Sep 5, 2016

rikusalminen commented Sep 5, 2016 •

edited

rurban commented Sep 5, 2016

About the security claims in README #17

About the security claims in README #17

Comments

rikusalminen commented Sep 4, 2016 • edited

Bulat-Ziganshin commented Sep 4, 2016

rurban commented Sep 4, 2016 • edited

Bulat-Ziganshin commented Sep 4, 2016 • edited

rurban commented Sep 4, 2016 • edited

Bulat-Ziganshin commented Sep 4, 2016

rurban commented Sep 5, 2016 • edited

rurban commented Sep 5, 2016

rikusalminen commented Sep 5, 2016 • edited

rurban commented Sep 5, 2016

rikusalminen commented Sep 4, 2016 •

edited

rurban commented Sep 4, 2016 •

edited

Bulat-Ziganshin commented Sep 4, 2016 •

edited

rurban commented Sep 4, 2016 •

edited

rurban commented Sep 5, 2016 •

edited

rikusalminen commented Sep 5, 2016 •

edited