-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving to SipHash-1-3 #73596
Comments
Rust and Ruby moved from SipHash-2-4 to SipHash-1-3. rust-lang/rust#29754 SipHash-1-3 seems faster than 2-4 and secure enough. |
+1 The discussions on the Rust and Ruby lists seem persuasive. |
I'm running pyperformance. BTW, hashlib doesn't provide siphash24. So can we simply replace siphash24 with siphash13, like ruby? |
PEP-456 defines an API to add more hashing algorithms and make the selection of hash algorithm a compile time option. We can easily add SipHash-1-3 and make it the default algorithm. Vendors then can select between FNV2, SipHash-1-3 and SipHash-2-4. On another note should we add SipHash-2-4 and 1-3 PRF to the hashlib mode? |
$ ../python.default -m perf compare_to default.json patched4.json -G
Slower (2):
- scimark_sor: 479 ms +- 8 ms -> 485 ms +- 9 ms: 1.01x slower (+1%)
- genshi_xml: 196 ms +- 2 ms -> 196 ms +- 2 ms: 1.00x slower (+0%) Faster (19):
Benchmark hidden because not significant (43) I'm creating siphash13 from reference implementation to double check the output. |
OK, I'll add siphash13, instead of replacing siphash24.
siphash implementation uses 64bit unit. |
I added SipHash13 as additional hash algorithm in https://github.com/tiran/cpython/tree/siphash13 . Still need to verify the finalizer. For hashlib I'd need to move to a different implementation of SipHash. The implementation in pyhash.c is optimized for speed and has a fixed key. |
On 01.02.2017 10:14, Christian Heimes wrote:
+1 on adding the 1-3 and making it the default; the faster Reading up a bit on the Rust thread and looking at this benchmark it seems as if it would make sense to not use a fixed Is there a good hash algorithm with provides better
+1 as well. |
There is undocumented option "Py_HASH_CUTOFF" to use DJBX33A for short string. See https://github.com/python/cpython/blob/master/Include/pyhash.h#L98 |
The performance of hash algorithm shouldn't affect general benchmarks since hash value is cached inside string object. Almost all dict lookups in critical parts are lookups with interned strings. But in corner cases the difference can be measurable. We should look not on results of macrobenchmarks, but find worst cases. There is also an alignment issue with the implementation of SipHash-2-4 (and I suppose with SipHash-1-3). See bpo-28055. |
On 01.02.2017 10:50, INADA Naoki wrote:
Thanks. I found a reference in the PEP-456: """ The state of small string optimization will be assessed during the beta The mentioned speedup sounds like enabling this by default Aside: I haven't been following this since the days I proposed the |
On 01.02.2017 11:07, Serhiy Storchaka wrote:
Good point. |
Current Py_HASH_CUTOFF implementation seems weak.
HashSecret used at last. Conflicting pair without secret conflict with secrets too.
So there are 8^(Py_HASH_CUTOFF-1) conflicting sequence can be built. Maybe, we should remove Py_HASH_CUTOFF completely? |
Py_HASH_CUTOFF was an experimental feature for low-performance platforms. On modern hardware the performance increase is marginal to non-existing. I planned to deprecate the feature in 3.6 but forgot. We can deprecate it now and remove it in either 3.7 or 3.8. |
On 01.02.2017 13:03, INADA Naoki wrote:
I think we ought to look for a better hash algorithm Some interesting investigations on this: PS: It may also be wise not to use the hash randomization |
We can and should still use hash randomization for short strings, but a different key than for hash randomization for longer strings. |
Py_HASH_CUTOFF uses different secret from siphash already. Attacker can produce large number of collision, without knowing the secret. BTW, we have FNV already. Let's survey about FNV-1 short string collisions. |
https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/ Murmur may be safe for <16 byte too. |
Unless somebody is able to show a real-world example with a considerable performance boost (>5%), I'm strictly against any kind of special casing for short strings. I don't want to trade security for performance. You might be able to persuade me to go for a more complex system, when you can both proof security and performance boost. |
OK, let's forget Py_HASH_CUTOFF for now and focus on SipHash-13. |
Christian Heimes: Could you create pull request? https://github.com/tiran/cpython/tree/siphash13 is 404 now. |
The old repos has a different name, https://github.com/tiran/cpython-mirror I'm busy with work at the moment. I'll get to the patch later. |
This issue is an optimization, but I still don't see any significant speedup :-) Would it be possible to write at least one microbenchmark showing a speedup? Maybe hash(<very long byte string>)? |
And needed microbenchmarks for different sizes, because the result should be relied on fitting the data in caches of different levels. |
microbench result: https://gist.github.com/methane/33c7b01c45ce23b67246f5ddaff9c9e7 Not significant ~ very little difference for small data. |
Worth revisiting? |
I am not sure this is worth doing. Microbenchmarks: ## import time
0.6% faster. ## hash(s+'a')
|
Yes, this is worth doing, IMO. It adds no more code and probably reduces maintenance costs as any improvements/bug-fixes to the rust/ruby versions can be easily ported. Even if the benefit is small, the cost is basically zero. |
"1.67 us +- 0.03 us: 1.78x faster" with a bytes string of 6k bytes sounds worth it to me. When we talk about "security" here, we are talking about a denial of service attack on the dict worst case performance: I know that it's not a popular opinion, but I don't think that this denial of service (DoS) is important. IMO there are enough other ways to crash a server. Moreover, the initial attack vector was a HTTP request with tons of header lines. In the meanwhile, the Python http module was modified to put arbitrary limits on the number of HTTP headers and the maximum length of a single HTTP header. It's nice to limit the risk of a DoS, but I don't think that we should go too far. If it worked for Rust and Ruby, SipHash-1-3 should be good as well for Python. I expect even more interesting speedup with bytes string longer than 6k bytes. And I'm quite sure that it's common that people manipulates long strings in Python :-) I retarget this change to Python 3.11. Please don't backport it since it changes the Python build system (configure options). |
I support this change, too. SipHash-1-3 is more than good enough for our use case. It provides sufficient diffusion of str and bytes hashes and has sufficient reliance against timing attacks on the PRF key. Also 64bit platforms are less affected less by hash collision DoS attacks anyway. The issue was far greater on 32bit systems. Nowadays virtually all production systems are 64bit. |
I recommend that you add SipHash-1-3 as an additional algorithm and make it the default. The removal of --with-hash-algorithm=siphash24 should go through regular deprecation cycle of two Python versions. |
Hash DoS is not only for HTTP headers. Everywhere creating dict from untrusted source can be attack vector. |
Since the days this was discussed, a lot of new and faster hash algorithms have been developed. It may be worthwhile looking at those instead. E.g. xxHash is a lot more performant than siphash: https://github.com/Cyan4973/xxHash (the link also has a comparison of hash algorithms) |
I am not sure its worth enough. Adding algorithm increase some maintenance cost... Is it really make someone happy? |
On 07.10.2021 11:49, Inada Naoki wrote:
That's certainly true, but at the same time, just focusing on string I wouldn't focus too much on this at the Python core level. IMO, it's much better to use application and use case specific methods |
Marc-Andre, Victor, your postings are off-topic. Please move your discussion to a new BPO. |
siphash13 is a short function. I wrote and designed PEP-456 so we can easily add new hashing functions. IMHO the maintenance cost is minimal. |
On 07.10.2021 12:16, Christian Heimes wrote:
I don't quite follow. Why is it fine that you discuss DoS, but it's not You're statement comes across as an attempt to shut down a possibly We could open a new issue to discuss faster alternatives to siphash |
But this BPO is not about discussing mitigations against DoS attacks in general. It's about adding SipHash1-3- and following the example of Rust and Ruby. If you like to discuss DoS attacks on hashing of numeric types or other mitigations, then please do this in a dedicated ticket. I like to keep this BPO focused on a single topic. |
I have contacted JP Aumasson to get his feedback on this proposal. |
BTW: We already use (a slight variant of) xxHash for tuples: https://bugs.python.org/issue34751 The issues is an interesting read, in particular on how xxHash was eventually chosen, with a whole set of other hash algorithms in between. |
On 07.10.2021 12:48, Christian Heimes wrote:
The point that both Victor and I wanted to make is that we have The motivation for moving to siphash 1-3 is performance and we can This broadens the discussion, yes, but that can easily be addressed Since siphash is a crypto hash function, whereas xxhash (and other With non-crypto hash algorithms available which exhibit good More details on xxhash collision stats: |
SipHash is a cryptographically secure PRF, not a cryptographic hashing algorithm, https://www.python.org/dev/peps/pep-0456/#siphash I'm strongly against using a different algorithm when the rest of the world is using either SipHash-2-4 or SipHash-1-3. A minuscule performance gain is not worth the risk. If users like to use an alternative algorithm, then they are free to do so. PEP-456 gives them a hook point to override the algorithm on build time of CPython. |
JP got back to me On 07/10/2021 14.34, Jean-Philippe Aumasson wrote:
This information disqualifies xxHash for our use case. |
On 07.10.2021 15:29, Christian Heimes wrote:
The quoted issue was for an early version of XXH3. Please see The numbers are close to the expected case, meaning that Looking at this older comparison, it would also make sense to https://cglab.ca/~abeinges/blah/hash-rs/ Given that dictionaries often use relatively short string keys, It would also have an effect on Python startup time, since all |
I suggest that you write a new PEP that discusses possible improvement for str and bytes hashing. This BPO is just about SipHash-1-3. As the author and implementer of PEP-456, I only approve SipHash-1-3 as new default algorithm. All other change proposals need at least separate BPOs, preferable a PEP. I don't want to have this simple feature proposal turned into a major design epic. |
Do Rust and Ruby cache the hash of the string in the string object? If they do not then the hashing speed is more important to them. |
Victor:
But do they use them as dict keys? AFAIK strings aren't hashed until hash() is called on them. |
That's correct. The hash of str and bytes is calculated on demand and then cached. Frozensets also cache their hash values while tuples don't have a cache. We ran experiments with hash caching in tuples many years ago. It turned out that the increased size had an overall negative effect on performance. This may have changed with modern hardware with more RAM and much larger CPU caches, though. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: