Bug report
Bug description:
This came up long ago when I was working on hardening the tuple hash, but got dropped on the floor.
It recently came up again in the context of frozendict haahing, which just copied much of the frozenset hashing code:
#149841
The problem is that shuffle_bits() only propagates changes from low-order bits to higher. Any function using only left shifts and multiplies suffers from this.
But hash codes may differ only in high-order bits to begin with. Avalanche is horrid in the absence of a stronger bit shuffler, which also propagates higher-to-lower.
It so happens that hash codes of low-precision floats do suffer from this.
>>> len(set(hash(i/256) for i in range(256))
256
So they're all different, but the variation is only in the topmsot bits:
>>> for i in range(10):
... print(hex(hash(i/256)))
...
0x0
0x20000000000000
0x40000000000000
0x60000000000000
0x80000000000000
0xa0000000000000
0xc0000000000000
0xe0000000000000
0x100000000000000
0x120000000000000
>>>
So frozenset struggles with them. First block of output from the code at the end, which analyzes the hashes of frozensets of all ways of taking 3 elements from those 256 floats (of course it doesn't matter if the same values are spelled as Fraction and/or Decimal instead).
number of possible hash codes: 18_446_744_073_709_551_616
number of hash codes total: 2_763_520
number of unique hash codes: 2_048
number of collisions: 2_761_472
mean expected collisions: 0.00
and its sdev: 0.00
z: +6069500754.14
worst pileups:
2_020
2_015
2_005
2_005
2_002
2_001
1_999
1_993
1_988
1_980
The z score is crazy high, but "pileups" matter much more than that. That's the number of distinct inputs that map to the same hash code. It's the length of collision chains the drive catastrop;ic slowdowns, not really the number of collisions.
The second block of output shows what happens instead if we just xor together the hashes of 1-tuples containing the same values:
number of possible hash codes: 18_446_744_073_709_551_616
number of hash codes total: 2_763_520
number of unique hash codes: 2_753_525
number of collisions: 9_995
mean expected collisions: 0.00
and its sdev: 0.00
z: +21968232.90
worst pileups:
4
4
3
3
3
3
3
3
3
3
The z score is still very high, but max pileup has become quite minor. Move to 4-element subsets and max pileup for frzenset zooms to over 100 thousand, while the xor of 1-tuple hashes peaks at a modest 13.
Tuple hashing doesn't do full-blown xxHash. In the full version, there's a long stretch at the end to work harder at avalanche. The per-element version tuple uses skips that. There's some useful left-to-right propagation via its rotate operation, but for 1-element tuples that's a "one and done" thing. High-order bits can't affect low-order ones further away than the one-tiee rotate count, Its effect in this context is nevertheless dramatic.
So what to do? If the problem is lack of left-to-right propagation, introduce some. Everyone else does ;-)
-
Roll our own. Even one ju8icious right shift could help a lot. Rohate (as in xxHash) even more (rotate propagates in both directions).
-
Look at adopting the tail end of xxHaah.
-
The C++ Boost library's combine_hashes() also propagates in both directions.
-
As does the "tempering" code before the Mersenne Twister delivers a result. The underlying generator is very slow at propagating bit changes across the state space, so they tacked on a "heavy avalanching sequence" to disguise that weakness. But even so it can very famously fall into "zeroland regions" (if most of state space bits happen to become zero, it cah take millions of iterations to break free of that).
As for frozendict, it doesn't appear to need any of this. The current PR uses a much stronger scheme based on copy/pasting two rounds of tuple's xxHash code, one round for the key hash and another to fold in the value hash. That's bound to be stronger than just (as in the attached code) xor'ing the hashes of 1-tuples, and it's almost certainly a complete waste of cycles to go on to call shuffle_bits() now. The propagation it does is weak (and zero of left-to-rght) compared to xxHash's core.
At a higher level, the spread of copy/pasted "mystery code" across modules is concerning. I'm old enough to know where all the bodies are buried, but the next generation isn't. If we have to duplicate code (and typically without enough comments to trace the code back to its original source), let's at least stick it in a new include file? Then there's just one place to look for comprehensive explanations and cautions - and no possibility of getting out of sync by accident. This kind of code is the opposite of "obvious" 😉.
Driving code:
from test import support
from collections import Counter
import math
def fint(n):
return f"{n:_}"
def analyze(hashcounts, nbins=1<<support.NHASHBITS):
print("number of possible hash codes:", fint(nbins))
assert all(hashcounts.values()) # counts of 0 not allowed
print()
numtotal = hashcounts.total()
numseen = len(hashcounts)
ncollision = numtotal - numseen
print(" number of hash codes total:", fint(numtotal))
print("number of unique hash codes:", fint(numseen))
print(" number of collisions:", fint(ncollision))
print()
mean, sdev = support.collision_stats(nbins, numtotal)
print("mean expected collisions:", format(mean, ".2f"))
print(" and its sdev:", format(sdev, ".2f"))
diff = ncollision - mean
z = diff / sdev
print(" z:", format(z, "+.2f"))
print()
print("worst pileups:")
for h, c in hashcounts.most_common(10):
print(" ", fint(c))
N, K = 8, 3
from itertools import combinations
def create(N=N, K=K):
c = Counter()
pow2 = 1 << N
small = tuple(i / pow2 for i in range(pow2))
for t in combinations(small, K):
c[hash(frozenset(t))] += 1
assert c.total() == math.comb(pow2, K)
return c
def create2(N=N, K=K):
c = Counter()
pow2 = 1 << N
small = tuple(i / pow2 for i in range(pow2))
for t in combinations(small, K):
h = 0
for e in t:
h ^= hash(tuple([e]))
c[h] += 1
assert c.total() == math.comb(pow2, K)
return c
analyze(create())
analyze(create2())
CPython versions tested on:
CPython main branch, 3.14, 3.13
Operating systems tested on:
No response
Bug report
Bug description:
This came up long ago when I was working on hardening the tuple hash, but got dropped on the floor.
It recently came up again in the context of frozendict haahing, which just copied much of the frozenset hashing code:
#149841
The problem is that
shuffle_bits()only propagates changes from low-order bits to higher. Any function using only left shifts and multiplies suffers from this.But hash codes may differ only in high-order bits to begin with. Avalanche is horrid in the absence of a stronger bit shuffler, which also propagates higher-to-lower.
It so happens that hash codes of low-precision floats do suffer from this.
So they're all different, but the variation is only in the topmsot bits:
So frozenset struggles with them. First block of output from the code at the end, which analyzes the hashes of frozensets of all ways of taking 3 elements from those 256 floats (of course it doesn't matter if the same values are spelled as Fraction and/or Decimal instead).
The z score is crazy high, but "pileups" matter much more than that. That's the number of distinct inputs that map to the same hash code. It's the length of collision chains the drive catastrop;ic slowdowns, not really the number of collisions.
The second block of output shows what happens instead if we just xor together the hashes of 1-tuples containing the same values:
The z score is still very high, but max pileup has become quite minor. Move to 4-element subsets and max pileup for frzenset zooms to over 100 thousand, while the xor of 1-tuple hashes peaks at a modest 13.
Tuple hashing doesn't do full-blown xxHash. In the full version, there's a long stretch at the end to work harder at avalanche. The per-element version tuple uses skips that. There's some useful left-to-right propagation via its rotate operation, but for 1-element tuples that's a "one and done" thing. High-order bits can't affect low-order ones further away than the one-tiee rotate count, Its effect in this context is nevertheless dramatic.
So what to do? If the problem is lack of left-to-right propagation, introduce some. Everyone else does ;-)
Roll our own. Even one ju8icious right shift could help a lot. Rohate (as in xxHash) even more (rotate propagates in both directions).
Look at adopting the tail end of xxHaah.
The C++ Boost library's
combine_hashes()also propagates in both directions.As does the "tempering" code before the Mersenne Twister delivers a result. The underlying generator is very slow at propagating bit changes across the state space, so they tacked on a "heavy avalanching sequence" to disguise that weakness. But even so it can very famously fall into "zeroland regions" (if most of state space bits happen to become zero, it cah take millions of iterations to break free of that).
As for frozendict, it doesn't appear to need any of this. The current PR uses a much stronger scheme based on copy/pasting two rounds of tuple's xxHash code, one round for the key hash and another to fold in the value hash. That's bound to be stronger than just (as in the attached code) xor'ing the hashes of 1-tuples, and it's almost certainly a complete waste of cycles to go on to call
shuffle_bits()now. The propagation it does is weak (and zero of left-to-rght) compared to xxHash's core.At a higher level, the spread of copy/pasted "mystery code" across modules is concerning. I'm old enough to know where all the bodies are buried, but the next generation isn't. If we have to duplicate code (and typically without enough comments to trace the code back to its original source), let's at least stick it in a new include file? Then there's just one place to look for comprehensive explanations and cautions - and no possibility of getting out of sync by accident. This kind of code is the opposite of "obvious" 😉.
Driving code:
CPython versions tested on:
CPython main branch, 3.14, 3.13
Operating systems tested on:
No response