-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weakness in tuple hash #40190
Comments
I've encountered some performance problems when The problem is as follows: for hashable values X and Y, The hash algorithm for tuples starts with a seed value, # assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl) The tuple (X, Y) therefore has hash value:
hash(Y) ^ 2 ...and the tuple (X, (X, Y)) has hash value:
hash((X, Y)) ^ 2 The outer multiplication is repeated, and is XORed with I realise that choosing a hash function is a difficult For my own application I'm fortunate in that I can |
Logged In: YES I'll repeat the tuple hashing algorithm in a fixed-width # assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl) |
Logged In: YES Any hash will have some collisions. If there is going to be The obvious alternatives are a more complicated hash (slows I would also expect your workaround of data rearrangement to |
Logged In: YES The OP was not referring to "some collisions"; his app Also, I am concerned about the tuple hash using the same |
Logged In: YES Wouldn't that just shift the pathological case from (x, (x, y)) |
Logged In: YES I suggest leaving the function as is, but replacing the True, there will always be the possibilities of collisions, |
Logged In: YES Note that in this case, the problem was because of nested |
Logged In: YES Hmm, you are correct. This is appears to be The correct solution is to change: value = my_mul(const, value) ^ hash(item) in Steve's pseudo-code to: value = my_mul(const, value ^ hash(item)) Of course, you still get a lot more robustness |
Logged In: YES Noting that
and
hash to the same thing today too (module -1 endcases). For >>> hash((3, (4, 8))) == hash((4, (3, 8)))
True
>>> |
Logged In: YES I can't make more time for this, so unassigned myself. This seems to make a good set of test inputs: N = whatever
base = range(N)
xp = [(i, j) for i in base for j in base]
inps = base + [(i, j) for i in base for j in xp] + \
[(i, j) for i in xp for j in base] When N == 50, inps contains 250,050 distinct elements, and Here's a hash function that yields no collisions across that def hx(x):
if isinstance(x, tuple):
result = 0x345678L
mult = 1000003
for elt in x:
y = hx(elt)
result = (((result + y) & 0xffffffffL) * mult) &
0xffffffffL
mult = (mult * 69069) & 0xffffffffL
result ^= len(x)
if result == -1:
result = -2
else:
result = hash(x)
return result In C, none of the "& 0xffffffffL" clauses are needed; in Python This does add a multiply per loop trip over what we have |
Logged In: YES Tim, your test fixture is very nice. I verified your result that there are no collisions The probability of no collisions is near zero for I ran my algorithm on your fixture set using the It is not true that I repeat multipliers (very often) - |
Logged In: YES Here is the Python code for the hash function def hx(x):
if isinstance(x, tuple):
result = 0x345678L
for n, elt in enumerate(x):
y = (tbl[n & 0x3f] + (n >> 6)) & 0xffffffffL
result = (y * (result ^ hx(elt))) & 0xffffffffL
result ^= len(x)
else:
result = hash(x)
return result
# 64 constants taken from MD5
# (see Modules/md5c.c in the Python sources)
tbl = (
-0x28955b88, -0x173848aa, 0x242070db, -0x3e423112,
-0xa83f051, 0x4787c62a, -0x57cfb9ed, -0x2b96aff,
0x698098d8, -0x74bb0851, -0xa44f, -0x76a32842,
0x6b901122, -0x2678e6d, -0x5986bc72, 0x49b40821,
-0x9e1da9e, -0x3fbf4cc0, 0x265e5a51, -0x16493856,
-0x29d0efa3, 0x2441453, -0x275e197f, -0x182c0438,
0x21e1cde6, -0x3cc8f82a, -0xb2af279, 0x455a14ed,
-0x561c16fb, -0x3105c08, 0x676f02d9, -0x72d5b376,
-0x5c6be, -0x788e097f, 0x6d9d6122, -0x21ac7f4,
-0x5b4115bc, 0x4bdecfa9, -0x944b4a0, -0x41404390,
0x289b7ec6, -0x155ed806, -0x2b10cf7b, 0x4881d05,
-0x262b2fc7, -0x1924661b, 0x1fa27cf8, -0x3b53a99b,
-0xbd6ddbc, 0x432aff97, -0x546bdc59, -0x36c5fc7,
0x655b59c3, -0x70f3336e, -0x100b83, -0x7a7ba22f,
0x6fa87e4f, -0x1d31920, -0x5cfebcec, 0x4e0811a1,
-0x8ac817e, -0x42c50dcb, 0x2ad7d2bb, -0x14792c6f
) |
Logged In: YES Nothing wrong with being "better than random" -- it's not the There's nothing in the structure of my example function About speed, the only answer is to time both, and across About the table of ints, it's always a dubious idea to multiply |
Logged In: YES Tim:
The simple change suggested by ygale (put the xor before the With this change it seems more difficult to find examples of |
Logged In: YES If 250,050 balls are tossed into 2**32 bins at random, the That said, I certainly agree that would be a major, easy, and |
Logged In: YES I uploaded my current diff and some unit tests Please look over the unit tests carefully. I am Note that besides the addition of the table containing Based on the results of the unit tests, I had to First of all, Tim is correct: constants that are divisible I ran into a problem with empty tuples nested in tuples. Finally, I moved the xor with the tuple length from |
Logged In: YES Re Tim's concern about cache hits: The trade-off is a slight risk of slight performance Anecdotally, I did not experience any qualitative Another alternative: use two constants: x = (len & 0x1 ? 0xd76aa471 : 0xe8c7b75f) * (++x ^ y); That passes all of the current unit tests, including |
Logged In: YES Applied a small, slightly cheaper (no second multiply) The new variation survives Tim's torture test which I See: |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: