Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash collisions for tuples #78932

Closed
jdemeyer opened this issue Sep 20, 2018 · 122 comments
Closed

Hash collisions for tuples #78932

jdemeyer opened this issue Sep 20, 2018 · 122 comments
Assignees
Labels
3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@jdemeyer
Copy link
Contributor

BPO 34751
Nosy @tim-one, @rhettinger, @mdickinson, @ericvsmith, @jdemeyer, @sir-sigurd
PRs
  • bpo-34751: improved hash function for tuples #9471
  • bpo-34751: improved hash function for tuples #9534
  • Files
  • htest.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tim-one'
    closed_at = <Date 2018-10-28.00:07:08.358>
    created_at = <Date 2018-09-20.13:27:27.408>
    labels = ['interpreter-core', '3.8', 'performance']
    title = 'Hash collisions for tuples'
    updated_at = <Date 2018-11-05.08:30:09.074>
    user = 'https://github.com/jdemeyer'

    bugs.python.org fields:

    activity = <Date 2018-11-05.08:30:09.074>
    actor = 'jdemeyer'
    assignee = 'tim.peters'
    closed = True
    closed_date = <Date 2018-10-28.00:07:08.358>
    closer = 'rhettinger'
    components = ['Interpreter Core']
    creation = <Date 2018-09-20.13:27:27.408>
    creator = 'jdemeyer'
    dependencies = []
    files = ['47857']
    hgrepos = []
    issue_num = 34751
    keywords = ['patch']
    message_count = 122.0
    messages = ['325870', '325883', '325891', '325936', '325949', '325950', '325953', '325955', '325956', '325957', '325959', '325960', '325964', '325981', '325982', '326016', '326018', '326021', '326023', '326024', '326026', '326029', '326032', '326047', '326067', '326081', '326083', '326086', '326110', '326115', '326117', '326118', '326119', '326120', '326147', '326173', '326193', '326194', '326198', '326200', '326203', '326212', '326214', '326217', '326218', '326219', '326236', '326278', '326290', '326302', '326306', '326309', '326331', '326332', '326333', '326334', '326337', '326352', '326375', '326396', '326435', '326505', '326507', '326508', '326510', '326512', '326583', '326602', '326603', '326615', '326640', '326653', '326667', '326753', '326765', '326820', '326835', '326872', '326874', '326877', '326878', '326883', '326892', '326893', '326894', '326898', '326903', '326905', '326906', '326907', '326908', '326910', '326917', '326918', '326927', '326929', '326947', '326949', '326961', '327005', '327014', '327038', '327042', '327043', '327044', '327045', '327055', '327061', '327063', '327078', '327258', '327260', '327298', '327311', '327319', '327380', '327391', '327394', '327423', '327460', '328665', '329287']
    nosy_count = 6.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'eric.smith', 'jdemeyer', 'sir-sigurd']
    pr_nums = ['9471', '9534']
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue34751'
    versions = ['Python 3.8']

    @jdemeyer
    Copy link
    Contributor Author

    It's not hard to come up with a hash collision for tuples:

    >>> hash( (1,0,0) )
    2528505496374819208
    >>> hash( (1,-2,-2) )
    2528505496374819208

    The underlying reason is that the hashing code mixes ^ and *. This can create collisions because, for odd x, we have x ^ (-2) == -x and this minus operation commutes with multiplication.

    This can be fixed by replacing ^ with +. On top of that, the hashing code for tuples looks needlessly complicated. A simple Bernstein hash suffices.

    @rhettinger
    Copy link
    Contributor

    It's not hard to come up with a hash collision for tuples:

    Nor is it hard to come up with collisions for most simple hash functions. For typical use cases of tuples, what we have works out fine (it just combined the effects of the underlying hashes).

    The underlying reason is that the hashing code mixes ^ and *.

    Addition also has a relationship to multiplication. I think the XOR is fine.

    A simple Bernstein hash suffices.

    That is more suited to character data and small hash ranges (as opposed to 64-bit).

    On top of that, the hashing code for tuples looks
    needlessly complicated.

    The important thing is that it has passed our testing (see the comment above) and that it compiles nicely (see the tight disassembly below).

    ----------------------

    L82:
    xorq %r14, %rax
    imulq %rbp, %rax
    leaq 82518(%rbp,%r12,2), %rbp
    movq %r13, %r12
    movq %rax, %r14
    leaq -1(%r13), %rax
    cmpq $-1, %rax
    je L81
    movq %rax, %r13
    L73:
    addq $8, %rbx
    movq -8(%rbx), %rdi
    call _PyObject_Hash
    cmpq $-1, %rax
    jne L82

    @rhettinger rhettinger added 3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Sep 20, 2018
    @rhettinger rhettinger added the performance Performance or resource usage label Sep 20, 2018
    @jdemeyer
    Copy link
    Contributor Author

    Nor is it hard to come up with collisions for most simple hash functions.

    The Bernstein hash algorithm is a simple algorithm for which it can be proven mathematically that collisions cannot be generated if the multiplier is unknown. That is an objective improvement over the current tuple hash. Of course, in practice the multiplier is not secret, but it suffices that it is random enough.

    Addition also has a relationship to multiplication.

    Of course, but not in a way that can be used to generate hash collisions. In fact, the simple relation between addition and multiplication makes this an actual provable mathematical statement.

    That is more suited to character data and small hash ranges (as opposed to 64-bit).

    Maybe you are thinking of the multiplier 33 which is classically used for character data? If you replace the multiplier 33 by a larger number like _PyHASH_MULTIPLIER == 1000003 (ideally it would be a truly random number), it works fine for arbitrary sequences and arbitrary bit lengths.

    The important thing is that it has passed our testing

    That's a silly argument. If it passed testing, it is only because the testing is insufficient.

    But really: what I am proposing is a better hash without obvious collisions which won't be slower than the existing hash. Why would you be against that? Why should we "roll our own" hash instead of using a known good algorithm?

    @jdemeyer jdemeyer reopened this Sep 20, 2018
    @rhettinger
    Copy link
    Contributor

    ISTM, you're being somewhat aggressive and condescending. I'll hand you off to Tim who is far better at wrestling over the details :-)

    From my point of view, you've made a bold and unsubstantiated claim that Python's existing tuple hash has collision issues with real code. Only a contrived and trivial integer example was shown.

    On my end, I made an honest effort to evaluate the suggestion. I read your post,
    looked up Bernstein hash, revisited the existing C code and found that everything it is doing makes sense to me. There was a comment indicating that empirical tests were run to get the best possible constants. Also, I disassembled the code to verify that it is somewhat efficiently compiled.

    Your responded that I'm being "silly" and then you reversed my decision to close the issue (because there is no evidence that the hash is broken). Accordingly, I'm going to hand it over to Tim who I'm sure will know the history of the existing code, can compare and contrast the various options, evaluate them in the context of typical Python use cases, and assess whether some change is warranted.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 21, 2018

    @jdemeyer, please define exactly what you mean by "Bernstein hash". Bernstein has authored many hashes, and none on his current hash page could possibly be called "simple":

    https://cr.yp.to/hash.html
    

    If you're talking about the teensy

        hash = hash * 33 + c;

    thing from now-ancient C folklore, Bernstein is claimed to have replaced addition with xor in that later:

    http://www.cse.yorku.ca/~oz/hash.html
    

    In any case, Python _used_ to do plain

        x = (1000003 * x) ^ y;
        
    but changed to the heart of the current scheme 14 years ago to address Source Forge bug python/cpython#40190:
        https://github.com/python/cpython/commit/41bd02256f5a2348d2de3d6e5fdfcaeb2fcaaebc#diff-cde4fb3109c243b2c2badb10eeab7fcd

    The Source Forge bug reports no longer appear to exist. If you care enough, dig through the messages with subject:

    [ python-Bugs-942952 ] Weakness in tuple hash
    

    here:

    https://mail.python.org/pipermail/python-bugs-list/2004-May/subject.html
    

    The thrust was that the simpler approach caused, in real life, hashes of nested tuples of the form

    (X, (X, Y))
    

    to return hash(Y) - the hash(X) parts across calls magically xor'ed out.

    It was also systematically the case that

    (X, (Y, Z))

    and

    (Y, (X, Z))

    hashed to the same thing.

    So the complications were added for a reason, to address the showed-up-in-real-life pathological cases discussed in the bug report already linked to. Since I don't believe we've had other reports of real-life pathologies since, nobody is going to change this again without truly compelling reasons. Which I haven't seen here, so I'm closing this report again.

    BTW, as also explained in that report, the goal of Python's hash functions is to minimize collisions in dicts in real life. We generally couldn't care less whether they "act random", or are hard to out-guess. For example,

     assert all(hash(i) == i for i in range(1000000))
    

    has always succeeded.

    @tim-one tim-one closed this as completed Sep 21, 2018
    @ned-deily
    Copy link
    Member

    (FWIW, I believe the Source Forge bug reports were forwarded ported to bugs.python.org. This looks like the issue in question: https://bugs.python.org/issue942952 )

    @tim-one
    Copy link
    Member

    tim-one commented Sep 21, 2018

    Ah! I see that the original SourceForge bug report got duplicated on this tracker, as PR bpo-942952. So clicking on that is a lot easier than digging thru the mail archive.

    One message there noted that replacing xor with addition made collision statistics much worse for the cases at issue.

    @jdemeyer
    Copy link
    Contributor Author

    ISTM, you're being somewhat aggressive and condescending.

    Interestingly, I felt the same with your reply, which explains my rapid reversion of your closing of the ticket. The way I see it: I pointed out a bug in tuple hashing and you just closed that, basically claiming that it's not a bug.

    I still don't understand why you are closing this so quickly. To put it differently: if I can improve the hash function, removing that obvious collision, extending the tests to catch this case, while not making it slower, why shouldn't this be done?

    @jdemeyer
    Copy link
    Contributor Author

    Since I don't believe we've had other reports of real-life pathologies since

    You just did in this bug report. Why doesn't that count?

    @tim-one
    Copy link
    Member

    tim-one commented Sep 21, 2018

    @jdemeyer, you didn't submit a patch, or give any hint that you _might_. It _looked_ like you wanted other people to do all the work, based on a contrived example and a vague suggestion.

    And we already knew from history that "a simple Bernstein hash suffices" is false for tuple hashing (for one possible meaning of "a simple Bernstein hash" - we had one of those 14 years ago).

    Neither Raymond nor I want to spend time on this ourselves, so in the absence of somebody else volunteering to do the work, there's no reason to keep the report open uselessly.

    If you're volunteering to do the work ... then do the work ;-)

    @jdemeyer
    Copy link
    Contributor Author

    To come back to the topic of hashing, with "Bernstein hash" for tuples I did indeed mean the simple (in Python pseudocode):

    # t is a tuple
    h = INITIAL_VALUE
    for x in t:
        z = hash(x)
        h = (h * MULTIPLIER) + z
    return h

    I actually started working on the code and there is indeed an issue with the hashes of (X, (Y, Z)) and (Y, (X, Z)) being equal that way. But that can be solved by replacing "hash(x)" in the loop by something slightly more complicated. After thinking about it, I decided to go for shifting higher to lower bits, so replacing z by z^(z >> 32) or something like that.

    This also has the benefit of mixing the high-order bits into the low-order bits, which is an additional improvement over the current code.

    Since I already wrote the code anyway, I'll create a PR for it.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 21, 2018

    You said it yourself: "It's not hard to come up with ...". That's not what "real life" means. Here:

    >>> len(set(hash(1 << i) for i in range(100_000)))
    61

    Wow! Only 61 hash codes across 100 thousand distinct integers?!

    Yup - and I don't care. I contrived the case, just like you apparently contrived yours.

    @jdemeyer
    Copy link
    Contributor Author

    For the record: my collision is not contrived. It came up in actual code where I was hashing some custom class using tuples and found an unexpected high number of collisions, which I eventually traced back to the collision I reported here.

    By the way, I think the worst real-life hash collision is

    >>> hash(-1) == hash(-2)
    True

    @jdemeyer jdemeyer reopened this Sep 21, 2018
    @ericvsmith
    Copy link
    Member

    I agree with Raymond and Tim here: while it's inevitably true that there are many possible hash collisions, we'd need to see where the current algorithm caused real problems in real code. Pointing out a few examples where there was a collision isn't very helpful.

    So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes. What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

    @jdemeyer
    Copy link
    Contributor Author

    So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes. What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

    In all honesty, I don't remember. This was a while ago and at that time I didn't care enough to put up a CPython bug report. Still, this is a collision for tuples of short length (3) containing small integers (0 and -2). Why do you find that contrived?

    What prompted me to report this bug now anyway is that I discovered bad hashing practices for other classes too. For example, meth_hash in Objects/methodobject.c simply XORs two hashes and XOR tends to suffer from catastrophic cancellation. So I was hoping to fix tuple hashing as an example for other hash functions to follow.

    @jdemeyer
    Copy link
    Contributor Author

    I should also make it clear that the collision hash((1,0,0)) == hash((1,-2,-2)) that I reported is due to the algorithm, it's not due to some bad luck that 2 numbers happen to be the same. There are also many more similar hash collisions for tuples (all involving negative numbers) due to the same underlying mathematical reason.

    I still don't understand why you don't consider it a problem. It may be a tiny problem, not really affecting the correct functioning of CPython. But, if we can fix this tiny problem, why shouldn't we?

    @ericvsmith
    Copy link
    Member

    I'm not one to judge the effectiveness of a new hash algorithm. But my personal concern here is making something else worse: an unintended consequence. IMO the bar for changing this would be very high, and at the least it would need to be addressing a known, not theoretical, problem.

    That said, I'll let the experts debate your proposed change.

    @rhettinger
    Copy link
    Contributor

    ISTM the collision of "hash((1,0,0)) == hash((1,-2,-2))" also stems from integers hashing to themselves and that twos-complement arithmetic relation, -x == ~x+1. Further, that example specifically exploits that "hash(-1) == hash(-2)" due to how error codes. In a way, tuple hashing is incidental.

    Occasional collisions aren't really a big deal -- it is normal for dicts to have some collisions and to resolve them. I would be more concerned is non-contrived datasets had many elements that gave exactly the same hash value resulting in a pile-up in one place.

    FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

        >>> from itertools import product
        >>> len(set(map(hash, product(range(100), repeat=4))))
        100000000

    The current hash is in the pretty-darned-good category and so we should be disinclined to change battle-tested code that is known to work well across a broad range of use cases.

    @jdemeyer
    Copy link
    Contributor Author

    Further, that example specifically exploits that "hash(-1) == hash(-2)"

    No, it does not. There is no hashing of -1 involved in the example hash((1,0,0)) == hash((1,-2,-2)).

    @tim-one
    Copy link
    Member

    tim-one commented Sep 21, 2018

    For me, it's largely because you make raw assertions with extreme confidence that the first thing you think of off the top of your head can't possibly make anything else worse. When it turns out it does make some things worse, you're equally confident that the second thing you think of is (also) perfect.

    So far we don't even have a real-world test case, let alone a coherent characterization of "the problem":

    """
    all involving negative numbers) due to the same underlying mathematical reason.
    """

    is a raw assertion, not a characterization. The only "mathematical reason" you gave before is that "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m". Which is true, and a good observation, but doesn't generalize as-is beyond -2.

    Stuff like this from the PR doesn't inspire confidence either:

    """
    MULTIPLIER = (1000003)**3 + 2 = 1000009000027000029: the multiplier should be big enough and the original 20-bit number is too small for a 64-bit hash. So I took the third power. Unfortunately, this caused a sporadic failure of the testsuite on 32-bit systems. So I added 2 which solved that problem.
    """

    Why do you claim the original was "too small"? Too small for what purpose? As before, we don't care whether Python hashes "act random". Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat? What was _the cause_ of the "sporadic failure" (whatever that means), and why did adding 2 fix it? Why isn't there a single word in _the code_ about where the mystery numbers came from?:

    You're creating at least as many mysteries as you're claiming to solve.

    We're not going to change such heavily used code on a whim.

    That said, you could have easily enough _demonstrated_ that there's potentially a real problem with a mix of small integers of both signs:

    >>> from itertools import product
    >>> cands = list(range(-10, -1)) + list(range(9))
    >>> len(cands)
    18
    >>> _ ** 4
    104976
    >>> len(set(hash(t) for t in product(cands, repeat=4)))
    35380

    And that this isn't limited to -2 being in the mix (and noting that -1 wasn't in the mix to begin with):

    >>> cands.remove(-2)
    >>> len(cands) ** 4
    83521
    >>> len(set(hash(t) for t in product(cands, repeat=4)))
    33323

    If that's "the problem", then - sure - it _may_ be worth addressing. Which we would normally do by looking for a minimal change to code that's been working well for over a decade, not by replacing the whole thing "just because".

    BTW, continuing the last example:

    >>> c1 = Counter(hash(t) for t in product(cands, repeat=4))
    >>> Counter(c1.values())
    Counter({1: 11539, 2: 10964, 4: 5332, 3: 2370, 8: 1576, 6: 1298, 5: 244})

    So despite that there were many collisions, the max number of times any single hash code appeared was 8. That's unfortunate, but not catastrophic.

    Still, if a small change could repair that, fine by me.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 21, 2018

    Oops!

    """
    "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m"
    """

    The tail end should say "m*(j^(-2)) == -m*j" instead.

    @jdemeyer
    Copy link
    Contributor Author

    FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

        >>> from itertools import product
        >>> len(set(map(hash, product(range(100), repeat=4))))
        100000000

    FWIW: the same is true for my new hash function

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 3, 2018

    A (simplified and slightly modified version of) xxHash seems to work very well, much better than SeaHash. Just like SeaHash, xxHash also works in parallel. But I'm not doing that and just using this for the loop:

        for y in t:
            y ^= y * (PRIME32_2 - 1)
            acc += y
            acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
            acc *= MULTIPLIER

    Plain xxHash does "y *= PRIME32_2" or equivalently "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps slightly with my new tuple test.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 3, 2018

    Thanks for looking at xxHash! An advantage is that it already comes in 32- and 64-bit versions.

    A (simplified and slightly modified version of) xxHash
    seems to work very well, much better than SeaHash.

    ? I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test. Which it also passed, most recently with 7 collisions. That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

    Just like SeaHash, xxHash also works in parallel. But I'm not
    doing that and just using this for the loop:

    for y in t:
    y ^= y * (PRIME32_2 - 1)
    acc += y
    acc = ((acc << 13) + (acc >> 19)) # rotate left by 13 bits
    acc *= MULTIPLIER

    Plain xxHash does "y *= PRIME32_2" or equivalently
    "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps
    slightly with my new tuple test.

    Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me. Does the variant also score well under SMHasher? "I don't see why it wouldn't" is not an answer to that ;-)

    Note that the change also slows the loop a bit - "acc += y" can't begin until the multiply finishes, and the following "acc *=" can't begin until the addition finishes. Lengthening the critical path needs to buy something real to justify the extra expense. I don't see it here.

    For posterity: the 64-bit version of xxHash uses different primes, and rotates left by 31 instead of by 13.

    Note: I'm assuming that by "PRIME32_2" you mean 2246822519U, and that "MULTIPLIER" means 2654435761U. Which appear to be the actual multipliers used in, e.g.,

    https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

    As always, I'm not a fan of making gratuitous changes. In any case, without knowing the actual values you're using, nobody can replicate what you're doing.

    That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a real advantage in that it apparently didn't _need_ the "t ^= t << 1" permutation to clear the new tuple test's masses of replicated sign bits.

    But the more we make changes to either, the more work _we_ have to do to ensure we haven't introduced subtle weaknesses. Which the Python test suite will never be substantial enough to verify - we're testing almost nothing about hash behavior. Nor should we: people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests. Python's test suite is more to ensure that known problems _in Python_ don't recur over the decades.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 3, 2018

    Here's a complete xxHash-based implementation via throwing away C++-isms from

    https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

    In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

    The 32-bit build fares worse:

    • 3 collisions in the old tuple test.
    • 51 collisions in the new tuple test.
    • 173 collisions across product([-3, 3], repeat=20)
    • 141 collisions across product([0.5, 0.25], repeat=20)
    • no other collisions

    The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3. So 141 is fine, but 173 is highly suspect. 51 for the new tuple test is way out of bounds.

    But I don't much care. None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases. All you have to do is keep trying.

    Much as with SeaHash, adding

    t ^= t << 1;
    

    at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build. It also cuts the product([-3, 3], repeat=20) collisions to 109. But boosts the old tuple test's collisions to 12. I wouldn't do it: it adds cycles for no realistic gain. We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash.

    BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit).

    #define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
    #if Py_NHASHBITS >= 64
    #    define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
    #    define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
    #    define Py_HASH_LSHIFT 31
    #elif Py_NHASHBITS >= 32
    #    define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
    #    define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
    #    define Py_HASH_LSHIFT 13
    #else
    #    error "can't make sense of Py_uhash_t size"
    #endif
    #define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)
    
    static Py_hash_t
    tuplehash(PyTupleObject *v)
    {
        Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
        Py_hash_t y;
        Py_ssize_t len = Py_SIZE(v);
        PyObject **p;
        x = 0x345678UL;
        p = v->ob_item;
        while (--len >= 0) {
            y = PyObject_Hash(*p++);
            if (y == -1)
                return -1;
            t = (Py_uhash_t)y;
            x += t * Py_HASH_MULT2;
            x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
            x *= Py_HASH_MULT1;
        }
        x += 97531UL;
        if (x == (Py_uhash_t)-1)
            x = -2;
        return x;
    }
    #undef Py_NHASHBITS
    #undef Py_HASH_MULT1
    #undef Py_HASH_MULT2
    #undef Py_HASH_LSHIFT
    #undef Py_HASH_RSHIFT

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 4, 2018

    I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test. Which it also passed, most recently with 7 collisions. That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

    Note that I'm always considering parametrized versions of the hash functions that I'm testing. I'm replacing the fixed multiplier (all algorithms mentioned here have such a thing) by a random multiplier which is 3 mod 8. I keep the other constants. This allows me to look at the *probability* of passing the testsuite. That says a lot more than a simple yes/no answer for a single test. You don't want to pass the testsuite by random chance, you want to pass the testsuite because you have a high probability of passing it.

    And I'm testing 32-bit hashes indeed since we need to support that anyway and the probability of collisions is high enough to get interesting statistical data.

    For SeaHash, the probability of passing my new tuple test was only around 55%. For xxHash, this was about 85%. Adding some input mangling improved both scores, but the xxHash variant was still better than SeaHash.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 4, 2018

    Note: I'm assuming that by "PRIME32_2" you mean 2246822519U

    Yes indeed.

    and that "MULTIPLIER" means 2654435761U.

    No, I mean a randomly chosen multiplier which is 3 mod 8.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 4, 2018

    people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests.

    There are hash functions that pass those tests which are still bad in practice when used as tuple hash function. That's really unfortunate, but it's a fact that we need to live with. It means that we may need to do some adjustments to the hash functions.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 4, 2018

    Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.

    What I'm doing is the most innocent change: just applying a fixed permutation on the *input* of the hash function. I'm not changing the actual hashing algorithm.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 4, 2018

    In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

    That's pretty bad actually. With 64 bits, you statistically expect something in the order of 10**-8 collisions. So what you're seeing is 9 orders of magnitude too many collisions.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 4, 2018

    Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.

    OK, I won't do that. The difference is not that much anyway (it increases the test success rate from about 85% to 90%)

    @tim-one
    Copy link
    Member

    tim-one commented Oct 4, 2018

    Note that I'm always considering parametrized
    versions of the hash functions that I'm testing.
    I'm replacing the fixed multiplier (all algorithms
    mentioned here have such a thing) by a random
    multiplier which is 3 mod 8.

    I've explained before in some detail that this makes NO SENSE for SeaHash: its multiplier was specially constructed to guarantee certain dispersion properties, to maximize the effectiveness of its "propagate right" step.

    You already have my explanation about that, and a link to the original Reddit thread in which it was proposed (and eagerly accepted by SeaHash's primary author). Also already explained that its multiplier appears to fail its design criteria at two specific bit positions. Which I can repair, but I would do so not by changing Python's version, but by bringing it to SeaHash's author's attention.

    OTOH, I haven't been able to find anything explaining why the xxHash authors picked what they picked. But they certainly didn't have "random" in mind - nobody does. You discovered for yourself by trying things that various properties make for bad multipliers, and people who work on these things "for real" knew that a long time ago. They almost certainly know a great deal more that we haven't thought of at all.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 4, 2018

    > In the 64-bit build there are no collisions across my
    > tests except for 11 in the new tuple test.

    That's pretty bad actually. With 64 bits, you statistically
    expect something in the order of 10**-8 collisions. So
    what you're seeing is 9 orders of magnitude too many collisions.

    You wrote the test, so you should remember that you throw away the top 32 bits of the 64-bit hash code :-) Tossing 345130 balls into 2**32 bins has an expected mean of 13.9 collisions and sdev 3.7. 11 collisions is fine.

    If I change the test to retain all the hash bits in the 64-bit build, there are no collisions in the new tuple test.

    If I change it again to retain only the high 32 hash bits, 27 collisions. Which is statistically suspect, but still universes away from "disaster" territory.

    @tim-one tim-one added topic-XML interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-XML labels Oct 4, 2018
    @tim-one
    Copy link
    Member

    tim-one commented Oct 4, 2018

    > people already wrote substantial test suites dedicated
    > to that sole purpose, and we should aim to be "mere
    > consumers" of functions that pass _those_ tests.

    There are hash functions that pass those tests which
    are still bad in practice when used as tuple hash function.

    Really? Which one(s)? If you're talking about that some fail badly when you _replace_ the constants they picked, I doubt they'd accept that as proof of anything. Things like FNV and DJB score poorly on SMHasher to begin with.

    That's really unfortunate, but it's a fact that we need
    to live with.

    I'm surprised they do as well as they do using less than a handful of invertible transformations per input, using state of the same bit width as the inputs.

    I don't expect them to be immune to all easily-provoked "worse than random" cases, though. Over time, these hashes change while keeping the same name. As noted before, this is at least the second version of SeaHash. The best of the original algorithms of this kind - murmurhash - is on its 3rd version now.

    The motivation is the same: someone bumps into real-life cases where they do _way way way_ worse than "random", and so they try to repair that as cheaply as possible.

    Most of these are designed for data centers to do cheap-as-possible reasonably robust fingerprinting of giant data blobs. They could already have used, e.g., SHA-2 for highly robust fingerprinting, but they don't want to pay the very much higher runtime cost.

    If you can provoke any deviation from randomness in that kind of hash, it's considered "broken" and is never used again. But in these far cheaper hashes, it's considered part of the tradeoffs. If you read the Reddit thread about SeaHash I cited before, the person who suggested the current transformations noted that there were still weaknesses in some areas of the input space he was able to find just by thinking about how it worked, but far milder than the weaknesses he found by thinking about how the original version worked.

    That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think about. So it goes.

    It means that we may need to do some adjustments to
    the hash functions.

    I'm fine with the faithful (barring errors on my part) xxHash I posted here before. And I don't care whether it works "better" or "worse" on Python's tiny test suite if its constants are replaced, or its algorithm is "tweaked". When xxHash version 2 is released, I want it to be as mindless as possible to replace the guts of Python's tuple-hash loop with the guts of xxHash version 2.

    It's easy to believe that SMHasher has nothing testing the bit patterns produced by mixing two's-complement integers of similar magnitude but opposite signs. That's not the kind of data giant data centers are likely to have much of ;-) If Appleby can be talked into _adding_ that kind of keyset data to SMHasher, then the next generations of these hashes will deal with it for us. But an objectively small number of collisions more than expected in some such cases now doesn't annoy me enough to want to bother.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 6, 2018

    As a sanity check, I tried the 32-bit version of MurmurHash2 (version 3 doesn't have a 32-bit version). All of my Python tests had collisions, and while the new tuple test dropped to 15, product([0.5, 0.25], repeat=20) skyrocketed from 141 (32-bit xxHash) to 3848.

    I could live with that too, but xxHash does better overall on our tiny tests, and requires less code and fewer cycles.

    Here's what I used:

    static Py_hash_t
    tuplehash(PyTupleObject *v)
    {
        const Py_uhash_t m = 0x5bd1e995;
        const int r = 24;
        Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
        Py_hash_t y;
        Py_ssize_t len = Py_SIZE(v);
        PyObject **p;
        x = 0x345678UL ^ (Py_uhash_t)len;
        p = v->ob_item;
        while (--len >= 0) {
            y = PyObject_Hash(*p++);
            if (y == -1)
                return -1;
            Py_uhash_t t = (Py_uhash_t)y;
            t *= m;
            t ^= t >> r;
            t *= m;
            x *= m;
            x ^= t;
        }
        x ^= x >> 13;
        x *= m;
        x ^= x >> 15;
        if (x == (Py_uhash_t)-1)
            x = -2;
        return x;
    }

    @tim-one
    Copy link
    Member

    tim-one commented Oct 6, 2018

    Thinking about the way xxHash works prompted me to try this initialization change:

        x = 0x345678UL + (Py_uhash_t)len; // adding in the length is new

    That was a pure win in my tests, slashing the number of collisions in the new tuple test, 32-bit build, from 51 to 17. In the 64-bit build, cut from 11 to 10, but when looking at the high 32 hash bits instead from 27 to 15. There were also small improvements in other 32-bit build collision stats.

    Full-blown xxHash (& SeaHash, and murmurhash, and ...) also fold in the length, but not inside their core loops. It's useful info!

    But they fold it in _after_ their core loops, not before. They're apparently worried about this: if their internal state ever reaches 0, that's a fixed point so long as all remaining incoming data is zeroes (in Python, e.g., picture adding one or more trailing integer or float zeroes or ..., which hash to 0). Folding in the length at the end snaps it slightly out of zeroland.

    I don't really care about that. Folding in the length at the start quickly leads to large differences across iterations for tuples of different lengths (thanks to repeated mulitiplication and "propagate right" steps), which is especially helpful in things like the nested tuple tests where there's almost as much variation in tuple lengths as there is in the few little integers bottom-level tuples contain.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 7, 2018

    I updated PR 9471 with a tuple hash function based on xxHash. The only change w.r.t. the official xxHash specification is that I'm not using parallellism and just using 1 accumulator. Please have a look.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 7, 2018

    Attaching htest.py so we have a common way to compare what various things do on the tests that have been suggested. unittest sucks for that. doctest too. Here's current code output from a 32-bit build; "ideally" we want "got" values not much larger than "mean" (these are counting collisions):

    range(100) by 3; 32-bit hash codes; mean 116.42 got 0
    -10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 69,596
    -50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 708,066
    0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 875,000
    [-3, 3] by 20; 32-bit hash codes; mean 128.00 got 1,047,552
    [0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 1,048,568
    old tuple test; 32-bit hash codes; mean 7.43 got 6
    new tuple test; 32-bit hash codes; mean 13.87 got 102,922

    And under a 64-bit build, where the full hash code is considered, and also its lowest and highest 32 bits. Note, e.g., that the old tuple test is an utter disaster if we only look at the high 32 bits. Which is actually fine by me - the point of this is to show what happens. Judging is a different (albeit related) issue ;-)

    range(100) by 3; 64-bit hash codes; mean 0.00 got 0
    range(100) by 3; 32-bit lower hash codes; mean 116.42 got 0
    range(100) by 3; 32-bit upper hash codes; mean 116.42 got 989,670
    -10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 69,596
    -10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 69,596
    -10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 101,438
    -50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 708,066
    -50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 708,066
    -50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 994,287
    0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 500,000
    0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 875,000
    0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 989,824
    [-3, 3] by 20; 64-bit hash codes; mean 0.00 got 1,047,552
    [-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 1,047,552
    [-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 1,047,552
    [0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 1,048,544
    [0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 1,048,575
    [0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 1,048,544
    old tuple test; 64-bit hash codes; mean 0.00 got 0
    old tuple test; 32-bit lower hash codes; mean 7.43 got 6
    old tuple test; 32-bit upper hash codes; mean 7.43 got 128,494
    new tuple test; 64-bit hash codes; mean 0.00 got 102,920
    new tuple test; 32-bit lower hash codes; mean 13.87 got 102,922
    new tuple test; 32-bit upper hash codes; mean 13.87 got 178,211

    @tim-one
    Copy link
    Member

    tim-one commented Oct 8, 2018

    Here's htest.py output from

    git pull https://github.com/jdemeyer/cpython.git bpo-34751

    and a variation. The number of collisions in the variation appear in square brackets at the end of each line.

    32-bit build:

    range(100) by 3; 32-bit hash codes; mean 116.42 got 0 [0]
    -10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 0 [0]
    -50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
    0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
    [-3, 3] by 20; 32-bit hash codes; mean 128.00 got 140 [108]
    [0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 99 [154]
    old tuple test; 32-bit hash codes; mean 7.43 got 4 [2]
    new tuple test; 32-bit hash codes; mean 13.87 got 19 [21]

    64-bit build:

    range(100) by 3; 64-bit hash codes; mean 0.00 got 0 [0]
    range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
    range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]
    -10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 0 [0]
    -10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 1 [0]
    -10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 1 [0]
    -50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
    -50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 128 [0]
    -50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 116 [8]
    0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
    0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 123 [258]
    0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 121 [0]
    [-3, 3] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
    [-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 129 [117]
    [-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 137 [115]
    [0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
    [0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 126 [131]
    [0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 137 [130]
    old tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
    old tuple test; 32-bit lower hash codes; mean 7.43 got 12 [5]
    old tuple test; 32-bit upper hash codes; mean 7.43 got 54 [52]
    new tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
    new tuple test; 32-bit lower hash codes; mean 13.87 got 10 [6]
    new tuple test; 32-bit upper hash codes; mean 13.87 got 20 [30]

    They're all fine by me. This is what the variation does:

    1. Removes all "final mix" code after the loop ends.
    2. Changes initialization to add in the length:
        Py_uhash_t acc = _PyHASH_XXPRIME_5 + (Py_uhash_t)len;

    The vast bulk of the "final mix" code applies a chain of tuple-agnostic permutations. The only exception is adding in the length, which #2 moves to the start instead.

    Applying permutations after the loop ends changes nothing about the number of full-width hash collisions, which is Python's _primary_ concern. Two full-width hash codes are the same after the final mix if and only if they're the same before the final mix starts.

    In xxHash they're concerned about getting close to "avalanche" perfection (each input bit affects all final output bits with probability about 0.5 for each). There aren't enough permutations _inside_ the loop to achieve that for the last input or two, so they pile up more permutations after the loop.

    While we do care about "propagate left" and "propagate right" to mix up very regular and/or sparse hash codes, "avalanche perfection" is of no particular value to us. To the contrary, in typical cases like product(range(100), repeat=3) it's better for us not to destroy all the regularity:

    range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
    range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]

    "Avalanche perfection" is what drives the number of collisions up with the final mix code intact. Without it, we get "waaaaaaay better than random" numbers of collisions.

    The other reason it's attractive to drop it: the final mix code is one long critical path (no instruction-level parallelism), adding 8 serialized machine instructions including two multiplies. That's a real expense for typically-short tuples used as dict keys or set elements.

    Nothing is anywhere near disaster territory either way, although "worse than random" remains quite possible, especially when cutting a 64-bit hash in half (do note that, either way, there are no collisions in any test when retaining the full 64-bit hash code).

    @tim-one
    Copy link
    Member

    tim-one commented Oct 9, 2018

    We need to worry about timing too :-( I'm using this as a test. It's very heavy on using 3-tuples of little ints as dict keys. Getting hash codes for ints is relatively cheap, and there's not much else going on, so this is intended to be very sensitive to changes in the speed of tuple hashing:

    def doit():
        from time import perf_counter as now
        from itertools import product
    
        s = now()
        d = dict.fromkeys(product(range(150), repeat=3), 0)
        for k in d:
             d[k] += 1
        for k in d:
             d[k] *= 2
        f = now()
        return f - s

    I run that five times in a loop on a mostly quiet box, and pick the smallest time as "the result".

    Compared to current master, a 64-bit release build under Visual Studio takes 20.7% longer. Ouch! That's a real hit.

    Fiddling the code a bit (see the PR) to convince Visual Studio to emit a rotate instruction instead of some shifts and an add reduces that to 19.3% longer. A little better.

    The variant I discussed last time (add in the length at the start, and get rid of all the post-loop avalanche code) reduces it to 8.88% longer. The avalanche code is fixed overhead independent of tuple length, so losing it is more valuable (for relative speed) the shorter the tuple.

    I can't speed it more. These high-quality hashes have unforgiving long critical paths, and every operation appears crucial to their good results in _some_ test we already have. But "long" is relative to our current tuple hash, which does relatively little to scramble bits, and never "propagates right" at all. In its loop, it does one multiply nothing waits on, and increments the multplier for the next iteration while the multiply is going on.

    Note: "the real" xxHash runs 4 independent streams, but it only has to read 8 bytes to get the next value to fold in. That can go on - in a single thread - while other streams are doing arithmetic (ILP). We pretty much have to "stop the world" to call PyObject_Hash() instead.

    We could call that 4 times in a row and _then_ start arithmetic. But most tuples that get hashed are probably less than 4 long, and the code to mix stream results together at the end is another bottleneck. My first stab at trying to split it into 2 streams ran substantially slower on realistic-length (i.e., small) tuples - so that was also my last stab ;-)

    I can live with the variant. When I realized we never "propagate right" now, I agreed with Jeroen that the tuple hash is fundamentally "broken", despite that people hadn't reported it as such yet, and despite that that flaw had approximately nothing to do with the issue this bug report was opened about. Switching to "a real" hash will avoid a universe of bad cases, and xxHash appears to be as cheap as a hash without glaringly obvious weaknesses gets: two multiplies, an add, and a rotate.

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 9, 2018

    Changes initialization to add in the length:

    What's the rationale for that change? You always asked me to stay as close as possible to the "official" hash function which adds in the length at the end. Is there an actual benefit from doing it in the beginning?

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Oct 9, 2018

    I pushed an update at PR 9471. I think I took into account all your comments, except for moving the length addition from the end to the begin of the function.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 9, 2018

    > Changes initialization to add in the length:

    What's the rationale for that change? You always asked
    me to stay as close as possible to the "official" hash function
    which adds in the length at the end. Is there an actual benefit
    from doing it in the beginning?

    The heart of xxHash is the state-updating code, neither initialization nor finalization. Likewise for SeaHash or any of the others.

    Without the post-loop avalanche code, adding the length at the end has very little effect - it will typically change only the last two bits of the final result. _With_ the avalanche code, the length can affect every bit in the result. But adding it in at the start also achieves that - same as changing any bit in the initial accumulator value.

    Adding it in at the start instead also takes the addition off the critical path. Which may or may not save a cycle or two (depending on processor and compiler), but can't hurt speed.

    I noted before that adding the length at the end _can_ break out of a zero fixed-point (accumulator becomes 0 and all remaining values hash to 0). Adding it at the start loses that.

    So there is a theoretical danger there ... OK, trying it both ways I don't see any significant differences in my test results or a timing difference outside the noise range. So I'm happy either way.

    @jdemeyer
    Copy link
    Contributor Author

    it will typically change only the last two bits of the final result

    Which is great if all that you care about is avoiding collisions.

    @rhettinger
    Copy link
    Contributor

    New changeset aeb1be5 by Raymond Hettinger (jdemeyer) in branch 'master':
    bpo-34751: improved hash function for tuples (GH-9471)
    aeb1be5

    @jdemeyer
    Copy link
    Contributor Author

    jdemeyer commented Nov 5, 2018

    Many thanks!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants