gh-111545: Add Py_HashDouble() function #112449

vstinner · 2023-11-27T11:51:03Z

Add again the private _PyHASH_NAN constant.
Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

Issue: Make _Py_HashDouble public again as "unstable" API #111545

📚 Documentation preview 📚: https://cpython-previews--112449.org.readthedocs.build/

vstinner · 2023-11-27T11:55:17Z

@encukou @mdickinson @serhiy-storchaka: Would you mind to review this change?

It's a new API for PyHash_Double() function: other APIs were discussed in PR #112095.

vstinner · 2023-11-30T22:34:26Z

I renamed the function from PyHash_Double() to Py_HashDouble() following #112095 (comment) discussion.

mdickinson

The API looks fine to me. I'm wondering whether we want to consider switching the meaning of the return value, so that 0 is used for non-NaNs and 1 for NaNs; I want to think of the return value as essentially an is_nan check, and from that perspective the current API has an extra negation to get my brain around.

But either way is probably fine (so long as it's documented).

Doc/c-api/hash.rst

vstinner · 2023-12-03T11:12:26Z

I rebased the PR on the main branch.

I'm wondering whether we want to consider switching the meaning of the return value, so that 0 is used for non-NaNs and 1 for NaNs

Ok, I modified the API to return 1 if the float is not-a-number (NaN), and return 0 otherwise.

encukou · 2023-12-04T10:02:48Z

In similar API, 1 means that you're getting a meaningful *result, and 0 means the result is NULL:

IMO, that's a good convention to establish (here, with 0 rather than NULL).
(The current devguide uses “lesser” and “greater”, which always seemed rather opaque to me, but “you're getting more data” does sounds like the “greater” result.)

mdickinson · 2023-12-04T11:31:14Z

IMO, that's a good convention to establish

Fair enough. That convention does feel somewhat in conflict with the common "zero-return = success", "non-zero-return = failure" API that we have elsewhere, but that's a wider issue.

mdickinson · 2023-12-04T11:32:31Z

but “you're getting more data” does sounds like the “greater” result

Agreed.

encukou · 2023-12-04T12:44:33Z

the common "zero-return = success", "non-zero-return = failure" API

If I'm reading the room correctly, we're reserving only -1 for errors -- but only the “hard” errors where an exception is set.
In the case of exceptions not being set, “success” and “failure” tend to mean different things to different people :)

vstinner · 2023-12-04T13:53:16Z

I reverted my change to move back to the previous API:

Return 0 if the argument is not-a-number (NaN)
Return 1 otherwise.

@mdickinson @encukou: Would you mind to review/approve formally the PR?

encukou

Looks good, thanks!

Other WG members might disagree, but I don't see anything too controversial any more.

vstinner · 2023-12-05T18:16:57Z

@encukou:

Other WG members might disagree, but I don't see anything too controversial any more.

I created capi-workgroup/decisions#2 for the C API Working Group.

vstinner · 2023-12-06T14:38:02Z

I merged my Py_HashPointer() PR. I rebased this PR on top on it.

The PR was rebased.

Doc/c-api/hash.rst

Doc/library/sys.rst

vstinner · 2023-12-11T21:19:16Z

I will update the PR to address the reviews, once the API will be approved by the C API Working Group.

Include/cpython/pyhash.h

* Add again the private _PyHASH_NAN constant. * Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

Lib/test/test_capi/test_hash.py

encukou

LGTM, thank you!

Doc/c-api/hash.rst

Python/pyhash.c

vstinner · 2023-12-12T15:57:37Z

I updated What's New in Python 3.13 to suggest a recipe replacing usage of the private _Py_HashDouble() function:

* Add :c:func:`Py_HashDouble` function to hash a C double number. Existing code
  using the private ``_Py_HashDouble()`` function can be updated to::

        Py_hash_t
        hash_double(PyObject *obj, double value)
        {
            Py_hash_t hash;
            if (Py_HashDouble(value, &hash) == 0) {
                hash = Py_HashPointer(obj);
            }
            return hash;
        }

  (Contributed by Victor Stinner in :gh:`111545`.)

serhiy-storchaka · 2023-12-12T16:56:36Z

I would prefer a simpler API that does not treat NaN as special at all and leave this to the user:

        Py_hash_t
        hash_double(PyObject *obj, double value)
        {
            if (Py_IS_NAN(value)) {
                return Py_HashPointer(obj);
            }
            else {
                return Py_HashDouble(value);
            }
        }

Even if it is slightly slower. But it requires less mental effort to understand and use the API.

encukou · 2023-12-14T13:51:46Z

Py_hash_t hash_double(PyObject *obj, double value)

What should an user pass in as obj?

vstinner · 2023-12-14T14:34:00Z

What should an user pass in as obj?

@serhiy-storchaka proposes the API: Py_hash_t Py_HashDouble(double value) which never fails. It returns 0 for NaN.

I would be fine with this API. It was more or less proposed earlier. But it was said that having to call Py_IS_NAN() might have an impact on performance. I supposed that Py_IS_NAN() is really cheap.

In terms of API, Py_hash_t Py_HashDouble(double value) is simpler than int Py_HashDouble(double value, Py_hash_t *result). There is no need to have to think if result is NULL or not, and there is no need to check for error.

Py_hash_t hash_double(PyObject *obj, double value) is only the example that I added to What's New in Python 3.13 doc to migrate from the private _Py_HashDouble() function.

vstinner · 2023-12-14T14:50:22Z

I wrote PR #113115 which implements Py_hash_t Py_HashDouble(double value) API so you can look at code to compare with this PR implementing the API int Py_HashDouble(double value, Py_hash_t *result).

vstinner · 2023-12-15T15:24:04Z

@serhiy-storchaka @mdickinson: So which API do you prefer?

Py_hash_t Py_HashDouble(double value)
int Py_HashDouble(double value, Py_hash_t *result)

serhiy-storchaka · 2023-12-15T15:44:50Z

I prefer your initial interface Py_hash_t Py_HashDouble(double value), but maybe without any special handling for NaN. It should be handled externally by user.

Unless there is really large performance advantage in using the second variant. I do not know how large should it be to justify more complex interface. 10% is not large.

vstinner · 2023-12-15T16:23:22Z

I prefer your initial interface Py_hash_t Py_HashDouble(double value)

Ok.

but maybe without any special handling for NaN. It should be handled externally by user.

If prefer to have a deterministic behavior and always return the same hash value (0) if value is NaN. There are legit use cases to treat NaN as hash value 0. See my comment.

mdickinson · 2023-12-16T11:06:00Z

I prefer your initial interface Py_hash_t Py_HashDouble(double value), but maybe without any special handling for NaN. It should be handled externally by user.

Ditto. I don't see a need for anything more complicated than this, and it feels wrong to me to allow minor performance differences to drive API design.

If prefer to have a deterministic behavior and always return the same hash value (0) if value is NaN. There are legit use cases to treat NaN as hash value 0.

This sounds good to me.

vstinner · 2023-12-16T12:28:23Z

Ditto. I don't see a need for anything more complicated than this, and it feels wrong to me to allow minor performance differences to drive API design.

Ok. I proposed to change C API Working Group decision on this API: capi-workgroup/decisions#2 (comment)

serhiy-storchaka · 2023-12-16T12:36:39Z

If prefer to have a deterministic behavior and always return the same hash value (0) if value is NaN.

It creates a vulnerability. CPython itself never calls this API for a NaN. If a third-party code calls it for a NaN value, it will allow to easily create non-equal objects with the same hash.

encukou · 2023-12-18T14:21:32Z

The simpler API, Py_hash_t Py_HashDouble(double value), has a footgun: if you forget to do the NaN check, all your NaNs will hash the same.
IMO, this is a case of expert blindness: it you know floats and Python's float-hashing behaviour, the required Py_IS_NAN check is second nature. But IMO, we shouldn't be designing the API for experts.
So I'd present slightly different choices:

int Py_HashDouble(double value, Py_hash_t *result), as in this PR, which makes it more obvious that different kinds of results are possible.
Py_hash_t Py_HashNonNaNDouble(double value), with an uglier name, where the necessary check is obvious from that name.

vstinner · 2023-12-18T14:31:06Z

The simpler API, Py_hash_t Py_HashDouble(double value), has a footgun: if you forget to do the NaN check, all your NaNs will hash the same.

Can this concern be resolved with better documentation? Explain in which case treating all NaN "as equal" can be an issue, or suggest a solution such as @serhiy-storchaka's recipe #112449 (comment) ?

I'm not convinced that using the same hash value (0) for all NaN numbers is a big deal, since Python was doing that until 3.9. The hash value is just an optimization to avoid the slower comparison operator, it's more about performance than about correctness.

$ python3.9
>>> nan1=float("inf")*0
>>> nan2=float("inf")*0
>>> nan2 is nan1
False

# Same hash value
>>> hash(nan1), hash(nan2)
(0, 0)

# dict works "as expected"
>>> d={nan1: 1, nan2: 2}; d
{nan: 1, nan: 2}
>>> d[nan1]
1
>>> d[nan2]
2

# where the magic happens, NaN is not equal to NaN
>>> nan2 == nan1
False

encukou · 2023-12-18T15:07:04Z

Can this concern be resolved with better documentation?

IMO, no. When doing a review, you don't check the docs of all functions you see. Just the surprising ones.

serhiy-storchaka · 2023-12-18T15:22:17Z

It is a big deal. Try to create a dict with 10000 or 100000 of NaNs.

d = {float('nan'): i for i in range(100000)}

vstinner · 2023-12-18T16:00:09Z

It is a big deal. Try to create a dict with 10000 or 100000 of NaNs.

Aha, I see: it's way faster in Python 3.10 where hash(nan) is not always 0.

# hash(nan) = 0
$ python3.9 -m timeit '{float("nan"): i for i in range(10_000)}'
1 loop, best of 5: 1.12 sec per loop

# hash(nan) = id(obj)
$ python3.10 -m timeit '{float("nan"): i for i in range(10_000)}'
100 loops, best of 5: 2.7 msec per loop

But the proposed API is for Py_HashDouble() which only takes a C double. It's up to the caller to use something else to calculate a different hash value for NaN. The question is how we can guide users to such solution.

For me, Py_HashDouble() is like a primitive used to build a hash function.

vstinner · 2023-12-18T16:01:55Z

@encukou:

IMO, no. When doing a review, you don't check the docs of all functions you see. Just the surprising ones.

I'm talking about Py_HashDouble() documentation. I'm asking if we can write doc in a way that users avoid the hash collision issue if their numbers can be NaN.

encukou · 2023-12-19T11:59:36Z

I'm talking about Py_HashDouble() documentation. I'm asking if we can write doc in a way that users avoid the hash collision issue if their numbers can be NaN.

I'm also talking about the documentation. Users won't read the docs. If the function has surprising behaviour, adding a hint that you should look at the docs goes a long way.

vstinner · 2023-12-20T11:02:50Z

I created PR #112095 more than 1 month ago. I spent time to run benchmark, implement different APIs, try to collect feedback on each API, and discuss in length advantages and disadvantages of each API. Sadly, we failed to reach a consensus on the API. Now another API is being discussed. The API looks simple to me, I didn't expect to spend more than one month on a single function.

I need to take a break from that topic. I don't have the energy to dig into these discussions. I prefer to close the PR for now.

vstinner requested a review from tiran as a code owner November 27, 2023 11:51

bedevere-app bot added the awaiting core review label Nov 27, 2023

bedevere-app bot mentioned this pull request Nov 15, 2023

Make _Py_HashDouble public again as "unstable" API #111545

Open

vstinner mentioned this pull request Nov 27, 2023

gh-111545: Add PyHash_Double() function #112095

Closed

vstinner force-pushed the hash_double3 branch from ae3843d to e69656a Compare November 30, 2023 22:31

vstinner changed the title ~~gh-111545: Add PyHash_Double() function~~ gh-111545: Add Py_HashDouble() function Nov 30, 2023

mdickinson approved these changes Dec 3, 2023

View reviewed changes

Doc/c-api/hash.rst Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting core review labels Dec 3, 2023

vstinner force-pushed the hash_double3 branch from e69656a to 8e29a02 Compare December 3, 2023 11:10

vstinner force-pushed the hash_double3 branch from 49bae56 to 5be68f0 Compare December 4, 2023 13:50

encukou previously approved these changes Dec 4, 2023

View reviewed changes

vstinner mentioned this pull request Dec 5, 2023

Add Py_HashDouble() function capi-workgroup/decisions#2

Closed

4 tasks

vstinner mentioned this pull request Dec 6, 2023

gh-111545: Add Py_HashPointer() function #112096

Merged

vstinner force-pushed the hash_double3 branch from 5be68f0 to 155764a Compare December 6, 2023 14:37

iritkatriel reviewed Dec 11, 2023

View reviewed changes

Doc/c-api/hash.rst Outdated Show resolved Hide resolved

Doc/c-api/hash.rst Outdated Show resolved Hide resolved

Doc/c-api/hash.rst Show resolved Hide resolved

Doc/library/sys.rst Outdated Show resolved Hide resolved

Doc/library/sys.rst Outdated Show resolved Hide resolved

encukou reviewed Dec 12, 2023

View reviewed changes

Include/cpython/pyhash.h Outdated Show resolved Hide resolved

pythongh-111545: Add Py_HashDouble() function

250c3e1

* Add again the private _PyHASH_NAN constant. * Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

Update Doc/c-api/hash.rst

44d3bd9

vstinner commented Dec 12, 2023

View reviewed changes

Lib/test/test_capi/test_hash.py Outdated Show resolved Hide resolved

Update Lib/test/test_capi/test_hash.py

6df0183

encukou approved these changes Dec 12, 2023

View reviewed changes

vstinner commented Dec 12, 2023

View reviewed changes

Doc/c-api/hash.rst Show resolved Hide resolved

Python/pyhash.c Outdated Show resolved Hide resolved

vstinner added 2 commits December 12, 2023 15:47

Update Python/pyhash.c

62ecc19

Suggest how to replace private _Py_HashDouble()

95a8f09

vstinner mentioned this pull request Dec 14, 2023

C API: Add PyObject_GenericHash() function #113024

Closed

skirpichev mentioned this pull request Dec 15, 2023

gh-111389: expose _PyHASH_INF/BITS/MODULUS/IMAG macros as public #111418

Merged

vstinner closed this Dec 20, 2023

vstinner deleted the hash_double3 branch December 20, 2023 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-111545: Add Py_HashDouble() function #112449

gh-111545: Add Py_HashDouble() function #112449

vstinner commented Nov 27, 2023 •

edited by github-actions bot

vstinner commented Nov 27, 2023

vstinner commented Nov 30, 2023

mdickinson left a comment

vstinner commented Dec 3, 2023

encukou commented Dec 4, 2023 •

edited

mdickinson commented Dec 4, 2023

mdickinson commented Dec 4, 2023

encukou commented Dec 4, 2023

vstinner commented Dec 4, 2023

encukou left a comment

vstinner commented Dec 5, 2023

vstinner commented Dec 6, 2023

vstinner commented Dec 11, 2023

encukou left a comment

vstinner commented Dec 12, 2023

serhiy-storchaka commented Dec 12, 2023

encukou commented Dec 14, 2023

vstinner commented Dec 14, 2023

vstinner commented Dec 14, 2023

vstinner commented Dec 15, 2023

serhiy-storchaka commented Dec 15, 2023

vstinner commented Dec 15, 2023

mdickinson commented Dec 16, 2023

vstinner commented Dec 16, 2023

serhiy-storchaka commented Dec 16, 2023

encukou commented Dec 18, 2023

vstinner commented Dec 18, 2023

encukou commented Dec 18, 2023

serhiy-storchaka commented Dec 18, 2023

vstinner commented Dec 18, 2023

vstinner commented Dec 18, 2023

encukou commented Dec 19, 2023

vstinner commented Dec 20, 2023

gh-111545: Add Py_HashDouble() function #112449

gh-111545: Add Py_HashDouble() function #112449

Conversation

vstinner commented Nov 27, 2023 • edited by github-actions bot

vstinner commented Nov 27, 2023

vstinner commented Nov 30, 2023

mdickinson left a comment

Choose a reason for hiding this comment

vstinner commented Dec 3, 2023

encukou commented Dec 4, 2023 • edited

mdickinson commented Dec 4, 2023

mdickinson commented Dec 4, 2023

encukou commented Dec 4, 2023

vstinner commented Dec 4, 2023

encukou left a comment

Choose a reason for hiding this comment

vstinner commented Dec 5, 2023

vstinner commented Dec 6, 2023

vstinner commented Dec 11, 2023

encukou left a comment

Choose a reason for hiding this comment

vstinner commented Dec 12, 2023

serhiy-storchaka commented Dec 12, 2023

encukou commented Dec 14, 2023

vstinner commented Dec 14, 2023

vstinner commented Dec 14, 2023

vstinner commented Dec 15, 2023

serhiy-storchaka commented Dec 15, 2023

vstinner commented Dec 15, 2023

mdickinson commented Dec 16, 2023

vstinner commented Dec 16, 2023

serhiy-storchaka commented Dec 16, 2023

encukou commented Dec 18, 2023

vstinner commented Dec 18, 2023

encukou commented Dec 18, 2023

serhiy-storchaka commented Dec 18, 2023

vstinner commented Dec 18, 2023

vstinner commented Dec 18, 2023

encukou commented Dec 19, 2023

vstinner commented Dec 20, 2023

vstinner commented Nov 27, 2023 •

edited by github-actions bot

encukou commented Dec 4, 2023 •

edited