-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make hash values the same width as a pointer (or Py_ssize_t) #53987
Comments
Currently, Python produces hash values with fit in a C "long". This is fine at first sight, but in the context of dict and set implementations, it means that
A future-proof option would be to change all hash values to be of Py_ssize_t values rather than C longs. Either directly, or by defining a new dedicated alias Py_hash_t. This would also impact the ABI, I suppose. |
As an example of padding behaviour (under Win64 with 32-bit longs and 64-bit pointers): Python 3.2a1+ (py3k, Sep 4 2010, 22:50:10) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
>>> import struct
>>> struct.calcsize("l")
4
>>> struct.calcsize("lP")
16
>>> struct.calcsize("lPP")
24 |
+1 for PyObject_Hash returning a Py_ssize_t. Nothing good can come from having a hash table larger than the range of a hash value. Tests may pass but performance would degrade catastrophically. |
Correct. So it either needs to happen before 3.2, or wait until 4.0, |
Shouldn't there be a provision for ABI versioning? |
Am 05.09.2010 13:12, schrieb Antoine Pitrou:
There is certainly support for ABI versioning; the ABI version defined
No, vice versa. The PEP promises that the ABI won't change until Python |
I would suggest making this change before the ABI freeze starts, then. |
Yes, please! (Provided this change can go in before 3.2.) For the numeric types at least, this should be a straightforward adjustment. And while we're at it, we could clean up some of the undefined behaviour from integer overflow that's in the current hash functions (tuple.__hash__, for example). |
Ping? |
Not that anybody needs my input on this, but... Given the range of people advocating for this change, this looks to me |
I think this is a bit exagerated. The performance issues will only |
I'd like to repeat that it will not be impossible to fix this after the |
Le dimanche 17 octobre 2010 à 17:40 +0000, Martin v. Löwis a écrit :
At the cost of very annoying complications, though. |
Can't we just do it now, and be done with it regardless of the stable ABI? |
Sure. Somebody needs to implement it (and consider what consequences |
I'm the maintainer of a third-party library (gmpy) that would be impacted by this and I'm definately in favor of this change. With ever increasing amounts of memory becoming standard in computers, more users will encounter performance issues with large dictionaries. I see 3.2 as the version of Python 3.x that gains wide-spread use. Regardless of the timing of a stable ABI, it's popularity will make it more difficult change the hash size in the future. |
Assume Python would make such a change, and users would build released |
Here's a patch. Please review. |
A few comments:
Otherwise, it looks fine. |
Dropping the cast was inadvertent. I've now added versionchanged and committed it in r85664. |
I added a placeholder to the What's New document. Since it will affect extension module authors it should be mentioned at a higher level than just Misc/NEWS. |
The patch does not address that "unsigned long" is still used to calculate the hash values. This breaks numeric hashing and leads to incorrect values. Python 3.2a3+ (py3k, Oct 17 2010, 19:03:38) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.hash_info
sys.hash_info(width=64, modulus=-1, inf=314159, nan=0, imag=1000003)
>>> Would "size_t" be the appropriate type to use? |
Good point. Here's another patch: |
Wouldn't it be better to use Py_hash_t instead of size_t for calculating the hash values in those hash functions ? |
And related to my previous comment: shouldn't Py_hash_t map to size_t instead of Py_ssize_t ? |
Some quick comments on the latest patch.
|
On Mon, Oct 18, 2010 at 8:45 AM, Benjamin Peterson
Why? As far as I can tell, negative values are only used as sentinels |
Le lundi 18 octobre 2010 à 13:56 +0000, Alexander Belopolsky a écrit :
You can, except that changing the sentinel value will probably break a |
On Mon, Oct 18, 2010 at 9:59 AM, Antoine Pitrou <report@bugs.python.org> wrote:
AFAICT, a change from (Py_ssize_t)-1 to (size_t)-1 is less likely to |
Benjamin Peterson wrote:
Ah, right... we use -1 as error indicator. |
Also, note that hash(-12) is -12. |
That's true.
I don't understand what you mean by "native" and "unrelated". Signed |
On Mon, Oct 18, 2010 at 11:27 AM, Antoine Pitrou <report@bugs.python.org> wrote:
Sorry, I could have been clearer indeed. Consider the following code: static Py_hash_t
long_hash(PyLongObject *v)
{
unsigned long x;
...
x = x * sign;
if (x == (unsigned long)-1)
x = (unsigned long)-2;
return (Py_hash_t)x;
} Wouldn't it be cleaner if x was the same type as hash? Note that |
The calculation of long_hash assumes an unsigned temporary type to get correct results for the bit shifting and masking. The calculation is done on the absolute value of the long and then the sign is applied. We either needed to (1) add an unsigned Py_hash_t type or (2) just use size_t and Py_ssize_t. |
On Mon, Oct 18, 2010 at 1:25 PM, Case Van Horsen <report@bugs.python.org> wrote:
Option (2) may actually be preferable because dict and set |
My expectation is that Py_hash_t is the same as Py_ssize_t. The goal is to make hashes match the range of possible table sizes. |
Am 18.10.2010 17:27, schrieb Antoine Pitrou:
I don't think it is. Code that is changed to use the new return type For code that *doesn't* get adjusted, I'm still uncertain what will Please correct me if I'm wrong. |
But this is an absolute requirement, a guarantee that we provide |
Please read the patch: typedef Py_ssize_t Py_hash_t; |
Yes, exactly.
I like (2); the use of Py_hash_t suggests to me that the type used for the hash is configurable independently of Py_ssize_t, which isn't true. Also, with Py_hash_t it's no longer clear that printing a hash value (e.g., using PyErr_Format and friends) should use the '%zd' modifier. |
I've attached a patch that fixes hashing for numerical types, sys.hash_info is now correct, fixes typeobject.c/wrap_hashfunc and tupleobject.c/tuplehash to use Py_ssize_t instead of long, and uses Py_ssize_t instead of Py_hash_t. I think it is clearer to use Py_ssize_t instead of Py_hash_t. I found two occurances where PyLong_FromLong needed to be replaced by PyLong_FromSsize_t and I think bugs like that would be easier to catch if Py_ssize_t is used. |
I maintain gmpy and it needs to calculate hash values for integers, floats, and rationals. I converted my hash calculations to use Py_ssize_t in a 64-bit Windows enviroment. All my tests pass when I build Python with my previous patch. In hindsight, I think I made a mistake in my previous patch by eliminating Py_hash_t and using Py_ssize_t/size_t. I ended up defining Py_hash_t and Py_uhash_t in gmpy to simplify the code and to easily support older versions of Python. I will work on a patch that defines Py_hash_t and Py_uhash_t and upload it later this evening. |
I've uploaded a patch against the current svn trunk that:
|
Thank you. Applied in r85803. |
Case's patch fixes test_builtin and test_complex failures on Windows 7 64-bit. But there's still a failure in test_dictviews: ====================================================================== Traceback (most recent call last):
File "Z:\__svn__\lib\test\test_dictviews.py", line 153, in test_items_set_operations
{('a', 1), ('b', 2)})
AssertionError: Items in the first set but not the second:
('a', 1)
('b', 2) Which boils down to the following issue: >>> d1 = {'a': 1, 'b': 2}
>>> d1.items()
dict_items([('a', 1), ('b', 2)])
>>> set(d1.items())
{('a', 1), ('b', 2)}
>>> d1.items() | set(d1.items())
{('a', 1), ('a', 1), ('b', 2), ('b', 2)} There are also a bunch of possibly related failures in test_weakset: ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_weakset.py", line 293, in test_inplace_on_
self
self.assertEqual(t, self.s)
AssertionError: <_weakrefset.WeakSet object at 0x000000000283CDC0> != <_weakrefs
et.WeakSet object at 0x000000000283B2C8> ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_weakset.py", line 72, in test_or
self.assertEqual(self.s | set(self.items2), i)
AssertionError: <_weakrefset.WeakSet object at 0x000000000285B400> != <_weakrefs
et.WeakSet object at 0x000000000285B260> ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_weakset.py", line 110, in test_symmetric_d
ifference
self.assertEqual(c in i, (c in self.d) ^ (c in self.items2))
AssertionError: False != True ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_weakset.py", line 61, in test_union
self.assertEqual(c in u, c in self.d or c in self.items2)
AssertionError: False != True ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_weakset.py", line 117, in test_xor
self.assertEqual(self.s ^ set(self.items2), i)
AssertionError: <_weakrefset.WeakSet object at 0x000000000292CA18> != <_weakrefs
et.WeakSet object at 0x000000000292C878> Another bunch of test_pyclbr failures: ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 152, in test_decorators
self.checkModule('test.pyclbr_input', ignore=['om'])
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 101, in checkModule
self.assertListEq(real_bases, pyclbr_bases, ignore)
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 28, in assertListEq
self.fail("%r missing" % missing.pop())
AssertionError: 'object' missing ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 142, in test_easy
self.checkModule('pyclbr')
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 101, in checkModule
self.assertListEq(real_bases, pyclbr_bases, ignore)
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 28, in assertListEq
self.fail("%r missing" % missing.pop())
AssertionError: 'object' missing ====================================================================== Traceback (most recent call last):
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 158, in test_others
cm('random', ignore=('Random',)) # from _random import Random as CoreGenera
tor
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 101, in checkModule
self.assertListEq(real_bases, pyclbr_bases, ignore)
File "Y:\py3k\__svn__\lib\test\test_pyclbr.py", line 28, in assertListEq
self.fail("%r missing" % missing.pop())
AssertionError: 'Random' missing And a test_sys failure which is probably easy to fix: File "Z:\svn\lib\test\test_sys.py", line 831, in test_objecttypes All these tests pass in 32-bit mode. |
This patch seems to fix all aforementioned failures. |
These changes also look all reasonable to me. |
Ok, I've committed them in r85808. |
On Win64, I get two unexpected failures: I terminated test_capi after a Windows exception box popped up. [ 37/349] test_capi
test test_capi failed -- Traceback (most recent call last):
File "C:\svn\py3k\lib\test\test_capi.py", line 50, in test_no_FatalError_infinite_loop
b'Fatal Python error:'
AssertionError: b"Fatal Python error: PyThreadState_Get: no current thread\r\n\r\nThis application has requested the Runtime to term
inate it in an unusual way.\nPlease contact the application's support team for more information." != b'Fatal Python error: PyThreadS
tate_Get: no current thread'
[ 67/349] test_concurrent_futures
test test_concurrent_futures failed -- Traceback (most recent call last):
File "C:\svn\py3k\lib\test\test_concurrent_futures.py", line 442, in test_timeout
future1]), finished)
AssertionError: Items in the second set but not the first:
<Future at 0x5510fd0 state=finished raised AssertionError> The dictviews test passes successfully. |
The test_capi problem is not 64-bit-specific (see bpo-9116). As for test_concurrent_futures, I also have a failure (not the same one) in both 32-bit and 64-bit builds here. I'm gonna open a separate issue. |
I recommend to declare this issue closed, and keep it closed unless somebody wants to propose to revert the change (widening the hash type) completely. Any remaining issues that people want to attribute to this change (correctly or incorrectly) should be reported as separate issues. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: