-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate ob_shash in BytesObject #91020
Comments
Code objects have more and more bytes attributes for now. Sets and dicts have own hash cache. |
I think it is a legacy of Python 2. Attributes and variable names are Unicode strings in Python 3, so the main reason of this optimization is no longer relevant. But some programs can still work with encoded bytes instead of strings. In particular os.environ and os.environb are implemented as dict of bytes on non-Windows. |
This change doesn't affect to os.environ. os.environ[key] does On the other hand, this change will affect |
When removed shash:
I will reconsider the removal before remove the cache. |
I'm getting tons of deprecation warnings from deepfreeze script. Please add _Py_COMP_DIAG_IGNORE_DEPR_DECL to Tools/scripts/deepfreeze.py. |
I'm sorry. Maybe, ccache hides the warning from me. |
If run this code, would it be slower? bytes_hash = hash(bytes_data)
bytes_hash = hash(bytes_data) # get hash twice |
Since Python 3.13, yes. It will be bit slower. |
Since hash() is a public function, maybe some users use hash value to manage bytes objects in their own way, then there may be a performance regression. For a rough example, dispatch data to 16 servers. h = hash(b)
sendto(server_number=h & 0xF, data=b) |
Since the hash is randomized, using hash(bytes) for such use case is not recommended. User should use stable hash functions instead. I agree that there is few use cases this change cause performance regression. But it is really few compared to overhead of adding 8bytes for all bytes instances. |
RAM is now relatively cheaper than CPU. |
Average RAM capacity doesn't grow as CPU cores grows. Bytes object is used for co_code that is hot. So cache efficiency is important. Would you give us more realistic (or real world) example for caching bytes hash is important? |
If put a bytes object into multiple dicts/sets, the hash need to be computed multiple times. This seems a common usage. bytes is a very basic type, users may use it in various ways. And unskilled users may checking the same bytes object against dicts/sets many times. FYI, 1 GiB data: function seconds |
First of all, this is just deprecating direct access of RAM and CACHE efficiency is not the only motivation for this. But if we stop using bytes objects in code objects by Python 3.13, there is no need to remove ob_shash.
Doesn't it lose only some milliseconds? |
I guess not much difference in benchmarks. Is it possible to let code objects use other types? In addition to ob_hash, maybe the extra byte \x00 at the end can be saved. |
I don't think calculating hash() for large bytes is not so common use case. Balance is important. Microbenchmark for specific case doesn't guarantee the good balance.
Of course, it is possible. But it needs large refactoring around code, including pyc cache file format. |
Just a note: as of this week (GH-31888), we no longer use bytes objects to store bytecode. Instead, the instructions are stored as part of the PyCodeObject struct. (Of course, we still use bytes objects for the various exception handling and debugging tables attached to code objects, but these are very cold fields by comparison.) Not to say this work isn't worthwhile, but it probably isn't worth doing for |
OK. Cache efficiency is dropped from motivations list.
I close this issue for now, because this issue is just for making direct access of ob_shash deprecated. After Python 3.12 become beta, we will reconsider about we should remove ob_shash or keep it. |
This has been brought up in NumPy, NumPy subclasses bytes, and I am not immediately sure how to best initialize the hash value. We can't really call Would it make sense to move the |
Would you give me a link to the code?
How about tp_init? |
@methane the code is here: With the initialization a bit later. The point being, this creates a bytes subclass the manual way currently. I am not condoning that, but I am also not sure there is an obvious replacement for creating a byte subclass instances. EDIT: The point about init/new is that they are Python level API, and as such expect args/kwargs. And I don't really think that would be ideal here either. I don't mind this requiring a bit of a hackish solution, but right now I am not sure what the solution is at all. |
Thank you. tp_alloc seems good. |
What are problems with calling |
@serhiy-storchaka it requires passing a Python object to create the new bytes (subclass) instance. NumPy does not have such an object available. We would have to create it just to call Now, I don't care about using "unstable" API. So if the category of "unstable" is created (and the discussion and Petr's last mail sound a bit like it), then it would make sense to me if this ABI detail was available only through that? Otherwise a custom |
Current ideas:
BTW, Since bytes object is PyVarObject, every subclasses won't extend the PyBytesObject, am I right? |
NumPy certainly does not, I suppose you could dynamically add more "items" at the end in a custom |
The happy path of the code in question is: PyTypeObject *type = /* get the type object */;
int itemsize = /* get the string 'length' */;
PyObject *obj;
void *destptr;
obj = type->tp_alloc(type, itemsize);
destptr = PyBytes_AS_STRING(obj);
((PyBytesObject *)obj)->ob_shash = -1;
memcpy(destptr, data, itemsize);
return obj; How should NumPy refactor that? |
#if PY_VERSION_HEX < 0x030b00b0
((PyBytesObject *)obj)->ob_shash = -1;
#endif |
I recommend keeping the The caching benefit doesn't happen often but when it does the benefits are significant. Crypto hashes are always stored as bytes and we use them to check for duplicates (i.e. the way git works). The output of str.encode() is bytes and it is legitimate to use this as a cache key (with the lru_cache for example). If I recall correctly, DNA sequence fragments get stored as bytes and are used for repeated dict/set lookups. This code has been stable for a very long time and it is almost certain that some code in the wild depends on cached hash values for efficient computation. ISTM the deprecation damages that code for no good reason. Looking back over the thread, it seems that the original motivations are mostly gone or are insignificant. Unless someone really wants this, I request that the field be undeprecated. |
Memory saving is not the only reason for removing the hash cache. Additionally, this issue doesn't remove the hash cache. |
The hash cache should remain. If you want to block direct access to the structure member that is reasonable. For code that does use the hash cache, it is performance critical. The feature has been in a place a long time. Code in the wild almost certainly relies on it (and reasonably so) for good performance. Saying "that CPUs are fast" is dismissive and ignore the essential fact that bytes can be of any size. At some size (likely somewhat small), a fast CPU computation loses to a field lookup, especially for a field that is likely already loaded in the cache line. In contrast, the data array is much less likely to be in the current cache line, so its memory accesses will be expensive. I've ready you "reasons" and still object to removing the hash cache. Please undo the deprecation. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: