-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up and fix OrderedDict #69596
Comments
Proposed patch cleans up and fixes minor bugs in C implementation of OrderedDict.
Also applied other cleanups. The size of sources is decreased by 105 lines. Objects/odictobject.c | 194 +-----------------------------!!!!!!!!!!!!!!!!!! |
These all look good except for perhaps #5 which I need to look at a bit more for its effect on OD subclasses. |
Thanks for working on this, Serhiy. I've left a review. As to the points you outlined, I have concerns with the impact of #3 and #5 on subclasses. Notably od.__class__ is not necessarily the same as type(od). Also #7 may introduce an unhandled re-entrancy, causing potentially incorrect outcomes. Also note that I was extremely careful to (almost) exactly match the pure Python implementation. Not only did this guarantee equivalent behavior, but it simplified the porting effort. I'm not opposed to deviating from the pure Python implementation as long as the behavior remains exactly the same. So if you want to change the behavior of OrderedDict you must be sure to make the equivalent change in the pure Python implementation (with the associated backward-compatibility constraints). Thanks again for working on this though. |
Thank you for your review Eric. As for using Py_TYPE(self) instead of the __class__ attribute in #3, this is consistent with the rest of Python core and stdlib. All C implementations use Py_TYPE(self) for repr and pickling, even if Python implementations can use __class__. >>> class S(set): __class__ = str
...
>>> s = S()
>>> s.__class__
<class 'str'>
>>> s
S()
>>> s.__reduce_ex__(2)
(<class '__main__.S'>, ([],), {})
>>> s+''
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'S' and 'str' Note that you can't just set s.__class__ = str, this is forbidden (bpo-24912). You should set __class__ in class definition or make it a property, and this doesn't have affect object's behavior, the only visible effect is the value of the __class__ attribute itself. One possible argument for Py_TYPE(self) (besides simplicity and historical reasons) is that it is more reliable. It doesn't cause executing arbitrary Python code and therefore is thread-safe and reentrant. It returns the true type of the object, that can be important for debugging. We should not care about exactly matching Python implementation, but rather about consistency with the rest of Python. If such type of mismatching is considered a bug, Python is full of such bugs. About #5, be sure that the new code is exact semantic equivalence to the old code besides copying the dict that is not needed now. I just dropped an iteration on always empty dict and related code. I don't see re-entrancy problem in #7. Could you please provide an example? |
Regarding Py_TYPE(od) vs. od.__class__, there is a difference for subclasses, as you demonstrated in your example. [1] Thanks for explaining your rationale. I now understand your argument about using PyTYPE() for repr and pickle in C types. I still have concerns, though, regarding parity between the two OrderedDict implementations. The key difference is that we're talking about an after-the-fact C port of an existing pure Python implementation. The two implementations should behave identically in nearly every case. The only place I would expect a deviation to not matter is for anything where Python-as-a-whole does not guarantee behavior. However there are *very* few of those when all is said and done. Any other variation should not be made casually and only if we are willing to accept that there may be code out there that relies on the pure Python behavior which the C implementation will break. So like I said before, as a rule I'm absolutely okay with changing the behavior as long as the pure Python implementation is changed to match and OrderedDict remains backward-compatible (and the change is meaningful, e.g. efficiency, consistency). Otherwise my concerns remain and we have to have sufficient justification for the change. It may be worth taking this to python-dev to get a clearer consensus on both "Py_TYPE(obj) vs. obj.__class__", as well as about parity between dual pure-Python/C implementations in the stdlib, regardless of the outcome of this issue. Both are points about which we should be consistent throughout Python. The type() vs. __class__ question may deserve an entry in the language reference and both may deserve a PEP. ----- For this particular case, I think we should still aim for compatibility with the pure Python implementation. To that effect, we could use Py_TYPE(od) only if PyODict_CheckExact() returns true (as a fast path) and use od.__class__ otherwise. That fast path would be safe for the C implementation since code can't change OrderedDict().__class__ (due to bpo-24912). That particular difference in the implementations (i.e. you *can* change od.__class__ for the pure Python one) is an acceptable compatibility break since it's unlikely anyone is changing od.__class__ *and* if they are then they can just switch to a simple subclass that wraps OrderedDict: # before:
from collections import OrderedDict
od = OrderedDict()
od.__class__ = SomethingElse
# after:
import collections
class OrderedDict(collections.OrderedDict):
pass
od = OrderedDict()
od.__class__ = SomethingElse If we *do* continue supporting "type(od) != od.__class__" in repr then I'd say why bother with a fast path for PyOdict_CheckExact(). That sort of efficiency isn't necessary for repr. If we stop supporting a differing od.__class__ then I'm fine simply using Py_TYPE() throughout. Likewise, if this is not a case we want to support then we must accept that we may break some code out there, however unlikely that is. In that case perhaps we could be more clear in the documentation that OrderedDict().__class__ should not be changed, though such an addition to the OrderedDict docs might just be clutter. A general FAQ or other broader doc entry about not assigning to obj.__class__ for stdlib types might be more appropriate. But that is where clarification from python-dev would help. [1] There is also a difference between type(obj) and obj.__class__ in the case of proxies (e.g. see bpo-16251), but that is less of an issue here. |
Regarding #5, you're right about OrderedDict().__dict__ being empty for the C implementation. (Nice observation!) So I'm okay with ripping all that code out of odict_reduce(). Since we're still accessing od.__dict__ through _PyObject_GetAttrId() that should not impact subclassing. |
Regarding #7, I see what you did now. That looks fine to me. |
There is no a difference. io, pickle, ElementTree, bz2, virtually all Backward compatibility related to __class__ assignment was already broken in C >>> from collections import *
>>> class foo(OrderedDict):
... def bark(self): return "spam"
...
>>> class bar(OrderedDict):
... pass
...
>>> od = bar()
>>> od.__class__ = foo
>>> od.bark()
'spam' In 3.5 it doesn't.
No, this assignment is forbidden (due to bpo-24912). You can't set __class_ for
Could you please raise a discussion on Python-Dev? You will formulate the |
Updated patch addresses Eric's comments. Changes related to #3 are reverted. We will return to this after discussing on Python-Dev. |
new patch LGTM |
Depending on what feedback we get from python-dev, that may need to be
Ah, I see. So the solution to that issue has *forced* a compatibility break.
I will. |
New changeset b6e33798f82a by Serhiy Storchaka in branch '3.5': New changeset 741ef17e9b86 by Serhiy Storchaka in branch 'default': |
Here is a patch that makes both implementations to use type(self) instead of self.__class__ in __repr__(), __reduce__() and copy(). There is a difference between current implementations. Python implementation uses self.__class__ in copy(), C implementation uses type(self). |
Seems there is a leak in _odict_add_new_node() when PyObject_Hash(key) fails. Here is a fix. |
both patches* LGTM
|
And thanks again, Serhiy, for taking the time on this. :) |
New changeset 93f948120773 by Serhiy Storchaka in branch '3.5': New changeset c3cec0f77eff by Serhiy Storchaka in branch 'default': |
Thank you for your reviews and discussions (and for your appreciated C acceleration of OrderedDict of course) Eric. I just want to make the code a little cleaner and reliable. As for odict_type.patch, I would prefer to commit only C part of the patch and left Python implementation unchanged. There few not very strong arguments for __class__ against type() in Python code.
|
Since the python-dev discussion about __class__, leaving the Python implementation alone is fine with me. |
New changeset a42c0c1c5133 by Serhiy Storchaka in branch '3.5': New changeset 10b965d59b49 by Serhiy Storchaka in branch 'default': |
Thanks Eric. A comment in PyODict_SetItem suggests to revert setting the value on the dict if adding new node failed. Following patch implements this suggestion. After committing the patch in bpo-25462 PyDict_DelItem can be replaced with _PyDict_DelItem_KnownHash and it will be always successful. |
Following patch makes the code for the iterating a little simpler by merging common code. |
In very rare circumstances it is possible that after a series of modifications using raw dict methods, ma_keys becomes to point to the same address, but allocated keys array has different size. But the guard in _odict_get_index uses only the address as resize sentinel. This can cause segmentation error if found key index is larger than the size of od_fast_nodes. Following patch adds the value of keys->dk_size as a sentinel. |
Could you please make a review of last three patches Eric? |
I will review those patches soon. |
All 3 patches look fine to me. In "odict_resize_sentinel.patch", it would be nice if you could accomplish that with a single sentinel. However, fixing the bug is more critical. |
New changeset 45ce4c6b4f36 by Serhiy Storchaka in branch '3.5': New changeset c16af48153a4 by Serhiy Storchaka in branch 'default': |
Thank you for your review Eric. I made error in commit messages, that is why they are non shown here. odict_revert_setting_on_error.patch and odict_iternext_simpler.patch were committed in 1594c23d8c2f and ad44d551c13c. od_resize_sentinel2 in odict_resize_sentinel.patch was renamed to od_fast_nodes_size. Now I see that this is not enough. It is possible that ma_keys is located on the same place, has the same size, but has different layout for keys with matched hashes. I'm trying to write more reliable checks. |
Following code prints X([(1, 1), (3, 3)]) on 3.4 and X([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]) on 3.5+. from collections import OrderedDict
class X(OrderedDict):
def __iter__(self):
for k in OrderedDict.__iter__(self):
if k % 2:
yield k
od = X((i, i) for i in range(5))
print(od.copy()) |
And even simpler example: list(od.keys()) is [1, 3] in 3.4 and [0, 1, 2, 3, 4] in 3.5. |
Proposed patch makes OrderedDict.copy() more consistent between implementations. |
Is this issue still relevant? |
It might be more appropriate to start a new issue for this, but I'll leave that decision to somehow who would know for sure. Anyway, basically the entire dict/PyDictObject api functions do not appear work at all with OrderedDict. Or rather, OrderedDict doesn't seem to be able to recognize the changes the dict api makes to an object. This is present in both 3.6.0 and 3.7.0 by the way. from operator import setitem
from collections import OrderedDict
from pprint import pprint
class thing:
def __init__(self):
ns = OrderedDict(a='od.__init__')
vars(__class__)['__dict__'].__set__(self, ns)
dict.__setitem__(ns, 'b', 'dict.__setitem__')
self.c = 'PyObject_SetAttr'
OrderedDict.__setitem__(ns, 'd', 'od.__setitem__')
ns.update(e='od.update')
object.__setattr__(self, 'f', 'PyObject_GenericSetAttr')
setattr(self, 'f', 'PyObject_SetAttr')
setitem(ns, 'g', 'od.__setitem__')
dict.update(ns, h='dict.update')
dict.setdefault(ns, 'i', 'i')
self = thing()
ns = self.__dict__
real_ns = {**ns}
missing = {k: ns[k] for k in real_ns.keys() - ns.keys()}
pprint(ns)
pprint(missing, width=len(f'{missing}')-1)
print(f'"b" in {ns.keys()} == {"b" in ns.keys()}')
print(f'"b" in {*ns.keys(),} == {"b" in [*ns.keys()]}')
del ns['a']
del ns['b']
print(f"ns.get('c', KeyError('c')) == {ns.get('c', KeyError('c'))}")
print(f"ns.pop('c', KeyError('c')) == {ns.pop('c', KeyError('c'))!r}")
ns.get('i')
ns.pop('i') Maybe it's considered undefined behavior for a subclass to use |
In general, it's true that if OrderedDict is a subclass of dict, then it would have no defense against someone making a direct call to the dict base class. Such a call should be expected to violate the OrderedDicts invariants.
Not really. The CPython source is supposed to only call PyDict_SetItem when the target is known to be an exact dictionary. If you find a case where that isn't true, please file a bug report and we'll fix it.
No need. We've known about this sort of problem for years. See https://bugs.python.org/issue10977 for example. There isn't really much we could do about it without causing other issues that would be worse. FWIW, this doesn't seem to be a problem in practice. Further, OrderedDict is expected to become less relevant now that regular dicts are order preserving. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: