-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OrderedDict written in C #61195
Comments
Here's an initial stab at writing OrderedDict in C. Though, the implementation is not heavily optimized and isn't super subclass-friendly, my hope is that it's relatively close to the final version. I'm getting this up now to get some eyes on it. The spot in the builtins is mostly for convenience, but I expect it will need to be exposed somewhere (perhaps _collections?). My experience with the C-API is relatively limited and my C-fu is not at a professional level. However, I'm pretty sure that I have most everything correct. The ultimate goal for this type is to use it for **kwargs. Note: this first patch has some reference leaks that I'm tracking down. |
Nit: This really not be exposed through _collections rather than hacked into builtins. |
The tests should probably be updated to use the PEP-399 idiom in order to test both the implementations. |
Here's a cleanup of test.test_collections that helps keep the subsequent patch (still forthcoming) cleaner. |
What's the reason for moving the OrderedDict tests in a separate file? |
Following the precedent of collections.deque and collections.defaultdict:
|
These are indeed good reasons. |
Here's an updated patch. I still have some ref-counting issues, but the patch is much closer to what I expect will be the final version. At this point it passes all the main unit tests (segfaults in some of the supplemental Mapping tests). One key thing is that for now the various iterators are faked using lists. Once everything is sorted out I'll plug my implementation of the iterators back in (which I'd already mostly finished). I see room for efficiency improvements in a couple spots. As well, I plan on improving the subclass-ability of the type once it's otherwise happy. Any feedback at the point would be helpful. Regardless, I'm being careful to get this right and I'm no expert with the C-API, so it's taking a while. :) Some other considerations:
|
Looks like I didn't get the patch lined up to tip so the review link isn't showing up. I'll have to fix that tomorrow. |
Are there any objections to this unittest cleanup patch? I'd like to commit it separately prior to anything that will go in for the C OrderedDict. |
Incidently, I'm just about done with the OrdereDict patch. I'm attaching the current one. At present it passes the unit tests, but I have two memory-related issues and a number of open questions (that I don't think are critical). The patch has the unittest cleanup merged in just so it will show up in reviewboard. |
Why did you copy test_collections.py instead of just creating a new file? |
To preserve the history of changes to the bulk of the code in test_ordereddict.py. |
I should clarify, odict.diff passes test_ordereddict. Regardless, it's pretty close. I'm going to figure out why the review link hates me on this issue, but until then any extra eyes here would be appreciated. The memory-related issues are pushing well past my experience. My goal is to get this in before PyCon and to have a reasonable plan at least for implementing **kwargs as OrderedDict. My intuition is that writing OrderedDict in C is the hard part. |
My current patch ends up with O(n) deletion, which won't fly, so I'm refactoring. While I'm at it I'm also planning on using the BSD queue.h doubly-linked list rather than the one that I rolled. I'm also going to pull the ordered dict implementation into it's own source file. However, these things should not have much of an impact on most of the code I've already written. I anticipate that the changes won't translate into a further large volume of work. In talking to Raymond, he emphasized the importance of making sure we avoid reentrancy problems. I'll be double-checking that and likely making use of the GIL in a couple spots. While the bulk of the implementation is complete, the remaining work to do here is what I've described above, along with more testing. An orthogonal problem is addressing the problem of the concrete dict API. I'll bring that up separately when this issue is basically done. |
I just looked at the test-collections patch and don't want it committed. It is full of trivial edits that make it hard to review and doesn't make the test suite better. It is okay to add some new tests, but I don't want to rearrange the existing code with minor edits. That will just make it more difficult to maintain the code across versions (it is no fun to backport a fix and then have unnecessary merge conflicts). |
It's worth noting that bpo-10977 is pretty closely related to this isssue. |
Thanks for the feedback, Raymond. I agree that there are a number of trivial edits there that shouldn't be there. I'll fix those. |
Raymond, do you have any objections to splitting the OrderedDict-related tests into their own module. Doing so allows for testing all the those test classes at once, which is handy when working on OrderedDict. This is more relevant now with the extra PEP-399-motivated tests. |
I finally had some time so here's an updated patch (including O(1) deletions). I've also added a bunch of notes at the top of odictobject.c. Some notable things left to do:
My goal here is to get an effective OrderedDict that we are happy with, while recognizing that there is room for optimization. However, I won't be okay with faultiness, so the implementation must be complete. This has been my mindset throughout. |
Without too many optimzations, the C implementation of OrderedDict is basically between 1x and 4x the performance of raw dict. This contrasts with the pure Python OrderedDict, which falls in roughly between 4x and 100x. I've attached an addition to pybench, not intended to be committed, that compares the 3. Run any of these commands to get timing: ./python Tools/pybench/pybench.py -t ODict I tuned the # rounds for each test such that the raw dict results all come out to just about the same value (2ms on my computer). That way it's easy to compare the pure Python or C results to the dict results. |
Here's an updated patch that has fixes for recursive pickles and for a couple memory-related segfaults that I missed before. That leaves the following to be done:
|
Here's one solution to the deletion-during-iteration problem. It makes the iterators track the key rather than the node, at the expense of a sliver of speed on __iter__() and keys(). |
The concern about reference cycles here lies in their existence in the linked-list. To see what I mean, check out the use of weakrefs in the pure Python implementation at Lib/collections/init.py. As the current implementation does not use PyObjects for the linked-list, I'm going to call this a non-issue. |
Here's what I feel is a nearly complete patch. The only outstanding issues that I feel need to be answered are 4 spots where calls into the interpreter may result in unexpected changes to the object or make the current function state out-of-date.
Once I feel comfortable with some resolution for those, I'm going to consider the patch ready to go, pending reviews. |
I figured out what I hope were the last memory-related issues. Apparently tp_traverse is not inherited if tp_flags is set. I had it set on all the view types and all the iterator types. So during GC it would blow up when it tried to call tp_traverse. Everything is looking pretty good so I'm attaching the latest diff. |
I agree with all Stephan's comments on Rietveld. See also bpo-24115 that fixed bugs similar to found in C implementation of OrderedDict. |
New changeset 0a7380984e37 by Eric Snow in branch '3.5': |
Thanks for pointing this out. I've fixed both dictobject.h and odictobject.h.
This isn't needed for 3.5, right? I'm not opposed to adding more |
I'm getting the following linker errors on Windows 8.1 for 32 and 64 bit debug and release builds. unresolved external symbol _PyODict_Type in C:\cpython\PCbuild\_collectionsmodule.obj, and _PyODictIter_Type, |
I've also merged this into default: changeset: 96384:10eabadba316b6f2034db4c076a60c63d25d8fc6 |
Hmm. I'm not too familiar with how things work for Windows. I'm also |
You already added public name Py_ODict_GetItemId. It uses private _Py_Identifier API and shouldn't be public. |
As for Windows issue, perhaps these names should be enumerated in PC/python3.def? See also bpo-23903. |
Ah. I'll remove it. |
New changeset c9404fba02ba by Eric Snow in branch '3.5': New changeset 951a3ef82180 by Eric Snow in branch '3.5': |
New changeset 7117e9b0f595 by Eric Snow in branch '3.5': |
If it's just a matter of adding the definitions then here's a patch. Does that look correct? |
That last message is about building on Windows. |
New changeset 030205f1e716 by Eric Snow in branch '3.5': |
New changeset fbe74badb0c6 by Eric Snow in branch 'default': |
New changeset 1d851112922f by Eric Snow in branch '3.5': |
I'm checking a fix for Windows against a buildbot (http://buildbot.python.org/all/builders/AMD64%20Windows8%203.x). |
New changeset 3ba1fa925f17 by Eric Snow in branch 'default': |
New changeset aea27164d207 by Eric Snow in branch '3.5': |
Well, that last one has everything compiling again. I expect it should be okay now. I'll watch the results. |
Opening again. I have too many questions. :) |
crash-1.py is due to an unchecked return value from _odictnode_VALUE(). We should probably use PyDict_GetItemWithError(), also in other I normally try to steer clear of stylistic remarks, but the Also, they're very inconvenient in a debugger. |
crash-2.py is due to the fact that _PyDict_Pop() deletes a reference The INCREF(key) in popitem should take place before calling _odict_popkey(). Again, I don't see the point of INCREF/DECREF *inside* of _odict_popkey(). |
@jim and Stefan, Thanks for thorough reviews! @Stefan, I'll take a look at those crashers and other suggestions ASAP. I really appreciate you taking the time. Now that the patch has been landed, would you mind opening new issues for each problem you find? That will help keep individual problems from getting lost. Thanks! |
Coverity has found an issue in odict, too: *** CID 1302699: Null pointer dereferences (NULL_RETURNS)
/Objects/odictobject.c: 1316 in odict_copy()
1310 od_copy = PyObject_CallFunctionObjArgs((PyObject *)Py_TYPE(od), NULL);
1311 if (od_copy == NULL)
1312 return NULL;
1313
1314 if (PyODict_CheckExact(od)) {
1315 _odict_FOREACH(od, node) {
>>> CID 1302699: Null pointer dereferences (NULL_RETURNS)
>>> Dereferencing a pointer that might be null "PyDict_GetItem((PyObject *)(PyObject *)od, node->key)" when calling "PyODict_SetItem".
1316 int res = PyODict_SetItem((PyObject *)od_copy,
1317 _odictnode_KEY(node),
1318 _odictnode_VALUE(node, od));
1319 if (res != 0)
1320 goto fail;
1321 } |
It's fine to open other issues, but I'm not happy with the resize()/get_index() situation. Currently I can't come up even with an informal "proof" that it'll always work (and see bpo-24361). I think these functions really *need* 100% code coverage. |
Addressing the concerns with resize()/get_index() is next on my list. I had meant to open up an issue on it last night but it was getting pretty late for me and it slipped my mind. I've opened bpo-24362 to track that work. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: