-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
property_descr_get reuse argument tuple #68098
Comments
I am working on implementing nameduple in C; I am almost there; however, on the path of moving to full compatibility, I ran into a refcount issue somewhere. Hopefully someone can help me work this out. To describe the issue, When I run the collections tests I most frequently segfault in a non namedtuple test; however, sometimes it runs (and passes) however that probably means I am using an unowned obect somewhere. I am new to the CPython API so I would love to go through this with someone to learn more about how it all works. I am currently at PyCon and would absolutly love to speak to someone in person. I will be at CPython sprints for at least one day. |
What's the motivating use case for this? |
Ideally, namedtuple is used to make your code cleaner, using "magic" indecies is less clear than using a named index in a lot of cases. Because namedtuple is mainly to make the code more readable, I don't think that it should have an impact on the runtime performance of the code. This means that namedtuple should be a very thin wrapper around tuple. Currently, namedtuple named item access is much slower than elementwise access. I have this as a standalone package (there are some changes in the diff I posted to acheive full backwards compat) here https://pypi.python.org/pypi/cnamedtuple/0.1.5 that show some profiling runs of this code. The notable one to me is named item access being around twice as fast. Another issue we ran into at work is that we found ways to get the exec call in nametuple to execute arbitrary code; while this would not be a concern for most people by the nature of the way we got this to happen, we wanted to look at other ways of writing namedtuple. I looked through some older discussion and found that non-exec solutions were rejected in the past for perfomance reasons. |
Have you thought of just exposing Object/structseq.c? Before you put much time into this, I'd get Raymond's acceptance of whatever approach you want to take. It might be best to raise it on python-ideas. |
would the idea be to deprecate namedtuple in favor of a public structseq that is exposed through collections, or change structseq to fit the namedtuple API? |
I haven't seen thought it through, just that it seems very similar to a C namedtuple. |
I stripped down the patch to only the descriptor like we had discussed. |
Can you post before and afters timings of the patch? |
# Original version / new python implementation # C implementation The fallback is the same implementation that is currently used so this should have no affect on pypi. |
sorry, I meant pypy |
I am updating the patch to include an entry in Misc/NEWS. |
FWIW, the current property(itemgetter(index)) code has a Python creation step, but the actual attribute lookup and dispatch is done entirely in C (no pure python steps around the eval lookup). Rather than making a user visible C hack directly to namedtuple, any optimization effort should be directly at improving the speed of property() and itemgetter(). Here are some quick notes to get you started: * The overhead of property() is almost nothing.
* The code for property_descr_get() in Objects/descrobject.c
* It has two nearly zero cost predictable branches
* So the the only real work is a call to
PyObject_CallFunctionObjArgs(gs->prop_get, obj, NULL);
* which then calls both
objargs_mktuple(vargs)
and
PyObject_Call(callable, args, NULL);
* That can be sped-up by using
PyTuple_New(1)
and a direct call to PyObject_Call()
* The logic in PyObject_Call can potentially be tightened
in the context of a property(itemgetter(index)) call.
Look to see whether recursion check is necessary
(itemgetter on a tuple subclass that doesn't extend __getitem__
is non-recursive)
* If so, then entire call to PyObject_Call() in property
can potentially be simplified to:
result = (*call)(func, arg, kw); I haven't looked too closely at this, but I think you get the general idea. If the speed of property(itemgetter(index)) is the bottleneck in your code, the starting point is to unwind the exact series of C steps performed to see if any of them can be simplified. For the most part, the code in property() and itemgetter() were both implemented using the simplest C parts of the C API rather than the fastest. The chain of calls isn't specialized for the common use case (i.e. property_get() needing exactly 1 argument rather than a variable length arg tuple and itemgetter doing a known integer offset on a list or tuple rather than the less common case of generic types and a possible tuple of indices). We should start by optimizing what we've already got. That will have a benefit beyond named tuples (faster itemgetters for sorting and faster property gets for the entire language). It also helps us avoid making the NT code less familiar (using custom private APIs rather than generic, well-known components). It also reduces the risk of breaking code that relies on the published implementation of named tuple attribute lookups (for example, I've seen deployed code that customizes the attribute docstrings like this): |
One other thought: the itemgetter.__call__ method is already pretty thin:
But you could add a special case for single integer index being applied to a known sequence. Extract the Py_ssize_t index in itemgetter_new and store it in the itemgetterobject. Then add a fast path in itemgetter.__call__. Roughly something like this: if (ig->index != -1 &&
PyTuple_Check(obj) &&
nitems == 1 &&
PyTuple_GET_SIZE(obj) > ig->index) {
result = PySequence_Fast_GET_ITEM(obj, ig->index);
Py_INCREF(result);
return result;
}
Perhaps also add a check to make sure the tuple subclass hasn't overridden the __getitem__ method (see an example of how to do this in the code for Modules/_collectionsmodule.c::_count_elements()). Something along these lines would allow all the steps to be inlined and would eliminate all the usual indirections inherent in the abstract API. Another alternative is to use the PySequence API instead of the PyTuple API. That trades away a little of the speed-up in return for speeding-up itemgetter on lists as well as tuples. |
I was unable to see a performance increase by playing with the itemgetter.__call__ code; however, updating the propery code seemed to show a small improvement. I think that for simple indexing the cost of checking if it is a sequence outways the faster dispatch (when using PySequence_GetItem). I can play with this further.
|
If you have a chance, run a C profiler so we can see whether most of the time is being spent in an attribute lookup for the current property(itemgetter) approach versus your nt-indexer approach. Without a profile, I have only my instincts that the overhead is a mix of indirections and function call overhead (both solveable by in-lining), and over-generalization for all PyObject_GetItem() (solvable by specialization to a tuple subclass), and variable length argument lists (solveable by using of PyTuple_New(1)). Ideally, I would like something that speeds-up broader swaths of the language and doesn't obfuscate the otherwise clean generated code. ISTM that the C code for both property() and itemgetter() still have room to optimize the common case. |
This was very exciting, I have never run gprof before; so just to make sure I did this correctly, I will list my steps:
Here is default: Each sample counts as 0.01 seconds. Here is my patch: Each sample counts as 0.01 seconds. It looks like you were correct that PyObject_CallFunctionObjArgs was eating up a lot of time. |
Hmm, the presense of _PyTuple_DebugMallocStats, repeat_traverse, and visit_decref suggests that the profile may have been run with debugging code enabled and GC enabled. The property patch looks good. Depending on how far you want to go with this, you could save your tuple of length-1 statically and reuse it on successive calls if its refcnt doesn't grow (see the code for zip() for an example of how to do this). That would save the PyTuple_New and tupledealloc calls. Going further, potentially you could in-line some of the code it PyObject_Call, caching the callsite and its NULL check, and looking at the other steps to see if they are all necessary in the context of a property_desc_get(). |
I switched to the static tuple. |
I don't think that I can cache the __call__ of the fget object because it might be an instance of a heaptype, and if someone changed the __class__ of the object in between calls to another heaptype that had a different __call__, you would still get the __call__ from the first type. I also don't know if this is supported behavior or just something that works by accident. I read through PyObject_Call, and all the code is needed assuming we are not caching the tp_call value. |
What kind of speed improvement have you gotten? |
I am currently on a different machine so these numbers are not relative to the others posted earlier.
|
That's pretty good for a small patch :-) For the pre-computed 1-tuple, I think you need to check for a refcnt of 1 and fallback to PyTuple_New if the tuple is in use (i.e. a property that calls another property). See Objects/enumobject.c::105 for an example. Also, consider providing a way to clean-up that tuple on shutdown. For example, look at what is done with the various freelists. An easier way is to make the premade tuple part of the property object struct so that it gets freed when the property is deallocated. Adding Serhiy to the nosy list, he can help with cleaning-up the patch so that it is ready to apply. |
I don't think that we need to worry about reusing the single argument tuple in a recursive situation because we never need the value after we start the call. We also just write our new value and then clean up with a NULL to make sure that we don't blow up when we dealloc the tuple. For example: >>> class C(object):
... @property
... def f(self):
... return D().f
...
>>> class D(object):
... @property
... def f(self):
... return 1
...
>>> C().f
1 This works with recursive properties. I am also updating the title and headers because this issue is no longer about namedtuple. |
New changeset 661cdbd617b8 by Raymond Hettinger in branch 'default': |
This optimization caused multiple crashes, so it has been decided to remove it :-( See bpo-30156. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: