-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Armin's method cache optimization updated for Python 2.6 #44842
Comments
I've forward ported and slightly cleaned up Armin's method cache patch (see bpo-1685986). I've attempted to clarify and tighten up several loose ends in the code, so hopefully I haven't mucked anything up. My performance results are not quite as good as Armin's, though still very encouraging. I see a typical speed up of 10%, with heavily object oriented application code seeing speedups of 15-20%. Given these rather significant results, the major task is to verify correctness. |
I tried this test on parrotbench (b0.py in particular) and I'm not sure I could distinguish an improvement over the noise (best case was about 5%). The variability is pretty high on my box (dual amd64 opertons, ubuntu gcc 4.1.2-ish). At first it seemed a bit slower with the original patch which uses unsigned ints. I tried changing to unsigned longs. It seemed a little better, though not sure if it was really due to the change. I think unsigned longs should be used for 64-bit boxes. Did you use a benchmark/program that is open source? I don't know that I have anything decent to test this with. Raymond probably does though. Also what platform did you test on? |
I benchmarked using pystone, pybench, and a bunch of my local scientific applications that have tight computational kernels still in pure Python. I tested on a 64bit Linux box, defining the version_tag as either int or unsigned long with no appreciable difference in performance. I'm trying to get parrotbench to run, but it doesn't seem to have been updated for modern Python versions. |
A further minor improvement could be achieved by |
I tried re-inlining the fast path from _PyType_Lookup() in object.c and found no measurable improvement on the simple benchmarks I tried. I've also stress-tested the patch by disabling the fast-path return, always performing the slow-path lookup, and asserting that the cached result matches the slow-path result. I then ran that modified interpreter on the Python test-suite, various benchmarks, and a range of my own applications. While not a formal proof of correctness, it was encouraging that the cache remained consistent. |
Why is CPython copying functions to inline them? This is not only code duplication that makes the code easy to break, it also bloats the caller function beyond readability. I believe that using a private .c #include file with an inline keyword, or at worst, #include the function code directly is a better solution. Is there any reason to use the method currently used by CPython? |
First off, all versions of this patch do away with the rather aggressively repeated inline code. My previous comment about refactoring and testing an inlined form were purely an experiment with null results. That aside, you do raise a good question. However, given the current patch, it is unfortunately off-topic and irrelevant to the consideration of this patch. Please feel free to pursue it elsewhere, since I worry that it will only serve as a negative distraction from the much more interesting aims of the mro optimization. Since you are clearly worried about the performance of attribute lookup, please try it out and report your findings. I'll be happy review the results from your benchmarks and any suggestions you have. |
FWIW, I'm still reviewing this code. |
I may be missing something trivial here, but why the cache global - and not per-typeobject? As I saw it - the mro cache was supposed to be a mapping of (type,name)->(descriptor/classattr) that's based on the __mro__ of type, and all of the dicts in type and its bases. I think the right way to do it is to use another Pythonic dict object in each type, that holds this cache (name->(descriptor/classattr)). Am I missing something here? |
The cache is global to make it fast, have a deterministic and low memory footprint, and be easy to invalidate. It is distinct from the standard Python dict implementation since it need not be as general nor fully associative, as cache collisions are not subject to re-probing. Neither does it even go through the motions of memory management, which would be rather hard to get the standard dict implementation to emulate. It is kept drastically minimal in the interests of performance and goes to great pains to hide and minimize the impact to the rest of the interpreter. In this regard it seems quite successful, as it even removes the need for the manually inlined code you commented on before. It also survives harsh stress-testing and presents no detectable performance regressions in any of my benchmarks (to the contrary, many operations are significantly faster). While a per-type-object cache sounds feasible, it is hard for me to see how it could be as efficient in the general case -- for one, the dict look up fast path is just so much heavier. Please feel free to take that as a challenge to produce a patch that contradicts my intuition. |
peaker: we tried several approaches (including ones similar to what you describe) in PyPy, and the one in this patch is the one that worked best there. There is of course no guarantee that the same experiments would have given exactly the same results in CPython, so feel free to try them all again - I won't :-) |
I've attached the microbenchmarks I was using for my own version of |
Attribute access that happens only once or very infrequently probably |
Okay, this patch is accepted. We'll likely do additional tweaks after Please close the other tracker items for this same issue. If no one applies this, I'll do it when I return from the holidays. |
Committed to trunk as r59931. Leaving open if you want to make more |
Backed out again in r59940 -- test_ctypes fails in test_incomplete.py. Armin reports that with his original patch on 2.4, this test passes. |
All tests passed when I first ported Armin's patch to 2.6. I'll have a |
Couldn't resist looking into this more and I must admit that I am a bit More so, if I selectively don't cache the name "next" (with the caching I'm beginning to suspect there is some black magic going on. None of Wish I had more time to debug this, but I'm already overdrawn until |
After some debug work, I found an explanation:
I have come with a patch which corrects the problem, and all ctypes Index: Modules/_ctypes/stgdict.c --- Modules/_ctypes/stgdict.c (revision 59939)
+++ Modules/_ctypes/stgdict.c (working copy)
@@ -470,7 +470,7 @@
Py_DECREF(pair);
return -1;
}
- if (-1 == PyDict_SetItem(realdict, name, prop)) {
+ if (-1 == PyObject_SetAttr(type, name, prop)) {
Py_DECREF(prop);
Py_DECREF(pair);
return -1;
Index: Modules/_ctypes/_ctypes.c
===================================================================
--- Modules/_ctypes/_ctypes.c (revision 59939)
+++ Modules/_ctypes/_ctypes.c (working copy)
@@ -410,7 +410,7 @@
StructType_setattro(PyObject *self, PyObject *key, PyObject *value)
{
/* XXX Should we disallow deleting _fields_? */
- if (-1 == PyObject_GenericSetAttr(self, key, value))
+ if (-1 == Py_TYPE(self)->tp_base->tp_setattro(self, key, value))
return -1;
I think that these changes are sensible: The first one deal with the |
Nice analysis. Please apply. |
OK, I apply first my 2 lines, then the original patch. |
For the record: |
Looks like this was re-applied in r59943 and r59944. Thanks for taking |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: