Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

property_descr_get reuse argument tuple #68098

Closed
llllllllll mannequin opened this issue Apr 10, 2015 · 26 comments
Closed

property_descr_get reuse argument tuple #68098

llllllllll mannequin opened this issue Apr 10, 2015 · 26 comments
Assignees
Labels
interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@llllllllll
Copy link
Mannequin

llllllllll mannequin commented Apr 10, 2015

BPO 23910
Nosy @warsaw, @rhettinger, @vstinner, @ericvsmith, @ericsnowcurrently, @serhiy-storchaka, @llllllllll
Files
  • namedtuple.patch: nameduple in C.
  • namedtuple-indexer.patch
  • namedtuple-indexer-update.patch
  • namedtuple-indexer-with-news.patch
  • property.patch
  • with-static-tuple.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2015-04-30.15:20:53.514>
    created_at = <Date 2015-04-10.20:10:37.345>
    labels = ['interpreter-core', 'type-feature']
    title = 'property_descr_get reuse argument tuple'
    updated_at = <Date 2018-10-01.10:11:02.588>
    user = 'https://github.com/llllllllll'

    bugs.python.org fields:

    activity = <Date 2018-10-01.10:11:02.588>
    actor = 'vstinner'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2015-04-30.15:20:53.514>
    closer = 'rhettinger'
    components = ['Interpreter Core']
    creation = <Date 2015-04-10.20:10:37.345>
    creator = 'llllllllll'
    dependencies = []
    files = ['38892', '38896', '38905', '38994', '39210', '39216']
    hgrepos = []
    issue_num = 23910
    keywords = ['patch']
    message_count = 26.0
    messages = ['240447', '240448', '240449', '240450', '240451', '240452', '240465', '240535', '240565', '240568', '240934', '242082', '242085', '242089', '242091', '242093', '242103', '242112', '242114', '242123', '242124', '242234', '242235', '242273', '243981', '326783']
    nosy_count = 8.0
    nosy_names = ['barry', 'rhettinger', 'vstinner', 'eric.smith', 'python-dev', 'eric.snow', 'serhiy.storchaka', 'llllllllll']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue23910'
    versions = ['Python 3.5']

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 10, 2015

    I am working on implementing nameduple in C; I am almost there; however, on the path of moving to full compatibility, I ran into a refcount issue somewhere. Hopefully someone can help me work this out.

    To describe the issue, When I run the collections tests I most frequently segfault in a non namedtuple test; however, sometimes it runs (and passes) however that probably means I am using an unowned obect somewhere. I am new to the CPython API so I would love to go through this with someone to learn more about how it all works.

    I am currently at PyCon and would absolutly love to speak to someone in person. I will be at CPython sprints for at least one day.

    @llllllllll llllllllll mannequin added extension-modules C modules in the Modules dir type-feature A feature request or enhancement labels Apr 10, 2015
    @ericvsmith
    Copy link
    Member

    ericvsmith commented Apr 10, 2015

    What's the motivating use case for this?

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 10, 2015

    Ideally, namedtuple is used to make your code cleaner, using "magic" indecies is less clear than using a named index in a lot of cases. Because namedtuple is mainly to make the code more readable, I don't think that it should have an impact on the runtime performance of the code. This means that namedtuple should be a very thin wrapper around tuple. Currently, namedtuple named item access is much slower than elementwise access. I have this as a standalone package (there are some changes in the diff I posted to acheive full backwards compat) here https://pypi.python.org/pypi/cnamedtuple/0.1.5 that show some profiling runs of this code. The notable one to me is named item access being around twice as fast.

    Another issue we ran into at work is that we found ways to get the exec call in nametuple to execute arbitrary code; while this would not be a concern for most people by the nature of the way we got this to happen, we wanted to look at other ways of writing namedtuple. I looked through some older discussion and found that non-exec solutions were rejected in the past for perfomance reasons.

    @ericvsmith
    Copy link
    Member

    ericvsmith commented Apr 10, 2015

    Have you thought of just exposing Object/structseq.c?

    Before you put much time into this, I'd get Raymond's acceptance of whatever approach you want to take. It might be best to raise it on python-ideas.

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 10, 2015

    would the idea be to deprecate namedtuple in favor of a public structseq that is exposed through collections, or change structseq to fit the namedtuple API?

    @ericvsmith
    Copy link
    Member

    ericvsmith commented Apr 10, 2015

    I haven't seen thought it through, just that it seems very similar to a C namedtuple.

    @rhettinger rhettinger self-assigned this Apr 11, 2015
    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 11, 2015

    I stripped down the patch to only the descriptor like we had discussed.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 12, 2015

    Can you post before and afters timings of the patch?

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 12, 2015

    # Original version / new python implementation
    ./python -m timeit -s "from collections import namedtuple;a = namedtuple('a', 'a b c')(1, 2, 3)" "a.a"
    10000000 loops, best of 3: 0.07 usec per loop

    # C implementation
    ./python -m timeit -s "from collections import namedtuple;a = namedtuple('a', 'a b c')(1, 2, 3)" "a.a"
    10000000 loops, best of 3: 0.028 usec per loop

    The fallback is the same implementation that is currently used so this should have no affect on pypi.

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 12, 2015

    sorry, I meant pypy

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 14, 2015

    I am updating the patch to include an entry in Misc/NEWS.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 26, 2015

    FWIW, the current property(itemgetter(index)) code has a Python creation step, but the actual attribute lookup and dispatch is done entirely in C (no pure python steps around the eval lookup).

    Rather than making a user visible C hack directly to namedtuple, any optimization effort should be directly at improving the speed of property() and itemgetter().

    Here are some quick notes to get you started:

    * The overhead of property() is almost nothing.
    * The code for property_descr_get() in Objects/descrobject.c
    * It has two nearly zero cost predictable branches
    * So the the only real work is a call to 
      PyObject_CallFunctionObjArgs(gs->prop_get, obj, NULL);
    * which then calls both
           objargs_mktuple(vargs) 
      and 
           PyObject_Call(callable, args, NULL);
    * That can be sped-up by using 
           PyTuple_New(1)
      and a direct call to  PyObject_Call()
    * The logic in PyObject_Call can potentially be tightened
      in the context of a property(itemgetter(index)) call.
      Look to see whether recursion check is necessary
      (itemgetter on a tuple subclass that doesn't extend __getitem__
       is non-recursive)
    * If so, then entire call to PyObject_Call() in property 
      can potentially be simplified to:
       result = (*call)(func, arg, kw);

    I haven't looked too closely at this, but I think you get the general idea. If the speed of property(itemgetter(index)) is the bottleneck in your code, the starting point is to unwind the exact series of C steps performed to see if any of them can be simplified. For the most part, the code in property() and itemgetter() were both implemented using the simplest C parts of the C API rather than the fastest. The chain of calls isn't specialized for the common use case (i.e. property_get() needing exactly 1 argument rather than a variable length arg tuple and itemgetter doing a known integer offset on a list or tuple rather than the less common case of generic types and a possible tuple of indices).

    We should start by optimizing what we've already got. That will have a benefit beyond named tuples (faster itemgetters for sorting and faster property gets for the entire language).

    It also helps us avoid making the NT code less familiar (using custom private APIs rather than generic, well-known components).

    It also reduces the risk of breaking code that relies on the published implementation of named tuple attribute lookups (for example, I've seen deployed code that customizes the attribute docstrings like this):
    Point = namedtuple('Point', ['x', 'y'])
    Point.x = property(Point.x.fget, doc='abscissa')
    Point.y = property(Point.y.fget, doc='ordinate')
    coordinate = Point(x=250, y=43)

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 26, 2015

    One other thought: the itemgetter.__call__ method is already pretty thin:

    if (!PyArg_UnpackTuple(args, "itemgetter", 1, 1, &obj))
    return NULL;
    if (nitems == 1)
        return PyObject_GetItem(obj, ig-\>item);
    

    But you could add a special case for single integer index being applied to a known sequence. Extract the Py_ssize_t index in itemgetter_new and store it in the itemgetterobject. Then add a fast path in itemgetter.__call__. Roughly something like this:

      if (ig->index != -1 &&
          PyTuple_Check(obj) && 
          nitems == 1 &&
          PyTuple_GET_SIZE(obj) > ig->index) {
           result = PySequence_Fast_GET_ITEM(obj, ig->index);
           Py_INCREF(result);
           return result;
      }
     
    Perhaps also add a check to make sure the tuple subclass hasn't overridden the __getitem__ method (see an example of how to do this in the code for Modules/_collectionsmodule.c::_count_elements()).

    Something along these lines would allow all the steps to be inlined and would eliminate all the usual indirections inherent in the abstract API.

    Another alternative is to use the PySequence API instead of the PyTuple API. That trades away a little of the speed-up in return for speeding-up itemgetter on lists as well as tuples.

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 27, 2015

    I was unable to see a performance increase by playing with the itemgetter.__call__ code; however, updating the propery code seemed to show a small improvement. I think that for simple indexing the cost of checking if it is a sequence outways the faster dispatch (when using PySequence_GetItem). I can play with this further.

    • default
      [joejev@Sheila cpython]$ ./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
      10000000 loops, best of 3: 0.101 usec per loop

    • patch
      [joejev@Sheila cpython]$ ./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
      10000000 loops, best of 3: 0.0942 usec per loop

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 27, 2015

    If you have a chance, run a C profiler so we can see whether most of the time is being spent in an attribute lookup for the current property(itemgetter) approach versus your nt-indexer approach. Without a profile, I have only my instincts that the overhead is a mix of indirections and function call overhead (both solveable by in-lining), and over-generalization for all PyObject_GetItem() (solvable by specialization to a tuple subclass), and variable length argument lists (solveable by using of PyTuple_New(1)).

    Ideally, I would like something that speeds-up broader swaths of the language and doesn't obfuscate the otherwise clean generated code. ISTM that the C code for both property() and itemgetter() still have room to optimize the common case.

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 27, 2015

    This was very exciting, I have never run gprof before; so just to make sure I did this correctly, I will list my steps:

    1. compile with the -pg flag set
    2. run the test with ./python -m timeit ...
    3. $ gprof python gmon.out > profile.out

    Here is default:

    Each sample counts as 0.01 seconds.
    % cumulative self self total
    time seconds seconds calls Ts/call Ts/call name
    22.48 0.87 0.87 PyEval_EvalFrameEx
    9.82 1.25 0.38 PyObject_CallFunctionObjArgs
    6.85 1.52 0.27 PyObject_GenericGetAttr
    6.46 1.77 0.25 tupledealloc
    5.56 1.98 0.22 PyArg_UnpackTuple
    5.17 2.18 0.20 PyNumber_AsSsize_t
    5.17 2.38 0.20 tuplesubscript
    5.04 2.58 0.20 PyObject_Call
    4.91 2.77 0.19 _PyType_Lookup
    4.65 2.95 0.18 PyTuple_New
    4.65 3.13 0.18 PyObject_GetItem
    4.39 3.30 0.17 itemgetter_call
    1.94 3.37 0.08 PyObject_GetAttr
    1.81 3.44 0.07 PyObject_GC_UnTrack
    1.81 3.51 0.07 _PyTuple_DebugMallocStats
    1.03 3.55 0.04 PyErr_Occurred
    1.03 3.59 0.04 property_descr_set
    1.03 3.63 0.04 tuplerepr
    0.78 3.66 0.03 PyLong_AsSsize_t
    0.78 3.69 0.03 PyObject_GC_Track
    0.52 3.71 0.02 property_descr_get
    0.52 3.73 0.02 repeat_next
    0.52 3.75 0.02 repeat_traverse
    0.39 3.77 0.02 PyArg_ValidateKeywordArguments
    0.39 3.78 0.02 _Py_CheckFunctionResult
    0.26 3.79 0.01 PyCFunction_NewEx
    0.26 3.80 0.01 PyCode_New
    0.26 3.81 0.01 PyErr_Restore
    0.26 3.82 0.01 PyType_GetSlot
    0.26 3.83 0.01 _PyObject_CallMethodIdObjArgs
    0.26 3.84 0.01 attrgetter_new
    0.26 3.85 0.01 update_one_slot
    0.13 3.86 0.01 _PyObject_GenericGetAttrWithDict
    0.13 3.86 0.01 _PyObject_SetAttrId
    0.13 3.87 0.01 gc_isenabled
    0.13 3.87 0.01 visit_decref

    Here is my patch:

    Each sample counts as 0.01 seconds.
    % cumulative self self total
    time seconds seconds calls Ts/call Ts/call name
    26.63 1.02 1.02 PyEval_EvalFrameEx
    8.88 1.36 0.34 PyObject_GenericGetAttr
    7.83 1.66 0.30 tupledealloc
    6.27 1.90 0.24 PyObject_Call
    5.74 2.12 0.22 PyTuple_New
    5.48 2.33 0.21 property_descr_get
    5.22 2.53 0.20 _PyType_Lookup
    4.44 2.70 0.17 tuplesubscript
    4.18 2.86 0.16 PyArg_UnpackTuple
    3.92 3.01 0.15 PyNumber_AsSsize_t
    3.66 3.15 0.14 PyObject_GetItem
    2.87 3.26 0.11 itemgetter_call
    2.61 3.36 0.10 PyObject_GC_UnTrack
    1.70 3.43 0.07 PyObject_GetAttr
    1.31 3.48 0.05 repeat_next
    1.31 3.53 0.05 repeat_traverse
    1.04 3.57 0.04 attrgetter_new
    1.04 3.61 0.04 property_descr_set
    0.78 3.64 0.03 PyErr_Occurred
    0.78 3.67 0.03 PyErr_Restore
    0.78 3.70 0.03 PyLong_AsSsize_t
    0.78 3.73 0.03 PyType_GetSlot
    0.52 3.75 0.02 PyObject_GC_Track
    0.26 3.76 0.01 PyArg_ValidateKeywordArguments
    0.26 3.77 0.01 PyDict_GetItem
    0.26 3.78 0.01 _PyObject_GenericGetAttrWithDict
    0.26 3.79 0.01 _PyTuple_DebugMallocStats
    0.26 3.80 0.01 _Py_CheckFunctionResult
    0.26 3.81 0.01 convertitem
    0.26 3.82 0.01 r_object
    0.26 3.83 0.01 tuplerepr
    0.13 3.83 0.01 _PyObject_SetAttrId

    It looks like you were correct that PyObject_CallFunctionObjArgs was eating up a lot of time.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 27, 2015

    Hmm, the presense of _PyTuple_DebugMallocStats, repeat_traverse, and visit_decref suggests that the profile may have been run with debugging code enabled and GC enabled.

    The property patch looks good. Depending on how far you want to go with this, you could save your tuple of length-1 statically and reuse it on successive calls if its refcnt doesn't grow (see the code for zip() for an example of how to do this). That would save the PyTuple_New and tupledealloc calls.

    Going further, potentially you could in-line some of the code it PyObject_Call, caching the callsite and its NULL check, and looking at the other steps to see if they are all necessary in the context of a property_desc_get().

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 27, 2015

    I switched to the static tuple.

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 27, 2015

    I don't think that I can cache the __call__ of the fget object because it might be an instance of a heaptype, and if someone changed the __class__ of the object in between calls to another heaptype that had a different __call__, you would still get the __call__ from the first type. I also don't know if this is supported behavior or just something that works by accident.

    I read through PyObject_Call, and all the code is needed assuming we are not caching the tp_call value.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 27, 2015

    What kind of speed improvement have you gotten?

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 27, 2015

    I am currently on a different machine so these numbers are not relative to the others posted earlier.

    • default
      ./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
      10000000 loops, best of 3: 0.0699 usec per loop

    • patch
      ./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
      10000000 loops, best of 3: 0.0468 usec per loop

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 29, 2015

    0.0699 usec per loop --> 0.0468

    That's pretty good for a small patch :-)

    For the pre-computed 1-tuple, I think you need to check for a refcnt of 1 and fallback to PyTuple_New if the tuple is in use (i.e. a property that calls another property). See Objects/enumobject.c::105 for an example.

    Also, consider providing a way to clean-up that tuple on shutdown. For example, look at what is done with the various freelists. An easier way is to make the premade tuple part of the property object struct so that it gets freed when the property is deallocated.

    Adding Serhiy to the nosy list, he can help with cleaning-up the patch so that it is ready to apply.

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 29, 2015

    I don't think that we need to worry about reusing the single argument tuple in a recursive situation because we never need the value after we start the call. We also just write our new value and then clean up with a NULL to make sure that we don't blow up when we dealloc the tuple. For example:

    >>> class C(object):
    ...     @property
    ...     def f(self):
    ...         return D().f
    ... 
    >>> class D(object):
    ...     @property
    ...     def f(self):
    ...         return 1
    ... 
    >>> C().f
    1

    This works with recursive properties.
    I also think that is is getting cleaned up on shutdown, if I put a pointer to garbage in the tuple, the interpreter blows up on shutdown; this makes me think that tuple_dealloc is being called somewhere. About putting the tuple on the property instance, that would nice for memory management; however, that increases the memory overhead of each property. This also means that we would only get the faster lookup after the property has been accessed once; this is fine but the current implementation makes it so that all properties are faster after any of them are looked up once. I could be wrong about the cleanup though.

    I am also updating the title and headers because this issue is no longer about namedtuple.

    @llllllllll llllllllll mannequin added interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) and removed extension-modules C modules in the Modules dir labels Apr 29, 2015
    @llllllllll llllllllll mannequin changed the title C implementation of namedtuple (WIP) property_descr_get reuse argument tuple Apr 29, 2015
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 30, 2015

    New changeset 661cdbd617b8 by Raymond Hettinger in branch 'default':
    Issue bpo-23910: Optimize property() getter calls. Patch by Joe Jevnik
    https://hg.python.org/cpython/rev/661cdbd617b8

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 24, 2015

    This patch caused crash with C implementation of lru_cache (bpo-14373). Opened bpo-24276 for fixing this.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 1, 2018

    This optimization caused multiple crashes, so it has been decided to remove it :-( See bpo-30156.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants