Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _PyObject_FastCall() #71315

Closed
vstinner opened this issue May 26, 2016 · 32 comments
Closed

Add _PyObject_FastCall() #71315

vstinner opened this issue May 26, 2016 · 32 comments
Labels
performance Performance or resource usage

Comments

@vstinner
Copy link
Member

BPO 27128
Nosy @scoder, @vstinner, @serhiy-storchaka, @1st1, @ztane
Files
  • fastcall.patch
  • default-May26-13-36-33.log
  • fastcall-2.patch
  • fast_call_alt.patch
  • fastcall-3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-09-01.13:13:54.081>
    created_at = <Date 2016-05-26.10:15:56.728>
    labels = ['performance']
    title = 'Add _PyObject_FastCall()'
    updated_at = <Date 2016-09-01.13:13:54.079>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2016-09-01.13:13:54.079>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-09-01.13:13:54.081>
    closer = 'vstinner'
    components = []
    creation = <Date 2016-05-26.10:15:56.728>
    creator = 'vstinner'
    dependencies = []
    files = ['43011', '43014', '44041', '44079', '44128']
    hgrepos = []
    issue_num = 27128
    keywords = ['patch']
    message_count = 32.0
    messages = ['266422', '266424', '266429', '266430', '268057', '268059', '272138', '272139', '272197', '272264', '272268', '272479', '272503', '272884', '272887', '272888', '272891', '273136', '273140', '273143', '273144', '273153', '273166', '273175', '273349', '273350', '273351', '273365', '273371', '273388', '273402', '274123']
    nosy_count = 6.0
    nosy_names = ['scoder', 'vstinner', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'ztane']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue27128'
    versions = ['Python 3.6']

    @vstinner
    Copy link
    Member Author

    Since the issue bpo-26814 proved that avoiding the creation of temporary tuples to call Python and C functions makes Python faster (between 2% and 29% depending on the benchmark), I extracted a first "minimal" patch to start merging this work.

    The first patch adds new functions:

    • PyObject_CallNoArg(func) and PyObject_CallArg1(func, arg): public functions
    • _PyObject_FastCall(func, args, nargs, kwargs): private function

    I hesitate between the C types "int" and "Py_ssize_t" for nargs. I read once that using "int" can cause performance issues on a loop using "i++" and "data[i]" because the compiler has to handle integer overflow of the int type.

    The "int" type is also annoying on Windows 64-bit, it causes compiler warnings on downcast like PyTuple_GET_SIZE(co->co_argcount) stored into a C int.

    _PyObject_FastCall() avoids the creation of tuple for:

    • All Python functions (PyFunction_Check)
    • C functions using METH_NOARGS or METH_O

    The patch removes the "cache tuple" optimization from property_descr_get(), it uses PyObject_CallArg1() instead. It means that the optimization is (currently) missed in some cases compared to the current code, but the code is safer and simpler.

    The patch adds Python/pystack.c which currently only contains _PyStack_AsTuple(), but will contain more code later.

    I tried to write the smallest patch, but I started to use PyObject_CallNoArg() and PyObject_CallArg1() when the code already created a tuple at each call: PyObject_CallObject(), call_function_tail() and PyEval_CallObjectWithKeywords().

    In the patch, keywords are not used in fast calls. But they will be used later. I prefer to start directly with keywords than changing the calling convention once again later.

    --

    Later, I will propose other patches to:

    • add METH_FASTCALL calling convention for C functions
    • modify Argument Clinic to use METH_FASTCALL

    So the fast call will be taken in more cases.

    --

    The long term plan is to slowly use the new FASTCALL calling convention "everywhere". The tricky point are tp_new, tp_init and tp_call attributes of type objects. In the issue bpo-26814, I wrote a patch adding Py_TPFLAGS_FASTNEW, Py_TPFLAGS_FASTINIT and Py_TPFLAGS_FASTCALL flags to use the FASTCALL calling convention for tp_new, tp_init and tp_call. The problem is that calling directly these methods looks common. If we can the calling convention of these methods, it will break the C API, I propose to discuss that later ;-)

    An alternative is to add a tp_fastcall method to PyTypeObject and use a wrapper for tp_call for backward compatibility. This option has also drawbacks. Again, I propose to discuss this later, and first start to focus on the changes that don't break anything ;-)

    @vstinner vstinner added the performance Performance or resource usage label May 26, 2016
    @vstinner
    Copy link
    Member Author

    Quick & dirty microbenchmark: I ran bench_fast-2.py of the issue bpo-26814. It looks like everything is slower :-p In fact, I already noticed this issue and I think that it is fixed with better compilation option: use "./configure --with-lto" and "make profile-opt". See my article:
    https://haypo.github.io/journey-to-stable-benchmark-deadcode.html

    ----------------------------------+-------------+---------------
    Tests | original | fastcall
    ----------------------------------+-------------+---------------
    filter | 76.2 us () | 116 us (+52%)
    map | 73.6 us (
    ) | 102 us (+38%)
    sorted(list, key=lambda x: x) | 82 us () | 121 us (+48%)
    sorted(list) | 14.7 us (
    ) | 17.3 us (+18%)
    b=MyBytes(); bytes(b) | 182 ns () | 243 ns (+33%)
    namedtuple.attr | 802 ns (
    ) | 1.44 us (+80%)
    object.__setattr__(obj, "x", 1) | 133 ns () | 166 ns (+25%)
    object.__getattribute__(obj, "x") | 116 ns (
    ) | 142 ns (+22%)
    getattr(1, "real") | 76 ns () | 95 ns (+25%)
    bounded_pymethod(1, 2) | 72 ns (
    ) | 102 ns (+42%)
    unbound_pymethod(obj, 1, 2) | 71 ns () | 99 ns (+38%)
    func() | 57 ns (
    ) | 81 ns (+41%)
    func(1, 2, 3) | 72 ns () | 100 ns (+39%)
    ----------------------------------+-------------+---------------
    Total | 248 us (
    ) | 358 us (+44%)
    ----------------------------------+-------------+---------------

    At least, we have a starting point ;-)

    @vstinner
    Copy link
    Member Author

    default-May26-13-36-33.log: CPython benchmark suite run using stable config.

    Faster (15):

    • regex_effbot: 1.26x faster
    • telco: 1.08x faster
    • unpack_sequence: 1.07x faster
    • mako_v2: 1.05x faster
    • meteor_contest: 1.05x faster
    • chaos: 1.04x faster
    • nbody: 1.04x faster
    • call_method_slots: 1.04x faster
    • etree_iterparse: 1.04x faster
    • etree_parse: 1.04x faster
    • call_method: 1.03x faster
    • raytrace: 1.03x faster
    • nqueens: 1.03x faster
    • call_method_unknown: 1.03x faster
    • formatted_logging: 1.02x faster

    Slower (8):

    • etree_generate: 1.05x slower
    • etree_process: 1.03x slower
    • call_simple: 1.03x slower
    • chameleon_v2: 1.02x slower
    • pathlib: 1.02x slower
    • float: 1.02x slower
    • silent_logging: 1.02x slower
    • json_load: 1.02x slower

    @vstinner
    Copy link
    Member Author

    Updated bench_fast-2.py result with Python compiled with PGO+LTO, with benchmark.py fixed to compute average + standard deviation. Only getattr() really seems slower:

    ----------------------------------+-----------------------+--------------------------
    Tests | original | fastcall
    ----------------------------------+-----------------------+--------------------------
    filter | 75.8 us +- 0.1 us () | 78.1 us +- 0.1 us
    map | 72.6 us +- 0.1 us (
    ) | 71.4 us +- 0.0 us
    sorted(list, key=lambda x: x) | 83.7 us +- 0.1 us () | 82.3 us +- 0.3 us
    sorted(list) | 14.9 us +- 0.0 us (
    ) | 14.7 us +- 0.0 us
    b=MyBytes(); bytes(b) | 199 ns +- 2 ns () | 194 ns +- 1 ns
    namedtuple.attr | 830 ns +- 20 ns (
    ) | 1.09 us +- 0.01 us (+31%)
    object.__setattr__(obj, "x", 1) | 133 ns +- 0 ns () | 134 ns +- 1 ns
    object.__getattribute__(obj, "x") | 117 ns +- 0 ns (
    ) | 115 ns +- 1 ns
    getattr(1, "real") | 93.2 ns +- 0.9 ns () | 76.9 ns +- 0.7 ns (-17%)
    bounded_pymethod(1, 2) | 73.4 ns +- 0.6 ns (
    ) | 70.7 ns +- 0.4 ns
    unbound_pymethod(obj, 1, 2) | 74.5 ns +- 0.2 ns () | 71.8 ns +- 0.6 ns
    func() | 60.2 ns +- 0.4 ns (
    ) | 59.3 ns +- 0.1 ns
    func(1, 2, 3) | 74.6 ns +- 0.4 ns () | 72.2 ns +- 0.3 ns
    ----------------------------------+-----------------------+--------------------------
    Total | 249 us (
    ) | 248 us
    ----------------------------------+-----------------------+--------------------------

    @serhiy-storchaka
    Copy link
    Member

    See bpo-27213. Maybe fast call with keyword arguments would avoid the creation of a dict.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 9, 2016

    Serhiy Storchaka added the comment:

    See bpo-27213. Maybe fast call with keyword arguments would avoid the creation of a dict.

    In a first verison of my implementation, I used dictionary items
    stored a a list of (key, value) tuples in the same PyObject* C array
    than positional parameters.

    But in practice, it's very rare in the C code base to have to call a
    function with keyword parameters, but most functions expect keyword
    parameters as a dict. They are implemented with
    PyArg_ParseTupleAndKeywords() which expects a dict.

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 8, 2016

    Rebased patch.

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 8, 2016

    (Oops, I removed a broken fastcall-2.patch which didn't include new pystack.c/pystack.h files. It's now fixed in the new fastcall-2.patch.)

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 8, 2016

    I spent the last 3 months on making the CPython benchmark suite more stable and enhance my procedure to run benchmarks to ensure that benchmarks are more stable.

    See my articles:
    https://haypo-notes.readthedocs.io/microbenchmark.html#my-articles

    I forked and enhanced the benchmark suite to use my perf module to run benchmarks in multiple processes:
    https://hg.python.org/sandbox/benchmarks_perf

    I ran this better benchmark suite on fastcall-2.patch on my laptop. The result is quite good:
    ----------------

    $ python3 -m perf compare_to ref.json fastcall.json -G  --min-speed=5
    Slower (4):
    - fastpickle/pickle_dict: 326 us +- 15 us -> 350 us +- 29 us: 1.07x slower
    - regex_effbot: 49.4 ms +- 1.3 ms -> 53.0 ms +- 1.2 ms: 1.07x slower
    - fastpickle/pickle: 432 us +- 8 us -> 457 us +- 10 us: 1.06x slower
    - pybench.ComplexPythonFunctionCalls: 838 ns +- 11 ns -> 884 ns +- 8 ns: 1.05x slower

    Faster (13):

    • spectral_norm: 289 ms +- 6 ms -> 250 ms +- 5 ms: 1.16x faster
    • pybench.SimpleIntFloatArithmetic: 622 ns +- 9 ns -> 559 ns +- 10 ns: 1.11x faster
    • pybench.SimpleIntegerArithmetic: 621 ns +- 10 ns -> 560 ns +- 9 ns: 1.11x faster
    • pybench.SimpleLongArithmetic: 891 ns +- 12 ns -> 816 ns +- 10 ns: 1.09x faster
    • pybench.DictCreation: 852 ns +- 13 ns -> 788 ns +- 16 ns: 1.08x faster
    • pybench.ForLoops: 10.8 ns +- 0.3 ns -> 9.99 ns +- 0.23 ns: 1.08x faster
    • pybench.NormalClassAttribute: 1.85 us +- 0.02 us -> 1.72 us +- 0.04 us: 1.08x faster
    • pybench.SpecialClassAttribute: 1.86 us +- 0.02 us -> 1.73 us +- 0.03 us: 1.07x faster
    • pybench.NestedForLoops: 21.9 ns +- 0.3 ns -> 20.7 ns +- 0.3 ns: 1.05x faster
    • pybench.SimpleListManipulation: 501 ns +- 4 ns -> 476 ns +- 5 ns: 1.05x faster
    • elementtree/process: 192 ms +- 3 ms -> 183 ms +- 2 ms: 1.05x faster
    • elementtree/generate: 225 ms +- 5 ms -> 214 ms +- 4 ms: 1.05x faster
    • hexiom2/level_25: 21.3 ms +- 0.3 ms -> 20.3 ms +- 0.1 ms: 1.05x faster

    Benchmark hidden because not significant (84): (...)
    ----------------

    Most benchmarks are not significant which is expected since fastcall-2.patch is really the most simple patch to start the work on "FASTCALL", it doesn't really implement any optimization, it only adds a new infrastructure to implement new optimizations.

    A few benchmarks are faster (only benchmarks at least 5% faster are shown using --min-speed=5).

    4 benchmarks are slower, but the slowdown should be temporarily: new optimizations should these benchmarks slower. See the issue bpo-26814 for more a concrete implementation and a lot of benchmark results if you don't trust me :-)

    I consider that benchmarks proved that there is no major slowdown, so fastcall-2.patch can be merged to be able to start working on real optimizations.

    @serhiy-storchaka
    Copy link
    Member

    Benchmarking results look nice, but despite the fact that this patch is only small part of bpo-26814, it looks to me larger that it could be.

    1. The patch includes two parts: adding _PyObject_FastCall() and adding PyObject_CallNoArg() and PyObject_CallArg1(). How large the role of latter functions in the speed up? Can we first just add _PyObject_FastCall() and measure the effect of adding PyObject_CallNoArg() and PyObject_CallArg1() separately? Can existing function PyObject_Call() be optimized to achieve a comparable benefit?

    2. I think that supporting keyword arguments in _PyObject_FastCall() doesn't make much sense now. Calling with keyword arguments adds such much overhead, that it dwarfs the benefit of avoiding the creation of one tuple. I think that the patch can be simpler if drop the support of keyword arguments.

    3. The patch adds two files for one function _PyStack_AsTuple(). I would prefer something like _PyTuple_FromArray(). It could be used in other places, not just in argument parsing.

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 9, 2016

    Benchmarking results look nice, but despite the fact that this patch is
    only small part of bpo-26814, it looks to me larger that it could be.

    Oh I failed to express my intent. This initial patch is not expected to
    introduce any speedup. In fact I noticed major performance regressions on
    the CPython benchmark suite using my full fastcall patch. It took me time
    to understand that they are more issues with benchmarks than my work. This
    minimum patch only adds new functions but don't really use them. I patched
    a few functions to show how the new functions can be used. I spent most of
    my time just to ensure that the minimum patch doesn't introduce performance
    regression.

    1. The patch includes two parts: adding _PyObject_FastCall() and adding
      PyObject_CallNoArg() and PyObject_CallArg1(). How large the role of latter
      functions in the speed up?

    See my remark above, no speedup is expected.

    Do you suggest to not add these 2 new functions? Since they are well
    defined and simple, I chose to make them public. Their API is nicer than
    _PyObject_Call().

    Can existing function PyObject_Call() be optimized to achieve a
    comparable benefit?

    Sorry, I don't understand. This function requires a tuple. The whole
    purpose of my patch is to avoid temporary tuples.

    In my full patch, PyObject_Call() calls _PyObject_FastCall() in most cases.

    1. I think that supporting keyword arguments in _PyObject_FastCall()
      doesn't make much sense now.

    Well, I can add support for keyword arguments later and start with an
    assertion (fail if they are used). But I really need them in the API, and I
    don't want to change to API later.

    I plan to add a new METH_FASTCALL calling convention for C functions. I
    would prefer to not have two new calling conventions, but use Argument
    Clinic to emit efficient code to parse arguments.

    Calling with keyword arguments adds such much overhead, that it dwarfs
    the benefit of avoiding the creation of one tuple. I think that the patch
    can be simpler if drop the support of keyword arguments.

    Keyword arguments are optional. Having support for them cost nothing when
    they are not used.

    1. The patch adds two files for one function _PyStack_AsTuple(). I would
      prefer something like _PyTuple_FromArray(). It could be used in other
      places, not just in argument parsing.

    I really want to have a "pystack" API. In this patch, the new file looks
    useless, but in the full patch there are many functions including a few
    complex functions. I prefer to add the file now and complete it later.

    I'm limited by Mercurial and our workflow (tools), it would be much easier
    to explain my work using a patch serie, but it's not possible to publish a
    patch serie...

    @serhiy-storchaka
    Copy link
    Member

    Do you suggest to not add these 2 new functions?

    Yes, I suggest to not add them. The API for calling is already too large.
    Internally we can directly use _PyObject_FastCall(), and third party code
    should get benefit from optimized PyObject_CallFunctionObjArgs().

    > Can existing function PyObject_Call() be optimized to achieve a
    > comparable benefit?
    Sorry, I don't understand. This function requires a tuple. The whole
    purpose of my patch is to avoid temporary tuples.

    Sorry, I meant PyObject_CallFunctionObjArgs() and like.

    Keyword arguments are optional. Having support for them cost nothing when
    they are not used.

    My point is that if keyword arguments are used, this is not a fast call, and
    should use old calling protocol. The overhead of creating a tuple for args is
    dwarfen by the overhead of creating a dict for kwargs and parsing it.

    I really want to have a "pystack" API. In this patch, the new file looks
    useless, but in the full patch there are many functions including a few
    complex functions. I prefer to add the file now and complete it later.

    But for now there is no a "pystack" API. What do you want to add? Can it be
    added with prefixes PyDict_, PyArg_ or PyEval_? On other side, other code can
    get a benefit from using _PyTuple_FromArray().

    Here is alternative simplified patch.

    1. _PyStack_AsTuple() is renamed to _PyTuple_FromArray() (-2 new files).
    2. Optimized PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs() and
      _PyObject_CallMethodIdObjArgs().
    3. Removed PyObject_CallNoArg() and PyObject_CallArg1(). Invocations are
      replaced by PyObject_CallFunctionObjArgs().
    4. Removed support of keyword arguments in _PyObject_FastCall() (saved about
      20 lines and few runtime checks in _PyCFunction_FastCall).
    5. Reverted changes in Objects/descrobject.c. They added a regression in
      namedtuple attributes access.

    @ztane
    Copy link
    Mannequin

    ztane mannequin commented Aug 12, 2016

    About "I hesitate between the C types "int" and "Py_ssize_t" for nargs. I read once that using "int" can cause performance issues on a loop using "i++" and "data[i]" because the compiler has to handle integer overflow of the int type."

    This is true because of -fwrapv, but I believe it is true also for Py_ssize_t which is also of signed type. However, there would be a speed-up achievable by disabling -fwrapv, because only then the i++; data[i] can be safely optimized into *(++data)

    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka added the comment:

    > Do you suggest to not add these 2 new functions?

    Yes, I suggest to not add them. The API for calling is already too large.
    Internally we can directly use _PyObject_FastCall(), and third party code
    should get benefit from optimized PyObject_CallFunctionObjArgs().

    Well, we can start without them, and see later if it's worth it.

    I didn't propose to add new functions to make the code faster, but to
    make the API simpler.

    I dislike PyEval_CallObjectWithKeywords(func, arg, kw) because it has
    a special case if arg is a tuple. If arg is a tuple, the tuple is
    unpacked. It already leaded to a complex and very bug in the
    implementation of generators! See the issue bpo-21209. I'm not sure that
    such use case is well known and understood by everyone...

    It's common to call a function with no argument or just one argument,
    so I proposed to add an obvious and simple API for these common cases.
    Well, again, I will open a new issue to discuss that.

    > > Can existing function PyObject_Call() be optimized to achieve a
    > > comparable benefit?
    > Sorry, I don't understand. This function requires a tuple. The whole
    > purpose of my patch is to avoid temporary tuples.

    Sorry, I meant PyObject_CallFunctionObjArgs() and like.

    Yes, my full patch does optimize these functions:
    https://hg.python.org/sandbox/fastcall/file/2dc558e01e66/Objects/abstract.c#l2523

    > Keyword arguments are optional. Having support for them cost nothing when
    > they are not used.

    My point is that if keyword arguments are used, this is not a fast call, and
    should use old calling protocol. The overhead of creating a tuple for args is
    dwarfen by the overhead of creating a dict for kwargs and parsing it.

    I'm not sure that I understand your point.

    For example, in my full patch, I have a METH_FASTCALL calling
    convention for C functions. With this calling convention, a function
    accepts positional arguments and keyword arguments. If you don't pass
    keyword arguments, the call should be faster according to my
    benchmarks.

    How do you want to implement METH_FASTCALL if you cannot pass keyword
    arguments? Does it mean that METH_FASTCALL can only be used by the
    functions which don't accept keyword arguments at all?

    It's ok if passing keyword arguments is not faster, but simply as fast
    as before, if the "positional arguments only" case is faster, no?

    > I really want to have a "pystack" API. In this patch, the new file looks
    > useless, but in the full patch there are many functions including a few
    > complex functions. I prefer to add the file now and complete it later.

    But for now there is no a "pystack" API. What do you want to add?

    See my fastcall branch:

    https://hg.python.org/sandbox/fastcall/file/2dc558e01e66/Include/pystack.h
    https://hg.python.org/sandbox/fastcall/file/2dc558e01e66/Python/pystack.c

    All these functions are private. They are used internally to implement
    all functions of the Python C API to call functions.

    On other side, other code can get a benefit from using _PyTuple_FromArray().

    Ah? Maybe you should open a different issue for that.

    I prefer to have an API specific to build a "stack" to call functions.

    Here is alternative simplified patch.

    1. _PyStack_AsTuple() is renamed to _PyTuple_FromArray() (-2 new files).
    2. Optimized PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs() and
      _PyObject_CallMethodIdObjArgs().

    My full patch does optimize "everything", it's deliberate to start
    with something useless but short.

    1. Reverted changes in Objects/descrobject.c. They added a regression in
      namedtuple attributes access.

    Ah? What is the regression?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 16, 2016

    New changeset 288ec55f1912 by Victor Stinner in branch 'default':
    Issue bpo-27128: Cleanup _PyEval_EvalCodeWithName()
    https://hg.python.org/cpython/rev/288ec55f1912

    New changeset e615718a6455 by Victor Stinner in branch 'default':
    Use Py_ssize_t in _PyEval_EvalCodeWithName()
    https://hg.python.org/cpython/rev/e615718a6455

    @vstinner
    Copy link
    Member Author

    Patch version 3: simpler and shorter patch

    • _PyObject_FastCall() keeps its kwargs parameter, but it must always be NULL. Support for keyword arguments will be added later.
    • I removed PyObject_CallNoArg() and PyObject_CallArg1()
    • I moved _PyStack_AsTuple() to Objects/abstract.c. A temporary home until the API grows until to require its own file (Python/pystack.c).

    I also pushed some changes unrelated to fastcall in Python/ceval.c to simplify the patch.

    Very few functions are modified (directly or indirectly) to use _PyObject_FastCall():

    • PyEval_CallObjectWithKeywords()
    • PyObject_CallFunction()
    • PyObject_CallMethod()
    • _PyObject_CallMethodId()

    Much more will come in following patches.

    @vstinner
    Copy link
    Member Author

    1. Reverted changes in Objects/descrobject.c. They added a regression in
      namedtuple attributes access.

    Oh, I now understand. The change makes "namedtuple.attr" slower. With fastcall-3.patch attached to this issue, the fast path is not taken on this benchmark, and so you loose the removed optimization (tuple cached in the modified descriptor function).

    In fact, you need the "full" fastcall change to make this attribute lookup *faster*:
    https://bugs.python.org/issue26814#msg263999

    So yeah, it's better to wait until more changes are merged.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 19, 2016

    New changeset a1a29d20f52d by Victor Stinner in branch 'default':
    Add _PyObject_FastCall()
    https://hg.python.org/cpython/rev/a1a29d20f52d

    New changeset 89e4ad001f3d by Victor Stinner in branch 'default':
    PyEval_CallObjectWithKeywords() uses fast call
    https://hg.python.org/cpython/rev/89e4ad001f3d

    New changeset 7cd479573de9 by Victor Stinner in branch 'default':
    call_function_tail() uses fast call
    https://hg.python.org/cpython/rev/7cd479573de9

    New changeset 34af2edface9 by Victor Stinner in branch 'default':
    Cleanup call_function_tail()
    https://hg.python.org/cpython/rev/34af2edface9

    New changeset adceb14cab96 by Victor Stinner in branch 'default':
    Cleanup callmethod()
    https://hg.python.org/cpython/rev/adceb14cab96

    New changeset 10f1a4910adb by Victor Stinner in branch 'default':
    PEP-7: add {...} around null_error() in abstract.c
    https://hg.python.org/cpython/rev/10f1a4910adb

    New changeset 5cf9524f2923 by Victor Stinner in branch 'default':
    Avoid call_function_tail() for empty format str
    https://hg.python.org/cpython/rev/5cf9524f2923

    New changeset f1ad6f64a11e by Victor Stinner in branch 'default':
    Fix PyObject_Call() parameter names
    https://hg.python.org/cpython/rev/f1ad6f64a11e

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 19, 2016

    New changeset 2da6dc1c30d8 by Victor Stinner in branch 'default':
    contains and rich compare slots use fast call
    https://hg.python.org/cpython/rev/2da6dc1c30d8

    New changeset 2d4d40da2aba by Victor Stinner in branch '3.5':
    Fix a refleak in call_method()
    https://hg.python.org/cpython/rev/2d4d40da2aba

    New changeset 5b1ed48aedef by Victor Stinner in branch '2.7':
    Fix a refleak in call_method()
    https://hg.python.org/cpython/rev/5b1ed48aedef

    New changeset df4efc23ab18 by Victor Stinner in branch '3.5':
    Fix a refleak in call_maybe()
    https://hg.python.org/cpython/rev/df4efc23ab18

    New changeset 7669fb39a9ce by Victor Stinner in branch '2.7':
    Fix a refleak in call_maybe()
    https://hg.python.org/cpython/rev/7669fb39a9ce

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 19, 2016

    New changeset 73b00fb1dc9d by Victor Stinner in branch 'default':
    Cleanup call_method() and call_maybe()
    https://hg.python.org/cpython/rev/73b00fb1dc9d

    New changeset 8e085070ab28 by Victor Stinner in branch 'default':
    call_method() and call_maybe() now use fast call
    https://hg.python.org/cpython/rev/8e085070ab28

    New changeset 2d2bc1906b5b by Victor Stinner in branch 'default':
    Issue bpo-27128: Cleanup slot_sq_item()
    https://hg.python.org/cpython/rev/2d2bc1906b5b

    New changeset 6eb586b85fa1 by Victor Stinner in branch 'default':
    Issue bpo-27128: slot_sq_item() uses fast call
    https://hg.python.org/cpython/rev/6eb586b85fa1

    New changeset 605a42a50496 by Victor Stinner in branch 'default':
    Issue bpo-27128: Cleanup slot_nb_bool()
    https://hg.python.org/cpython/rev/605a42a50496

    New changeset 6a21b6599692 by Victor Stinner in branch 'default':
    slot_nb_bool() now uses fast call
    https://hg.python.org/cpython/rev/6a21b6599692

    New changeset 45d2b5c12b19 by Victor Stinner in branch 'default':
    slot_tp_iter() now uses fast call
    https://hg.python.org/cpython/rev/45d2b5c12b19

    New changeset 124d5d0ef81f by Victor Stinner in branch 'default':
    calliter_iternext() now uses fast call
    https://hg.python.org/cpython/rev/124d5d0ef81f

    New changeset 71c22e592a9b by Victor Stinner in branch 'default':
    keyobject_richcompare() now uses fast call
    https://hg.python.org/cpython/rev/71c22e592a9b

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 19, 2016

    New changeset 3ab32f7add6e by Victor Stinner in branch 'default':
    Issue bpo-27128: _pickle uses fast call
    https://hg.python.org/cpython/rev/3ab32f7add6e

    @vstinner
    Copy link
    Member Author

    Ok, I updated the most simple forms of function calls.

    I will open new issues for more complex calls and more sensible parts of
    the code like ceval.c.

    Buildbots seem to be happy.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 19, 2016

    New changeset c2af917bde71 by Victor Stinner in branch 'default':
    PyFile_WriteObject() now uses fast call
    https://hg.python.org/cpython/rev/c2af917bde71

    New changeset 0da1ce362d15 by Victor Stinner in branch 'default':
    import_name() now uses fast call
    https://hg.python.org/cpython/rev/0da1ce362d15

    New changeset e5b24f595235 by Victor Stinner in branch 'default':
    PyErr_PrintEx() now uses fast call
    https://hg.python.org/cpython/rev/e5b24f595235

    New changeset 154f78d387f9 by Victor Stinner in branch 'default':
    call_trampoline() now uses fast call
    https://hg.python.org/cpython/rev/154f78d387f9

    New changeset 351b987d6d1c by Victor Stinner in branch 'default':
    sys_pyfile_write_unicode() now uses fast call
    https://hg.python.org/cpython/rev/351b987d6d1c

    New changeset abb93035ebb7 by Victor Stinner in branch 'default':
    _elementtree: deepcopy() now uses fast call
    https://hg.python.org/cpython/rev/abb93035ebb7

    New changeset 2954d2aa4c90 by Victor Stinner in branch 'default':
    pattern_subx() now uses fast call
    https://hg.python.org/cpython/rev/2954d2aa4c90

    @vstinner
    Copy link
    Member Author

    I created two new issues:

    • issue bpo-27809: _PyObject_FastCall(): add support for keyword arguments
    • issue bpo-27810: Add METH_FASTCALL: new calling convention for C functions

    @scoder
    Copy link
    Contributor

    scoder commented Aug 22, 2016

    FYI: I copied your (no-kwargs) implementation over into Cython and I get around 17% faster calls to Python functions with 2 positional arguments.

    @vstinner
    Copy link
    Member Author

    FYI: I copied your (no-kwargs) implementation over into Cython and I get around 17% faster calls to Python functions with 2 positional arguments.

    Hey, cool! It's always cool to get performance enhancement without having to break the C API nor having to modify source code :-)

    What do you mean by "I copied your (no-kwargs) implementation"? The whole giant patch? Or just a few changes? Which changes?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 22, 2016

    New changeset 7dd85b19c873 by Victor Stinner in branch 'default':
    Optimize call to Python function without argument
    https://hg.python.org/cpython/rev/7dd85b19c873

    @serhiy-storchaka
    Copy link
    Member

    The problem is that passing keyword arguments as a dict is not the most efficient way due to an overhead of creating a dict. For now keyword arguments are pushed on the stack as interlaced array of keyword names and values. It may be more efficient to push values and names as continuous arrays (bpo-27213). PyArg_ParseTupleAndKeywords() accepts a tuple and a dict, but private function _PyArg_ParseTupleAndKeywordsFast() (bpo-27574) can be changed to accept positional and keyword arguments as continuous arrays: (int nargs, PyObject **args, int nkwargs, PyObject **kwnames, PyObject **kwargs). Therefore we will be forced either to change the signature of _PyObject_FastCall() and the meaning of METH_FASTCALL, or add new _PyObject_FastCallKw() and METH_FASTCALLKW for support fast passing keyword arguments. Or may be add yet _PyObject_FastCallNoKw() for faster passing only positional arguments without an overhead of _PyObject_FastCall(). And make older _PyObject_FastCall() and METH_FASTCALL obsolete.

    There is yet one possibility. Argument Clinic can generate a dict that maps keyword argument names to indices of arguments and tie it to a function. External code should map names to indices using this dictionary and pass arguments as just a continuous array to function with METH_FASTCALL (raising an error if some argument is passed as positional and keyword, or if keyword-only argument is passed as positional, etc). In that case the kwargs parameter of _PyObject_FastCall() becomes obsolete too.

    @vstinner
    Copy link
    Member Author

    Serhiy: I "moved" your msg273365 to the issue bpo-27809.

    @scoder
    Copy link
    Contributor

    scoder commented Aug 22, 2016

    What do you mean by "I copied your (no-kwargs) implementation"?

    I copied what you committed into CPython for _PyFunction_FastCall():

    cython/cython@8f3d3bd

    and then enabled its usage in a couple of places:

    cython/cython@a3cfec8

    especially for all function/method calls that we generate for user code:

    cython/cython@a51df33

    Note that PyMethod objects get unpacked into function+self right before the PyFunction_Check(), so the tuple avoidance optimisation also applies to Python method calls.

    @vstinner
    Copy link
    Member Author

    Ok, I see much better with concrete commits. I'm really happy that
    Cython also benefits from these enhancements.

    Note: handling keywords is likely to change quickly ;-)

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2016

    The main features (_PyFunction_FastCall()) has been merged. Supporting keyword arguments is now handled by other issue (see issue bpo-27830). I close this issue.

    @vstinner vstinner closed this as completed Sep 1, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants