Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

Merged
merged 20 commits into from Apr 30, 2023

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Oct 6, 2022

@brandtbucher
Copy link
Member

Looks like not much difference:

Benchmarks with tag 'apps':
===========================

Slower (2):
- html5lib: 59.3 ms +- 2.4 ms -> 60.4 ms +- 2.4 ms: 1.02x slower
- tornado_http: 92.9 ms +- 1.3 ms -> 93.4 ms +- 1.4 ms: 1.01x slower

Faster (1):
- 2to3: 247 ms +- 1 ms -> 246 ms +- 0 ms: 1.00x faster

Benchmark hidden because not significant (1): chameleon

Geometric mean: 1.01x slower

Benchmarks with tag 'math':
===========================

Slower (1):
- pidigits: 193 ms +- 0 ms -> 199 ms +- 0 ms: 1.03x slower

Faster (2):
- nbody: 96.2 ms +- 2.0 ms -> 94.9 ms +- 1.9 ms: 1.01x faster
- float: 71.7 ms +- 0.9 ms -> 71.3 ms +- 1.0 ms: 1.01x faster

Geometric mean: 1.00x slower

Benchmarks with tag 'regex':
============================

Slower (2):
- regex_effbot: 3.33 ms +- 0.02 ms -> 3.39 ms +- 0.02 ms: 1.02x slower
- regex_v8: 21.1 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.01x slower

Faster (2):
- regex_dna: 200 ms +- 1 ms -> 198 ms +- 1 ms: 1.01x faster
- regex_compile: 128 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x faster

Geometric mean: 1.00x slower

Benchmarks with tag 'serialize':
================================

Slower (2):
- pickle_list: 3.92 us +- 0.03 us -> 4.11 us +- 0.10 us: 1.05x slower
- pickle_dict: 30.3 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.03x slower

Faster (3):
- xml_etree_iterparse: 105 ms +- 2 ms -> 101 ms +- 1 ms: 1.03x faster
- pickle: 10.3 us +- 0.3 us -> 10.1 us +- 0.1 us: 1.02x faster
- xml_etree_process: 53.2 ms +- 0.5 ms -> 52.4 ms +- 0.5 ms: 1.01x faster

Benchmark hidden because not significant (8): json_dumps, json_loads, pickle_pure_python, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_generate

Geometric mean: 1.00x faster

Benchmarks with tag 'startup':
==============================

Slower (2):
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Geometric mean: 1.00x slower

Benchmarks with tag 'template':
===============================

Slower (3):
- mako: 9.36 ms +- 0.05 ms -> 9.69 ms +- 0.04 ms: 1.04x slower
- genshi_xml: 48.5 ms +- 0.4 ms -> 49.5 ms +- 0.6 ms: 1.02x slower
- genshi_text: 20.9 ms +- 0.3 ms -> 21.2 ms +- 0.3 ms: 1.02x slower

Faster (1):
- django_template: 32.5 ms +- 0.4 ms -> 32.2 ms +- 0.3 ms: 1.01x faster

Geometric mean: 1.02x slower

All benchmarks:
===============

Slower (31):
- pickle_list: 3.92 us +- 0.03 us -> 4.11 us +- 0.10 us: 1.05x slower
- unpack_sequence: 45.7 ns +- 0.9 ns -> 47.9 ns +- 0.7 ns: 1.05x slower
- mako: 9.36 ms +- 0.05 ms -> 9.69 ms +- 0.04 ms: 1.04x slower
- deepcopy_reduce: 2.87 us +- 0.04 us -> 2.97 us +- 0.04 us: 1.04x slower
- go: 132 ms +- 1 ms -> 136 ms +- 1 ms: 1.03x slower
- deepcopy_memo: 33.2 us +- 0.3 us -> 34.3 us +- 0.4 us: 1.03x slower
- pidigits: 193 ms +- 0 ms -> 199 ms +- 0 ms: 1.03x slower
- pickle_dict: 30.3 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.03x slower
- json: 4.62 ms +- 0.08 ms -> 4.73 ms +- 0.12 ms: 1.03x slower
- genshi_xml: 48.5 ms +- 0.4 ms -> 49.5 ms +- 0.6 ms: 1.02x slower
- html5lib: 59.3 ms +- 2.4 ms -> 60.4 ms +- 2.4 ms: 1.02x slower
- regex_effbot: 3.33 ms +- 0.02 ms -> 3.39 ms +- 0.02 ms: 1.02x slower
- logging_silent: 90.3 ns +- 0.9 ns -> 91.9 ns +- 1.8 ns: 1.02x slower
- richards: 43.0 ms +- 1.3 ms -> 43.6 ms +- 1.1 ms: 1.02x slower
- genshi_text: 20.9 ms +- 0.3 ms -> 21.2 ms +- 0.3 ms: 1.02x slower
- generators: 73.3 ms +- 0.5 ms -> 74.3 ms +- 0.2 ms: 1.01x slower
- crypto_pyaes: 73.0 ms +- 0.6 ms -> 73.9 ms +- 0.5 ms: 1.01x slower
- deepcopy: 324 us +- 3 us -> 327 us +- 4 us: 1.01x slower
- sqlglot_parse: 1.32 ms +- 0.01 ms -> 1.33 ms +- 0.01 ms: 1.01x slower
- logging_format: 6.32 us +- 0.07 us -> 6.38 us +- 0.12 us: 1.01x slower
- regex_v8: 21.1 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.01x slower
- thrift: 736 us +- 13 us -> 743 us +- 17 us: 1.01x slower
- chaos: 67.4 ms +- 0.8 ms -> 67.9 ms +- 0.9 ms: 1.01x slower
- sympy_sum: 162 ms +- 2 ms -> 163 ms +- 1 ms: 1.01x slower
- tornado_http: 92.9 ms +- 1.3 ms -> 93.4 ms +- 1.4 ms: 1.01x slower
- coroutines: 23.6 ms +- 0.2 ms -> 23.8 ms +- 0.1 ms: 1.01x slower
- aiohttp: 1.00 ms +- 0.00 ms -> 1.01 ms +- 0.00 ms: 1.01x slower
- dulwich_log: 60.9 ms +- 0.4 ms -> 61.1 ms +- 0.4 ms: 1.00x slower
- gunicorn: 1.09 ms +- 0.01 ms -> 1.09 ms +- 0.01 ms: 1.00x slower
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Faster (29):
- mdp: 2.70 sec +- 0.01 sec -> 2.53 sec +- 0.01 sec: 1.07x faster
- fannkuch: 382 ms +- 4 ms -> 366 ms +- 5 ms: 1.05x faster
- nqueens: 80.4 ms +- 1.0 ms -> 77.7 ms +- 1.2 ms: 1.03x faster
- xml_etree_iterparse: 105 ms +- 2 ms -> 101 ms +- 1 ms: 1.03x faster
- scimark_sparse_mat_mult: 4.10 ms +- 0.08 ms -> 4.00 ms +- 0.05 ms: 1.03x faster
- spectral_norm: 94.1 ms +- 2.9 ms -> 91.8 ms +- 1.2 ms: 1.03x faster
- telco: 6.50 ms +- 0.11 ms -> 6.36 ms +- 0.12 ms: 1.02x faster
- scimark_fft: 311 ms +- 5 ms -> 305 ms +- 3 ms: 1.02x faster
- pickle: 10.3 us +- 0.3 us -> 10.1 us +- 0.1 us: 1.02x faster
- xml_etree_process: 53.2 ms +- 0.5 ms -> 52.4 ms +- 0.5 ms: 1.01x faster
- pathlib: 17.7 ms +- 0.2 ms -> 17.5 ms +- 0.3 ms: 1.01x faster
- nbody: 96.2 ms +- 2.0 ms -> 94.9 ms +- 1.9 ms: 1.01x faster
- sympy_str: 282 ms +- 6 ms -> 279 ms +- 4 ms: 1.01x faster
- regex_dna: 200 ms +- 1 ms -> 198 ms +- 1 ms: 1.01x faster
- scimark_sor: 103 ms +- 2 ms -> 102 ms +- 1 ms: 1.01x faster
- async_tree_none: 517 ms +- 10 ms -> 512 ms +- 9 ms: 1.01x faster
- django_template: 32.5 ms +- 0.4 ms -> 32.2 ms +- 0.3 ms: 1.01x faster
- sympy_expand: 453 ms +- 7 ms -> 450 ms +- 6 ms: 1.01x faster
- async_tree_io: 1.31 sec +- 0.01 sec -> 1.30 sec +- 0.01 sec: 1.01x faster
- meteor_contest: 103 ms +- 1 ms -> 102 ms +- 2 ms: 1.01x faster
- regex_compile: 128 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x faster
- scimark_monte_carlo: 66.2 ms +- 1.0 ms -> 65.8 ms +- 0.9 ms: 1.01x faster
- float: 71.7 ms +- 0.9 ms -> 71.3 ms +- 1.0 ms: 1.01x faster
- sqlite_synth: 2.59 us +- 0.02 us -> 2.58 us +- 0.03 us: 1.00x faster
- 2to3: 247 ms +- 1 ms -> 246 ms +- 0 ms: 1.00x faster
- hexiom: 5.99 ms +- 0.04 ms -> 5.97 ms +- 0.04 ms: 1.00x faster
- sqlglot_normalize: 104 ms +- 1 ms -> 104 ms +- 1 ms: 1.00x faster
- sympy_integrate: 20.3 ms +- 0.1 ms -> 20.3 ms +- 0.1 ms: 1.00x faster
- sqlglot_optimize: 50.9 ms +- 0.3 ms -> 50.8 ms +- 0.2 ms: 1.00x faster

Benchmark hidden because not significant (26): async_tree_cpu_io_mixed, async_tree_memoization, chameleon, coverage, deltablue, flaskblogging, json_dumps, json_loads, logging_simple, mypy, pickle_pure_python, pprint_safe_repr, pprint_pformat, pycparser, pyflate, pylint, raytrace, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_transpile, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_generate

Geometric mean: 1.00x slower

Perhaps we should try inlining all CALL_FUNCTION_EX, rather than specializing? Your call.

@Fidget-Spinner Fidget-Spinner changed the title gh-98003: Specialize CALL_FUNCTION_EX gh-98003: Inline call frames for CALL_FUNCTION_EX Oct 7, 2022
@Fidget-Spinner
Copy link
Member Author

@markshannon gentle ping for a review please.

@markshannon
Copy link
Member

The approach I would prefer is to:

  1. Refactor PyFunction_Type.tp_call so that it creates a frame, then calls PyEval_EvalFrame()
  2. Use the factored out code for creating the frame in CALL_FUNCTION_EX

There is a lot unnecessary dismantling of tuples and dicts in the current approach. This is not a problem with this PR, but I think it needs to be sorted out before "inlining" the call in CALL_FUNCTION_EX

Ideally CALL_FUNCTION_EX would look something like this:

CALL_FUNCTION_EX:
   if (Py_IS_TYPE(func, &PyFunction_Type) &&
        tstate->interp->eval_frame == NULL &&
        ((PyFunctionObject *)func)->vectorcall == _PyFunction_Vectorcall) {
                _PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_TD(
                    tstate, (PyFunctionObject *)function, locals, callargs, kwargs);
                );
                Py_DECREF(callargs);
                Py_DECREF(kwargs);
                if (new_frame == NULL) {
                    goto error;
                }
                /* Enter frame */
               ...
   }
   ...

Where _PyEvalFramePushAndInit_TD pushes the frame, and calls the factored out function for initializing the frame from *args, **kwargs.

@Fidget-Spinner
Copy link
Member Author

The approach I would prefer is to:

  1. Refactor PyFunction_Type.tp_call so that it creates a frame, then calls PyEval_EvalFrame()
  2. Use the factored out code for creating the frame in CALL_FUNCTION_EX

I'm confused. The first step is to create a frame, but then in your example code you show it creating another frame again. Why does it do that?

@markshannon
Copy link
Member

Because we don't really create a frame. We push a frame and then initialize it.

Pushing a frame is simple. It is initializing that is complex and we want to factor out.

@brandtbucher
Copy link
Member

Just curious, has this stalled out for now? At the very least, the changes in ceval.c will need to move to the new bytecodes.c, and you'll probably want to use the new DISPATCH_INLINED macro for inlining the call.

Totally get it if you're busy with school or something right now though! :) Just making sure that you aren't blocked on reviews or design discussions. If so, maybe we can move this forward (I'd really like to see it in)!

@Fidget-Spinner
Copy link
Member Author

Totally get it if you're busy with school or something right now though! :) Just making sure that you aren't blocked on reviews or design discussions. If so, maybe we can move this forward (I'd really like to see it in)!

Yeah my exams end next week. I plan to start contributing actively again after then!

@Fidget-Spinner
Copy link
Member Author

@brandtbucher and @markshannon I think this is ready for your reviews (when you have time of course).

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach creates a lot of intermediate objects.
I'm not sure if that is a result of the transformations in this PR, or if the original code also did created those intermediate objects.

If the objects were already being created, then maybe we should streamline it in another PR.
@Fidget-Spinner what do you think?

int code_flags = ((PyCodeObject *)PyFunction_GET_CODE(func))->co_flags;
PyObject *locals = code_flags & CO_OPTIMIZED ? NULL : Py_NewRef(PyFunction_GET_GLOBALS(func));

_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_Ex(tstate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to move the unpacking of the tuple into _PyEvalFramePushAndInit_Ex.

_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_Ex(tstate, func, locals, callargs, kwargs).

CALL_FUNCTION_EX is quite a large instruction. Pushing more code into _PyEvalFramePushAndInit_Ex would keep it smaller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpacking which tuple?

If you mean move the code above in, that would break convention with the rest of the code base. The other inline calls do these

                int code_flags = ((PyCodeObject *)PyFunction_GET_CODE(func))->co_flags;
                PyObject *locals = code_flags & CO_OPTIMIZED ? NULL : Py_NewRef(PyFunction_GET_GLOBALS(func));

before creating the new frame.

Py_INCREF(PyTuple_GET_ITEM(callargs, i));
}
}
_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than transforming callargs and kwargs into a shape suitable for _PyEvalFramePushAndInit, it might make sense to push the frame, then unpack the tuple and dict into the the frame without the intermediate objects.
It complicates the errror handling a bit, and the frame will need to be cleaned up if there is an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can work on this in another PR.

@markshannon
Copy link
Member

Thanks for this PR. Avoiding C stack consumption for these calls is definitely something we want.

@Fidget-Spinner
Copy link
Member Author

This approach creates a lot of intermediate objects.
I'm not sure if that is a result of the transformations in this PR, or if the original code also did created those intermediate objects.

If the objects were already being created, then maybe we should streamline it in another PR.
@Fidget-Spinner what do you think?

Yes these are the semantics of the original CALL_FUNCTION_EX. Most of these were copy pasted. I was surprised at how allocation heavy CALL_FUNCTION_EX is and have been thinking of ways to improve it.

Copy link
Member

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert on the call sequence, but the code generally looks good to me! I'd be excited to see this in 3.12.

I'm not sure how this interacts with PEP 669, but I'm sure @markshannon can answer that pretty easily.

Also, I'd kinda like to see buildbots (for correctness) and benchmarks (for curiousity) on this before merging. I can run the benchmarks for you whenever you think it's ready.

Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated
PyObject *const *newargs = has_dict
? _PyStack_UnpackDict(tstate, _PyTuple_ITEMS(callargs),
nargs, kwargs, &kwnames)
: &PyTuple_GET_ITEM(callargs, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe if len(callargs) == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be. When len(callargs) == 0 the args shouldn't be used by the vectorcall protocol.

Fidget-Spinner and others added 2 commits April 27, 2023 17:19
@Fidget-Spinner
Copy link
Member Author

Not sure why all the #line directives in the generated_cases.c.h file were removed after I ran the script. Does anyone know?

@Fidget-Spinner
Copy link
Member Author

Also, I'd kinda like to see buildbots (for correctness) and benchmarks (for curiousity) on this before merging. I can run the benchmarks for you whenever you think it's ready.

You can run the benchmarks but I expect no change sadly. Extended function calls are used rarely in pyperformance.

@Fidget-Spinner Fidget-Spinner added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 27, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit dfd6b01 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 27, 2023
@brandtbucher
Copy link
Member

Not sure why all the #line directives in the generated_cases.c.h file were removed after I ran the script. Does anyone know?

The regen script needs to be run with --emit-line-directives. I think the Makefile was updated to use the new option, but maybe the Windows build scripts weren't updated with the new option (which is off by default). Maybe you know the right place to add it?

You can run the benchmarks but I expect no change sadly. Extended function calls are used rarely in pyperformance.

Understood. I just kicked off a run anyways. :)

Python/bytecodes.c Outdated Show resolved Hide resolved
Co-authored-by: Carl Meyer <carl@oddbird.net>
@brandtbucher
Copy link
Member

Yep, no real change on pyperformance (as expected).

@arhadthedev
Copy link
Member

Removing sprint because this issue missed the ones in 2022 and 2023.

@Fidget-Spinner
Copy link
Member Author

Removing sprint because this issue missed the ones in 2022 and 2023.

Yeah this issue was added in the 2022 CPython core dev sprint which us some ways back now.

@Fidget-Spinner Fidget-Spinner merged commit ed95e8c into python:main Apr 30, 2023
20 checks passed
@Fidget-Spinner Fidget-Spinner deleted the call_function_ex_inline_3 branch April 30, 2023 13:08
@Fidget-Spinner
Copy link
Member Author

Thanks everyone!

carljm added a commit to carljm/cpython that referenced this pull request May 1, 2023
* main: (26 commits)
  pythongh-104028: Reduce object creation while calling callback function from gc (pythongh-104030)
  pythongh-104036: Fix direct invocation of test_typing (python#104037)
  pythongh-102213: Optimize the performance of `__getattr__` (pythonGH-103761)
  pythongh-103895: Improve how invalid `Exception.__notes__` are displayed (python#103897)
  Adjust expression from `==` to `!=` in alignment with the meaning of the paragraph. (pythonGH-104021)
  pythongh-88496: Fix IDLE test hang on macOS (python#104025)
  Improve int test coverage (python#104024)
  pythongh-88773: Added teleport method to Turtle library (python#103974)
  pythongh-104015: Fix direct invocation of `test_dataclasses` (python#104017)
  pythongh-104012: Ensure test_calendar.CalendarTestCase.test_deprecation_warning consistently passes (python#104014)
  pythongh-103977: compile re expressions in platform.py only if required (python#103981)
  pythongh-98003: Inline call frames for CALL_FUNCTION_EX (pythonGH-98004)
  Replace Netlify with Read the Docs build previews (python#103843)
  Update name in acknowledgements and add mailmap (python#103696)
  pythongh-82054: allow test runner to split test_asyncio to execute in parallel by sharding. (python#103927)
  Remove non-existing tools from Sundry skiplist (python#103991)
  pythongh-103793: Defer formatting task name (python#103767)
  pythongh-87092: change assembler to use instruction sequence instead of CFG (python#103933)
  pythongh-103636: issue warning for deprecated calendar constants (python#103833)
  Various small fixes to dis docs (python#103923)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants