gh-115754: Get singletons via function calls #116572

vstinner · 2024-03-10T20:11:04Z

Issue: C API: Implement singletons as functions calls in the stable ABI 3.13 #115754

vstinner · 2024-03-10T20:11:44Z

PR created to measure the impact on performance of getting the 5 singletons via function calls.

vstinner · 2024-03-11T08:01:48Z

Patch adding a benchmark function:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index b6536045e6..03dcbd0cb9 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -3284,6 +3284,39 @@ test_weakref_capi(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args))
     _Py_COMP_DIAG_POP
 }
 
+static PyObject *
+bench_none(PyObject *Py_UNUSED(module), PyObject *args)
+{
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "n", &loops)) {
+        return NULL;
+    }
+
+    PyTime_t t1;
+    (void)PyTime_PerfCounter(&t1);
+    for (Py_ssize_t i=0; i < loops; i++) {
+        PyObject *obj;
+        int res = 0;
+
+        obj = Py_None;
+        res += Py_IsNone(obj) + Py_IsFalse(obj) + Py_IsTrue(obj);
+
+        obj = Py_False;
+        res += Py_IsNone(obj) + Py_IsFalse(obj) + Py_IsTrue(obj);
+
+        obj = Py_True;
+        res += Py_IsNone(obj) + Py_IsFalse(obj) + Py_IsTrue(obj);
+
+        if (res != 3) {
+            abort();
+        }
+    }
+    PyTime_t t2;
+    (void)PyTime_PerfCounter(&t2);
+
+    return PyFloat_FromDouble(PyTime_AsSecondsDouble(t2 - t1));
+}
+
 
 static PyMethodDef TestMethods[] = {
     {"set_errno",               set_errno,                       METH_VARARGS},
@@ -3423,6 +3456,7 @@ static PyMethodDef TestMethods[] = {
     {"function_set_kw_defaults", function_set_kw_defaults, METH_VARARGS, NULL},
     {"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},
     {"test_weakref_capi", test_weakref_capi, METH_NOARGS},
+    {"bench_none", bench_none, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

Benchmark script:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_none)

Results of the microbenchmark with CPU isolation on Linux. Python built with -O3 without PGO nor LTO. PGO/LTO doesn't matter here, since _testcapi is built as a shared library: Py_None becomes a regular function call, there is nothing to optimize.

$ python3 -m pyperf  compare_to ref.json func.json 
Mean +- std dev: [ref] 25.8 ns +- 0.0 ns -> [func] 56.3 ns +- 0.1 ns: 2.18x slower

Converting Py_None constant to a function call makes accessing this variable around 2x slower. Well, that's not surprising, and that's the maximum overhead on a microbenchmark.

vstinner · 2024-03-11T08:09:13Z

I also ran pyperformance 1.11.0, this time with PGO+LTO and CPU isolation on Linux. My shell script run to run pyperformance:

set -e -x

cd main
git clean -fdx
rm -rf /opt/bench_python

./configure --enable-optimizations --with-lto --prefix /opt/bench_python
make
make install
cd ..

/opt/bench_python/bin/python3 -m pip install pyperformance
/opt/bench_python/bin/python3 -m pyperformance run -o bench.json

Results.

Benchmarks with tag 'apps':
===========================

+----------------+----------+------------------------+
| Benchmark      | ref      | func                   |
+================+==========+========================+
| 2to3           | 429 ms   | 434 ms: 1.01x slower   |
+----------------+----------+------------------------+
| chameleon      | 11.0 ms  | 11.0 ms: 1.01x slower  |
+----------------+----------+------------------------+
| docutils       | 4.24 sec | 4.23 sec: 1.00x faster |
+----------------+----------+------------------------+
| tornado_http   | 194 ms   | 195 ms: 1.01x slower   |
+----------------+----------+------------------------+
| Geometric mean | (ref)    | 1.01x slower           |
+----------------+----------+------------------------+

Benchmark hidden because not significant (1): html5lib

Benchmarks with tag 'asyncio':
==============================

+-------------------------------+----------+------------------------+
| Benchmark                     | ref      | func                   |
+===============================+==========+========================+
| async_tree_io_tg              | 1.61 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_cpu_io_mixed | 669 ms   | 674 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_io                 | 1.60 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_io_tg        | 1.67 sec | 1.69 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_tg           | 124 ms   | 125 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_none_tg            | 655 ms   | 665 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| async_tree_memoization_tg     | 797 ms   | 811 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| Geometric mean                | (ref)    | 1.01x slower           |
+-------------------------------+----------+------------------------+

Benchmark hidden because not significant (9): async_tree_eager_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_memoization_tg, async_tree_eager_memoization, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager_io, async_tree_memoization, async_tree_none

Benchmarks with tag 'math':
===========================

+----------------+--------+----------------------+
| Benchmark      | ref    | func                 |
+================+========+======================+
| nbody          | 132 ms | 128 ms: 1.03x faster |
+----------------+--------+----------------------+
| float          | 125 ms | 126 ms: 1.01x slower |
+----------------+--------+----------------------+
| Geometric mean | (ref)  | 1.01x faster         |
+----------------+--------+----------------------+

Benchmark hidden because not significant (1): pidigits

Benchmarks with tag 'regex':
============================

+----------------+---------+-----------------------+
| Benchmark      | ref     | func                  |
+================+=========+=======================+
| regex_v8       | 37.5 ms | 37.1 ms: 1.01x faster |
+----------------+---------+-----------------------+
| Geometric mean | (ref)   | 1.00x faster          |
+----------------+---------+-----------------------+

Benchmark hidden because not significant (3): regex_effbot, regex_compile, regex_dna

Benchmarks with tag 'serialize':
================================

+-------------------+---------+-----------------------+
| Benchmark         | ref     | func                  |
+===================+=========+=======================+
| unpickle_list     | 7.63 us | 7.53 us: 1.01x faster |
+-------------------+---------+-----------------------+
| xml_etree_process | 97.8 ms | 98.6 ms: 1.01x slower |
+-------------------+---------+-----------------------+
| json_loads        | 40.8 us | 41.3 us: 1.01x slower |
+-------------------+---------+-----------------------+
| xml_etree_parse   | 238 ms  | 241 ms: 1.01x slower  |
+-------------------+---------+-----------------------+
| json_dumps        | 15.5 ms | 16.5 ms: 1.06x slower |
+-------------------+---------+-----------------------+
| pickle            | 17.6 us | 19.7 us: 1.12x slower |
+-------------------+---------+-----------------------+
| pickle_dict       | 43.1 us | 50.2 us: 1.17x slower |
+-------------------+---------+-----------------------+
| pickle_list       | 6.64 us | 7.87 us: 1.19x slower |
+-------------------+---------+-----------------------+
| Geometric mean    | (ref)   | 1.04x slower          |
+-------------------+---------+-----------------------+

Benchmark hidden because not significant (6): unpickle, tomli_loads, unpickle_pure_python, xml_etree_generate, pickle_pure_python, xml_etree_iterparse

Benchmarks with tag 'startup':
==============================

+------------------------+---------+-----------------------+
| Benchmark              | ref     | func                  |
+========================+=========+=======================+
| python_startup         | 15.4 ms | 15.5 ms: 1.00x slower |
+------------------------+---------+-----------------------+
| python_startup_no_site | 13.6 ms | 13.6 ms: 1.01x slower |
+------------------------+---------+-----------------------+
| Geometric mean         | (ref)   | 1.00x slower          |
+------------------------+---------+-----------------------+

Benchmarks with tag 'template':
===============================

+-----------------+---------+-----------------------+
| Benchmark       | ref     | func                  |
+=================+=========+=======================+
| mako            | 16.1 ms | 16.2 ms: 1.01x slower |
+-----------------+---------+-----------------------+
| genshi_xml      | 83.8 ms | 84.6 ms: 1.01x slower |
+-----------------+---------+-----------------------+
| django_template | 56.5 ms | 57.4 ms: 1.01x slower |
+-----------------+---------+-----------------------+
| genshi_text     | 35.6 ms | 36.7 ms: 1.03x slower |
+-----------------+---------+-----------------------+
| Geometric mean  | (ref)   | 1.02x slower          |
+-----------------+---------+-----------------------+

All benchmarks:
===============

+-------------------------------+----------+------------------------+
| Benchmark                     | ref      | func                   |
+===============================+==========+========================+
| coverage                      | 499 ms   | 152 ms: 3.29x faster   |
+-------------------------------+----------+------------------------+
| dask                          | 885 ms   | 628 ms: 1.41x faster   |
+-------------------------------+----------+------------------------+
| logging_silent                | 166 ns   | 155 ns: 1.07x faster   |
+-------------------------------+----------+------------------------+
| nbody                         | 132 ms   | 128 ms: 1.03x faster   |
+-------------------------------+----------+------------------------+
| telco                         | 12.6 ms  | 12.3 ms: 1.03x faster  |
+-------------------------------+----------+------------------------+
| scimark_fft                   | 522 ms   | 511 ms: 1.02x faster   |
+-------------------------------+----------+------------------------+
| unpickle_list                 | 7.63 us  | 7.53 us: 1.01x faster  |
+-------------------------------+----------+------------------------+
| regex_v8                      | 37.5 ms  | 37.1 ms: 1.01x faster  |
+-------------------------------+----------+------------------------+
| pprint_pformat                | 2.45 sec | 2.42 sec: 1.01x faster |
+-------------------------------+----------+------------------------+
| logging_simple                | 9.51 us  | 9.43 us: 1.01x faster  |
+-------------------------------+----------+------------------------+
| pprint_safe_repr              | 1.20 sec | 1.19 sec: 1.01x faster |
+-------------------------------+----------+------------------------+
| generators                    | 45.4 ms  | 45.1 ms: 1.01x faster  |
+-------------------------------+----------+------------------------+
| coroutines                    | 35.4 ms  | 35.2 ms: 1.01x faster  |
+-------------------------------+----------+------------------------+
| docutils                      | 4.24 sec | 4.23 sec: 1.00x faster |
+-------------------------------+----------+------------------------+
| sympy_integrate               | 32.3 ms  | 32.4 ms: 1.00x slower  |
+-------------------------------+----------+------------------------+
| python_startup                | 15.4 ms  | 15.5 ms: 1.00x slower  |
+-------------------------------+----------+------------------------+
| sympy_sum                     | 254 ms   | 255 ms: 1.00x slower   |
+-------------------------------+----------+------------------------+
| dulwich_log                   | 129 ms   | 129 ms: 1.00x slower   |
+-------------------------------+----------+------------------------+
| sqlglot_optimize              | 86.5 ms  | 86.9 ms: 1.00x slower  |
+-------------------------------+----------+------------------------+
| sqlglot_normalize             | 172 ms   | 173 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| richards_super                | 83.0 ms  | 83.4 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| python_startup_no_site        | 13.6 ms  | 13.6 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| chameleon                     | 11.0 ms  | 11.0 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| asyncio_tcp_ssl               | 2.36 sec | 2.37 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| asyncio_tcp                   | 689 ms   | 694 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_io_tg              | 1.61 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| mako                          | 16.1 ms  | 16.2 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| scimark_monte_carlo           | 99.3 ms  | 100.0 ms: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_cpu_io_mixed | 669 ms   | 674 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| float                         | 125 ms   | 126 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| xml_etree_process             | 97.8 ms  | 98.6 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| deepcopy                      | 553 us   | 557 us: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_generators              | 686 ms   | 692 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| tornado_http                  | 194 ms   | 195 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| mdp                           | 3.95 sec | 3.99 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| genshi_xml                    | 83.8 ms  | 84.6 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| go                            | 222 ms   | 224 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_io                 | 1.60 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_io_tg        | 1.67 sec | 1.69 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_tg           | 124 ms   | 125 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| nqueens                       | 121 ms   | 122 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| hexiom                        | 8.78 ms  | 8.87 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| sympy_expand                  | 751 ms   | 759 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| 2to3                          | 429 ms   | 434 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| json_loads                    | 40.8 us  | 41.3 us: 1.01x slower  |
+-------------------------------+----------+------------------------+
| richards                      | 74.0 ms  | 74.9 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| bench_thread_pool             | 1.42 ms  | 1.43 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| xml_etree_parse               | 238 ms   | 241 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| raytrace                      | 402 ms   | 408 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| django_template               | 56.5 ms  | 57.4 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| deepcopy_reduce               | 4.89 us  | 4.96 us: 1.01x slower  |
+-------------------------------+----------+------------------------+
| async_tree_none_tg            | 655 ms   | 665 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| crypto_pyaes                  | 103 ms   | 104 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| deepcopy_memo                 | 57.9 us  | 58.8 us: 1.02x slower  |
+-------------------------------+----------+------------------------+
| sqlglot_transpile             | 2.50 ms  | 2.54 ms: 1.02x slower  |
+-------------------------------+----------+------------------------+
| sqlglot_parse                 | 1.99 ms  | 2.02 ms: 1.02x slower  |
+-------------------------------+----------+------------------------+
| async_tree_memoization_tg     | 797 ms   | 811 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| spectral_norm                 | 155 ms   | 158 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| pathlib                       | 26.5 ms  | 27.2 ms: 1.02x slower  |
+-------------------------------+----------+------------------------+
| scimark_sor                   | 199 ms   | 204 ms: 1.03x slower   |
+-------------------------------+----------+------------------------+
| genshi_text                   | 35.6 ms  | 36.7 ms: 1.03x slower  |
+-------------------------------+----------+------------------------+
| sqlite_synth                  | 3.73 us  | 3.90 us: 1.05x slower  |
+-------------------------------+----------+------------------------+
| typing_runtime_protocols      | 172 us   | 180 us: 1.05x slower   |
+-------------------------------+----------+------------------------+
| json_dumps                    | 15.5 ms  | 16.5 ms: 1.06x slower  |
+-------------------------------+----------+------------------------+
| pickle                        | 17.6 us  | 19.7 us: 1.12x slower  |
+-------------------------------+----------+------------------------+
| pickle_dict                   | 43.1 us  | 50.2 us: 1.17x slower  |
+-------------------------------+----------+------------------------+
| pickle_list                   | 6.64 us  | 7.87 us: 1.19x slower  |
+-------------------------------+----------+------------------------+
| Geometric mean                | (ref)    | 1.01x faster           |
+-------------------------------+----------+------------------------+

Benchmark hidden because not significant (35): unpickle, regex_effbot, gc_traversal, tomli_loads, meteor_contest, scimark_lu, logging_format, comprehensions, async_tree_eager_cpu_io_mixed_tg, regex_compile, async_tree_eager, unpack_sequence, bench_mp_pool, unpickle_pure_python, async_tree_eager_memoization_tg, pyflate, xml_etree_generate, deltablue, asyncio_websockets, fannkuch, pidigits, regex_dna, scimark_sparse_mat_mult, create_gc_cycles, html5lib, async_tree_eager_memoization, pickle_pure_python, xml_etree_iterparse, async_tree_cpu_io_mixed, sympy_str, chaos, async_tree_cpu_io_mixed_tg, async_tree_eager_io, async_tree_memoization, async_tree_none

In the limited C API version 3.13, getting Py_None, Py_False, Py_True, Py_Ellipsis and Py_NotImplemented singletons is now implemented as function calls at the stable ABI level to hide implementation details. Getting these constants still return borrowed references.

vstinner · 2024-03-11T12:44:44Z

@encukou @zooba: Would you mind to review my PR?

zooba · 2024-03-11T16:02:52Z

Is that benchmark using the macro Py_IsNone or the exported Py_IsNone? Do we see a difference between the two? Not that I'm at all concerned by the benchmark - the operation is trivially optimised out of a tight loop. But having the check occur on our side means we can more easily optimise for comparison if we ever change how "get" works.

Did you try with the single Py_GetSingleton(int) approach as well? I'd expect that to be slower again, but far more forwards/backwards compatible than individual functions. And behind the macro it still looks the same to users.

vstinner · 2024-03-11T16:32:45Z

Is that benchmark using the macro Py_IsNone or the exported Py_IsNone?

My microbenchmark used the macro.

vstinner · 2024-03-11T16:33:56Z

Did you try with the single Py_GetSingleton(int) approach as well?

I didn't try. I dislike Py_GetSingleton(int) since we need a slower switch/case, the function can fail (for unknown constant and invalid values), and we need to implement new constants (identifiers for these constants).

zooba · 2024-03-11T17:18:24Z

I think it's best to have limited API functions that do the checks inside of Python, so that callers who have an unknown object would pass it in and we would return a bool saying if it is/isn't that singleton (or in the Py_GetSingleton approach, they could get the singleton id for an object). That allows us to do faster checks, completely avoids the borrowed/stolen reference count issues, and avoids any potential not-really-singleton issues that may arise in the future or in other implementations. (This is similar to _PyUnicode_EqualToASCIIString which is an API that I like.)

Getting another instance of the singleton and running a normal comparison is like converting your char * to PyUnicodeObject and doing the comparison. Technically it's correct and more flexible, but it's also easy and faster for us to do the check on our side.

Edit So what I'm saying is, hopefully the code that is calling the new getters is also calling the function comparisons, not the macros. I assume the benchmarks would look the same right now since the comparison is still cheap, though.

vstinner · 2024-03-15T23:10:35Z

Alternative: PR gh-116883 "Add Py_GetConstantRef() function".

vstinner · 2024-03-18T21:25:59Z

Alternative: PR #116883 "Add Py_GetConstantRef() function".

Apparently, this approach is preferred. I close this PR.

bedevere-app bot mentioned this pull request Mar 10, 2024

C API: Implement singletons as functions calls in the stable ABI 3.13 #115754

Closed

vstinner force-pushed the getnone branch from 54b6f83 to f8fc1c5 Compare March 11, 2024 08:40

vstinner marked this pull request as ready for review March 11, 2024 08:40

vstinner requested a review from encukou as a code owner March 11, 2024 08:40

bedevere-app bot added the awaiting core review label Mar 11, 2024

vstinner changed the title ~~WIP: gh-115754: Get singletons via function calls~~ gh-115754: Get singletons via function calls Mar 11, 2024

vstinner mentioned this pull request Mar 11, 2024

gh-115754: Expose constants (ex: Py_None) as public ABI symbols (stable ABI) capi-workgroup/decisions#16

Closed

vstinner force-pushed the getnone branch from f8fc1c5 to 880278d Compare March 11, 2024 10:08

vstinner mentioned this pull request Mar 11, 2024

gh-115754: Export Py_None, Py_False, Py_True as symbols #115755

Closed

Fix the stable ABI

060283f

vstinner requested a review from a team as a code owner March 11, 2024 12:11

ericsnowcurrently self-requested a review March 11, 2024 16:00

vstinner closed this Mar 18, 2024

vstinner deleted the getnone branch March 18, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-115754: Get singletons via function calls #116572

gh-115754: Get singletons via function calls #116572

vstinner commented Mar 10, 2024 •

edited by bedevere-app bot

vstinner commented Mar 10, 2024

vstinner commented Mar 11, 2024 •

edited

vstinner commented Mar 11, 2024 •

edited

vstinner commented Mar 11, 2024

zooba commented Mar 11, 2024

vstinner commented Mar 11, 2024

vstinner commented Mar 11, 2024

zooba commented Mar 11, 2024 •

edited

vstinner commented Mar 15, 2024

vstinner commented Mar 18, 2024

gh-115754: Get singletons via function calls #116572

gh-115754: Get singletons via function calls #116572

Conversation

vstinner commented Mar 10, 2024 • edited by bedevere-app bot

vstinner commented Mar 10, 2024

vstinner commented Mar 11, 2024 • edited

vstinner commented Mar 11, 2024 • edited

vstinner commented Mar 11, 2024

zooba commented Mar 11, 2024

vstinner commented Mar 11, 2024

vstinner commented Mar 11, 2024

zooba commented Mar 11, 2024 • edited

vstinner commented Mar 15, 2024

vstinner commented Mar 18, 2024

vstinner commented Mar 10, 2024 •

edited by bedevere-app bot

vstinner commented Mar 11, 2024 •

edited

vstinner commented Mar 11, 2024 •

edited

zooba commented Mar 11, 2024 •

edited