Skip to content

Conversation

@vstinner
Copy link
Member

@vstinner vstinner commented Nov 17, 2025

Add PyDict_FromKeysAndValues() and PyDict_FromItems() functions.

API:

PyObject* PyDict_FromKeysAndValues(PyObject *const *keys,
                                   PyObject *const *values,
                                   Py_ssize_t length)

PyObject* PyDict_FromItems(PyObject *const *items, Py_ssize_t length)

📚 Documentation preview 📚: https://cpython-previews--141682.org.readthedocs.build/

Add PyDict_FromKeysAndValues() and PyDict_FromItems() functions.
@vstinner vstinner changed the title gh-139772: Add PyDict_FromItems() function gh-139772: Add PyDict_FromKeysAndValues() function Nov 17, 2025
@vstinner
Copy link
Member Author

Benchmark on dict creation with Unicode strings:

  • ref: PyDict_New() + PyDict_SetItem()
  • keys_and_values: PyDict_FromKeysAndValues() (including the time to create the two arrays, keys, and values)
  • from_items: PyDict_FromItems() (including the time to create the array, keys, and values)
Benchmark ref keys_and_values from_items
dict-1 488 ns 466 ns: 1.05x faster 469 ns: 1.04x faster
dict-10 3.71 us 3.30 us: 1.12x faster 3.19 us: 1.16x faster
dict-100 31.8 us 27.8 us: 1.15x faster 28.1 us: 1.13x faster
dict-1,000 295 us 249 us: 1.18x faster 254 us: 1.16x faster
dict-10,000 2.93 ms 2.53 ms: 1.16x faster 2.53 ms: 1.16x faster
Geometric mean (ref) 1.13x faster 1.13x faster
diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index c14f925b4e7..9987bfa41ba 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -2595,6 +2595,133 @@ create_managed_weakref_nogc_type(PyObject *self, PyObject *Py_UNUSED(args))
 }
 
 
+static PyObject *
+bench_dict_new(PyObject *ob, PyObject *args)
+{
+    Py_ssize_t size, loops;
+    if (!PyArg_ParseTuple(args, "nn", &size, &loops)) {
+        return NULL;
+    }
+
+    PyTime_t t1, t2;
+    PyTime_PerfCounterRaw(&t1);
+    for (Py_ssize_t loop=0; loop < loops; loop++) {
+        PyObject *d = PyDict_New();
+        if (d == NULL) {
+            return NULL;
+        }
+
+        for (Py_ssize_t i=0; i < size; i++) {
+            PyObject *key = PyUnicode_FromFormat("%zi", i);
+            assert(key != NULL);
+
+            PyObject *value = PyLong_FromLong(i);
+            assert(value != NULL);
+
+            assert(PyDict_SetItem(d, key, value) == 0);
+            Py_DECREF(key);
+            Py_DECREF(value);
+        }
+
+        assert(PyDict_Size(d) == size);
+        Py_DECREF(d);
+    }
+    PyTime_PerfCounterRaw(&t2);
+
+    return PyFloat_FromDouble(PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
+static PyObject *
+bench_dict_fromkeysandvalues(PyObject *ob, PyObject *args)
+{
+    Py_ssize_t size, loops;
+    if (!PyArg_ParseTuple(args, "nn", &size, &loops)) {
+        return NULL;
+    }
+
+    PyTime_t t1, t2;
+    PyTime_PerfCounterRaw(&t1);
+    for (Py_ssize_t loop=0; loop < loops; loop++) {
+        PyObject **keys = (PyObject **)PyMem_Malloc(size * sizeof(PyObject*));
+        if (keys == NULL) {
+            return NULL;
+        }
+        PyObject **values = (PyObject **)PyMem_Malloc(size * sizeof(PyObject*));
+        if (values == NULL) {
+            return NULL;
+        }
+
+        for (Py_ssize_t i=0; i < size; i++) {
+            PyObject *key = PyUnicode_FromFormat("%zi", i);
+            assert(key != NULL);
+
+            PyObject *value = PyLong_FromLong(i);
+            assert(value != NULL);
+
+            keys[i] = key;
+            values[i] = value;
+        }
+
+        PyObject *d = PyDict_FromKeysAndValues(keys, values, size);
+        assert(d != NULL);
+        Py_DECREF(d);
+
+        for (Py_ssize_t i=0; i < size; i++) {
+            Py_DECREF(keys[i]);
+            Py_DECREF(values[i]);
+        }
+        PyMem_Free(keys);
+        PyMem_Free(values);
+    }
+    PyTime_PerfCounterRaw(&t2);
+
+    return PyFloat_FromDouble(PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
+static PyObject *
+bench_dict_fromitems(PyObject *ob, PyObject *args)
+{
+    Py_ssize_t size, loops;
+    if (!PyArg_ParseTuple(args, "nn", &size, &loops)) {
+        return NULL;
+    }
+
+    PyTime_t t1, t2;
+    PyTime_PerfCounterRaw(&t1);
+    for (Py_ssize_t loop=0; loop < loops; loop++) {
+        PyObject **items = (PyObject **)PyMem_Malloc(size * 2 * sizeof(PyObject*));
+        if (items == NULL) {
+            return NULL;
+        }
+
+        for (Py_ssize_t i=0; i < size; i++) {
+            PyObject *key = PyUnicode_FromFormat("%zi", i);
+            assert(key != NULL);
+
+            PyObject *value = PyLong_FromLong(i);
+            assert(value != NULL);
+
+            items[i * 2    ] = key;
+            items[i * 2 + 1] = value;
+        }
+
+        PyObject *d = PyDict_FromItems(items, size);
+        assert(d != NULL);
+        Py_DECREF(d);
+
+        for (Py_ssize_t i=0; i < size * 2; i++) {
+            Py_DECREF(items[i]);
+        }
+        PyMem_Free(items);
+    }
+    PyTime_PerfCounterRaw(&t2);
+
+    return PyFloat_FromDouble(PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
 static PyMethodDef TestMethods[] = {
     {"set_errno",               set_errno,                       METH_VARARGS},
     {"test_config",             test_config,                     METH_NOARGS},
@@ -2691,6 +2818,9 @@ static PyMethodDef TestMethods[] = {
     {"toggle_reftrace_printer", toggle_reftrace_printer, METH_O},
     {"create_managed_weakref_nogc_type",
         create_managed_weakref_nogc_type, METH_NOARGS},
+    {"bench_dict_new", bench_dict_new, METH_VARARGS},
+    {"bench_dict_fromkeysandvalues", bench_dict_fromkeysandvalues, METH_VARARGS},
+    {"bench_dict_fromitems", bench_dict_fromitems, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };
 

Script:

import pyperf
import _testcapi

runner = pyperf.Runner()
for size in (1, 10, 100, 1_000, 10_000):
    runner.bench_time_func(f'dict-{size:,}', _testcapi.bench_dict_new, size)

@vstinner
Copy link
Member Author

@scoder @davidhewitt: Do these 2 APIs fit your needs to create a dictionary?

@scoder
Copy link
Contributor

scoder commented Nov 18, 2025

Do these 2 APIs fit your needs to create a dictionary?

In Cython, we'd probably know up-front whether the keys are all str or not, so the type checking loop may be redundant in many cases and a flag option could avoid it. However, given branch prediction, first time memory load overhead, etc., I doubt that the added time is going to be visible for reasonably sized literal dicts in the real world. The main gain is from avoiding the repeated back and forth through the C-API barrier, which is certainly worth it.

@davidhewitt
Copy link
Contributor

davidhewitt commented Nov 18, 2025

Is the array in FromItems an array of alternating keys & values? I think my comment in #139963 (comment) as to why the offsets might be necessary for guaranteeing correct interpretation of the layout of Rust tuples still stands, at least for PyO3 user code if we wanted to avoid rewriting their inputs into a C-style array.

For PyO3's internal creation of dictionaries, we should be able to use any of these functions fine 👍

@vstinner
Copy link
Member Author

Is the array in FromItems an array of alternating keys & values?

Yes: key1, value1, key2, value2, ..., keyN, valueN.

I think my comment in #139963 (comment) as to why the offsets might be necessary for guaranteeing correct interpretation of the layout of Rust tuples still stands, at least for PyO3 user code if we wanted to avoid rewriting their inputs into a C-style array.

Can't you modify your code to produce a flat items vector as FromItems() expect? Or produce two arrays, keys and values, for FromKeysAndValues()?

@vstinner
Copy link
Member Author

vstinner commented Nov 18, 2025

@scoder:

In Cython, we'd probably know up-front whether the keys are all str or not, so the type checking loop may be redundant in many cases and a flag option could avoid it.

The problem is that the PyObject* PyDict_NewPresized(Py_ssize_t size, int unicode_keys) API didn't convince the C API Working Group. So I propose higher level APIs, PyDict_FromKeysAndValues() and PyDict_FromItems(), which don't expose implementation details such as unicode_keys.

However, given branch prediction, first time memory load overhead, etc., I doubt that the added time is going to be visible for reasonably sized literal dicts in the real world. The main gain is from avoiding the repeated back and forth through the C-API barrier, which is certainly worth it.

Right. At the end, proposed APIs are 1.13x faster than calling PyDict_New() + PyDict_SetItem().

@davidhewitt
Copy link
Contributor

davidhewitt commented Nov 18, 2025

Can't you modify your code to produce a flat items vector as FromItems() expect? Or produce two arrays, keys and values, for FromKeysAndValues()?

For PyO3 internal code, yes. However we might want to expose this for users of PyO3. While designing that API, I think I've decided that they will need to do some arranging to the objects anyway. So while I still think having the flexibility to have offsets is nice, please don't block this on me.

@vstinner
Copy link
Member Author

@methane @encukou: What do you think of these functions? I propose adding these functions instead of adding PyDict_NewPresized().

@encukou
Copy link
Member

encukou commented Nov 20, 2025

I'd like to consider:

  • as per the discussion here: adding stride arguments, like the offset ones in the current _PyDict_FromItems? This would mean we only need a single function (as well as enable keys & values in arrays of more complex structures than pairs of PyObject*, which I guess is less important)
  • making this an update operation, with a new dict created if you pass NULL as the dict to update? That would allow creating an array from several chunks

@vstinner
Copy link
Member Author

as per the discussion here: adding stride arguments, like the offset ones in the current _PyDict_FromItems?

#139963 implements such API:

PyObject* PyDict_FromItems(
    PyObject *const *keys,
    Py_ssize_t keys_offset,
    PyObject *const *values,
    Py_ssize_t values_offset,
    Py_ssize_t length)

Such API is harder to use (more error-prone), and requires more checks. I prefer a simpler API for the two most common use cases.

@vstinner
Copy link
Member Author

making this an update operation, with a new dict created if you pass NULL as the dict to update? That would allow creating an array from several chunks

Aha, like a batch of PyDict_SetItem() calls, interesting. I'm not sure if it's a good idea to have a single function to create a dictionary or update a dictionary. I think that I would prefer separated functions for that. For example, PyDict_Merge() has an override parameter to decide what to do if a key already exists.

Currently, there are already PyDict_Update() and PyDict_Merge() which accept a dictionary or a collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants