New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle non-interned keyword argument names #469
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks reasonable, but it's a bit on the complex end with the added double loop in an already very complicated function.
Could I ask you to explore two more alternatives and benchmark them as well?
-
Simplest: what if we replace the pointer equality by
PyUnicode_Compare
? (essentially what the PyPy code path is already doing). -
What if we preprocess the keyword argument names at the beginning of the function to ensure that they are interned? I think I'm liking this option better because it avoids the need to call
PyTuple_GetItem
many times during overload resolution when nanobind is targeting the stable ABI.
// Ensure that keyword argument names are interned
PyObject **kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
*key_interned = key;
Py_INCREF(key_interned);
PyUnicode_InternInPlace(&key_interned);
if (NB_LIKELY(key == key_interned)) // string was already interned
Py_DECREF(key_interned);
else
cleanup.append(key_interned);
kwnames[i] = key_interned;
}
This would go to line ~528 just after the other alloca
calls. Then, the rest of the code can use kwnames
.
For benchmarking, what you really want to do is to try this on a function with many keyword arguments (e.g. 10-15, which is not unreasonable). That's when the O(n^2) argument comparison loop starts to become pricey and some of these details will begin to matter more.
src/nb_func.cpp
Outdated
} | ||
|
||
// Skip this overload if any arguments were unavailable | ||
if (i != nargs_step1) | ||
continue; | ||
|
||
// Check for non-interned keyword matches if applicable | ||
if (any_arg_deferred) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating the double loop over parameters here makes things too complicated IMO.
Test functions:
Microbenchmark script:
Python 3.12, regular ABI: Stock nanobind:
This PR as uploaded: pretty much a wash with stock nanobind, though var-kwargs + filling in defaults gets somewhat slower
Always compare contents: really bad
Intern first unconditionally: somewhat worse but maybe acceptable
Intern first if the string object isn't already interned: gets back much of the difference
|
Fascinating! Clear signal, we do not want to run the comparisons by default.
I don't follow what you did here, can you share the diff for this strategy? One more request: could you run this in a limited API build? Should be as simple as adding I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations. |
Just added one line before InternInPlace:
Stand by for results there!
One thing to note about the PR as currently implemented is that we don't actually enter the second double-loop unless there are unused keywords. That will only happen when there is actually a non-interned keyword, or the function takes |
FWIW I don't think that the |
Stable ABI benchmarks: Stock nanobind:
PR as uploaded:
Always compare:
Intern first:
|
Great! What about something like this to further accelerate the non-stable ABI variant? (This potentially avoids the (Untested, may need some syntax tweaks) // Ensure that keyword argument names are interned
bool interned;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
interned = true;
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
interned &= ((PyASCIIObject *)key)->state.interned;
}
#else
interned = false;
#endif
PyObject **kwnames = nullptr;
if (NB_LIKELY(interned)) {
kwnames = kwargs_in;
} else {
kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
*key_interned = key;
Py_INCREF(key_interned);
PyUnicode_InternInPlace(&key_interned);
if (NB_LIKELY(key == key_interned)) // string was already interned
Py_DECREF(key_interned);
else
cleanup.append(key_interned);
kwnames[i] = key_interned;
}
} By the way, is it valid to assume that |
ah nvm, fix incoming |
I updated the non-stableabi benchmarks with the new var-kwargs tests also, and updated the comment above to show my current benchmarking script. I think "intern first" clearly wins on stable ABI. It's less clear on the regular ABI; it depends whether you'd rather penalize var-kwargs functions with unspecified defaulted args a lot, or everyone a little bit. I'll try your suggested tweak and see if we can close the gap there. I think "intern first" wins overall because it's so much simpler; guessing you feel the same?
CompactUnicodeObject adds additional fields past the end of ASCIIObject, so this is safe. I also just noticed there's a macro |
This works apparently:
|
NVM, I see now that the fancier versions all build on the ascii object. |
How about this? // Ensure that keyword argument names are interned. That makes it faster
// to compare them against pre-interned argument names in the overload chain.
PyObject **kwnames = nullptr;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
bool kwnames_interned = true;
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
kwnames_interned &= ((PyASCIIObject *) key)->state.interned != 0;
}
if (NB_LIKELY(kwnames_interned)) {
kwnames = ((PyTupleObject *) kwargs_in)->ob_item;
goto traverse_overloads;
}
#endif
kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
for (size_t i = 0; i < nkwargs_in; ++i) {
PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
*key_interned = key;
Py_INCREF(key_interned);
PyUnicode_InternInPlace(&key_interned);
if (NB_LIKELY(key == key_interned)) // string was already interned
Py_DECREF(key_interned);
else
cleanup.append(key_interned);
kwnames[i] = key_interned;
}
traverse_overloads: |
Perfect. === 3.12 ABI === stock:
intern before:
Verdict: no noticeable difference. === Stable ABI === stock:
intern before:
Verdict: clear improvement on a few benchmarks due to reducing the number of tuple calls; cost on others is not awful. |
Great! Can you update the commit with this one and add a changelog entry? |
2f68e1e
to
f7066a3
Compare
f7066a3
to
857dde1
Compare
Should be all set! Thanks for your help in figuring out a better approach here. |
Thanks 👍 |
Fixes #468.
Size impact on my laptop: nb_func.cpp.o grows by 368 bytes (text size 22565 before, 22933 after).
Performance impact based on some timeit microbenchmarks (run twice in each configuration):
test_functions_ext.test_02(3, 5)
goes from 32.4/32.3 ns to 32.0/31.8 nstest_functions_ext.test_02(down=3, up=5)
goes from 35.3/35.1 ns to 34.4/34.4 nstest_functions_ext.test_02(down=3)
goes from 33.3/33.5 ns to 34.0/34.0 nsWhich feels like basically a wash to me.
I didn't bother clearing the
cast_flags::deferred
flags used for this bookkeeping before calling the function, because that takes time and it seemed easy enough to just tell type casters to ignore the new flag.