Properly handle non-interned keyword argument names #469

oremanj · 2024-03-12T05:33:38Z

Fixes #468.

Size impact on my laptop: nb_func.cpp.o grows by 368 bytes (text size 22565 before, 22933 after).

Performance impact based on some timeit microbenchmarks (run twice in each configuration):

test_functions_ext.test_02(3, 5) goes from 32.4/32.3 ns to 32.0/31.8 ns
test_functions_ext.test_02(down=3, up=5) goes from 35.3/35.1 ns to 34.4/34.4 ns
test_functions_ext.test_02(down=3) goes from 33.3/33.5 ns to 34.0/34.0 ns

Which feels like basically a wash to me.

I didn't bother clearing the cast_flags::deferred flags used for this bookkeeping before calling the function, because that takes time and it seemed easy enough to just tell type casters to ignore the new flag.

wjakob

This change looks reasonable, but it's a bit on the complex end with the added double loop in an already very complicated function.

Could I ask you to explore two more alternatives and benchmark them as well?

Simplest: what if we replace the pointer equality by PyUnicode_Compare? (essentially what the PyPy code path is already doing).
What if we preprocess the keyword argument names at the beginning of the function to ensure that they are interned? I think I'm liking this option better because it avoids the need to call PyTuple_GetItem many times during overload resolution when nanobind is targeting the stable ABI.

    // Ensure that keyword argument names are interned
    PyObject **kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }

This would go to line ~528 just after the other alloca calls. Then, the rest of the code can use kwnames.

For benchmarking, what you really want to do is to try this on a function with many keyword arguments (e.g. 10-15, which is not unreasonable). That's when the O(n^2) argument comparison loop starts to become pricey and some of these details will begin to matter more.

wjakob · 2024-03-12T08:44:43Z

src/nb_func.cpp

            }

            // Skip this overload if any arguments were unavailable
            if (i != nargs_step1)
                continue;

+            // Check for non-interned keyword matches if applicable
+            if (any_arg_deferred) {


Duplicating the double loop over parameters here makes things too complicated IMO.

oremanj · 2024-03-12T11:13:53Z

Test functions:

    m.def("test_many",
          [](int, int, int, int, int, int, int, int,
             int, int, int, int, int, int, int, int) {},
          "param01"_a = 1, "param02"_a = 1, "param03"_a = 3, "param04"_a = 4,
          "param05"_a = 5, "param06"_a = 6, "param07"_a = 7, "param08"_a = 8,
          "param09"_a = 9, "param10"_a = 10, "param11"_a = 11, "param12"_a = 12,
          "param13"_a = 13, "param14"_a = 14, "param15"_a = 15, "param16"_a = 16);

    m.def("test_kwargs", [](int, nb::kwargs) {}, "param99"_a = 1, "kwargs"_a);

Microbenchmark script:

from timeit import timeit
import test_functions_ext
ns = {"fn": test_functions_ext.test_many, "fnvar": test_functions_ext.test_kwargs}

def params_str(nums):
    return ",".join(f"param{n:02d}={n}" for n in nums)

results = {}

def run(desc, stmt):
    nsec = timeit(stmt, number=3_000_000, globals=ns) / 3_000_000 * 1_000_000_000
    results.setdefault(desc, []).append(nsec)

# consistent randomly-chosen parameter order
ORDER = [15, 5, 3, 1, 10, 9, 16, 4, 14, 11, 6, 7, 8, 2, 13, 12]

for _ in range(2):
    run(
        "All positional:",
        "fn(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)",
    )
    run(
        "All keyword:",
        "fn({})".format(params_str(ORDER)),
    )
    run(
        "Some keyword, rest defaulted:",
        "fn({})".format(params_str(ORDER[:2])),
    )
    run(
        "All defaulted:",
        "fn()",
    )
    run(
        "Var-kwargs, 16 unmatched:",
        "fnvar({})".format(params_str(ORDER)),
    )
    run(
        "Var-kwargs, 2 unmatched:",
        "fnvar({})".format(params_str(ORDER[:2])),
    )

for desc, times in results.items():
    print(f"{desc:>40}", " / ".join(f"{nsec:.1f}ns" for nsec in times))

Python 3.12, regular ABI:

Stock nanobind:

                         All positional: 104.7ns / 102.4ns
                            All keyword: 493.2ns / 493.5ns
           Some keyword, rest defaulted: 91.6ns / 91.5ns
                          All defaulted: 77.3ns / 77.3ns
               Var-kwargs, 16 unmatched: 668.1ns / 668.7ns
                Var-kwargs, 2 unmatched: 70.0ns / 70.0ns

This PR as uploaded: pretty much a wash with stock nanobind, though var-kwargs + filling in defaults gets somewhat slower

                         All positional: 104.2ns / 101.7ns
                            All keyword: 482.3ns / 476.3ns
           Some keyword, rest defaulted: 92.8ns / 90.6ns
                          All defaulted: 73.9ns / 73.8ns
               Var-kwargs, 16 unmatched: 752.7ns / 752.6ns
                Var-kwargs, 2 unmatched: 85.7ns / 85.6ns

Always compare contents: really bad

                         All positional: 104.4ns / 102.7ns
                            All keyword: 1268.2ns / 1264.2ns
           Some keyword, rest defaulted: 276.7ns / 275.7ns
                          All defaulted: 78.9ns / 77.6ns
               Var-kwargs, 16 unmatched: 786.4ns / 775.2ns
                Var-kwargs, 2 unmatched: 82.2ns / 82.1ns

Intern first unconditionally: somewhat worse but maybe acceptable

                         All positional: 107.7ns / 104.5ns
                            All keyword: 543.4ns / 540.4ns
           Some keyword, rest defaulted: 100.6ns / 100.9ns
                          All defaulted: 79.1ns / 79.5ns
               Var-kwargs, 16 unmatched: 706.0ns / 706.3ns
                Var-kwargs, 2 unmatched: 79.9ns / 80.0ns

Intern first if the string object isn't already interned: gets back much of the difference

                         All positional: 106.3ns / 104.0ns
                            All keyword: 501.4ns / 500.4ns
           Some keyword, rest defaulted: 96.0ns / 95.5ns
                          All defaulted: 79.1ns / 78.9ns
               Var-kwargs, 16 unmatched: 663.3ns / 662.0ns
                Var-kwargs, 2 unmatched: 75.1ns / 74.8ns

wjakob · 2024-03-12T11:27:38Z

Fascinating! Clear signal, we do not want to run the comparisons by default.

Intern first if the string object isn't already interned

I don't follow what you did here, can you share the diff for this strategy?

One more request: could you run this in a limited API build? Should be as simple as adding STABLE_ABI to the cmake command and ensuring that you have Python 3.12 installed. Be careful to delete the old .so binding file, otherwise Python will prefer the version-specific one.

I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.

oremanj · 2024-03-12T11:55:34Z

I don't follow what you did here, can you share the diff for this strategy?

Just added one line before InternInPlace:

#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
        if (PyUnicode_Check(key) && !(((PyASCIIObject *)key)->state.interned))
#endif
            PyUnicode_InternInPlace(&key_interned);

One more request: could you run this in a limited API build?

Stand by for results there!

I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.

One thing to note about the PR as currently implemented is that we don't actually enter the second double-loop unless there are unused keywords. That will only happen when there is actually a non-interned keyword, or the function takes **kwargs, or we're about to return failure. My benchmark doesn't capture that case but I can come up with another one that does.

wjakob · 2024-03-12T12:48:37Z

FWIW I don't think that the PyUnicode_Check(key) could ever fail. The vector call protocol spec says that keywords is, quote: "Either NULL or a tuple with the names of the keyword arguments" (source)

oremanj · 2024-03-12T13:05:45Z

Stable ABI benchmarks:

Stock nanobind:

                         All positional: 126.2ns / 123.8ns
                            All keyword: 706.7ns / 702.7ns
           Some keyword, rest defaulted: 150.5ns / 150.5ns
                          All defaulted: 99.2ns / 99.2ns
               Var-kwargs, 16 unmatched: 727.5ns / 712.5ns
                Var-kwargs, 2 unmatched: 87.7ns / 87.4ns

PR as uploaded:

                         All positional: 126.7ns / 123.9ns
                            All keyword: 692.0ns / 692.2ns
           Some keyword, rest defaulted: 165.3ns / 165.6ns
                          All defaulted: 101.3ns / 101.2ns
               Var-kwargs, 16 unmatched: 858.6ns / 839.5ns
                Var-kwargs, 2 unmatched: 105.3ns / 105.3ns

Always compare:

                         All positional: 126.4ns / 123.8ns
                            All keyword: 1494.5ns / 1474.5ns
           Some keyword, rest defaulted: 327.8ns / 327.4ns
                          All defaulted: 99.2ns / 99.3ns
               Var-kwargs, 16 unmatched: 807.9ns / 809.1ns
                Var-kwargs, 2 unmatched: 98.6ns / 98.6ns

Intern first:

                         All positional: 128.4ns / 126.4ns
                            All keyword: 649.9ns / 651.8ns
           Some keyword, rest defaulted: 131.1ns / 130.9ns
                          All defaulted: 102.0ns / 102.2ns
               Var-kwargs, 16 unmatched: 821.0ns / 822.0ns
                Var-kwargs, 2 unmatched: 98.9ns / 98.8ns

wjakob · 2024-03-12T13:20:51Z

Great! What about something like this to further accelerate the non-stable ABI variant? (This potentially avoids the alloca and reference counting)

(Untested, may need some syntax tweaks)

// Ensure that keyword argument names are interned
bool interned;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
    interned = true;
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
        interned &= ((PyASCIIObject *)key)->state.interned;
    }
#else
    interned = false;
#endif

PyObject **kwnames = nullptr;
if (NB_LIKELY(interned)) {
    kwnames = kwargs_in;
} else {
    kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }
}

By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.

wjakob · 2024-03-12T13:23:42Z

ah nvm, fix incoming

oremanj · 2024-03-12T13:24:43Z

I updated the non-stableabi benchmarks with the new var-kwargs tests also, and updated the comment above to show my current benchmarking script.

I think "intern first" clearly wins on stable ABI. It's less clear on the regular ABI; it depends whether you'd rather penalize var-kwargs functions with unspecified defaulted args a lot, or everyone a little bit. I'll try your suggested tweak and see if we can close the gap there. I think "intern first" wins overall because it's so much simpler; guessing you feel the same?

By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.

CompactUnicodeObject adds additional fields past the end of ASCIIObject, so this is safe. I also just noticed there's a macro PyUnicode_CHECK_INTERNED which I'll use for better clarity here.

wjakob · 2024-03-12T13:26:47Z

This works apparently:

def f(**kwargs):
    print(kwargs)

f(ğ=1)

wjakob · 2024-03-12T13:28:16Z

NVM, I see now that the fancier versions all build on the ascii object.

wjakob · 2024-03-12T13:37:37Z

How about this?

    // Ensure that keyword argument names are interned. That makes it faster
    // to compare them against pre-interned argument names in the overload chain.
    PyObject **kwnames = nullptr;

#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
    bool kwnames_interned = true;
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
        kwnames_interned &= ((PyASCIIObject *) key)->state.interned != 0;
    }
    if (NB_LIKELY(kwnames_interned)) {
        kwnames = ((PyTupleObject *) kwargs_in)->ob_item;
        goto traverse_overloads;
    }
#endif

    kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }

traverse_overloads:

oremanj · 2024-03-12T13:51:45Z

Perfect.

=== 3.12 ABI ===

stock:

                         All positional: 104.9ns / 102.2ns
                            All keyword: 495.5ns / 494.8ns
           Some keyword, rest defaulted: 91.5ns / 91.4ns
                          All defaulted: 77.3ns / 77.3ns
               Var-kwargs, 16 unmatched: 659.9ns / 660.4ns
                Var-kwargs, 2 unmatched: 69.9ns / 70.0ns

intern before:

                         All positional: 105.3ns / 102.7ns
                            All keyword: 489.0ns / 490.5ns
           Some keyword, rest defaulted: 93.1ns / 92.9ns
                          All defaulted: 77.2ns / 77.3ns
               Var-kwargs, 16 unmatched: 658.5ns / 657.5ns
                Var-kwargs, 2 unmatched: 70.6ns / 70.3ns

Verdict: no noticeable difference.

=== Stable ABI ===

stock:

                         All positional: 126.2ns / 123.8ns
                            All keyword: 713.0ns / 715.9ns
           Some keyword, rest defaulted: 150.5ns / 150.7ns
                          All defaulted: 99.2ns / 99.3ns
               Var-kwargs, 16 unmatched: 737.5ns / 738.9ns
                Var-kwargs, 2 unmatched: 88.7ns / 88.5ns

intern before:

                         All positional: 128.7ns / 126.4ns
                            All keyword: 614.0ns / 615.5ns
           Some keyword, rest defaulted: 131.1ns / 131.1ns
                          All defaulted: 102.1ns / 102.1ns
               Var-kwargs, 16 unmatched: 757.9ns / 758.5ns
                Var-kwargs, 2 unmatched: 98.7ns / 98.8ns

Verdict: clear improvement on a few benchmarks due to reducing the number of tuple calls; cost on others is not awful.

wjakob · 2024-03-12T13:53:39Z

Great! Can you update the commit with this one and add a changelog entry?

oremanj · 2024-03-12T14:01:35Z

Should be all set! Thanks for your help in figuring out a better approach here.

wjakob · 2024-03-12T14:04:59Z

Thanks 👍

wjakob reviewed Mar 12, 2024

View reviewed changes

oremanj force-pushed the noninterned-kwargs branch from 2f68e1e to f7066a3 Compare March 12, 2024 14:00

Properly handle non-interned keyword argument names

857dde1

oremanj force-pushed the noninterned-kwargs branch from f7066a3 to 857dde1 Compare March 12, 2024 14:01

wjakob merged commit fa873d1 into wjakob:master Mar 12, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle non-interned keyword argument names #469

Properly handle non-interned keyword argument names #469

oremanj commented Mar 12, 2024 •

edited

wjakob left a comment •

edited

wjakob Mar 12, 2024

oremanj commented Mar 12, 2024 •

edited

wjakob commented Mar 12, 2024 •

edited

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024 •

edited

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

wjakob commented Mar 12, 2024

wjakob commented Mar 12, 2024 •

edited

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

Properly handle non-interned keyword argument names #469

Properly handle non-interned keyword argument names #469

Conversation

oremanj commented Mar 12, 2024 • edited

wjakob left a comment • edited

Choose a reason for hiding this comment

wjakob Mar 12, 2024

Choose a reason for hiding this comment

oremanj commented Mar 12, 2024 • edited

wjakob commented Mar 12, 2024 • edited

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024 • edited

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

wjakob commented Mar 12, 2024

wjakob commented Mar 12, 2024 • edited

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024

wjakob commented Mar 12, 2024

oremanj commented Mar 12, 2024 •

edited

wjakob left a comment •

edited

oremanj commented Mar 12, 2024 •

edited

wjakob commented Mar 12, 2024 •

edited

wjakob commented Mar 12, 2024 •

edited

wjakob commented Mar 12, 2024 •

edited