Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle non-interned keyword argument names #469

Merged
merged 1 commit into from Mar 12, 2024

Conversation

oremanj
Copy link
Contributor

@oremanj oremanj commented Mar 12, 2024

Fixes #468.

Size impact on my laptop: nb_func.cpp.o grows by 368 bytes (text size 22565 before, 22933 after).

Performance impact based on some timeit microbenchmarks (run twice in each configuration):

  • test_functions_ext.test_02(3, 5) goes from 32.4/32.3 ns to 32.0/31.8 ns
  • test_functions_ext.test_02(down=3, up=5) goes from 35.3/35.1 ns to 34.4/34.4 ns
  • test_functions_ext.test_02(down=3) goes from 33.3/33.5 ns to 34.0/34.0 ns

Which feels like basically a wash to me.

I didn't bother clearing the cast_flags::deferred flags used for this bookkeeping before calling the function, because that takes time and it seemed easy enough to just tell type casters to ignore the new flag.

Copy link
Owner

@wjakob wjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks reasonable, but it's a bit on the complex end with the added double loop in an already very complicated function.

Could I ask you to explore two more alternatives and benchmark them as well?

  1. Simplest: what if we replace the pointer equality by PyUnicode_Compare? (essentially what the PyPy code path is already doing).

  2. What if we preprocess the keyword argument names at the beginning of the function to ensure that they are interned? I think I'm liking this option better because it avoids the need to call PyTuple_GetItem many times during overload resolution when nanobind is targeting the stable ABI.

    // Ensure that keyword argument names are interned
    PyObject **kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }

This would go to line ~528 just after the other alloca calls. Then, the rest of the code can use kwnames.

For benchmarking, what you really want to do is to try this on a function with many keyword arguments (e.g. 10-15, which is not unreasonable). That's when the O(n^2) argument comparison loop starts to become pricey and some of these details will begin to matter more.

src/nb_func.cpp Outdated
}

// Skip this overload if any arguments were unavailable
if (i != nargs_step1)
continue;

// Check for non-interned keyword matches if applicable
if (any_arg_deferred) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating the double loop over parameters here makes things too complicated IMO.

@oremanj
Copy link
Contributor Author

oremanj commented Mar 12, 2024

Test functions:

    m.def("test_many",
          [](int, int, int, int, int, int, int, int,
             int, int, int, int, int, int, int, int) {},
          "param01"_a = 1, "param02"_a = 1, "param03"_a = 3, "param04"_a = 4,
          "param05"_a = 5, "param06"_a = 6, "param07"_a = 7, "param08"_a = 8,
          "param09"_a = 9, "param10"_a = 10, "param11"_a = 11, "param12"_a = 12,
          "param13"_a = 13, "param14"_a = 14, "param15"_a = 15, "param16"_a = 16);

    m.def("test_kwargs", [](int, nb::kwargs) {}, "param99"_a = 1, "kwargs"_a);

Microbenchmark script:

from timeit import timeit
import test_functions_ext
ns = {"fn": test_functions_ext.test_many, "fnvar": test_functions_ext.test_kwargs}

def params_str(nums):
    return ",".join(f"param{n:02d}={n}" for n in nums)

results = {}

def run(desc, stmt):
    nsec = timeit(stmt, number=3_000_000, globals=ns) / 3_000_000 * 1_000_000_000
    results.setdefault(desc, []).append(nsec)

# consistent randomly-chosen parameter order
ORDER = [15, 5, 3, 1, 10, 9, 16, 4, 14, 11, 6, 7, 8, 2, 13, 12]

for _ in range(2):
    run(
        "All positional:",
        "fn(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)",
    )
    run(
        "All keyword:",
        "fn({})".format(params_str(ORDER)),
    )
    run(
        "Some keyword, rest defaulted:",
        "fn({})".format(params_str(ORDER[:2])),
    )
    run(
        "All defaulted:",
        "fn()",
    )
    run(
        "Var-kwargs, 16 unmatched:",
        "fnvar({})".format(params_str(ORDER)),
    )
    run(
        "Var-kwargs, 2 unmatched:",
        "fnvar({})".format(params_str(ORDER[:2])),
    )

for desc, times in results.items():
    print(f"{desc:>40}", " / ".join(f"{nsec:.1f}ns" for nsec in times))

Python 3.12, regular ABI:

Stock nanobind:

                         All positional: 104.7ns / 102.4ns
                            All keyword: 493.2ns / 493.5ns
           Some keyword, rest defaulted: 91.6ns / 91.5ns
                          All defaulted: 77.3ns / 77.3ns
               Var-kwargs, 16 unmatched: 668.1ns / 668.7ns
                Var-kwargs, 2 unmatched: 70.0ns / 70.0ns

This PR as uploaded: pretty much a wash with stock nanobind, though var-kwargs + filling in defaults gets somewhat slower

                         All positional: 104.2ns / 101.7ns
                            All keyword: 482.3ns / 476.3ns
           Some keyword, rest defaulted: 92.8ns / 90.6ns
                          All defaulted: 73.9ns / 73.8ns
               Var-kwargs, 16 unmatched: 752.7ns / 752.6ns
                Var-kwargs, 2 unmatched: 85.7ns / 85.6ns

Always compare contents: really bad

                         All positional: 104.4ns / 102.7ns
                            All keyword: 1268.2ns / 1264.2ns
           Some keyword, rest defaulted: 276.7ns / 275.7ns
                          All defaulted: 78.9ns / 77.6ns
               Var-kwargs, 16 unmatched: 786.4ns / 775.2ns
                Var-kwargs, 2 unmatched: 82.2ns / 82.1ns

Intern first unconditionally: somewhat worse but maybe acceptable

                         All positional: 107.7ns / 104.5ns
                            All keyword: 543.4ns / 540.4ns
           Some keyword, rest defaulted: 100.6ns / 100.9ns
                          All defaulted: 79.1ns / 79.5ns
               Var-kwargs, 16 unmatched: 706.0ns / 706.3ns
                Var-kwargs, 2 unmatched: 79.9ns / 80.0ns

Intern first if the string object isn't already interned: gets back much of the difference

                         All positional: 106.3ns / 104.0ns
                            All keyword: 501.4ns / 500.4ns
           Some keyword, rest defaulted: 96.0ns / 95.5ns
                          All defaulted: 79.1ns / 78.9ns
               Var-kwargs, 16 unmatched: 663.3ns / 662.0ns
                Var-kwargs, 2 unmatched: 75.1ns / 74.8ns

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

Fascinating! Clear signal, we do not want to run the comparisons by default.

Intern first if the string object isn't already interned

I don't follow what you did here, can you share the diff for this strategy?

One more request: could you run this in a limited API build? Should be as simple as adding STABLE_ABI to the cmake command and ensuring that you have Python 3.12 installed. Be careful to delete the old .so binding file, otherwise Python will prefer the version-specific one.

I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.

@oremanj
Copy link
Contributor Author

oremanj commented Mar 12, 2024

I don't follow what you did here, can you share the diff for this strategy?

Just added one line before InternInPlace:

#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
        if (PyUnicode_Check(key) && !(((PyASCIIObject *)key)->state.interned))
#endif
            PyUnicode_InternInPlace(&key_interned);

One more request: could you run this in a limited API build?

Stand by for results there!

I am thinking that the "intern first" strategy will win in that situation due to the considerably lower number of tuple indexing operations.

One thing to note about the PR as currently implemented is that we don't actually enter the second double-loop unless there are unused keywords. That will only happen when there is actually a non-interned keyword, or the function takes **kwargs, or we're about to return failure. My benchmark doesn't capture that case but I can come up with another one that does.

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

FWIW I don't think that the PyUnicode_Check(key) could ever fail. The vector call protocol spec says that keywords is, quote: "Either NULL or a tuple with the names of the keyword arguments" (source)

@oremanj
Copy link
Contributor Author

oremanj commented Mar 12, 2024

Stable ABI benchmarks:

Stock nanobind:

                         All positional: 126.2ns / 123.8ns
                            All keyword: 706.7ns / 702.7ns
           Some keyword, rest defaulted: 150.5ns / 150.5ns
                          All defaulted: 99.2ns / 99.2ns
               Var-kwargs, 16 unmatched: 727.5ns / 712.5ns
                Var-kwargs, 2 unmatched: 87.7ns / 87.4ns

PR as uploaded:

                         All positional: 126.7ns / 123.9ns
                            All keyword: 692.0ns / 692.2ns
           Some keyword, rest defaulted: 165.3ns / 165.6ns
                          All defaulted: 101.3ns / 101.2ns
               Var-kwargs, 16 unmatched: 858.6ns / 839.5ns
                Var-kwargs, 2 unmatched: 105.3ns / 105.3ns

Always compare:

                         All positional: 126.4ns / 123.8ns
                            All keyword: 1494.5ns / 1474.5ns
           Some keyword, rest defaulted: 327.8ns / 327.4ns
                          All defaulted: 99.2ns / 99.3ns
               Var-kwargs, 16 unmatched: 807.9ns / 809.1ns
                Var-kwargs, 2 unmatched: 98.6ns / 98.6ns

Intern first:

                         All positional: 128.4ns / 126.4ns
                            All keyword: 649.9ns / 651.8ns
           Some keyword, rest defaulted: 131.1ns / 130.9ns
                          All defaulted: 102.0ns / 102.2ns
               Var-kwargs, 16 unmatched: 821.0ns / 822.0ns
                Var-kwargs, 2 unmatched: 98.9ns / 98.8ns

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

Great! What about something like this to further accelerate the non-stable ABI variant? (This potentially avoids the alloca and reference counting)

(Untested, may need some syntax tweaks)

// Ensure that keyword argument names are interned
bool interned;
#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
    interned = true;
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
        interned &= ((PyASCIIObject *)key)->state.interned;
    }
#else
    interned = false;
#endif

PyObject **kwnames = nullptr;
if (NB_LIKELY(interned)) {
    kwnames = kwargs_in;
} else {
    kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }
}

By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

ah nvm, fix incoming

@oremanj
Copy link
Contributor Author

oremanj commented Mar 12, 2024

I updated the non-stableabi benchmarks with the new var-kwargs tests also, and updated the comment above to show my current benchmarking script.

I think "intern first" clearly wins on stable ABI. It's less clear on the regular ABI; it depends whether you'd rather penalize var-kwargs functions with unspecified defaulted args a lot, or everyone a little bit. I'll try your suggested tweak and see if we can close the gap there. I think "intern first" wins overall because it's so much simpler; guessing you feel the same?

By the way, is it valid to assume that key is always a PyASCIIObject? I think it is always a unicode object, but IIRC those can be represented by two different data structures depending on whether or not the string is ASCII or not.

CompactUnicodeObject adds additional fields past the end of ASCIIObject, so this is safe. I also just noticed there's a macro PyUnicode_CHECK_INTERNED which I'll use for better clarity here.

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

This works apparently:

def f(**kwargs):
    print(kwargs)

f(ğ=1)

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

NVM, I see now that the fancier versions all build on the ascii object.

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

How about this?

    // Ensure that keyword argument names are interned. That makes it faster
    // to compare them against pre-interned argument names in the overload chain.
    PyObject **kwnames = nullptr;

#if !defined(PYPY_VERSION) && !defined(Py_LIMITED_API)
    bool kwnames_interned = true;
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i);
        kwnames_interned &= ((PyASCIIObject *) key)->state.interned != 0;
    }
    if (NB_LIKELY(kwnames_interned)) {
        kwnames = ((PyTupleObject *) kwargs_in)->ob_item;
        goto traverse_overloads;
    }
#endif

    kwnames = (PyObject **) alloca(nkwargs_in * sizeof(PyObject *));
    for (size_t i = 0; i < nkwargs_in; ++i) {
        PyObject *key = NB_TUPLE_GET_ITEM(kwargs_in, i),
                 *key_interned = key;
        Py_INCREF(key_interned);

        PyUnicode_InternInPlace(&key_interned);

        if (NB_LIKELY(key == key_interned)) // string was already interned
            Py_DECREF(key_interned);
        else
            cleanup.append(key_interned);
        kwnames[i] = key_interned;
    }

traverse_overloads:

@oremanj
Copy link
Contributor Author

oremanj commented Mar 12, 2024

Perfect.

=== 3.12 ABI ===

stock:

                         All positional: 104.9ns / 102.2ns
                            All keyword: 495.5ns / 494.8ns
           Some keyword, rest defaulted: 91.5ns / 91.4ns
                          All defaulted: 77.3ns / 77.3ns
               Var-kwargs, 16 unmatched: 659.9ns / 660.4ns
                Var-kwargs, 2 unmatched: 69.9ns / 70.0ns

intern before:

                         All positional: 105.3ns / 102.7ns
                            All keyword: 489.0ns / 490.5ns
           Some keyword, rest defaulted: 93.1ns / 92.9ns
                          All defaulted: 77.2ns / 77.3ns
               Var-kwargs, 16 unmatched: 658.5ns / 657.5ns
                Var-kwargs, 2 unmatched: 70.6ns / 70.3ns

Verdict: no noticeable difference.

=== Stable ABI ===

stock:

                         All positional: 126.2ns / 123.8ns
                            All keyword: 713.0ns / 715.9ns
           Some keyword, rest defaulted: 150.5ns / 150.7ns
                          All defaulted: 99.2ns / 99.3ns
               Var-kwargs, 16 unmatched: 737.5ns / 738.9ns
                Var-kwargs, 2 unmatched: 88.7ns / 88.5ns

intern before:

                         All positional: 128.7ns / 126.4ns
                            All keyword: 614.0ns / 615.5ns
           Some keyword, rest defaulted: 131.1ns / 131.1ns
                          All defaulted: 102.1ns / 102.1ns
               Var-kwargs, 16 unmatched: 757.9ns / 758.5ns
                Var-kwargs, 2 unmatched: 98.7ns / 98.8ns

Verdict: clear improvement on a few benchmarks due to reducing the number of tuple calls; cost on others is not awful.

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

Great! Can you update the commit with this one and add a changelog entry?

@oremanj
Copy link
Contributor Author

oremanj commented Mar 12, 2024

Should be all set! Thanks for your help in figuring out a better approach here.

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

Thanks 👍

@wjakob wjakob merged commit fa873d1 into wjakob:master Mar 12, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: kwargs dispatch fails when receiving a non-interned version of a keyword
2 participants