Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[smart_holder] Add a new return value policy return_as_bytes #3838

Merged
merged 15 commits into from
Apr 15, 2022

Conversation

wangxf123456
Copy link
Contributor

Description

Add a new return value policy return_as_bytes to make C++ functions return bytes to Python instead of str.

We can convert return values to bytes by applying py::bytes, but this might be hard when dealing with nested types.

@wangxf123456 wangxf123456 marked this pull request as ready for review March 31, 2022 00:23
@Skylion007
Copy link
Collaborator

This isn't necessary though as we already have bindings for 'py::bytesas well aspy::bytearray` and they both have constructors for std::string. Adding an additional return_value_policy just for this specific use case seems ill-advised and will make maintenance much more difficult. All one needs to do is wrap the std::string or std::stringview with the py::bytes constructor to achieve the desired result.

// test return_value_policy::return_as_bytes
m.def(
"invalid_utf8_string_as_bytes",
[]() { return std::string("\xba\xd0\xba\xd0"); },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[]() { return std::string("\xba\xd0\xba\xd0"); },
[]() { return py::bytes(std::string("\xba\xd0\xba\xd0")); },

is all that was ever needed.

@Skylion007 Skylion007 closed this Mar 31, 2022
// test return_value_policy::return_as_bytes
m.def(
"invalid_utf8_string_array_as_bytes",
[]() { return std::array<std::string, 1>{{"\xba\xd0\xba\xd0"}}; },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[]() { return std::array<std::string, 1>{{"\xba\xd0\xba\xd0"}}; },
[]() { return std::array<py::bytes, 1>{{"\xba\xd0\xba\xd0"}}; },

would solve this use case, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, you could return the py::array directly or use py::memoryview::from_buffer

@Skylion007
Copy link
Collaborator

If you are every unhappy with the behavior of casters, you can always just return the python objects themselves to the same effect.

@rwgk
Copy link
Collaborator

rwgk commented Mar 31, 2022

If you are every unhappy with the behavior of casters, you can always just return the python objects themselves to the same effect.

This is for the PyCLIF pybind11 code generator. You're right, for a simple std::string return we could generate a lambda with py::bytes, but for nested types that's not possible, as far as we can see, without very complex logic somehow unraveling the nested types, to then generated nested new code. Concretely:

 -        []() { return std::array<std::string, 1>{{"\xba\xd0\xba\xd0"}}; },
 +       []() { return std::array<py::bytes, 1>{{"\xba\xd0\xba\xd0"}}; },

That is a transformation we'd have to make automatically and recursively.

In contrast, the core of this PR is super simple:

  • adds one enum (cost essentially zero)
  • adds one if in the caster

It got us from ~97.5% success to ~98.5%, i.e. it was the top-most-important fix by far that we needed. (We're in the "long tail" phase of the project.)

@rwgk
Copy link
Collaborator

rwgk commented Mar 31, 2022

We really need this — essentially 6-line change.

@rwgk rwgk reopened this Mar 31, 2022
@Skylion007
Copy link
Collaborator

My biggest concern is that this is to other return_value_policy. IE, we may want to combine this flag with reference_internal etc at some point in the future. Also, this is only valid for a single underlying type (std::string caster), which seems wrong. I agree we probably need a better way to handle this, but hacking the caster like this doesn't seem right. Maybe we need a proxy wrapper that automatically triggers this modified caster behavior through a templated function?

@Skylion007
Copy link
Collaborator

Skylion007 commented Mar 31, 2022

What we really need way is the ability to override the value_conv and key_conv behavior of these templated types. Abusing return_value_policy for this has terrible code smell.

@rwgk
Copy link
Collaborator

rwgk commented Mar 31, 2022

What we really need way is the ability to override the value_conv and key_conv behavior of these templated types. Abusing return_value_policy for this has terrible code smell.

Hm ... could you explain more?

This PR is taking a very simple path to achieve the desired behavior. Could it be even simpler?

@henryiii
Copy link
Collaborator

Taking a simple path is not always the correct path. If you have a clear meaning for return value policy (which is not to control the conversion of types), then abusing it to do what you want here can spell disaster down the road. It might make it impossible to refactor, for example, is hard to document and confusing to newcomers. I think it's worth investigating to see if there's a way around it without making it a return_value_policy.

@Skylion007
Copy link
Collaborator

Skylion007 commented Mar 31, 2022

º> This PR is taking a very simple path to achieve the desired behavior. Could it be even simpler?
Simpler is not necessarily the issue here.
Here is a trivial use case the current API design would not cover:

       []() { return std::array<py::bytes, 1>{{"\xba\xd0\xba\xd0"}}; },

and specify to have it return by reference, by value, or by copy. Since we return value policy is an Enum and not a flag, we cannot combine this return value policies easily with the other ones which are mutually exclusive to one another.

The real issue here is that we have no way disambiguate casters except by doing the cast ourselves in the lambda. This is normally trivial, but becomes non-trivial for container or other variant types. We could of course fix that with another caster which changes the behavior of current caster.
The main issue is that we cannot disambiguate return types that multiple valid C++ casts. Strings actually have three:

  1. An immutable unicode or other codec variant string
  2. An immutable byte object
  3. A mutable bytearray (Do you suggest we add yet ANOTHER return value policy for bytearrays which are nested?

This exposes three issues with our current :

  1. We cannot pass any extra arguments to our C++ to Python casters.
  2. We cannot pass any arguments to the value and key casters in our STL casters (std::vector, std::array, and std::unordered_map, etc...) (EXCEPT FOR return_value_policy, and the parent handle).
  3. We cannot specify a custom type caster to be used only for a single function, we have to reimplement that type caster as part of the lambda.

We could just add a special wrapper that triggers an alternative version of the stl_casters, but I don't think that would solve your problem since presumably your issue is actually ABSL or other container types?

The easiest solution is probably to call the caster directly in the lambda with some modified optional args.
The messiest but robust solution would be to do some STL magic to rip apart these nested casters and convert them to byte strings and automatically has a templated getter that does the casting as it is accessed: https://stackoverflow.com/a/14280396/2444240

We could abstract the list_caster, array_caster, map_caster, and set_caster further to allow for templates which modify the key_conv and value_conv as well though.

Another solution though would probably be to make this another extra arg that specified in the def() block that specializes the casters or allows the user to specify their own.

@henryiii @wjakob I would love to hear your thoughts on how to best solve this use case / ambiguity.

handle s = decode_utfN(buffer, nbytes);
handle s;
if (policy == return_value_policy::return_as_bytes) {
s = PYBIND11_BYTES_FROM_STRING_AND_SIZE(buffer, nbytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyBytes_FromStringAndSize

The old macro is unfinished Python 2 cleanup. For new code like this it's best to use the Python 3 C API directly.

reference_internal
reference_internal,

/** Use this policy to make C++ functions return bytes to Python instead of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of word-smithing:

With this policy, C++ string types are converted to Python bytes, instead of str. This is most useful when a C++ function returns a container-like type with nested C++ string types, and py::bytes cannot be applied easily. Note that this return_value_policy is not concerned with lifetime/ownership semantics, like the other policies, but the purpose of return_as_bytes is certain to be orthogonal, because C++ strings are always copied to Python bytes or str.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ strings are always copied to Python bytes or str.

What about a dictionary that has byte strings as keys and references to a C++ Object as values? This still would not work even with the wordsmithing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.
But see my longer comment from a couple minutes ago.

@rwgk
Copy link
Collaborator

rwgk commented Mar 31, 2022

º> This PR is taking a very simple path to achieve the desired behavior. Could it be even simpler? Simpler is not necessarily the issue here. Here is a trivial use case the current API design would not cover:

       []() { return std::array<py::bytes, 1>{{"\xba\xd0\xba\xd0"}}; },

and specify to have it return by reference, by value, or by copy. Since we return value policy is an Enum and not a flag, we cannot combine this return value policies easily with the other ones which are mutually exclusive to one another.

The real issue here is that we have no way disambiguate casters except by doing the cast ourselves in the lambda. This is normally trivial, but becomes non-trivial for container or other variant types. We could of course fix that with another caster which changes the behavior of current caster. The main issue is that we cannot disambiguate return types that multiple valid C++ casts. Strings actually have three:

  1. An immutable unicode or other codec variant string
  2. An immutable byte object
  3. A mutable bytearray (Do you suggest we add yet ANOTHER return value policy for bytearrays which are nested?

This exposes three issues with our current :

  1. We cannot pass any extra arguments to our C++ to Python casters.
  2. We cannot pass any arguments to the value and key casters in our STL casters (std::vector, std::array, and std::unordered_map, etc...) (EXCEPT FOR return_value_policy, and the parent handle).
  3. We cannot specify a custom type caster to be used only for a single function, we have to reimplement that type caster as part of the lambda.

We could just add a special wrapper that triggers an alternative version of the stl_casters, but I don't think that would solve your problem since presumably your issue is actually ABSL or other container types?

The easiest solution is probably to call the caster directly in the lambda with some modified optional args. The messiest but robust solution would be to do some STL magic to rip apart these nested casters and convert them to byte strings and automatically has a templated getter that does the casting as it is accessed: https://stackoverflow.com/a/14280396/2444240

We could abstract the list_caster, array_caster, map_caster, and set_caster further to allow for templates which modify the key_conv and value_conv as well though.

Another solution though would probably be to make this another extra arg that specified in the def() block that specializes the casters or allows the user to specify their own.

@henryiii @wjakob I would love to hear your thoughts on how to best solve this use case / ambiguity.

This is a very good analysis, thanks Aaron! For strategic reasons, we cannot afford to make this a huge general project at the moment, we have to approach this with a long-term view: we still have to prove that pybind11 actually works for Google, by successfully integrating it into PyCLIF, i.e. we have to get from 98.5% to 100%. Once we've made that hurdle, we can devote more time on bigger projects for pybind11 itself, like generalizing the return_value_policy concept.

return_value_policy as-is is certainly not capable of covering complex cases in general. The bytes/str issue is just a more (the most?) practically important example. In that sense we're not making anything worse, we're just brining a fundamental limitation to light. — Assume the enum type wasn't called return_value_policy, but return_value_ownership_policy, clearly bytes/str would be a mismatch. But the name is return_value_policy, and bytes/str totally makes sense, it's just an orthogonal dimension that wasn't used before.

Will anyone ever need something more general? Will that just be over-engineering? I don't know.

With one enum and if we can cover what we need. We can generalize from there, as and if needed. I don't think that one extra enum targeted at (currently) one if in one caster can lead to terrible problems down the road.

@Skylion007
Copy link
Collaborator

This is a very good analysis, thanks Aaron! For strategic reasons, we cannot afford to make this a huge general project at the moment, we have to approach this with a long-term view: we still have to prove that pybind11 actually works for Google, by successfully integrating it into PyCLIF, i.e. we have to get from 98.5% to 100%. Once we've made that hurdle, we can devote more time on bigger projects for pybind11 itself, like generalizing the return_value_policy concept.

Maybe it's best for this to live in another branch and merge in once it's cleaned up and up to 100%?

@Skylion007
Copy link
Collaborator

I would also be okay if it's hidden by some IFDEF flag. I just really don't want this becoming a part of the public API that we have to support later like the PYTHON2 compatability macros.

@rwgk
Copy link
Collaborator

rwgk commented Mar 31, 2022

This is a very good analysis, thanks Aaron! For strategic reasons, we cannot afford to make this a huge general project at the moment, we have to approach this with a long-term view: we still have to prove that pybind11 actually works for Google, by successfully integrating it into PyCLIF, i.e. we have to get from 98.5% to 100%. Once we've made that hurdle, we can devote more time on bigger projects for pybind11 itself, like generalizing the return_value_policy concept.

Maybe it's best for this to live in another branch and merge in

I could easily maintain this in the smart_holder branch if you prefer. pybind11 will not work for PyCLIF without this, just like pybind11 won't work for PyCLIF without py::smart_holder. It's a pretty good fit there.

once it's cleaned up and up to 100%?

My totally personal bet: nobody will ever have enough motivation (time/money) to generalize return_value_policy further, to address the general limitations you pointed out. Note that Dave Abrahams invented it 20+ years ago. It's limited, but does the job for the most part. After literally two decades it seems unlikely that anyone would seriously want to tear this up.

@Skylion007
Copy link
Collaborator

Actually @wangxf123456 for internal usages, does this return_value_policy works with dicts? Or does it only work with sequences and sets?

be applied easily. Note that this return_value_policy is not concerned
with lifetime/ownership semantics, like the other policies, but the
purpose of return_as_bytes is certain to be orthogonal, because C++
strings are always copied to Python bytes or str. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mark this as experimental and likely to change in the future?

Copy link
Collaborator

@Skylion007 Skylion007 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxf123456 Please add that this is for sequences only too: #3838 (comment) here too

@wangxf123456
Copy link
Contributor Author

Actually @wangxf123456 for internal usages, does this return_value_policy works with dicts? Or does it only work with sequences and sets?

This should only work with sequences and sets. For example, return types like Dict[str, bytes] does not work. But there are only a few cases like this internally and can be easily fixed. For more complex cases, I have not done enough tests. But there are only less than 10 cases like Dict[bytes, object] internally.

@Skylion007
Copy link
Collaborator

@wangxf123456 @rwgk I also found some C++ template magic to rebind the type of (STL) container types: https://stackoverflow.com/a/32214640/2444240 We might be able to use a trick like this to recursively change the Return type parameter to disambiguate how we should return std::string.

The TLDR of this is we need to have a variant of std::string caster that prefers to output bytes and we need a way to signal that should be called.

One idea is to change the behavior of out_cast with a special tag like we for is_operator. This is actually pretty easy, but would involve setting a static variable that I would like to avoid. We need some way to pass that info into the type caster or query that extra arg from inside the std::string caster.

There also may be a way to abuse the polymorphic_type_hook to do this.

I am by no means an expert on C++ templating idioms, but I feel like there has to be a better way to do this than abusing the return type.

@henryiii
Copy link
Collaborator

henryiii commented Apr 5, 2022

I'd much rather @Skylion007's suggestion be attempted. This is adding a misusage of the return policies that doesn't even cover all cases like dict's, or complex types. It does not combine with other return types (since they are disjoint conceptual features). If we add it, it will be impossible to pull out (just see our "private" compatibility macros!). Just because it's easy don't make it right. If it is really, really needed and attempts to do it properly have failed, then the name should start with an underscore, and probably have a warning in the code that it is not guarantied to be kept in the future.

@rwgk
Copy link
Collaborator

rwgk commented Apr 5, 2022

FYI: I'm systematically combing through include/pybind11 to see how the one new enum + if could disturb things, or set up traps. I'll report here when I'm done. Could take a few more days. (Our focus is on larger scale issues that we need to sort out to get to 100% success rate for PyCLIF + pybind11.)

@henryiii
Copy link
Collaborator

henryiii commented Apr 5, 2022

If we add it, it will be impossible to pull out (just see our "private" compatibility macros!). If it is really, really needed, then the name should start with an underscore, and probably have a warning in the code that it is not guarantied to be kept in the future.

I’m not worried about the current code. I’m worried about breaking the mental and programmatic model of return value policies and casters.

@henryiii
Copy link
Collaborator

henryiii commented Apr 5, 2022

I'm systematically combing through include/pybind11 … Could take a few more days.

That time might be better spent trying @Skylion007’s ideas above.

@henryiii
Copy link
Collaborator

henryiii commented Apr 5, 2022

What c++ standard are you targeting in PyCLIF?

@rwgk
Copy link
Collaborator

rwgk commented Apr 5, 2022

I'm systematically combing through include/pybind11 … Could take a few more days.

That time might be better spent trying @Skylion007’s ideas above.

Data first. Also priorities. I don't want the big project die the death of a thousand cuts, getting sidetracked with too many side issues.

What c++ standard are you targeting in PyCLIF?

C++17 required (already).

But I want to keep the smart_holder branch compatible with master. (I spent many hours keeping it compatible with all the old compilers.)

@rwgk rwgk changed the base branch from master to smart_holder April 15, 2022 14:28
@henryiii
Copy link
Collaborator

If it helps: I've already said I'd be okay with it if it has a private name (_return_as_bytes) and a warning that it might be removed. This does need to be fixed properly soon, but a quick solution is okay as long as no one users it beyond users who can move to the new solution quickly once it's ready.

@rwgk rwgk changed the title Add a new return value policy return_as_bytes [smart_holder] Add a new return value policy return_as_bytes Apr 15, 2022
Ralf W. Grosse-Kunstleve added 2 commits April 15, 2022 08:51
@rwgk
Copy link
Collaborator

rwgk commented Apr 15, 2022

If it helps: I've already said I'd be okay with it if it has a private name (_return_as_bytes) and a warning that it might be removed. This does need to be fixed properly soon, but a quick solution is okay as long as no one users it beyond users who can move to the new solution quickly once it's ready.

Thanks, I added the underscore. I'll merge this now on the smart_holder branch, to keep the PyCLIF-pybind11 integration work on track. If @wjakob supports having this on master, I'll back-port.

@rwgk rwgk merged commit 7064d43 into pybind:smart_holder Apr 15, 2022
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Apr 15, 2022
@rwgk rwgk removed the needs changelog Possibly needs a changelog entry label Apr 15, 2022
rwgk pushed a commit to rwgk/pybind11 that referenced this pull request Jun 11, 2024
pybind#3838)"

This reverts commit 7064d43.

Conflicts resolved in:
    include/pybind11/eigen.h
    tests/test_builtin_casters.cpp
rwgk pushed a commit that referenced this pull request Jun 11, 2024
#3838)"

This reverts commit 7064d43.

Conflicts resolved in:
    include/pybind11/eigen.h
    tests/test_builtin_casters.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants