Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode fixes and docs #624

Merged
merged 5 commits into from
Feb 14, 2017
Merged

Conversation

jagerman
Copy link
Member

As discussed in #591, we currently don't fail gracefully when encountering invalid unicode. This PR propagates the unicode error, while also adding support for std::u16string, std::u32string and the associated char16_t/char32_t types.

std::is_same<CharT, char>::value ? 8 :
std::is_same<CharT, char16_t>::value ? 16 :
std::is_same<CharT, char32_t>::value ? 32 :
(sizeof(CharT) == 2 ? 16 : 32); /* std::wstring is UTF-16 on Windows, UTF-32 everywhere else */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this assumption be removed using something like sizeof(wchar_t)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF_N = sizeof(CharT)*8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The char16_t and char32_t aren't guaranteed to be 2 and 4 bytes, only that they are at least 2/4 bytes (they are more like std::uint_least16_t). They are guaranteed to be unique types, however, hence checking the actual types rather than just the sizes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes sense then.

}
// Helper class for UTF-{8,16,32} strings:
template <typename CharT, class Traits, class Allocator>
struct type_caster<std::basic_string<CharT, Traits, Allocator>, enable_if_t<is_std_char_type<CharT>::value>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SFINAE potentially overkill: is there ever a situation where we could get a std::basic_string that uses a non-character type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory you could create a std::basic_string<int>. We aren't likely to see it, but I figured better to be explicit in what we support here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!


object utfNbytes = reinterpret_steal<object>(PyUnicode_AsEncodedString(
load_src.ptr(),
UTF_N == 8 ? "utf8" : UTF_N == 16 ? "utf16" : "utf32",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be moved out of the function call into a constexpr const char *

Copy link
Member Author

@jagerman jagerman Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be treated that way anyway, given that the UTF_N is a constexpr? Never mind, I see it's used more than once.

static handle cast(const StringType &src, return_value_policy /* policy */, handle /* parent */) {
const char *buffer = reinterpret_cast<const char *>(src.c_str());
ssize_t nbytes = ssize_t(src.size() * sizeof(CharT));
handle s = PyUnicode_Decode(buffer, nbytes, UTF_N == 8 ? "utf8" : UTF_N == 16 ? "utf16" : "utf32", nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@wjakob
Copy link
Member

wjakob commented Feb 1, 2017

This looks great -- I added a few minor comments.

@jagerman
Copy link
Member Author

jagerman commented Feb 1, 2017

Rebased onto master (and squashed in the constexpr const char *encoding change).

return PyUnicode_FromWideChar(wstr, 1);
static handle cast(CharT src, return_value_policy policy, handle parent) {
if (std::is_same<char, CharT>::value) {
handle s = PyUnicode_DecodeLatin1((const char *) &src, 1, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is DecodeLatin1 used here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have to do something with a single char in the 128-255 range, and since Latin1 codepoints are identical to unicode codepoints, this seemed the most logical choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

@jbarlow83
Copy link
Contributor

jbarlow83 commented Feb 3, 2017

@jagerman Have you checked what happens in this PR if a Python str is cast to a char and the Unicode code point is U+0100 and above, or wchar16_t for U+01_0000 and above?

I found that in master this currently truncates the code point which is probably incorrect (and maybe endian unsafe too). Since you changed char casting perhaps your changes address this already? If not I will write up the issue.

@jagerman
Copy link
Member Author

jagerman commented Feb 4, 2017

(Edit: ignore all these; see my later comment below).

For all of the char types, it decodes to the UTF-n encoding in native byte order, then returns the first char/char16_t/etc. element. (For UTF-16 and -32, the BOM that Python always adds to the encoded string is skipped).

For char you're right that this is going to be wrong (not only for U+100 and above, but also for U+80 and above): the first byte is going to be just the char value of the initial byte of a multibyte UTF-8 sequence. We should probably fix it to return the unicode codepoint for 0-255, and throw for anything above 255. (Though char is signed, so the 128-255 need to become the appropriate negative, i.e. via a static_cast<char>((unsigned char) value)). Moreover I think we should do this before we tell Python to encode as UTF-8.

For char16_t (and wchar_t on Windows) we should probably be throwing the same sort of error for >U+FFFF values. Endian issues aren't a concern because we're never converting to bytes, but just keeping everything in native 16-bit integers. (Or to put it another way, endianness is a concern for the std::u16string user, above the pybind11 level).

Now that I think about it, I'm not sure why we aren't also checking the src string length: I'd expect an exception if I tried to call a function taking a char with a string, but the current behaviour is to just take the first char value. (because, of course, this is the caster for a char * as well)

@jbarlow83
Copy link
Contributor

Okay, I will open a new issue to describe this behavior.

@jagerman
Copy link
Member Author

jagerman commented Feb 4, 2017

No need, your comment here suffices.

@jagerman
Copy link
Member Author

jagerman commented Feb 4, 2017

Actually, scratch all that I said above. The char caster needs to be the way it is because it gets invoked for char * values, too (and, in fact, this is its primary purpose). That's something of a limitation in how pybind deals with pointers, but I'm not sure it's a huge problem: accepting a single char is pretty rare in practice.

So I guess my question: is this merely a theoretical problem, or is this a problem you're running into in actual code?

@jbarlow83
Copy link
Contributor

It's not a problem I have, it's a corner case I discovered because @wjakob asked for my Unicode documentation to mention character literals.

I agree that modern C++11 code is unlikely to accept char literals but there's lots of old stuff out there. I found a practical example. I contribute to tesseract-ocr which was written in ancient C++ but is gradually being modernized. Take a quick look at its internal STRING class:

https://github.com/tesseract-ocr/tesseract/blob/d55f462c9c57b9fe5c3c146e126b50ed72090a3d/ccutil/strngs.h

The interface has a several functions like STRING::split() and operator+ that accept char literals. Even though this is ancient C++, if I were going to modernize this code, my strategy would be to wrap it with pybind11, write some test cases, modernize the implementation to wrap std::string internally, and then modernize the interface to phase it out.

We've established that the pybind11 behavior is incorrect in that it mangles UTF-8, but it is a corner case. I agree it may not even be worth fully fixing, given that char* is much more important.

Do you think it would be possible to have a static_assert that rejects functions with a char or 16-bit wide single character in their signature? 32-bit wide characters work fine, so functions that bind to Python could use that as the literal type and implement a little wrapper to convert in whatever way the caller thinks is appropriate.

@jagerman
Copy link
Member Author

jagerman commented Feb 5, 2017

I just pushed a change to split up the char * and char casters into separate classes. This required going back to the pybind11 assumption that T and T* are always equivalent from a type caster point of view, amending it to T and T* are always equivalent except when T is a char/wchar_t/etc.

@jagerman jagerman force-pushed the unicode-fixes-and-docs branch 3 times, most recently from a0a98d4 to a79b5f6 Compare February 5, 2017 07:54
@jagerman
Copy link
Member Author

jagerman commented Feb 5, 2017

Will fight with MSVC and figure out what to do with PyPy tomorrow.

@jagerman
Copy link
Member Author

jagerman commented Feb 5, 2017

Pushed a rebased/squashed version with MSVC/PyPy fixes, and a cleanup to the pointer type_caster implementation.

@jagerman
Copy link
Member Author

jagerman commented Feb 5, 2017

... and again to fix the conflict.


static PYBIND11_DESCR name() { return type_descr(_(PYBIND11_STRING_NAME)); }
// chr()/unichr() isn't technically a type, but it should get the point across:
PYBIND11_TYPE_CASTER(CharT, _(PYBIND11_CHR_NAME "(") + _<(max <= 0xffff)>(_("<=") + _<max>(), _("")) + _(")"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a quite strange type annotation (explicitly specifying the range). Perhaps just chr/unichr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's a bit strange, but I have a feeling it's going to be rather confusing to see an error about invalid function arguments with a unichar accepted, but only certain unichar actually accepted. (I guess this is more or less the same issue with #651).

I suppose, bigger picture, it would be nice to be able to override or supplement the call failure message so that a caster could return an argument failure with a reason (e.g. "Only char values <= 255 are accepted") and have that reason displayed in the exception (but, I suppose, only if the function isn't overloaded).

Copy link
Member

@wjakob wjakob Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you insist on having an informative type annotation, could we call it unichr16 or something like that? Or the type annotation could be more pythonic so that type signature parsers won't be thrown off (e.g. unichr[encoding='utf16'] or unichr[max=0xffff]), but I think that's less readable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't insist; the latest push changed it to just chr()/unichr()`. I'm not sure of a better choice here, since the type is inherently non-pythonic.

Re: unichar[encoding='utf16'] is a bit misleading: we don't care about the specific encoding. I propose we leave it without the range and just leave it up to the binder to mention it in the function description.

Are the brackets in chr() going to cause problems--i.e. should it be shortened to just chr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm sort of ambivalent about what to report here, but I suppose I'd marginally rank them as char[max=0xffff] > chr() > char16 > chr. (With "uni" prepended for python 2.7).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round brackets will cause problems. Parameterized types use square brackets in type annotations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ok, please feel free to take your top pick then.)

@dean0x7d
Copy link
Member

dean0x7d commented Feb 8, 2017

I'm a bit skeptical about the spun-off char caster. The implementation adds make_caster_type as a gatekeeper for all type_casters with a specialization and SFINAE which will need to be evaluated for every single caster, but it'll only ever be useful for a remote corner case. Python doesn't even know about single character types and any API which implements moving single characters across language boundaries is going to be slow (to put it mildly) and most likely bad design. So why not leave the simpler unified caster and just acknowledge the rare corner case in the docs?

@jagerman
Copy link
Member Author

jagerman commented Feb 8, 2017

The main point is that the unified caster is broken in its current state for char values: it isn't valid to send the first byte of a UTF-8 string as a char value, but rather this should be a conversion failure (along the same lines as trying to pass -1 to a function taking an unsigned int). I think the alternative would need to be explicitly not allowing binding char arguments rather than silently truncating python unicode string input to a single utf-8 byte.

@jagerman
Copy link
Member Author

jagerman commented Feb 8, 2017

The single-gatekeeper make_caster_type has separate potential advantage, though: it would let us change the define-your-own-caster behaviour to get it out of the pybind::detail namespace, e.g. with something like:

class MyCaster { ... };
DECLARE_CUSTOM_CASTER(MyType, MyCaster);

rather than having custom casters declare themselves in the pybind11::detail namespace.

(But of course, this doesn't necessarily mean we need the char caster support).

@dean0x7d
Copy link
Member

dean0x7d commented Feb 8, 2017

The main point is that the unified caster is broken in its current state for char values

I get the issue, I'm just wondering if the expense is worth it for a pretty obscure case. The alternative being documenting it as a limitation which isn't even likely to come up.

The single-gatekeeper make_caster_type has separate potential advantage, though: it would let us change the define-your-own-caster behaviour to get it out of the pybind::detail namespace

Wouldn't just moving type_caster out of detail achieve the same effect?

@jagerman
Copy link
Member Author

jagerman commented Feb 8, 2017

Wouldn't just moving type_caster out of detail achieve the same effect?

I don't see how it could be done without breaking backwards compatibility (which, I'm guessing, is the main reason its in detail in the first place). But even in pybind it would still have external code working in the pybind11 namespace.

But basically, this was just an afterthought: I didn't write it with that intention in the first place.

@jbarlow83
Copy link
Contributor

@dean0x7d I provided an example of it in use further. It was not difficult to find working C++ that accepts char as input; I imagine others can be found. Another example would be people experimenting with strchr and several other core libc functions. Sometimes experimental code finds its way into production. In one project I use pybind11 to access C++ code that is not available to Python, to save development time, not for performance. I'd rather use a pure Python library, but it doesn't exist. And because I want to work with an existing installed library, I can't make any changes to the C++ API.

I think refusing to compile a direct binding to fn(char c) is acceptable, but silently corrupting data at runtime is never acceptable.

@dean0x7d
Copy link
Member

dean0x7d commented Feb 8, 2017

I think an informative error is quite doable:

// the unified char caster with potential silent failure for a single char
operator CharT&() { return value[0]; }
// unified caster with loud failure
operator CharT&() { 
    if (value.size() == 1)
        return value[0]; 
    else
        throw InformativeError();
}

@wjakob
Copy link
Member

wjakob commented Feb 8, 2017

@dean0x7d: Oh wow, I'm really surprised that it wasn't like that before. That's definitely a very severe oversight.

I also agree with @dean0x7d's point that adding a layer of indirection to every type caster just to special-case char variants might not be worth it in the big picture (unless there are some other compelling reasons to do so)

If returning a std::string with invalid utf-8 data, we currently fail
with an uninformative TypeError instead of propagating the
UnicodeDecodeError that Python sets on failure.
This adds support for wchar{16,32}_t character literals and the
associated std::u{16,32}string types.  It also folds the
character/string conversion into a single type_caster template, since
the type casters for string and wstring were mostly the same anyway.
@jagerman
Copy link
Member Author

jagerman commented Feb 9, 2017

I've reverted the caster changes, and did the loud failure during load-casting as @dean0x7d suggested. (Though a little more complex to distinguish between codepoint-too-large and too-many-character errors).

With this commit, when casting to a single character, as opposed to a
C-style string, we make sure the input wasn't a multi-character string
or a single character with codepoint too large for the character type.

This also changes the character cast op to CharT instead of CharT& (we
need to be able to return a temporary decoded char value, but also
because there's little gained by bothering with an lvalue return here).

Finally it changes the char caster to 'has-a-string-caster' instead of
'is-a-string-caster' because, with the cast_op change above, there's
nothing at all gained from inheritance.  This also lets us remove the
`success` from the string caster (which was only there for the char
caster) into the char caster itself.  (I also renamed it to 'none' and
inverted its value to better reflect its purpose).  The None -> nullptr
loading also now takes place only under a `convert = true` load pass.
Although it's unlikely that a function taking a char also has overloads
that can take a None, it seems marginally more correct to treat it as a
conversion.

This commit simplifies the size assumptions about character sizes with
static_asserts to back them up.
Fixes clang's sign conversion warnings.
@wjakob
Copy link
Member

wjakob commented Feb 14, 2017

This looks really great -- thank you for working out all the nitty gritty details regarding character casts.

@wjakob wjakob merged commit 11a337f into pybind:master Feb 14, 2017
jagerman added a commit to jagerman/pybind11 that referenced this pull request Feb 23, 2017
The string conversion logic added in PR pybind#624 for all std::basic_strings
was using the old std::wstring logic, but that was underused and turns
out to have hade a bug in accepting almost anything convertible to
the previous std::string logic by only accepting unicode or byte/string
(Python 3/2) types.

Fixes pybind#685.
jagerman added a commit to jagerman/pybind11 that referenced this pull request Feb 23, 2017
The string conversion logic added in PR pybind#624 for all std::basic_strings
was derived from the old std::wstring logic, but that was underused and
turns out to have had a bug in accepting almost anything convertible to
unicode, while the previous std::string logic was much stricter.  This
restores the previous std::string logic by only allowing actual unicode
or string types.

Fixes pybind#685.
wjakob pushed a commit that referenced this pull request Feb 24, 2017
* Make string conversion stricter

The string conversion logic added in PR #624 for all std::basic_strings
was derived from the old std::wstring logic, but that was underused and
turns out to have had a bug in accepting almost anything convertible to
unicode, while the previous std::string logic was much stricter.  This
restores the previous std::string logic by only allowing actual unicode
or string types.

Fixes #685.

* Added missing 'requires numpy' decorator

(I forgot that the change to a global decorator here is in the
not-yet-merged Eigen PR)
@jagerman jagerman modified the milestone: v2.1 Mar 19, 2017
jagerman added a commit to jagerman/pybind11 that referenced this pull request Apr 26, 2017
The Unicode support added in 2.1 (PR pybind#624) inadvertently broke accepting
`bytes` as std::string/char* arguments.  This restores it with a
separate path that does a plain conversion (i.e. completely bypassing
all the encoding/decoding code), but only for single-byte string types.
jagerman added a commit to jagerman/pybind11 that referenced this pull request Apr 26, 2017
The Unicode support added in 2.1 (PR pybind#624) inadvertently broke accepting
`bytes` as std::string/char* arguments.  This restores it with a
separate path that does a plain conversion (i.e. completely bypassing
all the encoding/decoding code), but only for single-byte string types.
jagerman added a commit that referenced this pull request Apr 28, 2017
The Unicode support added in 2.1 (PR #624) inadvertently broke accepting
`bytes` as std::string/char* arguments.  This restores it with a
separate path that does a plain conversion (i.e. completely bypassing
all the encoding/decoding code), but only for single-byte string types.
@rwgk rwgk mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants