Unicode fixes and docs #624

jagerman · 2017-01-30T04:45:25Z

As discussed in #591, we currently don't fail gracefully when encountering invalid unicode. This PR propagates the unicode error, while also adding support for std::u16string, std::u32string and the associated char16_t/char32_t types.

wjakob · 2017-02-01T10:15:53Z

include/pybind11/cast.h

+        std::is_same<CharT, char>::value     ? 8 :
+        std::is_same<CharT, char16_t>::value ? 16 :
+        std::is_same<CharT, char32_t>::value ? 32 :
+        (sizeof(CharT) == 2 ? 16 : 32); /* std::wstring is UTF-16 on Windows, UTF-32 everywhere else */


Could this assumption be removed using something like sizeof(wchar_t)?

UTF_N = sizeof(CharT)*8?

The char16_t and char32_t aren't guaranteed to be 2 and 4 bytes, only that they are at least 2/4 bytes (they are more like std::uint_least16_t). They are guaranteed to be unique types, however, hence checking the actual types rather than just the sizes.

That all makes sense then.

wjakob · 2017-02-01T10:18:03Z

include/pybind11/cast.h

-    }
+// Helper class for UTF-{8,16,32} strings:
+template <typename CharT, class Traits, class Allocator>
+struct type_caster<std::basic_string<CharT, Traits, Allocator>, enable_if_t<is_std_char_type<CharT>::value>> {


SFINAE potentially overkill: is there ever a situation where we could get a std::basic_string that uses a non-character type?

In theory you could create a std::basic_string<int>. We aren't likely to see it, but I figured better to be explicit in what we support here.

wjakob · 2017-02-01T10:20:24Z

include/pybind11/cast.h

+
+        object utfNbytes = reinterpret_steal<object>(PyUnicode_AsEncodedString(
+            load_src.ptr(),
+            UTF_N == 8 ? "utf8" : UTF_N == 16 ? "utf16" : "utf32",


could be moved out of the function call into a constexpr const char *

~~Won't it be treated that way anyway, given that the UTF_N is a constexpr?~~ Never mind, I see it's used more than once.

wjakob · 2017-02-01T10:21:05Z

include/pybind11/cast.h

+    static handle cast(const StringType &src, return_value_policy /* policy */, handle /* parent */) {
+        const char *buffer = reinterpret_cast<const char *>(src.c_str());
+        ssize_t nbytes = ssize_t(src.size() * sizeof(CharT));
+        handle s = PyUnicode_Decode(buffer, nbytes, UTF_N == 8 ? "utf8" : UTF_N == 16 ? "utf16" : "utf32", nullptr);


wjakob · 2017-02-01T10:22:34Z

This looks great -- I added a few minor comments.

jagerman · 2017-02-01T14:57:40Z

Rebased onto master (and squashed in the constexpr const char *encoding change).

jbarlow83 · 2017-02-02T01:49:48Z

include/pybind11/cast.h

-        return PyUnicode_FromWideChar(wstr, 1);
+    static handle cast(CharT src, return_value_policy policy, handle parent) {
+        if (std::is_same<char, CharT>::value) {
+            handle s = PyUnicode_DecodeLatin1((const char *) &src, 1, nullptr);


Why is DecodeLatin1 used here?

Because we have to do something with a single char in the 128-255 range, and since Latin1 codepoints are identical to unicode codepoints, this seemed the most logical choice.

That makes sense.

jbarlow83 · 2017-02-03T23:17:38Z

@jagerman Have you checked what happens in this PR if a Python str is cast to a char and the Unicode code point is U+0100 and above, or wchar16_t for U+01_0000 and above?

I found that in master this currently truncates the code point which is probably incorrect (and maybe endian unsafe too). Since you changed char casting perhaps your changes address this already? If not I will write up the issue.

jagerman · 2017-02-04T00:15:53Z

(Edit: ignore all these; see my later comment below).

For all of the char types, it decodes to the UTF-n encoding in native byte order, then returns the first char/char16_t/etc. element. (For UTF-16 and -32, the BOM that Python always adds to the encoded string is skipped).

For char you're right that this is going to be wrong (not only for U+100 and above, but also for U+80 and above): the first byte is going to be just the char value of the initial byte of a multibyte UTF-8 sequence. We should probably fix it to return the unicode codepoint for 0-255, and throw for anything above 255. (Though char is signed, so the 128-255 need to become the appropriate negative, i.e. via a static_cast<char>((unsigned char) value)). Moreover I think we should do this before we tell Python to encode as UTF-8.

For char16_t (and wchar_t on Windows) we should probably be throwing the same sort of error for >U+FFFF values. Endian issues aren't a concern because we're never converting to bytes, but just keeping everything in native 16-bit integers. (Or to put it another way, endianness is a concern for the std::u16string user, above the pybind11 level).

Now that I think about it, I'm not sure why we aren't also checking the src string length: I'd expect an exception if I tried to call a function taking a char with a string, but the current behaviour is to just take the first char value. (because, of course, this is the caster for a char * as well)

jbarlow83 · 2017-02-04T00:28:11Z

Okay, I will open a new issue to describe this behavior.

jagerman · 2017-02-04T00:34:24Z

No need, your comment here suffices.

jagerman · 2017-02-04T04:35:34Z

Actually, scratch all that I said above. The char caster needs to be the way it is because it gets invoked for char * values, too (and, in fact, this is its primary purpose). That's something of a limitation in how pybind deals with pointers, but I'm not sure it's a huge problem: accepting a single char is pretty rare in practice.

So I guess my question: is this merely a theoretical problem, or is this a problem you're running into in actual code?

jbarlow83 · 2017-02-04T22:34:09Z

It's not a problem I have, it's a corner case I discovered because @wjakob asked for my Unicode documentation to mention character literals.

I agree that modern C++11 code is unlikely to accept char literals but there's lots of old stuff out there. I found a practical example. I contribute to tesseract-ocr which was written in ancient C++ but is gradually being modernized. Take a quick look at its internal STRING class:

https://github.com/tesseract-ocr/tesseract/blob/d55f462c9c57b9fe5c3c146e126b50ed72090a3d/ccutil/strngs.h

The interface has a several functions like STRING::split() and operator+ that accept char literals. Even though this is ancient C++, if I were going to modernize this code, my strategy would be to wrap it with pybind11, write some test cases, modernize the implementation to wrap std::string internally, and then modernize the interface to phase it out.

We've established that the pybind11 behavior is incorrect in that it mangles UTF-8, but it is a corner case. I agree it may not even be worth fully fixing, given that char* is much more important.

Do you think it would be possible to have a static_assert that rejects functions with a char or 16-bit wide single character in their signature? 32-bit wide characters work fine, so functions that bind to Python could use that as the literal type and implement a little wrapper to convert in whatever way the caller thinks is appropriate.

jagerman · 2017-02-05T07:44:08Z

I just pushed a change to split up the char * and char casters into separate classes. This required going back to the pybind11 assumption that T and T* are always equivalent from a type caster point of view, amending it to T and T* are always equivalent except when T is a char/wchar_t/etc.

jagerman · 2017-02-05T07:56:58Z

Will fight with MSVC and figure out what to do with PyPy tomorrow.

jagerman · 2017-02-05T22:50:36Z

Pushed a rebased/squashed version with MSVC/PyPy fixes, and a cleanup to the pointer type_caster implementation.

jagerman · 2017-02-05T22:55:50Z

... and again to fix the conflict.

wjakob · 2017-02-08T07:50:57Z

include/pybind11/cast.h


-    static PYBIND11_DESCR name() { return type_descr(_(PYBIND11_STRING_NAME)); }
+    // chr()/unichr() isn't technically a type, but it should get the point across:
+    PYBIND11_TYPE_CASTER(CharT, _(PYBIND11_CHR_NAME "(") + _<(max <= 0xffff)>(_("<=") + _<max>(), _("")) + _(")"));


This seems like a quite strange type annotation (explicitly specifying the range). Perhaps just chr/unichr?

I agree that it's a bit strange, but I have a feeling it's going to be rather confusing to see an error about invalid function arguments with a unichar accepted, but only certain unichar actually accepted. (I guess this is more or less the same issue with #651).

I suppose, bigger picture, it would be nice to be able to override or supplement the call failure message so that a caster could return an argument failure with a reason (e.g. "Only char values <= 255 are accepted") and have that reason displayed in the exception (but, I suppose, only if the function isn't overloaded).

If you insist on having an informative type annotation, could we call it unichr16 or something like that? Or the type annotation could be more pythonic so that type signature parsers won't be thrown off (e.g. unichr[encoding='utf16'] or unichr[max=0xffff]), but I think that's less readable

I don't insist; the latest push changed it to just chr()/unichr()`. I'm not sure of a better choice here, since the type is inherently non-pythonic.

Re: unichar[encoding='utf16'] is a bit misleading: we don't care about the specific encoding. I propose we leave it without the range and just leave it up to the binder to mention it in the function description.

Are the brackets in chr() going to cause problems--i.e. should it be shortened to just chr?

Hmm. I'm sort of ambivalent about what to report here, but I suppose I'd marginally rank them as char[max=0xffff] > chr() > char16 > chr. (With "uni" prepended for python 2.7).

Round brackets will cause problems. Parameterized types use square brackets in type annotations.

(ok, please feel free to take your top pick then.)

dean0x7d · 2017-02-08T21:10:50Z

I'm a bit skeptical about the spun-off char caster. The implementation adds make_caster_type as a gatekeeper for all type_casters with a specialization and SFINAE which will need to be evaluated for every single caster, but it'll only ever be useful for a remote corner case. Python doesn't even know about single character types and any API which implements moving single characters across language boundaries is going to be slow (to put it mildly) and most likely bad design. So why not leave the simpler unified caster and just acknowledge the rare corner case in the docs?

jagerman · 2017-02-08T21:25:38Z

The main point is that the unified caster is broken in its current state for char values: it isn't valid to send the first byte of a UTF-8 string as a char value, but rather this should be a conversion failure (along the same lines as trying to pass -1 to a function taking an unsigned int). I think the alternative would need to be explicitly not allowing binding char arguments rather than silently truncating python unicode string input to a single utf-8 byte.

jagerman · 2017-02-08T21:32:35Z

The single-gatekeeper make_caster_type has separate potential advantage, though: it would let us change the define-your-own-caster behaviour to get it out of the pybind::detail namespace, e.g. with something like:

class MyCaster { ... };
DECLARE_CUSTOM_CASTER(MyType, MyCaster);

rather than having custom casters declare themselves in the pybind11::detail namespace.

(But of course, this doesn't necessarily mean we need the char caster support).

dean0x7d · 2017-02-08T21:45:28Z

The main point is that the unified caster is broken in its current state for char values

I get the issue, I'm just wondering if the expense is worth it for a pretty obscure case. The alternative being documenting it as a limitation which isn't even likely to come up.

The single-gatekeeper make_caster_type has separate potential advantage, though: it would let us change the define-your-own-caster behaviour to get it out of the pybind::detail namespace

Wouldn't just moving type_caster out of detail achieve the same effect?

jagerman · 2017-02-08T21:51:37Z

Wouldn't just moving type_caster out of detail achieve the same effect?

I don't see how it could be done without breaking backwards compatibility (which, I'm guessing, is the main reason its in detail in the first place). But even in pybind it would still have external code working in the pybind11 namespace.

But basically, this was just an afterthought: I didn't write it with that intention in the first place.

jbarlow83 · 2017-02-08T22:03:18Z

@dean0x7d I provided an example of it in use further. It was not difficult to find working C++ that accepts char as input; I imagine others can be found. Another example would be people experimenting with strchr and several other core libc functions. Sometimes experimental code finds its way into production. In one project I use pybind11 to access C++ code that is not available to Python, to save development time, not for performance. I'd rather use a pure Python library, but it doesn't exist. And because I want to work with an existing installed library, I can't make any changes to the C++ API.

I think refusing to compile a direct binding to fn(char c) is acceptable, but silently corrupting data at runtime is never acceptable.

dean0x7d · 2017-02-08T22:33:30Z

I think an informative error is quite doable:

// the unified char caster with potential silent failure for a single char
operator CharT&() { return value[0]; }

// unified caster with loud failure
operator CharT&() { 
    if (value.size() == 1)
        return value[0]; 
    else
        throw InformativeError();
}

wjakob · 2017-02-08T22:38:58Z

@dean0x7d: Oh wow, I'm really surprised that it wasn't like that before. That's definitely a very severe oversight.

I also agree with @dean0x7d's point that adding a layer of indirection to every type caster just to special-case char variants might not be worth it in the big picture (unless there are some other compelling reasons to do so)

If returning a std::string with invalid utf-8 data, we currently fail with an uninformative TypeError instead of propagating the UnicodeDecodeError that Python sets on failure.

This adds support for wchar{16,32}_t character literals and the associated std::u{16,32}string types. It also folds the character/string conversion into a single type_caster template, since the type casters for string and wstring were mostly the same anyway.

jagerman · 2017-02-09T19:15:53Z

I've reverted the caster changes, and did the loud failure during load-casting as @dean0x7d suggested. (Though a little more complex to distinguish between codepoint-too-large and too-many-character errors).

With this commit, when casting to a single character, as opposed to a C-style string, we make sure the input wasn't a multi-character string or a single character with codepoint too large for the character type. This also changes the character cast op to CharT instead of CharT& (we need to be able to return a temporary decoded char value, but also because there's little gained by bothering with an lvalue return here). Finally it changes the char caster to 'has-a-string-caster' instead of 'is-a-string-caster' because, with the cast_op change above, there's nothing at all gained from inheritance. This also lets us remove the `success` from the string caster (which was only there for the char caster) into the char caster itself. (I also renamed it to 'none' and inverted its value to better reflect its purpose). The None -> nullptr loading also now takes place only under a `convert = true` load pass. Although it's unlikely that a function taking a char also has overloads that can take a None, it seems marginally more correct to treat it as a conversion. This commit simplifies the size assumptions about character sizes with static_asserts to back them up.

Fixes clang's sign conversion warnings.

wjakob · 2017-02-14T10:07:59Z

This looks really great -- thank you for working out all the nitty gritty details regarding character casts.

The string conversion logic added in PR pybind#624 for all std::basic_strings was using the old std::wstring logic, but that was underused and turns out to have hade a bug in accepting almost anything convertible to the previous std::string logic by only accepting unicode or byte/string (Python 3/2) types. Fixes pybind#685.

The string conversion logic added in PR pybind#624 for all std::basic_strings was derived from the old std::wstring logic, but that was underused and turns out to have had a bug in accepting almost anything convertible to unicode, while the previous std::string logic was much stricter. This restores the previous std::string logic by only allowing actual unicode or string types. Fixes pybind#685.

* Make string conversion stricter The string conversion logic added in PR #624 for all std::basic_strings was derived from the old std::wstring logic, but that was underused and turns out to have had a bug in accepting almost anything convertible to unicode, while the previous std::string logic was much stricter. This restores the previous std::string logic by only allowing actual unicode or string types. Fixes #685. * Added missing 'requires numpy' decorator (I forgot that the change to a global decorator here is in the not-yet-merged Eigen PR)

The Unicode support added in 2.1 (PR pybind#624) inadvertently broke accepting `bytes` as std::string/char* arguments. This restores it with a separate path that does a plain conversion (i.e. completely bypassing all the encoding/decoding code), but only for single-byte string types.

The Unicode support added in 2.1 (PR #624) inadvertently broke accepting `bytes` as std::string/char* arguments. This restores it with a separate path that does a plain conversion (i.e. completely bypassing all the encoding/decoding code), but only for single-byte string types.

jagerman mentioned this pull request Jan 30, 2017

Documentation request: Unicode conversions page #591

Closed

wjakob reviewed Feb 1, 2017

View reviewed changes

jagerman force-pushed the unicode-fixes-and-docs branch from df6e1e8 to a2f41b4 Compare February 1, 2017 14:57

jagerman force-pushed the unicode-fixes-and-docs branch from a2f41b4 to 6fe53d2 Compare February 1, 2017 21:02

jbarlow83 reviewed Feb 2, 2017

View reviewed changes

jagerman force-pushed the unicode-fixes-and-docs branch 3 times, most recently from a0a98d4 to a79b5f6 Compare February 5, 2017 07:54

jagerman force-pushed the unicode-fixes-and-docs branch from a79b5f6 to 49625f4 Compare February 5, 2017 22:48

jagerman force-pushed the unicode-fixes-and-docs branch from 49625f4 to 31c4c78 Compare February 5, 2017 22:55

wjakob reviewed Feb 8, 2017

View reviewed changes

jagerman force-pushed the unicode-fixes-and-docs branch from fb174cf to d1964f9 Compare February 9, 2017 19:02

jagerman added 2 commits February 9, 2017 14:03

Propagate unicode conversion failure

d8c7dd7

If returning a std::string with invalid utf-8 data, we currently fail with an uninformative TypeError instead of propagating the UnicodeDecodeError that Python sets on failure.

jagerman force-pushed the unicode-fixes-and-docs branch from d1964f9 to 24df2d9 Compare February 9, 2017 19:14

jagerman force-pushed the unicode-fixes-and-docs branch from 24df2d9 to 86ca73f Compare February 9, 2017 19:19

jagerman added 2 commits February 9, 2017 16:45

flake8 style fix

09147cb

clang warning fixes

f076666

Fixes clang's sign conversion warnings.

wjakob merged commit 11a337f into pybind:master Feb 14, 2017

jagerman mentioned this pull request Feb 16, 2017

Question about cast std::wstring with Python3 #674

Closed

jagerman mentioned this pull request Feb 23, 2017

Make string conversion stricter #695

Merged

jagerman modified the milestone: v2.1 Mar 19, 2017

jagerman mentioned this pull request Apr 26, 2017

Fix Python 3 bytes conversion to std::string/char* #817

Merged

jagerman mentioned this pull request May 8, 2017

variant_caster cant handle c++ type nullptr_t #839

Closed

rwgk mentioned this pull request Feb 9, 2023

FWD pybind11 google/pybind11k#624

Closed

Unicode fixes and docs #624

Unicode fixes and docs #624

Conversation

jagerman commented Jan 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jagerman Feb 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob commented Feb 1, 2017

jagerman commented Feb 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbarlow83 commented Feb 3, 2017 • edited Loading

jagerman commented Feb 4, 2017 • edited Loading

jbarlow83 commented Feb 4, 2017

jagerman commented Feb 4, 2017

jagerman commented Feb 4, 2017

jbarlow83 commented Feb 4, 2017

jagerman commented Feb 5, 2017

jagerman commented Feb 5, 2017

jagerman commented Feb 5, 2017

jagerman commented Feb 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob Feb 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dean0x7d commented Feb 8, 2017 • edited Loading

jagerman commented Feb 8, 2017

jagerman commented Feb 8, 2017 • edited Loading

dean0x7d commented Feb 8, 2017

jagerman commented Feb 8, 2017

jbarlow83 commented Feb 8, 2017

dean0x7d commented Feb 8, 2017

wjakob commented Feb 8, 2017

jagerman commented Feb 9, 2017

wjakob commented Feb 14, 2017

jagerman Feb 1, 2017 •

edited

Loading

jbarlow83 commented Feb 3, 2017 •

edited

Loading

jagerman commented Feb 4, 2017 •

edited

Loading

jagerman commented Feb 5, 2017 •

edited

Loading

wjakob Feb 8, 2017 •

edited

Loading

dean0x7d commented Feb 8, 2017 •

edited

Loading

jagerman commented Feb 8, 2017 •

edited

Loading