Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

Open
BillyONeal opened this issue Jul 24, 2018 · 8 comments
Labels
bug Something isn't working help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed

Comments

@BillyONeal
Copy link

UTF-8 <-> UTF-16 is prohibited by http://eel.is/c++draft/locale.codecvt#virtuals-3, since 4 UTF-8 encoding units can produce 2 UTF-16 encoding units.

Unfortunately it is not so easy to fix as we would need the counterpart to do_unshift on the "wide" side of the interface which would be massively ABI breaking.

@cubbimew
Copy link

cubbimew commented Jul 24, 2018

I've seen opensource use of std::codecvt_utf8_utf16 with file streams -- maybe the authors are unaware that their code is non-portable, since it works with libstdc++ (fails with libc++ and MSVC). Here's a demo: http://coliru.stacked-crooked.com/a/dd397ba886d9bbff

Can't the standard acknowledge that an implementation is allowed to do the right thing with "otherwise, the behavior is implementation-defined"? Or, perhaps introduce std::u16stream and std::u16filebuf that aren't tied to this restriction?

@tahonermann
Copy link
Member

I've been aware of this limitation for some time, but I've never researched why the limitation exists. Is there something fundamental about std::basic_filebufthat doesn't work for MxN transcoders? I assume not since libstdc++ apparently works as expected with such transcoders. Perhaps the limitation exists in the standard simply because the desired behavior doesn't match (some) existing implementations?

Regardless, my answer for this is to deprecate std::codecvt in favor of some yet-to-be-determined replacement. My inclination is towards something like text_view, though that doesn't fully suffice as a replacement since text_view doesn't support writeable views (it supports read-only views and write-only iterators), nor can a stream be imbued with such a view (I'm so far unconvinced that a replacement for imbue() functionality is needed or desirable).

@BillyONeal
Copy link
Author

When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky.

text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor. High quality UTF encoders use techniques to process more than one encoding unit at a time (e.g. most UTF-8 texts have long strings of 7 bit ASCII in them, and a vectorized version of that to UTF-16 is easy-ish to write). The paper claims codecvt is expensive due to the virtual function call, but so long as the virtual call is done on whole buffers rather than on single codepoints that virtual call should be noise; as UTF decoding itself is a fairly expensive operation.

Honestly, I think the ideal interface looks very much like codecvt with a better way of reporting errors and removal of the 1:N restriction. Unfortunately lifting the 1:N restriction is an ABI break for MSVC++ because we store only a single wchar_t inside the basic_filebuf when a codecvt facet is engaged :(

@tahonermann
Copy link
Member

When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky.

Good question, @jwakely, any comments on this?

text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor.

This is definitely a concern. I've long intended to provide transcoding interfaces that would enable optimizations like you described when invoked with a contiguous range (or segmented contiguous ranges). Alas, I've made no progress on that front.

@BillyONeal
Copy link
Author

Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4

@tahonermann
Copy link
Member

Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P

I think we all play that role more often than not ;)

Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4

Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work?

@jwakely
Copy link

jwakely commented Jul 25, 2018

If libstdc++ gets it right it's not my doing, and I can't really comment on that code. It's indistinguishable from magic as far as I'm concerned.

@BillyONeal
Copy link
Author

Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work?

I've seen the talk but not attempted implementing his algorithm yet. I wrote these a long time ago when I was trying to replace the guts of our codecvt facets for ABI-broken-world, but ABI-broken-world seems to only get further way so I had to stop that work for some time.

@tahonermann tahonermann added help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed bug Something isn't working labels Aug 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed
Development

No branches or pull requests

4 participants