-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scalar_value_sequence[_view] #16
Comments
It is true that it does not define what a character is. It defines "abstract character", for example. However, the vast majority of the uses of the the word "character" in the Unicode Standard refer to concepts similar to code points, not similar to grapheme clusters. I think the only time the Unicode Standard uses the word to mean something more like a grapheme cluster than a code point, it does so with scare quotes ('“character”', or 'end-user “character”') and those uses are restricted to discussions contrasting code point-like concepts with grapheme-like concepts. I would just change the specification to say "as a sequence of code points". |
But why would we then define that the One True Way to iterate Unicode text is code points? We are repeating the same mistake Remember that algorithms usually use just |
This is a different discussion, though. (And it's one that we will need to have when we start seriously working on this). "code point" is the intended meaning of the proposal as is, so this change would fix the ambiguity of terms without changing the intended meaning. |
Even if the paper would use the term "code point", the name of the class is ambiguous. And iterating code points is not that common actually. For file and network I/O you need the number of code units. For rendering you need the number of grapheme clusters. |
Ok, that's fair enough. Small correction, though: the number of grapheme clusters is actually irrelevant for rendering. The number of glyphs may be useful, but that doesn't really map to grapheme clusters and is information that cannot be retrieved without a font at hand. Note that grapheme clusters aren't even useful to separate a string before doing glyph lookup, as glyphs may span grapheme cluster boundaries. Grapheme clusters are useful for user interaction, like selection, using the left/right arrows, or using the Delete key (but not Backspace! for that you are more likely to want code points) |
text_view has been due for some updates for a long time now; I just haven't had time to get to it. We've acknowledged that there are use cases for all of code point, (extended) grapheme cluster, word, sentence, etc... enumeration. I'm quite sure we'll end up providing views for code points and (extended) grapheme clusters. What names will be proposed for those is TBD. I think there is a lot of support for types named |
But if the user doesn't understand the distinction, we will be left with buggy code. Current design makes it easy to use the API incorrectly. |
I agree. I think the current consensus is that we'll want to provide grapheme clusters as the default "character" that users work with, but provide access to the code point (and code unit) sequence in other ways. @tzlaine's Boost.Text work prototypes this approach. |
I'd say |
I appreciate the clarity of intent in providing such an interface, but in practice, I think it would make for a cumbersome type to use. Programmers that aren't experts in Unicode don't want to have to worry about these distinctions; and in fact, worrying about these could be a distraction from what they are actually trying to get done. I also worry about what a I believe, that to reach most programmers, we need to provide simple types that, for most purposes, just do the right thing by default, but expose the underlying data as needed for experts. Think of a need to search some text for a particular "character". Let's say the character to match is a member of the basic source character set, 'X' for example. If the programmer has to be aware that 'X' can have combining code points and that the grapheme cluster interfaces must therefore be used unless matching a base character with combining character(s) is desired, then we've already lost. We need to ensure that the result of something like Within SG16, consensus has been moving towards making |
Yes, iterating grapheme clusters would be the least surprising behavior to novice programmers. This would be the rare occurrence of string type not being broken.
Converting to the lower level is trivial while converting to upper is not. We will need helper functions to do this.
Consider std algorithms such as
No, the programmer lost. I don't want to hide bugs until later time. Look at what raw pointers and basic_string has done - infinite number of bugs that cost insane amount of money and manpower to maintain. Again, yes, if I've implemented CodePointSequence for my purposes and after looking at Boost.Text and text_view paper I think this design is the most promising: template <TextEncoding ET, std::endian Endianness = std::endian::native, typename Allocator = std::allocator<std::byte>>
class code_unit_sequence;
template <TextEncoding ET, std::endian Endianness = std::endian::native>
class code_unit_sequence_view;
template <typename T>
concept bool CodeUnitSequence();
template <typename T>
concept bool CodeUnitSequenceView();
template <CodeUnitSequence Container, TextEncoding ET = default_encoding_type_t<Container>>
class code_point_sequence;
template <CodeUnitSequenceView VT, TextEncoding ET = default_encoding_type_t<VT>>
class code_point_sequence_view; I think having separate big-endian and little-endian encodings is not useful. Endianness matters only at the byte level so there should be class templates that handle it.
|
Funnily enough, Swift does this, and their string type is currently broken because of it https://bugs.swift.org/browse/SR-375. |
I think that string type that has .characters.count is fundamentally broken. Also I wanted to say that I would like to implement code_unit_sequence and code_point_sequence and produce a formal paper. I just want a blessing. |
Well, that's just the most trivial way of demonstrating how it's broken. Iterating over Swift strings also produces similarly broken results.
Come join us on Slack https://cpplang.slack.com/messages/sg16-unicode, or the mailing list http://www.open-std.org/mailman/listinfo/unicode, or even join our next teleconference http://www.open-std.org/pipermail/unicode/2018-June/000037.html |
@Lyberta No blessing is necessary of course. We (SG16) have been wrestling with this question for a while now, but haven't made any decisions one way or another. The next pre-meeting mailing is quite some time away and I do plan on scheduling time to discuss code points vs grapheme clusters at our meetings in the not too distant future. So, I'll echo what Martinho said; join us on Slack, the mailing list, and our telecons (invite info is on the mailing list and I can send you an invite on request if you like). You'll get more immediate feedback and be better able to contribute to our direction than by writing a paper (at least in the short term). I do think we'll want to write a paper on this subject at some point, but I think it would be great if it were collaboratively developed within SG16 with a goal of presenting an agreed upon approach with pros/cons to the rest of the committee. |
I'm closing this issue as non-actionable since there does not appear to be consensus for a particular direction. The concerns raised will need to be addressed as part of #31. Anyone wishing to propose a specific solution is encouraged to open a new issue or to submit a paper. |
I've read the
text_view
proposal and I think it uses very ambiguous terminology such as:As far as I know Unicode standard never defines what character is. The closest term for a character is grapheme cluster.
Second, there are many ways to iterate over Unicode data such as:
Yet
text_view
gives only singlebegin
andend
functions. I think we should standardizecode_point_sequence_view
because it has unambiguous name. After that we can standardizegrapheme_cluster_sequence_view
and higher level stuff.The text was updated successfully, but these errors were encountered: