Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upUnicodeStrSlice: add width() and graphemes() methods #15619
Conversation
bluss
reviewed
Jul 11, 2014
| #[inline] | ||
| fn words(&self) -> Words<'a> { | ||
| self.split(u_char::is_whitespace).filter(|s| !s.is_empty()) | ||
| } |
This comment has been minimized.
This comment has been minimized.
bluss
Jul 11, 2014
Contributor
Just as an aside, we note that this is not an UAX #29 algorithm for word boundaries
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 11, 2014
Author
Contributor
Good point. I'll add that to my list of things to think about.
This comment has been minimized.
This comment has been minimized.
lilyball
Jul 11, 2014
Contributor
Offhand, I'm inclined to say that either this should be a UAX #29 algorithm for word boundaries, or it should be removed entirely. It's easy enough for users to write that same words() function themselves if they need it, and the existence of this function precludes adding the UAX #29 version (at least, without avoiding confusion down the road).
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
That's a good point.
On the other hand, the UAX#29 word breaking algorithm is a bit meaty, and often people don't need something that complicated. Maybe we should leave words() as is, and have a function with a different name that implements the UAX#29 algorithm? I don't want people to use words() and end up having a lot more complication than necessary.
Maybe this deserves its own issue?
This comment has been minimized.
This comment has been minimized.
lilyball
Jul 12, 2014
Contributor
I can certainly understand why you don't want to implement the UAX#29 algorithm right now, and so I think you should just drop words() entirely. It's easy for someone else to write the whitespace-splitting version if they decide that's what they want, but I don't think it's appropriate to call that "words" within the context of a unicode-focused library.
This comment has been minimized.
This comment has been minimized.
bluss
Jul 12, 2014
Contributor
words() is a separate issue. It was just moved in this commit, but it should be addressed (renamed or removed).
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
I opened #15628, but this behaviour of &str.words has been around for a while, so can be addressed later.
bluss
reviewed
Jul 11, 2014
| /// If `is_extended` is true, the iterator is over the *extended grapheme clusters*; | ||
| /// otherwise, the iterator is over the *legacy grapheme clusters*. | ||
| /// [UAX#28](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) | ||
| /// recommends extended grapheme cluster boundaries for general processing. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Using |
huonw
reviewed
Jul 12, 2014
| /// let v: Vec<&str> = some_words.words().collect(); | ||
| /// assert_eq!(v, vec!["Mary", "had", "a", "little", "lamb"]); | ||
| /// ``` | ||
| fn words(&self) -> Words<'a>; |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Is there any particular reason that these functions are moving around? (Would it be possible to leave them where they were, to minimise the noise in the diff?)
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
It would be. I just wanted to sort the functions in a more sensible order, because it seemed like they were pretty haphazardly thrown together. (...which is my fault; I did put them there, after all!)
I'll swap them back to their original order.
huonw
reviewed
Jul 12, 2014
| /// [Unicode Standard Annex #11](http://www.unicode.org/reports/tr11/) | ||
| /// recommends that these characters be treated as 1 column (i.e., | ||
| /// `is_cjk` = `false`) if the locale is unknown. | ||
| //fn width(&self, is_cjk: bool) -> uint; |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
(FWIW, commenting this out, but leaving the doccomment above, caused the docs to leak into the trim docs; all fixed now, though.)
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
Yes, I noticed this. I'd intended to implement the width() function in the initial PR for libunicode, but apparently it slipped my mind. :\
huonw
reviewed
Jul 12, 2014
| /// assert!("".is_whitespace()); | ||
| /// | ||
| /// assert!( !"abc".is_whitespace()); | ||
| /// assert!("a\r\nb".graphemes(true).count() == 3); |
This comment has been minimized.
This comment has been minimized.
huonw
reviewed
Jul 12, 2014
| /// | ||
| /// assert!( !"abc".is_whitespace()); | ||
| /// assert!("a\r\nb".graphemes(true).count() == 3); | ||
| /// assert!("\U0001f1f7\U0001f1fa\U0001f1f8".graphemes(true).count() == 1); |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Maybe there could be a more... visible example here (either replacing this one, or in addition to this one, your choice), e.g.
let s = "a\u0310e\u0301o\u0308\u0332"; // åéö̲
let graphemes = s.graphemes(true).collect::<Vec<&str>>();
assert_eq!(graphemes.as_slice(), &["a\u0310", "e\u0301", "o\u0308\u0332"]);
huonw
reviewed
Jul 12, 2014
| ]; | ||
|
|
||
| for &(s, l) in testdata.iter() { | ||
| assert!(l == s.graphemes(true).count()); |
This comment has been minimized.
This comment has been minimized.
huonw
reviewed
Jul 12, 2014
| @@ -2070,6 +2079,138 @@ String::from_str("\u1111\u1171\u11b6")); | |||
| } | |||
|
|
|||
| #[test] | |||
| fn test_graphemes() { | |||
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Is it feasible for this to be testing that the computed graphemes are correct? And can this test .graphemes(false) too?
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
Good idea. I'll do that.
The Unicode provided tests are either for extended mode or extension agnostic. I'll come up with some alternative tests for legacy mode.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 13, 2014
Author
Contributor
Added tests for extended and legacy mode, and now testing that the graphemes are correct.
huonw
reviewed
Jul 12, 2014
| } | ||
|
|
||
| self.cat = if take_next { | ||
| let CharRange { ch: _, next } = self.string.char_range_at(idx); |
This comment has been minimized.
This comment has been minimized.
huonw
reviewed
Jul 12, 2014
| idx += 1; // rule GB3 | ||
| } | ||
|
|
||
| // we cheat by precomputing the next idx above, so we have to |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
Yes, but this is a slight optimization for the \r and \r\n cases, since we know both characters are 1 byte and we have to special case it anyway.
This comment has been minimized.
This comment has been minimized.
huonw
reviewed
Jul 12, 2014
| continue; | ||
| } | ||
|
|
||
| match state { |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
This whole match could be
state = match state {
Start if ch == '\r' => { /* handle special case */ }
Start => match cat {
gr::GC_Control => break,
// ...
};rather than having state = in each branch. Also, theoretically, you could write it as a match on (state, cat), something like
state = match (state, ch) {
(Start, _) if ch == '\r' => { .. }
(Start, gr::GC_Control) => break,
// ...
(FindExtend, _) => { take_next = false; break; }
(HangulL, gr::GR_L) => continue,
// ...
}although repeating the state element might get annoying (this would allow you to fold identical arms together, e.g. have a trailing arm that's (FindExtend, _) | (HangulL, _) | (HangulLV, _) | ... => { take_next = false; break }.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
I think stylistically I prefer the first to the second, but I'll take a closer look at both.
This comment has been minimized.
This comment has been minimized.
huonw
Jul 13, 2014
Member
Yes, you should choose as you wish, the second looks like it might make the code harder to understand.
huonw
reviewed
Jul 12, 2014
| break; // rule GB4 | ||
| } | ||
| state = match cat { | ||
| gr::GC_Control => break, |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Is this the only circumstance in which take_next is true? However, if I'm reading this correctly, it seems that this means that the first grapheme in \nFoo... is \nF (i.e. it's taking the next character after idx?).
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
(If this is the only circumstance in which take_next is true, maybe false would be a better default, or maybe it could be removed altogether, although this may not be worth it?)
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
take_next should also be true when the char_indices() iterator returns None. That's the reason for the default.
This comment has been minimized.
This comment has been minimized.
huonw
Jul 13, 2014
Member
Ah, ok, I think I understand my confusion about \nFoo: the take_next variable really means "take current", i.e. whether to include the current codepoint ch in the grapheme or not. Could you rename it to take_current? (Similarly the next in for (next, ch) in ... { idx = next; ... is actually the offset of the current codepoint ch, not the next one, could you adjust that to curr or current or whatever too?)
huonw
reviewed
Jul 12, 2014
| @@ -350,6 +385,43 @@ def emit_conversions_module(f, lowerupper, upperlower): | |||
| sorted(lowerupper.iteritems(), key=operator.itemgetter(0)), is_pub=False) | |||
| f.write("}\n\n") | |||
|
|
|||
| def emit_grapheme_module(f, grapheme_table, grapheme_cats): | |||
| f.write("pub mod grapheme {\n") | |||
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
You can use a multiline string here.
f.write("""pub mod grapheme
use core::option::{Some, None};
...
fn bsearch_range_value_table(...
...
#[allow(non_camel_case_types)]
#[deriving(Clone)]
pub enum GraphemeCat {
""")
for cat in ...
# ...
f.write(" }\n\n")
huonw
reviewed
Jul 12, 2014
| # Regional Indicator | ||
| grapheme_cats["RegionalIndicator"] = grapheme_regional_indicator | ||
|
|
||
| # Prepend - "Curerently there are no characters with this value" |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
Ah, good catch.
I should attribute the quote; it's from UAX29. Just wanted to note it lest someone think I've forgotten the Prepend case when looking at the code in the future.
huonw
reviewed
Jul 12, 2014
| @@ -155,6 +185,11 @@ def ungroup_cat(cat): | |||
| lo += 1 | |||
| return cat_out | |||
|
|
|||
| def gen_unassigned(assigned): | |||
| all_set = set(ungroup_cat([(0, 0xd7ff), (0xe000, 0x10ffff)])) | |||
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Is this really the most efficient way to do this? e.g.
assigned = set(assigned)
return [i for i in range(0, 0xd800) if i not in assigned] + [i for i in range(0xe000, 0x10ffff + 1) if i not in assigned]might be better.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
I wasn't really worrying about efficiency for this script since it's run once in a blue moon, but it's true that yours is more reasonable.
huonw
reviewed
Jul 12, 2014
| @@ -167,7 +202,7 @@ def format_table_content(f, content, indent): | |||
| line = " "*indent | |||
| first = True | |||
| for chunk in content.split(","): | |||
| if len(line) + len(chunk) < 98: | |||
| if len(line) + len(chunk) < 96: | |||
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Ok, I checked locally: reverting this change reduces the diffstat to
5 files changed, 1335 insertions(+), 58 deletions(-)
...
src/libunicode/tables.rs | 885 +++++++++++++++++++++++++++++++++++++++++++++-
rather than
5 files changed, 4104 insertions(+), 2767 deletions(-)
...
src/libunicode/tables.rs | 6363 ++++++++++++++++++++++++++-------------------
as it is now. Please revert it, or at least do it in its own separate commit, to keep things as clean as possible.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
Ah yes, meant to revert that, was playing with it for something else. Good catch.
huonw
reviewed
Jul 12, 2014
|
|
||
| # Control | ||
| grapheme_cats["Control"] = set() | ||
| for cat in ["Zl", "Zp", "Cc", "Cf"]: |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 12, 2014
Member
Could you note that Cs is unnecessary since Rust's chars (and strings) hold scalar values only (i.e. no surrogates can be encountered ever)?
huonw
reviewed
Jul 12, 2014
|
|
||
| # SpacingMark | ||
| grapheme_cats["SpacingMark"] = group_cat(list( | ||
| set(ungroup_cat(gencats["Mc"])).difference(ungroup_cat(grapheme_cats["Extend"])) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 12, 2014
Author
Contributor
Converted them to operator form and forgot this one. Good catch.
This comment has been minimized.
This comment has been minimized.
|
It seems that |
Kimundi
reviewed
Jul 12, 2014
| } | ||
|
|
||
| impl<'a> UnicodeStrSlice<'a> for &'a str { | ||
| #[inline] | ||
| fn words(&self) -> Words<'a> { | ||
| self.split(u_char::is_whitespace).filter(|s| !s.is_empty()) | ||
| fn graphemes(&self, is_extended: bool) -> Graphemes<'a> { |
This comment has been minimized.
This comment has been minimized.
Kimundi
Jul 12, 2014
Member
I think it would make sense to provide grapheme_indices() as well, to complement char_indices()
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
huonw
reviewed
Jul 13, 2014
| // We do this because most of the time we would end up | ||
| // looking up each character twice. | ||
| cat = match self.cat { | ||
| None => grapheme_category(ch), |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 13, 2014
Member
Functions are generally used module-qualified, i.e. grapheme::grapheme_category (or gr::grapheme_category(...)).
This comment has been minimized.
This comment has been minimized.
|
Now that I think about it, probably it would also be nice to have a |
This comment has been minimized.
This comment has been minimized.
|
OK, we have Is there a nicer way of reversing a static slice than |
This comment has been minimized.
This comment has been minimized.
|
You can just test equality of two iterators using |
This comment has been minimized.
This comment has been minimized.
|
|
huonw
reviewed
Jul 14, 2014
| /// # Example | ||
| /// | ||
| /// ```rust | ||
| /// let gr1 = "a̐éö̲".graphemes(true).collect::<Vec<&str>>(); |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 14, 2014
Member
I think using explicit code-point notation (as I used previously) may be better here, because the interesting thing with this iterator is how it groups multiple codepoints into a single entity.
When written in short form, it's not clear that these aren't individual codepoints.
huonw
reviewed
Jul 14, 2014
| /// External iterator for grapheme clusters and byte offsets. | ||
| #[deriving(Clone)] | ||
| pub struct GraphemeIndices<'a> { | ||
| string: &'a str, |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 14, 2014
Author
Contributor
Ah, yes. Good point.
(CharOffsets in core::str can do the same)
huonw
reviewed
Jul 14, 2014
| _ => self.catb.take_unwrap() | ||
| }; | ||
|
|
||
| state = match state { |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 14, 2014
Member
To be clear, this table is effectively running http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules in reverse, right?
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 14, 2014
Author
Contributor
Yes. It implements the same splits, but running from the end of the string rather than the front.
huonw
reviewed
Jul 14, 2014
| } | ||
| }, | ||
| HangulLV => match cat { | ||
| gr::GC_V => continue, |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 14, 2014
Member
Is this correct? I don't see a V × LV grapheme production in the table? (And similarly, for the next line, I don't see a LV × LV one?)
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 14, 2014
Author
Contributor
The semantics of the Hangul states are slightly different here than in the forward iterator. Specifically, HangulL means that we have a HangulL as the character to the right (that is, the previous character). HangulLV means that the character to the right is a HangulV, and HangulLVT means the character to the right is HangulT. (So look at the last character to figure out what state we're in, basically.)
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
kwantam
Jul 14, 2014
Author
Contributor
I'll add some more comments to make it clearer. Thanks for asking for clarification.
huonw
reviewed
Jul 14, 2014
| gr::GC_RegionalIndicator => Regional, | ||
| _ => break, | ||
| }, | ||
| Start => break, // GC_Control |
This comment has been minimized.
This comment has been minimized.
huonw
Jul 14, 2014
Member
This and the case below could be handled by
match cat {
...
gr::GC_Control => { take_curr = state == Start; break }
}in the previous line.
This comment has been minimized.
This comment has been minimized.
kwantam
added some commits
Jul 11, 2014
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton @huonw should be good to go now. Also realized that I needed |
This comment has been minimized.
This comment has been minimized.
|
saw approval from huonw |
This comment has been minimized.
This comment has been minimized.
|
merging kwantam/rust/master = cf432b8 into auto |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
fast-forwarding master to auto = 2692ae1 |
bors
added a commit
that referenced
this pull request
Jul 15, 2014
bors
closed this
Jul 16, 2014
bors
merged commit cf432b8
into
rust-lang:master
Jul 16, 2014
2 checks passed
This comment has been minimized.
This comment has been minimized.
|
How is that still Unstable? Should I open a new issue to stabilize this (at least |
This comment has been minimized.
This comment has been minimized.
|
@ariasuni It's not included even as experimental feature yet -- so the first step you can do is to get it into Rust as an unstable feature (write a PR, discuss on the forum or even write an RFC, which is probably not needed). |
This comment has been minimized.
This comment has been minimized.
mamciek
commented
Dec 28, 2017
|
FYI: It was removed from master |
kwantam commentedJul 11, 2014
width()computes the displayed width of a string, ignoring the width of control characters.graphemes()returns aGraphemesstruct, which implements an iterator over the grapheme clusters of a &str.unicode.pyCnakaNot_Assignedcloses #7043