Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeStrSlice: add width() and graphemes() methods #15619

Merged
merged 2 commits into from Jul 16, 2014

Conversation

Projects
None yet
8 participants
@kwantam
Copy link
Contributor

kwantam commented Jul 11, 2014

  • width() computes the displayed width of a string, ignoring the width of control characters.
    • arguably we might do something else for control characters, but the question is, what?
    • users who want to do something else can iterate over chars()
  • graphemes() returns a Graphemes struct, which implements an iterator over the grapheme clusters of a &str.
  • added code to generate additionial categories in unicode.py
    • Cn aka Not_Assigned
    • categories necessary for grapheme cluster breaking
  • tidied up the exports from libunicode
    • all exports are exposed through a module rather than directly at crate root.
    • std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode

closes #7043

@kwantam kwantam referenced this pull request Jul 11, 2014

Closed

Unicode grapheme support #7043

#[inline]
fn words(&self) -> Words<'a> {
self.split(u_char::is_whitespace).filter(|s| !s.is_empty())
}

This comment has been minimized.

@bluss

bluss Jul 11, 2014

Contributor

Just as an aside, we note that this is not an UAX #29 algorithm for word boundaries

This comment has been minimized.

@kwantam

kwantam Jul 11, 2014

Author Contributor

Good point. I'll add that to my list of things to think about.

This comment has been minimized.

@lilyball

lilyball Jul 11, 2014

Contributor

Offhand, I'm inclined to say that either this should be a UAX #29 algorithm for word boundaries, or it should be removed entirely. It's easy enough for users to write that same words() function themselves if they need it, and the existence of this function precludes adding the UAX #29 version (at least, without avoiding confusion down the road).

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

That's a good point.

On the other hand, the UAX#29 word breaking algorithm is a bit meaty, and often people don't need something that complicated. Maybe we should leave words() as is, and have a function with a different name that implements the UAX#29 algorithm? I don't want people to use words() and end up having a lot more complication than necessary.

Maybe this deserves its own issue?

This comment has been minimized.

@lilyball

lilyball Jul 12, 2014

Contributor

I can certainly understand why you don't want to implement the UAX#29 algorithm right now, and so I think you should just drop words() entirely. It's easy for someone else to write the whitespace-splitting version if they decide that's what they want, but I don't think it's appropriate to call that "words" within the context of a unicode-focused library.

This comment has been minimized.

@bluss

bluss Jul 12, 2014

Contributor

words() is a separate issue. It was just moved in this commit, but it should be addressed (renamed or removed).

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

I opened #15628, but this behaviour of &str.words has been around for a while, so can be addressed later.

/// If `is_extended` is true, the iterator is over the *extended grapheme clusters*;
/// otherwise, the iterator is over the *legacy grapheme clusters*.
/// [UAX#28](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
/// recommends extended grapheme cluster boundaries for general processing.

This comment has been minimized.

@bluss

bluss Jul 11, 2014

Contributor

It says 28 instead of 29 here. Reviewing the important stuff :)

@kwantam

This comment has been minimized.

Copy link
Contributor Author

kwantam commented Jul 11, 2014

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Jul 12, 2014

Using char_indices() is a great change 👍

/// let v: Vec<&str> = some_words.words().collect();
/// assert_eq!(v, vec!["Mary", "had", "a", "little", "lamb"]);
/// ```
fn words(&self) -> Words<'a>;

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Is there any particular reason that these functions are moving around? (Would it be possible to leave them where they were, to minimise the noise in the diff?)

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

It would be. I just wanted to sort the functions in a more sensible order, because it seemed like they were pretty haphazardly thrown together. (...which is my fault; I did put them there, after all!)

I'll swap them back to their original order.

/// [Unicode Standard Annex #11](http://www.unicode.org/reports/tr11/)
/// recommends that these characters be treated as 1 column (i.e.,
/// `is_cjk` = `false`) if the locale is unknown.
//fn width(&self, is_cjk: bool) -> uint;

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

(FWIW, commenting this out, but leaving the doccomment above, caused the docs to leak into the trim docs; all fixed now, though.)

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Yes, I noticed this. I'd intended to implement the width() function in the initial PR for libunicode, but apparently it slipped my mind. :\

/// assert!("".is_whitespace());
///
/// assert!( !"abc".is_whitespace());
/// assert!("a\r\nb".graphemes(true).count() == 3);

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Would it be possible to actually list the graphemes for these examples?

///
/// assert!( !"abc".is_whitespace());
/// assert!("a\r\nb".graphemes(true).count() == 3);
/// assert!("\U0001f1f7\U0001f1fa\U0001f1f8".graphemes(true).count() == 1);

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Maybe there could be a more... visible example here (either replacing this one, or in addition to this one, your choice), e.g.

let s = "a\u0310e\u0301o\u0308\u0332"; // åéö̲
let graphemes = s.graphemes(true).collect::<Vec<&str>>();
assert_eq!(graphemes.as_slice(), &["a\u0310", "e\u0301", "o\u0308\u0332"]);
];

for &(s, l) in testdata.iter() {
assert!(l == s.graphemes(true).count());

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

fwiw, using assert_eq!(s.graphemes(true).count(), expected) is better.

@@ -2070,6 +2079,138 @@ String::from_str("\u1111\u1171\u11b6"));
}

#[test]
fn test_graphemes() {

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Is it feasible for this to be testing that the computed graphemes are correct? And can this test .graphemes(false) too?

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Good idea. I'll do that.

The Unicode provided tests are either for extended mode or extension agnostic. I'll come up with some alternative tests for legacy mode.

This comment has been minimized.

@kwantam

kwantam Jul 13, 2014

Author Contributor

Added tests for extended and legacy mode, and now testing that the graphemes are correct.

}

self.cat = if take_next {
let CharRange { ch: _, next } = self.string.char_range_at(idx);

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

This could be idx = self.string.char_range_at(idx).next;

idx += 1; // rule GB3
}

// we cheat by precomputing the next idx above, so we have to

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Wouldn't this be simpler if the next idx wasn't precomputed?

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Yes, but this is a slight optimization for the \r and \r\n cases, since we know both characters are 1 byte and we have to special case it anyway.

This comment has been minimized.

@huonw

huonw Jul 13, 2014

Member

I'm not sure we need to micro-optimise that case a lot? shrug

continue;
}

match state {

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

This whole match could be

state = match state {
    Start if ch == '\r' => { /* handle special case */ }
    Start => match cat { 
         gr::GC_Control => break,
    // ...
};

rather than having state = in each branch. Also, theoretically, you could write it as a match on (state, cat), something like

state = match (state, ch) {
    (Start, _) if ch == '\r' => { .. }
    (Start, gr::GC_Control) => break,
    // ...
    (FindExtend, _) => { take_next = false; break; }
    (HangulL, gr::GR_L) => continue,
    // ...
}

although repeating the state element might get annoying (this would allow you to fold identical arms together, e.g. have a trailing arm that's (FindExtend, _) | (HangulL, _) | (HangulLV, _) | ... => { take_next = false; break }.

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

I think stylistically I prefer the first to the second, but I'll take a closer look at both.

This comment has been minimized.

@huonw

huonw Jul 13, 2014

Member

Yes, you should choose as you wish, the second looks like it might make the code harder to understand.

break; // rule GB4
}
state = match cat {
gr::GC_Control => break,

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Is this the only circumstance in which take_next is true? However, if I'm reading this correctly, it seems that this means that the first grapheme in \nFoo... is \nF (i.e. it's taking the next character after idx?).

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

(If this is the only circumstance in which take_next is true, maybe false would be a better default, or maybe it could be removed altogether, although this may not be worth it?)

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

take_next should also be true when the char_indices() iterator returns None. That's the reason for the default.

This comment has been minimized.

@huonw

huonw Jul 13, 2014

Member

Ah, ok, I think I understand my confusion about \nFoo: the take_next variable really means "take current", i.e. whether to include the current codepoint ch in the grapheme or not. Could you rename it to take_current? (Similarly the next in for (next, ch) in ... { idx = next; ... is actually the offset of the current codepoint ch, not the next one, could you adjust that to curr or current or whatever too?)

@@ -350,6 +385,43 @@ def emit_conversions_module(f, lowerupper, upperlower):
sorted(lowerupper.iteritems(), key=operator.itemgetter(0)), is_pub=False)
f.write("}\n\n")

def emit_grapheme_module(f, grapheme_table, grapheme_cats):
f.write("pub mod grapheme {\n")

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

You can use a multiline string here.

f.write("""pub mod grapheme
    use core::option::{Some, None};
    ...
    fn bsearch_range_value_table(...

    ...
    #[allow(non_camel_case_types)]
    #[deriving(Clone)]
    pub enum GraphemeCat {
""")
     for cat in ...
         # ...
    f.write("     }\n\n")
# Regional Indicator
grapheme_cats["RegionalIndicator"] = grapheme_regional_indicator

# Prepend - "Curerently there are no characters with this value"

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

"Currently"

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Ah, good catch.

I should attribute the quote; it's from UAX29. Just wanted to note it lest someone think I've forgotten the Prepend case when looking at the code in the future.

@@ -155,6 +185,11 @@ def ungroup_cat(cat):
lo += 1
return cat_out

def gen_unassigned(assigned):
all_set = set(ungroup_cat([(0, 0xd7ff), (0xe000, 0x10ffff)]))

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Is this really the most efficient way to do this? e.g.

assigned = set(assigned)
return [i for i in range(0, 0xd800) if i not in assigned] + [i for i in range(0xe000, 0x10ffff + 1) if i not in assigned]

might be better.

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

I wasn't really worrying about efficiency for this script since it's run once in a blue moon, but it's true that yours is more reasonable.

@@ -167,7 +202,7 @@ def format_table_content(f, content, indent):
line = " "*indent
first = True
for chunk in content.split(","):
if len(line) + len(chunk) < 98:
if len(line) + len(chunk) < 96:

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Any particular reason for this change?

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Ok, I checked locally: reverting this change reduces the diffstat to

5 files changed, 1335 insertions(+), 58 deletions(-)
 ...
 src/libunicode/tables.rs  | 885 +++++++++++++++++++++++++++++++++++++++++++++-

rather than

5 files changed, 4104 insertions(+), 2767 deletions(-)
 ...
 src/libunicode/tables.rs  | 6363 ++++++++++++++++++++++++++-------------------

as it is now. Please revert it, or at least do it in its own separate commit, to keep things as clean as possible.

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Ah yes, meant to revert that, was playing with it for something else. Good catch.


# Control
grapheme_cats["Control"] = set()
for cat in ["Zl", "Zp", "Cc", "Cf"]:

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Could you note that Cs is unnecessary since Rust's chars (and strings) hold scalar values only (i.e. no surrogates can be encountered ever)?


# SpacingMark
grapheme_cats["SpacingMark"] = group_cat(list(
set(ungroup_cat(gencats["Mc"])).difference(ungroup_cat(grapheme_cats["Extend"]))

This comment has been minimized.

@huonw

huonw Jul 12, 2014

Member

Why difference here, but - elsewhere?

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Converted them to operator form and forgot this one. Good catch.

@huonw

This comment has been minimized.

Copy link
Member

huonw commented Jul 12, 2014

It seems that unicode.py has r = "unicode.rs", while it should be r = "tables.rs".

}

impl<'a> UnicodeStrSlice<'a> for &'a str {
#[inline]
fn words(&self) -> Words<'a> {
self.split(u_char::is_whitespace).filter(|s| !s.is_empty())
fn graphemes(&self, is_extended: bool) -> Graphemes<'a> {

This comment has been minimized.

@Kimundi

Kimundi Jul 12, 2014

Member

I think it would make sense to provide grapheme_indices() as well, to complement char_indices()

This comment has been minimized.

@kwantam

kwantam Jul 12, 2014

Author Contributor

Ah yes, I like that idea.

This comment has been minimized.

@kwantam

kwantam Jul 13, 2014

Author Contributor

Done.

// We do this because most of the time we would end up
// looking up each character twice.
cat = match self.cat {
None => grapheme_category(ch),

This comment has been minimized.

@huonw

huonw Jul 13, 2014

Member

Functions are generally used module-qualified, i.e. grapheme::grapheme_category (or gr::grapheme_category(...)).

@kwantam

This comment has been minimized.

Copy link
Contributor Author

kwantam commented Jul 13, 2014

Now that I think about it, probably it would also be nice to have a DoubleEndedIterator implementation for Graphemes and GraphemeIndices. I'll implement that tonight.

@kwantam

This comment has been minimized.

Copy link
Contributor Author

kwantam commented Jul 14, 2014

OK, we have DoubleEndedIterator for Graphemes and GraphemeIndices.

Is there a nicer way of reversing a static slice than slice.iter().rev().collect()? I'm not so worried about the efficiency of the test code, but that doesn't mean I want to be doing something boneheaded.

@huonw

This comment has been minimized.

Copy link
Member

huonw commented Jul 14, 2014

You can just test equality of two iterators using std::iter::order::equals and so avoid the collect entirely.

@kwantam

This comment has been minimized.

Copy link
Contributor Author

kwantam commented Jul 14, 2014

👍 thanks

/// # Example
///
/// ```rust
/// let gr1 = "a̐éö̲".graphemes(true).collect::<Vec<&str>>();

This comment has been minimized.

@huonw

huonw Jul 14, 2014

Member

I think using explicit code-point notation (as I used previously) may be better here, because the interesting thing with this iterator is how it groups multiple codepoints into a single entity.

When written in short form, it's not clear that these aren't individual codepoints.

/// External iterator for grapheme clusters and byte offsets.
#[deriving(Clone)]
pub struct GraphemeIndices<'a> {
string: &'a str,

This comment has been minimized.

@huonw

huonw Jul 14, 2014

Member

Can this just store start_offset: uint, rather than the whole string?

This comment has been minimized.

@kwantam

kwantam Jul 14, 2014

Author Contributor

Ah, yes. Good point.

(CharOffsets in core::str can do the same)

_ => self.catb.take_unwrap()
};

state = match state {

This comment has been minimized.

@huonw

huonw Jul 14, 2014

Member

To be clear, this table is effectively running http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules in reverse, right?

This comment has been minimized.

@kwantam

kwantam Jul 14, 2014

Author Contributor

Yes. It implements the same splits, but running from the end of the string rather than the front.

}
},
HangulLV => match cat {
gr::GC_V => continue,

This comment has been minimized.

@huonw

huonw Jul 14, 2014

Member

Is this correct? I don't see a V × LV grapheme production in the table? (And similarly, for the next line, I don't see a LV × LV one?)

This comment has been minimized.

@kwantam

kwantam Jul 14, 2014

Author Contributor

The semantics of the Hangul states are slightly different here than in the forward iterator. Specifically, HangulL means that we have a HangulL as the character to the right (that is, the previous character). HangulLV means that the character to the right is a HangulV, and HangulLVT means the character to the right is HangulT. (So look at the last character to figure out what state we're in, basically.)

This comment has been minimized.

@huonw

huonw Jul 14, 2014

Member

Ah, maybe you could add some more comments, or even use a new enum?

This comment has been minimized.

@kwantam

kwantam Jul 14, 2014

Author Contributor

I'll add some more comments to make it clearer. Thanks for asking for clarification.

gr::GC_RegionalIndicator => Regional,
_ => break,
},
Start => break, // GC_Control

This comment has been minimized.

@huonw

huonw Jul 14, 2014

Member

This and the case below could be handled by

match cat {
    ...
    gr::GC_Control => { take_curr = state == Start; break }
}

in the previous line.

This comment has been minimized.

@kwantam

kwantam Jul 14, 2014

Author Contributor

👍

kwantam added some commits Jul 11, 2014

add Graphemes iterator; tidy unicode exports
- Graphemes and GraphemeIndices structs implement iterators over
  grapheme clusters analogous to the Chars and CharOffsets for chars in
  a string. Iterator and DoubleEndedIterator are available for both.

- tidied up the exports for libunicode. crate root exports are now moved
  into more appropriate module locations:
  - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module
  - UnicodeChar exported from char instead of crate root
  - canonical_combining_class is exported from str rather than crate root

Since libunicode's exports have changed, programs that previously relied
on the old export locations will need to change their `use` statements
to reflect the new ones. See above for more information on where the new
exports live.

closes #7043
[breaking-change]
@kwantam

This comment has been minimized.

Copy link
Contributor Author

kwantam commented Jul 14, 2014

@alexcrichton @huonw should be good to go now.

Also realized that I needed [breaking-change] in the commit message because I tidied the export locations for libunicode, and programs that link directly to it would break. (Since it's pretty young, I doubt there are any such programs yet...)

@bors

This comment has been minimized.

Copy link
Contributor

bors commented on cf432b8 Jul 15, 2014

saw approval from huonw
at kwantam@cf432b8

This comment has been minimized.

Copy link
Contributor

bors replied Jul 15, 2014

merging kwantam/rust/master = cf432b8 into auto

This comment has been minimized.

Copy link
Contributor

bors replied Jul 15, 2014

kwantam/rust/master = cf432b8 merged ok, testing candidate = 2692ae1

This comment has been minimized.

Copy link
Contributor

bors replied Jul 16, 2014

fast-forwarding master to auto = 2692ae1

bors added a commit that referenced this pull request Jul 15, 2014

auto merge of #15619 : kwantam/rust/master, r=huonw
- `width()` computes the displayed width of a string, ignoring the width of control characters.
    - arguably we might do *something* else for control characters, but the question is, what?
    - users who want to do something else can iterate over chars()

- `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str.
    - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
    - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29)

- added code to generate additionial categories in `unicode.py`
    - `Cn` aka `Not_Assigned`
    - categories necessary for grapheme cluster breaking

- tidied up the exports from libunicode
  - all exports are exposed through a module rather than directly at crate root.
  - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode

closes #7043

@bors bors closed this Jul 16, 2014

@bors bors merged commit cf432b8 into rust-lang:master Jul 16, 2014

2 checks passed

continuous-integration/travis-ci The Travis CI build passed
Details
default all tests passed
@ariasuni

This comment has been minimized.

Copy link
Contributor

ariasuni commented Jun 14, 2017

How is that still Unstable? Should I open a new issue to stabilize this (at least graphemes()?)

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Aug 3, 2017

@ariasuni It's not included even as experimental feature yet -- so the first step you can do is to get it into Rust as an unstable feature (write a PR, discuss on the forum or even write an RFC, which is probably not needed).

@mamciek

This comment has been minimized.

Copy link

mamciek commented Dec 28, 2017

FYI: It was removed from master
"Remove all unstable deprecated functionality"
8d90d3f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.