Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up&str.words should be using the unicode word boundary algorithm #15628
Comments
huonw
referenced this issue
Jul 12, 2014
Merged
UnicodeStrSlice: add width() and graphemes() methods #15619
huonw
added
the
A-libs
label
Jul 12, 2014
This comment has been minimized.
This comment has been minimized.
|
Note that this gives some possibly unintuitive results.
|
This comment has been minimized.
This comment has been minimized.
|
It occurs to me that we could take a |
This comment has been minimized.
This comment has been minimized.
|
I think taking a bool for this is bad design. It should be a separate method. |
This comment has been minimized.
This comment has been minimized.
|
This has to be decided before 1.0. |
This comment has been minimized.
This comment has been minimized.
|
Here is the implementation I've just written for a program I'm working on: fn word_at(&self, idx: usize) -> &str {
let s = &self[idx..];
let mut take_prev = true;
let mut take_cur = true;
let mut state = WState::Start;
let mut prev_idx = 0;
let mut idx = 0;
for (curr, c) in s.char_indices() {
prev_idx = idx;
idx = curr;
let cat = word::category(c);
if cat == WCat::Extend || cat == WCat::Format {
if state == WState::Start {
break;
} else {
continue;
}
}
state = match state {
WState::Start if c == '\r' => {
if idx + 1 != s.len() && s.char_at(idx + 1) == '\n' {
idx += 1;
}
break;
},
WState::Start => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => break,
},
WState::ALetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet |
WCat::SingleQuote => WState::ALetterExt,
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ALetterExt => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::HLetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
WCat::SingleQuote => WState::HLetterExt(1),
WCat::DoubleQuote => WState::HLetterExt(2),
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::HLetterExt(n) => match cat {
WCat::ALetter if n < 2 => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = n == 1;
break;
}
},
WState::Numeric => match cat {
WCat::Numeric => WState::Numeric,
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidNum | WCat::MidNumLet |
WCat::SingleQuote => WState::NumericExt,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::NumericExt => match cat {
WCat::Numeric => WState::Numeric,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::Katakana => match cat {
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ExtendNumLet => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::RegionalIndicator => match cat {
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => {
take_cur = false;
break;
}
},
WState::SingleQuote => unreachable!(),
}
}
if take_cur {
idx += s.char_at(idx).len_utf8();
} else if !take_prev {
idx = prev_idx;
}
&s[..idx]
}
fn word_at_reverse(&self, idx: usize) -> &str {
let s = &self[..idx];
let mut take_prev = true;
let mut take_cur = true;
let mut state = WState::Start;
let mut prev_prev_idx = s.len();
let mut prev_idx = s.len();
let mut idx = s.len();
for (curr, c) in s.char_indices().rev() {
let cat = word::category(c);
if cat == WCat::Extend || cat == WCat::Format {
continue;
}
prev_prev_idx = prev_idx;
prev_idx = idx;
idx = curr;
state = match state {
WState::Start if c == '\n' => {
if idx > 0 && s.char_at_reverse(idx) == '\r' {
idx -= 1;
}
break;
},
WState::Start => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::SingleQuote => WState::SingleQuote,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => break,
},
WState::ALetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet |
WCat::SingleQuote => WState::ALetterExt,
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ALetterExt => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::HLetter => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidLetter | WCat::MidNumLet => WState::HLetterExt(0),
WCat::DoubleQuote => WState::HLetterExt(2),
WCat::Numeric => WState::Numeric,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::HLetterExt(n) => match cat {
WCat::ALetter if n < 2 => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::SingleQuote => match cat {
WCat::HebrewLetter => WState::HLetter,
_ => {
take_cur = false;
break;
}
},
WState::Numeric => match cat {
WCat::Numeric => WState::Numeric,
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::MidNum | WCat::MidNumLet |
WCat::SingleQuote => WState::NumericExt,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::NumericExt => match cat {
WCat::Numeric => WState::Numeric,
_ => {
take_cur = false;
take_prev = false;
break;
}
},
WState::Katakana => match cat {
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::ExtendNumLet => match cat {
WCat::ALetter => WState::ALetter,
WCat::HebrewLetter => WState::HLetter,
WCat::Numeric => WState::Numeric,
WCat::Katakana => WState::Katakana,
WCat::ExtendNumLet => WState::ExtendNumLet,
_ => {
take_cur = false;
break;
}
},
WState::RegionalIndicator => match cat {
WCat::RegionalIndicator => WState::RegionalIndicator,
_ => {
take_cur = false;
break;
}
},
}
}
if idx == s.len() {
idx = 0;
} else if !take_prev {
idx = prev_prev_idx;
} else if !take_cur {
idx = prev_idx;
}
&s[idx..]
} |
This comment has been minimized.
This comment has been minimized.
|
It's still not clear to me that we want "words" to be UAX#29 vs "normal" splitting. Most people expect to break on whitespace based on their experiences in other languages. That's not a reason to do or not do it; I'm just pointing out that some thought would be good here. I'm convinced that it's worthwhile to have a UAX#29 implementation available in the library, since there is good reason to have strong Unicode support. I'm just unclear on how the naming should be handled. |
bstrie
added
the
I-nominated
label
Jan 12, 2015
This comment has been minimized.
This comment has been minimized.
|
I think This seems like a nice function though. I suppose it's quite specific so it could have a longer name... I can give some suggestions: And if you wanted to be ultra precise/unambiguous/mean you could remove |
This comment has been minimized.
This comment has been minimized.
One might want to read the documentation then.
It's clear that no thought went into the current implementation. A function that uses a degenerate algorithm which barely finds words in ascii does not belong in a unicode library. I'm not sure what purpose this algorithm serves at all. Chrome certainly doesn't use it to find word boundaries and neither does vim. If someone actually needs this behavior then they can get it with the one-liner in the first post. |
This comment has been minimized.
This comment has been minimized.
|
This is a fair observation. In that case, probably we should
This works for me if it works for you. From a practical standpoint: are you interested in putting together a pull request to integrate the above into libunicode? If not, I can probably find some time to do it over the weekend. |
This comment has been minimized.
This comment has been minimized.
|
I don't know python so I won't do it. Besides, the current python implementation seems bad because it doesn't use the normative file for the grapheme classes. Also, the names of the functions should reflect what they are actually doing. |
brson
added this to the 1.0 beta milestone
Jan 15, 2015
brson
added
the
P-backcompat-libs
label
Jan 15, 2015
This comment has been minimized.
This comment has been minimized.
|
Don't need to do as described. Just need to decide something. |
alexcrichton
removed
the
I-nominated
label
Jan 15, 2015
pnkfelix
self-assigned this
Feb 12, 2015
huonw
assigned
huonw
and unassigned
pnkfelix
Feb 12, 2015
This comment has been minimized.
This comment has been minimized.
|
FWIW, it seems totally reasonable to mark this |
huonw
added a commit
to huonw/rust
that referenced
this issue
Feb 13, 2015
This comment has been minimized.
This comment has been minimized.
|
I can find exactly one and a half other languages that implements similar functionality (rather than just using plain split). Haskell has PR for unstabilsation, #22253. |
huonw
added a commit
to huonw/rust
that referenced
this issue
Feb 13, 2015
This comment has been minimized.
This comment has been minimized.
|
The two algorithms are different enough that they should use different method names IMO. Whitespace split is popular and useful (as a poor man's lex for command lines, for example?). |
steveklabnik
added a commit
to steveklabnik/rust
that referenced
this issue
Feb 13, 2015
steveklabnik
added a commit
to steveklabnik/rust
that referenced
this issue
Feb 14, 2015
steveklabnik
added a commit
to steveklabnik/rust
that referenced
this issue
Feb 15, 2015
bors
added a commit
that referenced
this issue
Apr 7, 2015
Manishearth
added a commit
to Manishearth/rust
that referenced
this issue
Apr 7, 2015
kwantam
added a commit
to kwantam/rust
that referenced
this issue
Apr 7, 2015
This comment has been minimized.
This comment has been minimized.
|
RFC’d rust-lang/rfcs#1054 |
This comment has been minimized.
This comment has been minimized.
|
I've become convinced that, while it might be useful to offer UAX#29 breaking as part of libunicode, it's definitely not what most people will want most of the time, so it should probably not be the default I'm almost ready to submit a PR to add a UAX#29 word breaking implementation to libunicode, but probably having a short function name that performs the standard "split on whitespace" is something that people kind of expect from the stdlib. Perhaps renaming the current function per @SimonSapin's suggestion is the right way to go here. By the way, @mahkoh, I think your reverse iterator above does not correctly handle Format and Extend characters. For example, I think your implementation might group Format or Extend characters with the wrong word, and I'm pretty sure it doesn't do the right thing in the pathological case, namely, when there are Format or Extend characters at the beginning of a string. |
This comment has been minimized.
This comment has been minimized.
|
@kwantam that’s great! However I’d encourage you to publish this code on crates.io rather than try to get it into libunicode. That way it can evolve independently of the compiler, and the API does not need to make as strong compatibility promises as |
This comment has been minimized.
This comment has been minimized.
|
That's a not unreasonable opinion, but I think it's probably worth having the discussion in the PR rather than here. If it's ultimately rejected in favor of moving to a separate library that's fine. A larger question is whether more of the functionality currently in libunicode should be moved elsewhere, e.g., into a standalone crate. For example, it seems slightly arbitrary to have grapheme clustering, normalizations, etc. in stdlib (as is currently the case) and keeping other functionality elsewhere. |
This comment has been minimized.
This comment has been minimized.
Yes, definitely. I kinda volunteered to do this a while back, but haven’t gotten to it yet. |
This comment has been minimized.
This comment has been minimized.
|
I worry that having a Unicode is hard and complicated, so I fear many people will just treat strings like ASCII if a lot of Unicode functionality is moved out of the stdlib. The best way to encourage people to handle non-ASCII data correctly is, in my opinion, to make it very easy to do so, and that can be accomplished effectively by keeping all Unicode functionality in-tree. |
This comment has been minimized.
This comment has been minimized.
|
“More correct” really depends on what you’re doing with it. While I agree with the general sentiment that good Unicode support is good, just because it’s says “Unicode” in the name doesn’t mean that UAX#29 is always the right thing. Let’s pretend for a minute that The Python equivalent is The only code I ever wrote that was actually manipulating words in prose text was line breaking in a text layout engine, but 1. that’s UAX#14, not #29, 2. it’s specialized enough that I want lots of control: I don’t expect to just use a single standard library method. |
This comment has been minimized.
This comment has been minimized.
blankname
commented
Apr 11, 2015
|
Note: I'm out of my depth here and late for bed, so hopefully I'm not just adding noise. Personally I'd love to have a words iterator that works along the same lines as the "Counting the Words in a UTF-8 String" example in this icu guide http://userguide.icu-project.org/strings/utext So use UAX#29 word breaking, but only return words and numbers; not whitespace or punctuation. I don't think that this behaviour would be surprising. The only downsides I see are complexity of implementation and the fact that it would be less speedy than just splitting by whitespace. As far as complexity goes, I think that this type of complexity argues for it being implemented in the standard library, where it will get a well scrutinized and high quality implementation that is available to anyone using the standard library. As for performance, it would be easy for a programmer to opt out. If they know what they're doing and what they're doing doesn't need the more complex word splitting algorithm then splitting by whitespace is only a one-liner away. It appears to me to be a suitable default. This seems like another place that the standard library can help programmers avoid tripping up. It would be more complex for implementor not user. For the user it would just work, even in cases the programmer never thought about. |
This comment has been minimized.
This comment has been minimized.
I disagree with the claim that putting something in |
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin I agree with your concerns about stability and domain knowledge, but I fear you might be overselling them slightly. My understanding is that marking it As far as domain knowledge goes, in general I agree but in this case it's mostly just a matter of implementing the slightly gross rules (the backward iterator is the gross one) in a more-or-less linear time automaton with reasonable constant factors and acceptable code quality. The deep domain knowledge is required for writing UAX#29, surely, but not so much for implementing it. @blankname I think what you're after is effectively |
huonw commentedJul 12, 2014
Currently it's just splitting on whitespace (i.e. is equivalent to splitting with
regex!("\s+")). Preferably it should be using http://www.unicode.org/reports/tr29/#Word_Boundaries(The old behaviour is easy to replicate with the above regex, or with
s.split(|c: char| c.is_whitespace()).filter(|s| !s.is_empty()).)