Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename or replace str::words to side-step the ambiguity of “a word”. #1054

Merged
merged 2 commits into from
Apr 17, 2015

Conversation

SimonSapin
Copy link
Contributor

View.

@steveklabnik
Copy link
Member

👍 from me

kwantam added a commit to kwantam/rust that referenced this pull request Apr 12, 2015
This patch does the following:

1. Adds three new structs in libunicode/str.rs:

   a. UnicodeWords: a filter on the UWordBounds iterator that yields only
      the "words" of a string as defined in Section 4 of Unicode Standard
      Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

   b. UWordBounds: an iterator that segments a string on its word
      boundaries as defined in UAX#29. Note that this *only* segments
      the string, and does *not* drop whitespace and other non-word
      pieces of the text (that's what UnicodeWords does).

      Note that UWordBounds has both a forward and backward iterator
      that have total running time (that is, to segment the entire
      string) linear in the size of the string. It should be noted that
      with pathological inputs the reverse iterator could be about 2x less
      efficient than the forward iterator, but on reasonable inputs
      their costs are similar.

   c. UWordBoundIndices: the above iterator, but returning tuples of
      (offset, &str).

2. Adds three new functions in the `UnicodeStr` trait:

   a. words_unicode(): returns a UnicodeWords iterator.

   b. split_words_uax29(): returns a UWordBounds iterator.

   c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.

3. Updates the `src/etc/unicode.py` script to generate tables necessary
   for running the UWordBounds iterators.

4. Adds a new script, `src/etc/unicode_gen_breaktests.py`,
   which processes the grapheme and word break tests published
   by the Unicode consortium into a format for inclusion in
   libcollectionstest.

5. Adds new impls in libcollections's `str` corresponding to the
   `UnicodeStr` functions of (2).

   Note that this new functionality is gated with `feature(unicode)`.

6. Adds tests in libcollectionstest to exercise this new functionality.

   In addition, updates the test data for the graphemes test to
   correspond to the output from the script of (4). (Note that at the
   moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in rust-lang#15628;
rather, it introduces a new function alongside `words()` that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves `words()`
alone.
kwantam added a commit to kwantam/rust that referenced this pull request Apr 12, 2015
This patch does the following:

1. Adds three new structs in libunicode/str.rs:

   a. UnicodeWords: a filter on the UWordBounds iterator that yields only
      the "words" of a string as defined in Section 4 of Unicode Standard
      Annex rust-lang#29 (UAX#29), http://unicode.org/reports/tr29/#Word_Boundaries

   b. UWordBounds: an iterator that segments a string on its word
      boundaries as defined in UAX#29. Note that this *only* segments
      the string, and does *not* drop whitespace and other non-word
      pieces of the text (that's what UnicodeWords does).

      Note that UWordBounds has both a forward and backward iterator
      that have total running time (that is, to segment the entire
      string) linear in the size of the string. It should be noted that
      with pathological inputs the reverse iterator could be about 2x less
      efficient than the forward iterator, but on reasonable inputs
      their costs are similar.

   c. UWordBoundIndices: the above iterator, but returning tuples of
      (offset, &str).

2. Adds three new functions in the `UnicodeStr` trait:

   a. words_unicode(): returns a UnicodeWords iterator.

   b. split_words_uax29(): returns a UWordBounds iterator.

   c. split_words_uax29_indices(): returns a UWordBoundIndices iterator.

3. Updates the `src/etc/unicode.py` script to generate tables necessary
   for running the UWordBounds iterators.

4. Adds a new script, `src/etc/unicode_gen_breaktests.py`,
   which processes the grapheme and word break tests published
   by the Unicode consortium into a format for inclusion in
   libcollectionstest.

5. Adds new impls in libcollections's `str` corresponding to the
   `UnicodeStr` functions of (2).

   Note that this new functionality is gated with `feature(unicode)`.

6. Adds tests in libcollectionstest to exercise this new functionality.

   In addition, updates the test data for the graphemes test to
   correspond to the output from the script of (4). (Note that at the
   moment this change is primarily cosmetic.)

This patch does not settle the question raised by @huonw in rust-lang#15628;
rather, it introduces a new function alongside `words()` that follows
UAX#29.

In addition, it does not address the concerns that @SimonSapin raises in
rust-lang/rfcs#1054 since it leaves `words()`
alone.
@SimonSapin
Copy link
Contributor Author

Some relevant discussion starting here: rust-lang/rust#15628 (comment)

@nagisa
Copy link
Member

nagisa commented Apr 12, 2015

I like the propsed approach. This provides a low-cost convenience wrapper to something many people will be getting wrong otherwise.

How wrong you ask? Splitting over the ASCII space only. Python does not provide any proper convenience wrapper in its standard library and in all code I’ve seen literally all word splitting is done over the ASCII space.

@SimonSapin
Copy link
Contributor Author

“Wrong” depends on what you’re doing. Sometimes, you might be e.g. parsing a space-separated file formet and splitting on non-breaking spaces or other non-ASCII whitespace is not what you want. But s.split(&[' ', '\t'][..]) or similar is available for that situation, and hopefully users will know the difference.

That said, I’m becoming less and less convinced that this should be included at all. I’d like to see use cases. (rust-lang/rust#15628 (comment))

@SimonSapin
Copy link
Contributor Author

And if you’re in a situation where considering only ASCII is wrong, isn’t considering only whitespace also wrong and shouldn’t you be looking for Unicode word boundaries? rust-lang/rust#15628

@nagisa
Copy link
Member

nagisa commented Apr 12, 2015

you might be e.g. parsing a space-separated file formet and splitting on non-breaking spaces or other non-ASCII whitespace is not what you want

you shouldn’t use words/split_whitespace in this case anyway, because it is not split_whitespace’s purpose to split columns in ASCII space-delimited files.

And if you’re in a situation where considering only ASCII is wrong, isn’t considering only whitespace also wrong and shouldn’t you be looking for Unicode word boundaries?

Indeed. Since I’ve previously been doing web development almost exclusively, I found that people think words in prose will always have an ASCII space between them. Considering they will be separated by some unicode whitespace instead is just as wrong.

On the other hand, nothing in split_whitespace suggest it will produce a list of words. So, either just remove words completely, or deprecate it in favour of split_whitespace.

@iopq
Copy link
Contributor

iopq commented Apr 12, 2015

I think the standard library should only implement things that have an obvious "best" implementation. If there are too many complexities and questions about what IS best, it should be sorted out in the crates.io ecosystem.

@lilyball
Copy link
Contributor

👍 on renaming to split_whitespace.

Just 2 days ago I wrote some code that needed to split a line on whitespace. Having str::split_whitespace would have been very useful to me. I actually forgot about str::words so I ended up writing s.split(" "), which bugged me for the reasons you already listed (only split on space, and didn't collapse sequences of whitespace), although in my particular use-case it was acceptable.

FWIW I think I forgot about str::words because it's not grouped with either the split_* methods or with lines() (though it is grouped with graphemes). So in addition to renaming it, I think it should be moved to be adjacent to the other splitting methods for ease of discovery.

@lilyball
Copy link
Contributor

More generally, I think people expect that splitting on whitespace should be an easy thing to do. And I agree with them, it really should. It's not always the correct thing to do, but it's a common enough thing to want.

@kwantam
Copy link

kwantam commented Apr 13, 2015

I agree that splitting on whitespace is a common enough thing that providing a convenience function for it is a good idea, and that calling said function words() is a bit too imprecise.

Moving this function from libunicode into libcollections and renaming it split_whitespace() is the right approach here, I think. 👍

@lilyball
Copy link
Contributor

Why move it out of libunicode?

@kwantam
Copy link

kwantam commented Apr 13, 2015

No particularly strong reason, but it seems to me that since it doesn't directly rely on any Unicode tables (though it does indirectly, though the char::is_whitespace() function; but this is itself exported), it can just live in libcollections instead.

@lilyball
Copy link
Contributor

The problem with moving it into libcollections is that means crates that only use libunicode/libcore (but not liballoc) lose access to this function.

Note that most of libcollections's str methods are actually just wrappers that call into libcore or libunicode. The only str methods implemented in libcollections directly are ones that require allocation (e.g. the methods that return String).

@kwantam
Copy link

kwantam commented Apr 13, 2015

Good point. 👍

@tafia
Copy link

tafia commented Apr 14, 2015

I personally think using split / regex is enough. It is almost a one-line expression that could easily be found with a simple google search in the future.

@lilyball
Copy link
Contributor

A regex is waaaay overkill for this problem, and split() doesn't collapse sequences of adjacent whitespace, so if we say "you have to use split()" then everyone has to remember "and add .filter(|x| !x.is_empty())"

@tafia
Copy link

tafia commented Apr 14, 2015

You won't say to use split "only". You'll let them google it and then find
the one line split + is_empty (or whatever). For instance I couldn't easily
find an equivalent in c#, I google and quickly found simple answers).

I think std should

  • remain as simple as possible
  • focus on basic functionalities with not so basic implementations

But I am clearly not the best adviser, I just wanted to give my opinion.
On 14 Apr 2015 14:35, "Kevin Ballard" notifications@github.com wrote:

A regex is waaaay overkill for this problem, and split() doesn't collapse
sequences of adjacent whitespace, so if we say "you have to use split()"
then everyone has to remember "and add .filter(|x| !x.is_empty())"


Reply to this email directly or view it on GitHub
#1054 (comment).

@tshepang
Copy link
Member

@kballard s.split_whitespace() is easier on the eyes than s.split(char::is_whitespace).filter(|s| !s.is_empty()). It's also very common functionality.

@nikomatsakis
Copy link
Contributor

👍 Ship it!

@alexcrichton alexcrichton self-assigned this Apr 16, 2015
@kwantam
Copy link

kwantam commented Apr 16, 2015

Slight sticking point: while words() is marked unstable, the Words struct is marked stable. So probably a deprecation / transition plan is required here.

@huonw
Copy link
Member

huonw commented Apr 16, 2015

I wonder if this wouldn't be better be tackled with a pattern API as mentioned in the alternatives, e.g. s.split(std::str::Whitespaces), which would collapse runs of whitespace into one.

One could probably even use an associated constant/type on the str type to avoid having to import more things: s.split(str::Whitespaces) would work anywhere that the str type itself is in scope.

@alexcrichton
Copy link
Member

Thanks again for the RFC @SimonSapin! The feedback on this has been quite positive, so I'm going to merge this. The fact that Words is #[stable] (as @kwantam pointed out) I believe was just a mistake and we should be fine to revert the stability. I also think that @huonw's idea is a plausible option, but it is probably more ergonomic to have an inherent method anyway (less imports). Regardless we'll have a bit more time after this lands as #[unstable] to tweak it for a bit as well.

@lilyball
Copy link
Contributor

Just to note, while I like @huonw's general idea of providing some pre-baked patterns for common things, I don't think that is actually a suitable replacement for str::words() because it won't be able to drop the empty first/last elements when the string has leading/trailing space (e.g. " a b ".split(std::str::Whitespaces) would yield ["", "a", "b", ""]).

@kwantam
Copy link

kwantam commented Apr 22, 2015

It should be possible to provide a pattern for match() that does the right thing, though.

@lilyball
Copy link
Contributor

@kwantam Ah hmm, you're right. I actually didn't realize we had a matches() method now, though it seems like an obvious thing to have in hindsight.

SimonSapin added a commit to SimonSapin/rust that referenced this pull request Jun 6, 2015
But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation
which we decided not to include in of `std`:
rust-lang/rfcs#1054
bors added a commit to rust-lang/rust that referenced this pull request Jun 9, 2015
* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well.
* Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while.
* Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`.
* Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it.

r? @alexcrichton
@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet