Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode case folding, caseless matching, and iterator methods #19277

Closed
SimonSapin opened this issue Nov 24, 2014 · 4 comments
Closed

Unicode case folding, caseless matching, and iterator methods #19277

SimonSapin opened this issue Nov 24, 2014 · 4 comments

Comments

@SimonSapin
Copy link
Contributor

I made https://github.com/SimonSapin/rust-casefold for Servo, the HTML spec requires “compatibility caseless matching”. Some of it might be interesting to have in libunicode/libcollections. @aturon, @alexcrichton, how much do you think is appropriate to include? I’d like your input before a prepare a PR (and have to deal with Rust bootstrapping).

zip_all and iter_eq are two generic function (independent of Unicode) that could be default methods of Iterator. The former is like i.zip(j).all(f), but also return false if the two iterators have a different length. The latter (which uses the former) check that the iterators have the same content. That is, it is equivalent to i.collect::Vec<_>() == j.collect::Vec<_>(), but compares elements one by one and does not allocate. (It also stops at the first difference rather than consume both iterators until the end.)

Case folding is fairly straightforward. The data could be generated with src/etc/unicode.py and kept in src/libunicode/tables.rs, like existing Unicode data.

Caseless matching however is more complex: there are different variants of it. Other than the “default” variant, they require NFD and NFKD normalization. libunicode already has nfd_chars and nfkd_chars methods on &str, but here that would require allocating an intermediate String. So, in the same spirit as #19042, it might be useful to expose another API for Unicode normalization (all four variants of it, while we’re at it) from a generic Iterator<char> rather than just &str / Chars.

Thoughts?

Nothing urgent here, but consider this when stabilizing libunicode.

@ghost
Copy link

ghost commented Nov 24, 2014

I can certainly see zip_all being useful but with a rather different signature that more closely aligns with the existing zip: fn zip_longer<U, I>(&self, it: I) -> ZipLonger<T, U> where I: Iterator<U> where ZipLonger implements Iterator<(Option<T>, Option<U>)>. I'd be tempted to see it tightened even further by having it use an enumerated type of three variants Left, Right, Both instead of a tuple of two Options.

The proposed zip_all unnecessarily fuses what is a regular zip with an all combinator.

Here's a suggested implementation:

use std::iter::{Chain, Map, repeat, Repeat, TakeWhile, Zip};

type ZipLonger<'a, 'b, 'c, T, U, I1, I2> = std::iter::TakeWhile<
    'a,
    (Option<T>, Option<U>),
    Zip<
        Chain<
            Map<'b, T, Option<T>, I1>,
            Repeat<Option<T>>
        >,
        Chain<
            Map<'c, U, Option<U>, I2>,
            Repeat<Option<U>>
        >
    >
>;

fn zip_longer<'a, 'b, 'c, T: Clone, U: Clone, I1, I2>(a: I1, b: I2)
    -> ZipLonger<'a, 'b, 'c, T, U, I1, I2>
    where I1: Iterator<T>, 
          I2: Iterator<U> {
    a.map(Some).chain(repeat(None))
        .zip(b.map(Some).chain(repeat(None)))
        .take_while(|&(ref left, ref right)| left.is_some() || right.is_some())
}

(as an Iterator method rather than a free-standing function)

IMHO as for iter_eq, it seems like since it's a fairly trivial composition of existing combinators (well, including zip_longer), it may be hard to justify including it in core regardless of its high utility.

@SimonSapin
Copy link
Contributor Author

@jakub- Indeed, zip_longer is a better primitive that other things can be built on, including iter_eq. I’ll submit a PR for it.

@SimonSapin
Copy link
Contributor Author

zip_longest PR: #19283

@SimonSapin
Copy link
Contributor Author

Closing:

  • It turns out iter_eq already exists as std::iter::order::eq.
  • zip_longest is now in its own crate and also in rust-itertools
  • rust-casefold is now its own crate. There is no urgency to have it in std, we can revisit later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant