New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Needle API (née Pattern API) #2500

Open
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
@kennytm
Member

kennytm commented Jul 14, 2018

@Centril Centril added the T-libs label Jul 15, 2018

@Centril

This comment has been minimized.

Contributor

Centril commented Jul 15, 2018

🎉

Happy to see you got some use out of my link to https://crates.io/crates/galil-seiferas and cc @bluss who's the author of that crate.

@gilescope

This comment has been minimized.

gilescope commented Jul 18, 2018

SharedHaystack seems like a general concept. Could we call it CheapClone or something like that - I'm sure there's lots of other places we'd like to know if a clone is expensive or not.

@kennytm

This comment has been minimized.

Member

kennytm commented Jul 18, 2018

@gilescope Interesting, though if we generalize the concept it would raise the question what is meant by "cheap", e.g. is the Clone of u32 cheap? [u32; 1]? [u32; 65536]? Box<Rc<[u32]>>?

Suppose we do introduce the ShallowClone marker trait:

pub trait ShallowClone: Clone {}
impl<'a, T: ?Sized + 'a> ShallowClone for &'a T {}
impl<T: ?Sized> ShallowClone for Rc<T> {}
impl<T: ?Sized> ShallowClone for Arc<T> {}

the SharedHaystack bound could be changed to a trait alias and everything should work...

#[deprecated]
trait SharedHaystack = Haystack + ShallowClone;

... as long as SharedHaystack is still unstable. If it has become stable then we could not change anything (a third party crate could impl SharedHaystack for MyRef<MyHay> with MyRef<T>: !ShallowClone). So if we intend to take ShallowClone seriously we should have separate stabilization tracks between the whole Pattern API and SharedHaystack.

Anyway ShallowClone should belong to another RFC. I've added this to unresolved questions.

### Consumer
A consumer provides the `.consume()` method to implement `starts_with()` and `trim_start()`. It

This comment has been minimized.

@shepmaster

shepmaster Jul 19, 2018

Member

to implement starts_with()

Is that missing from this or outdated?

This comment has been minimized.

@kennytm

kennytm Jul 19, 2018

Member

This is the starts_with algorithm, not the trait method :)

@shepmaster

This comment has been minimized.

Member

shepmaster commented Jul 19, 2018

It's the most 🚲🏚 thing, but I don't find Hay evocative. I think that it's supposed to be "a haystack that's bigger than the haystack we are currently looking at", so maybe something like HayField?

@shepmaster

This comment has been minimized.

Member

shepmaster commented Jul 19, 2018

Experience report — implementing Pattern for searching &str

See my branch (permalink).

Background

Jetscii is a library that implements "find the first of any of these 16 bytes in a string". Its stable interface is fn find(&self, haystack: &str) -> Option<usize>. It has a feature flag allowing it to implement the current unstable Pattern API.

Thoughts

  • There are a lot of types and traits that fit together in subtle ways
  • Searcher and Consumer feel like they will be redundant, based on their descriptions
  • It is unclear what Searcher and Consumer map to in algorithm-function-space. This is important to know in order to test the implementations.
  • I did not think of my pattern in the terms of what the Searcher / Consumer primitives offer
    • Implementing Consumer feels very non-performant for my case
  • From my implementation, it's unclear why Searcher and Consumer are different traits. It's also unclear why Pattern is itself a different trait. For me, these were all implemented by a single type, the type I already had. I'm sure there's a reason for the indirection, but justifying that upfront in the docs will be highly useful.

Implementation failures

I implemented Pattern and friends and it passed all my tests. I then had @kennytm take a look at it and we identified multiple issues with my implementation:

  • I didn't properly return the offset in the hay, but only in the haystack. While documented, this is a subtle interaction that doesn't break anything until you start using the second result of find (or one of the iterators).
  • My implementation of consume was completely broken, but was not tested at all. As mentioned above, I don't think in terms of these two primitives, so I didn't test an algorithm that used consume at first.

Looking forward

One exciting thing is that this allows implementing a pattern for &[T], which is what Jetscii actually does. I plan on trying that and will post another comment for that.

@kennytm

This comment has been minimized.

Member

kennytm commented Jul 19, 2018

Thanks @shepmaster! I'll update the descriptions in the pattern_3 crate for the unaddressed points (why Searcher and Consumer being different traits are explained in the RFC but not in the crate documentations).

There are a lot of types and traits that fit together in subtle ways

Could you elaborate what do you mean by "subtle"?

Implementing Consumer feels very non-performant for my case

Yes this problem also exists for my regex implementation which does an unanchored search.

unsafe impl<'p> Consumer<str> for RegexSearcher<'p> {
    fn consume(&mut self, span: Span<&str>) -> Option<usize> {
        let (hay, range) = span.into_parts();
        let m = self.regex.find_at(hay, range.start)?;
        if m.start() == range.start {
            Some(m.end())
        } else {
            None
        }
    }
}

I don't know if this could be improved performance-wise other than asking the Searcher implementor to provide such primitive too.

(If we disregard the performance issue we could default impl a Consumer in terms of a Searcher and vice-versa, but most of the time this isn't a good idea.)

@shepmaster

This comment has been minimized.

Member

shepmaster commented Jul 19, 2018

Experience report — implementing Pattern for searching &[u8]

See my branch (permalink).

Background

Previous comment

Thoughts

  • I'm very happy that Span::as_bytes exists. I expected there to be a more general map though.
  • For my particular case, the &[u8] implementation is simpler because that's my core algorithm. My &str Pattern can actually delegate to the &[u8] one.

Looking forward

I've needed to write custom consumers of a Pattern before as well; I hope that to be the next comment.

Merge Searcher and Consumer.
Explained why Pattern cannot be merged into Searcher.

Block on RFC 1672.
@kennytm

This comment has been minimized.

Member

kennytm commented Jul 24, 2018

Update

Addressing #2500 (comment).

  1. Merged Searcher and Consumer into a single trait, to reduce the number of types. The concept of "consumer" still exists where the Searcher impl can be an enum to choose between a search-optimized or consume-optimized structure. Microbenchmarks shows that this runtime selection doesn't incur much slowdown.

  2. Added more documentation about Searcher and Pattern into the pattern-3 crate. Unfortunately not available in docs.rs yet until they have fixed that #![feature(extern_prelude)] bug 😛

  3. Changed some Pattern impl not blocked by #1672 to use a "blanket" impl.


I'm very happy that Span::as_bytes exists. I expected there to be a more general map though.

A general map cannot safely exist, as you could write span.map(|h| "") and that produced nonsense.

An unsafe version can be done as

let (hay, range) = span.into_parts();
let hay = hay.as_inner();
unsafe { Span::from_parts(hay, range) }
@gereeter

This is a generally great RFC and I'm overall quite happy with the API and definitely happy with the demonstrated performance improvements.

I would be interested in seeing more detail and benchmarks in regards to the behaviour with owned haystacks. From what I can tell, this makes split on Vec take quadratic time, since the trisection needs to copy the entire tail of the Vec on every split. This seems like a hard problem, and it would be bad to back ourselves into a corner.

One (verbose) solution would be to introduce an intermediate data structure, a PartialVec<T> that owns the memory of a whole Vec but only owns a small range worth of elements. It would be possible to convert it into a Vec by shifting the elements to the start. Then, split_around would be turned into two variants (where PartialVec would presumably be specified in another associated type), split_around_forward(...) -> (Vec, Vec, PartialVec) and split_around_backward(...) -> (PartialVec, Vec, Vec). split would use split_around_forward and rsplit would use split_around_backward.

A hay can *borrowed* from a haystack.
```rust
pub trait Haystack: Deref<Target: Hay> + Sized {

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

I think that, due to its unsafe methods being called from safe code (e.g. in trim_start), Haystack needs to be an unsafe trait. Otherwise, without ever writing an unsafe block and therefore promising to uphold the invariants of safe code, an invalid implementation of Haystack could violate memory safety. The fact that split_around and split_unchecked are unsafe capture the fact that the caller, to preserve memory safety, must pass in valid indices, but it does nothing to prevent the callee from doing arbitrary bad behaviour even if the indices are valid.

Hay probably also needs to be an unsafe trait. It looks like in practice, Searchers are implemented for specific Hay types, indicating trust of just those implementations, and , so it may not be strictly necessary. Additionally, one of the requirements of a valid Haystack implementation could be the validity of the associated Hay type. However, with the proposed impl<'a, H: Hay> Haystack for &'a H, this is impossible to promise, and I think it would be necessary for Hay to be an unsafe trait.

This comment has been minimized.

@alercah

alercah Aug 3, 2018

Contributor

I agree.

This comment has been minimized.

@kennytm

kennytm Aug 4, 2018

Member

Fixed, both are now unsafe.

A `SharedHaystack` is a marker sub-trait which tells the compiler this haystack can cheaply be
cheaply cloned (i.e. shared), e.g. a `&H` or `Rc<H>`. Implementing this trait alters some behavior
of the `Span` structure discussed next section.

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

I'm somewhat uncomfortable by the use of specialization to modify the behaviour of Span instead of just providing more optimized functions. Admittedly, this changed behaviour seems hard to directly observe, since into_parts is only available on shared spans. This definitely isn't a big deal.

with invalid ranges. Implementations of these methods often start with:
```rust
fn search(&mut self, span: SharedSpan<&A>) -> Option<Range<A::Index>> {

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

Is SharedSpan a relic of a previous version of this proposal? I don't see it defined anywhere and it sounds like Span<H> where H: SharedHaystack.

This comment has been minimized.

@kennytm

kennytm Aug 3, 2018

Member

Fixed. (In an ancient version there was SharedSpan<H> and UniqueSpan<H> where Haystack has an associated type Span to determine which span to use. The resulting code was quite ugly.)

let span = unsafe { Span::from_parts("CDEFG", 3..8) };
// we can find "CD" at the start of the span.
assert_eq!("CD".into_searcher().search(span.clone()), Some(3..5));
assert_eq!("CD".into_searcher().consume(span.clone()), Some(5));

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

Should this (and the other examples calling .into_searcher().consume(...)) be .into_consumer().consume(...)?

This comment has been minimized.

@kennytm

kennytm Aug 3, 2018

Member

Fixed. Copy-and-paste error.

let mut searcher = pattern.into_searcher();
let mut rest = Span::from(haystack);
while let Some(range) = searcher.search(rest.borrow()) {
let [left, _, right] = unsafe { rest.split_around(range) };

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

It seems very common to call split_around and then throw away one or more of the components. For owned containers like Vec, at least, this involves allocating a vector for the ignored elements, copying them to their new location, then finally dropping and deallocating. Would it be possible to add more methods to Haystack that only return some of the parts? They could have default definitions in terms of split_around, so they shouldn't cause any more difficulty for implementers, but owned containers would be able to override them for better performance.

It also occurs to me that slice_unchecked is actually one of these specialized methods, returning only the middle component.

This comment has been minimized.

@kennytm

kennytm Aug 3, 2018

Member

We'll need 3 more names for these 😝 ([left, middle, _], [left, _, right], [_, middle, right])

pub trait Haystack: Deref<Target: Hay> + Sized {
fn empty() -> Self;
unsafe fn split_around(self, range: Range<Self::Target::Index>) -> [Self; 3];
unsafe fn slice_unchecked(self, range: Range<Self::Target::Index>) -> Self;

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

Could slice_unchecked have a default implementation as follows?

unsafe fn slice_unchecked(self, range: Range<Self::Target::Index>) -> Self {
    let [_, middle, _] = self.split_around(range);
    middle
}

This comment has been minimized.

@kennytm

kennytm Aug 3, 2018

Member

Good idea. Added.

* Implement `Hay` to `str`, `[T]` and `OsStr`.
* Implement `Haystack` to `∀H: Hay. &H`, `&mut str` and `&mut [T]`.

This comment has been minimized.

@gereeter

gereeter Aug 2, 2018

The pattern_3 crate also has an implementation for Vec<T> (though not String or OsString). Are those owned implementations intended eventually? Is that just out of scope for this particular RFC?

This comment has been minimized.

@kennytm

kennytm Aug 3, 2018

Member

I don't intend to add these into the standard library, due to the efficiency concern you've raised.

pattern_3 does implement for Vec<T> just to illustrate that it can transfer owned data type correctly.

@gereeter

This comment has been minimized.

gereeter commented Aug 2, 2018

I separated this comment out because it is far more questionable than the rest. I know that I personally tend to go overboard with squeezing out tiny and inconsequential bits of runtime performance at the expense of compile time and ergonomics. That said,

Merged Searcher and Consumer into a single trait, to reduce the number of types. The concept of "consumer" still exists where the Searcher impl can be an enum to choose between a search-optimized or consume-optimized structure. Microbenchmarks shows that this runtime selection doesn't incur much slowdown.

This doesn't feel like the right trade-off to me. Reducing the number of types is definitely useful for implementers of patterns that don't have a special optimization for consume and may make comprehension easier for users of the API. However,

  • Implementations are probably going to be much more rare than uses. Complicating the implementation a small amount for the sake of performance seems like a good thing. Yes, the runtime selection doesn't incur much slowdown, but it seems wrong to force an unnecessary performance penalty, no matter how small.
  • I actually find the split types easier to read (as a user). It seems confusing that there are two functions that return identical types, but might (or might not, depending on the searcher) panic if I call the wrong function on their result. With the split types, the type system tells me to call search on a Searcher and consume on a Consumer and will yell at me if I get it wrong. If I only test with haystacks that have the same searcher and consumer, I might not notice the mistake, meaning I could have a less efficient implementation that also breaks when given certain types. This is a little far-fetched, I admit, since [T] has a different searcher and consumer, but that should be an implementation detail. It shouldn't be easy to run into internal points like that.

If getting the implementation right is an issue, there could just be a wrapper type along the lines of

pub struct SearcherConsumer<S> {
    inner: S
}

impl<H, S: Searcher<H>> Consumer<H> for SearcherConsumer {
    // ...
}

This could then be the default type for Pattern::Consumer. If there isn't a particularly performant way to implement Consumer, then just using the default should be painless and correct.

@kennytm

This comment has been minimized.

Member

kennytm commented Aug 3, 2018

@gereeter

Implementations are probably going to be much more rare than uses.

This I disagree. People seldom use Searcher/Consumer directly, unless they are implementing a new generic algorithm. The standard matches/split/starts_with/etc methods already covered all common cases you could do with a generic pattern.

Yes, the runtime selection doesn't incur much slowdown, but it seems wrong to force an unnecessary performance penalty, no matter how small.

There is zero performance penalty in string matching caused by merging consumer and searcher, because we already have a runtime selection between empty and non-empty needle.

In trim_start and starts_with, I believe LLVM is able to recognize there's a loop invariant (not tested).

@Kimundi

This comment has been minimized.

Member

Kimundi commented Aug 8, 2018

@kennytm: Thank you, thank you, thank you! 🎉🎉🎉 This is the kind of end state that I attempted to reach with the various Pattern API sketches, but never could put enough effort into.

In general huge 👍 from me, but I had a few thoughts when reading through the RFC and comments right now.

  • Probably way past the point where it would be sensible, but your comparison of Pattern and Searcher to IntoIterator and Iterator made me think of whether it wouldn't have been best to call Pattern IntoSearcher from the start. Mainly because its name is horribly confusing with Rusts pattern matching support, especially given that pattern matching also supports "string patterns" and "slice patterns".
  • I also somewhat agree with @gereeter that the split of Searcher/Consumer while using the same Self::Searcher type seems somewhat confusing.
@kennytm

This comment has been minimized.

Member

kennytm commented Aug 11, 2018

Okay, so Searcher and Consumer are separate again then 🙃.

@Kimundi If Searcher and Consumer are two different traits, renaming Pattern to IntoSearcher would be misleading as it omits the Consumer part. Also, we cannot have a blanket impl unlike IntoIterator:

impl<H, S> Pattern<H> for S
where
    H: Haystack,
    S: Searcher<H::Target> + Consumer<H::Target>,
{
    type Searcher = Self;
    type Consumer = Self;
    fn into_searcher(self) -> Self { self }
    fn into_consumer(self) -> Self { self }
}

because due to backward compatibility we have a different blanket impl to support:

impl<'h, F> Pattern<&'h str> for F
where
    F: FnMut(char) -> bool,
{ ... }

and these two will conflict when a type implements (FnMut(char) -> bool) + Searcher<str> + Consumer<str>. (I've added this in the latest commit.)

Given these details I'm mildly against renaming Pattern.

@shepmaster

This comment has been minimized.

Member

shepmaster commented Aug 11, 2018

Experience report — consuming Pattern

My goal is to split a string into an iterator of delimiter / not delimiter values.

Background

Previous comment

I've implemented this code previously using the current Pattern API.

Thoughts

  • While my original case is focused on splitting strings with delimiters, I am interested in making the code as generic as possible. I would usually start by making the code using a concrete type (e.g. &str) and then making it more generic.

  • I don't understand what &str would be in the terms of the new API. Is it a Haystack? Hay? Should I actually be using str instead?

  • The relationships between Haystack, Hay, and Span are unclear.

  • Span is poorly introduced / motivated in the documentation.

  • I don't like the terms "Haystack" and "Hay" (or perhaps the things that they are applied to?). In my mind, I search a haystack, but the code actually searches inside of a Span.

  • I usually think of searching a haystack for a needle, but I haven't seen mention of a "needle". Perhaps that's a useful word to use, somewhere?

  • I seemingly cannot search a slice for a single value:

    ext::find(b"alpha", b'a'); // no
    ext::find(b"alpha", b"a"); // no
    ext::find(b"alpha", &b"a"[..]); // no
    ext::find(&b"alpha"[..], b'a'); // no
    ext::find(&b"alpha"[..], b"a"); // no
    ext::find(&b"alpha"[..], b"a"[..]); // no

    I think if this is merged, people will expect a lot of feature parity between slices and strings.

Working code

extern crate pattern_3;

use pattern_3::{Hay, Haystack, Pattern, Searcher, Span};
use std::ops::Deref;

#[derive(Copy, Clone, Debug, PartialEq)]
pub enum SplitType<T> {
    Piece(T),
    Delimiter(T),
}

pub struct SplitKeepingDelimiter<Thing, S>
where
    Thing: Haystack + Deref,
    Thing::Target: Hay,
{
    thing: Span<Thing>,
    searcher: S,
    saved_delimiter: Option<Thing>,
}

impl<Thing, S> Iterator for SplitKeepingDelimiter<Thing, S>
where
    Thing: Haystack + Deref,
    Thing::Target: Hay,
    S: Searcher<Thing::Target>,
{
    type Item = SplitType<Thing>;

    fn next(&mut self) -> Option<Self::Item> {
        if let Some(saved_delimiter) = self.saved_delimiter.take() {
            return Some(SplitType::Delimiter(saved_delimiter));
        }

        let thing = self.thing.take();
        // Search for the next occurrence of the delimiter
        match self.searcher.search(thing.borrow()) {
            Some(idx) => {
                // We found a delimiter
                let [l, m, r] = unsafe { thing.split_around(idx) };

                if l.is_empty() {
                    // The delimiter starts the remainder of the string
                    self.thing = r;
                    Some(SplitType::Delimiter(m.into()))
                } else {
                    // There's something before the delimiter
                    self.saved_delimiter = Some(m.into());
                    self.thing = r;
                    Some(SplitType::Piece(l.into()))
                }
            }
            None => {
                // There are no more delimiters
                if thing.is_empty() {
                    // And there's no more string to search
                    None
                } else {
                    // One last piece to return
                    Some(SplitType::Piece(thing.into()))
                }
            }
        }
    }
}

pub trait SplitKeepingDelimiterExt {
    fn split_keeping_delimiter<P>(self, pattern: P) -> SplitKeepingDelimiter<Self, P::Searcher>
    where
        Self: Haystack,
        Self::Target: Hay,
        P: Pattern<Self>;
}

impl<H> SplitKeepingDelimiterExt for H
where
    H: Haystack,
    H::Target: Hay,
{
    fn split_keeping_delimiter<P>(self, pattern: P) -> SplitKeepingDelimiter<Self, P::Searcher>
    where
        P: Pattern<Self>,
    {
        SplitKeepingDelimiter {
            thing: Span::from(self),
            searcher: pattern.into_searcher(),
            saved_delimiter: None,
        }
    }
}

#[cfg(test)]
mod test {
    use super::SplitKeepingDelimiterExt;

    #[test]
    fn split_with_delimiter() {
        use super::SplitType::*;
        let delims = &[',', ';'][..];
        let items: Vec<_> = "alpha,beta;gamma".split_keeping_delimiter(delims).collect();
        assert_eq!(
            &items,
            &[
                Piece("alpha"),
                Delimiter(","),
                Piece("beta"),
                Delimiter(";"),
                Piece("gamma")
            ]
        );
    }

    #[test]
    fn split_with_delimiter_allows_consecutive_delimiters() {
        use super::SplitType::*;
        let delims = &[',', ';'][..];
        let items: Vec<_> = ",;".split_keeping_delimiter(delims).collect();
        assert_eq!(&items, &[Delimiter(","), Delimiter(";")]);
    }

    #[test]
    fn split_with_delimiter_bytes() {
        use super::SplitType::*;

        let items: Vec<_> = b"comma,separated,data,".split_keeping_delimiter(|&c: &u8| c == b',').collect();
        assert_eq!(
            &items,
            &[
                Piece(&b"comma"[..]),
                Delimiter(b","),
                Piece(b"separated"),
                Delimiter(b","),
                Piece(b"data"),
                Delimiter(b","),
            ]
        );
    }
}

I'm not happy with the Thing generic type name; I really wanted to call it "haystack", but it's not a Haystack so...

Overall thoughts

I'm very optimistic about this API. I'm hoping that a diverse set of eyes on the code can help hammer out the naming as well as adding more comprehensive documentation.

@kennytm

This comment has been minimized.

Member

kennytm commented Aug 12, 2018

Thanks for the report @shepmaster !


I usually think of searching a haystack for a needle, but I haven't seen mention of a "needle". Perhaps that's a useful word to use, somewhere?

We could rename the trait Pattern to Needle. wdyt?

cc @Centril (1) and @Kimundi (2) who want to rename Pattern to something else

The name Haystack is fine because we get

fn contains<H, P>(haystack: H, needle: P) -> bool
where
    H: Haystack,
    P: Needle<H>;

so for those not directly working with Searcher we are indeed "searching for a needle in a haystack".

In my mind, I search a haystack, but the code actually searches inside of a Span.

maybe reading it as "search inside of a ____ of haystack" is better?


I seemingly cannot search a slice for a single value:

This is unfortunately impossible because it will conflict with:

impl<'h, T, F> Pattern<&'h [T]> for F
where 
    F: FnMut(&T) -> bool,

as we could impl FnMut(&Foo) -> bool for &Foo for a third-party type Foo (this is possible because & is #[fundamental]).

@Centril

This comment has been minimized.

Contributor

Centril commented Aug 12, 2018

We could rename the trait Pattern to Needle. wdyt?

My thinking is that anything is better than Pattern. ;)

Other than that I don't have any strong opinions (or any opinions at all).
I think Needle is fine. Have you considered any naming based on the word "predicate"?

@kennytm

This comment has been minimized.

Member

kennytm commented Aug 12, 2018

@Centril "Predicate" feels too generic and I'd expect Predicate<T> is exactly FnMut(&T) -> bool like C#.

@Centril

This comment has been minimized.

Contributor

Centril commented Aug 12, 2018

@kennytm I buy that :) Needle will have to do barring a better name.

@Kimundi

This comment has been minimized.

Member

Kimundi commented Aug 13, 2018

👍 for Needle - though then the question is wether we now refer to it as the "haystack API" or the "needle API" 😄

Re: Ability to search a single element: We could provide a newtype-like wrapper type:

ext::find(&b"alpha"[..], Needle(b'a'));

Would be kind of ugly, but at least be doable.

Alternatively, we put the FnMut pattern behind a newtype wrapper, and live with the inconsistency to strings (though I'm not sure if this fixes the issue).

@kennytm

This comment has been minimized.

Member

kennytm commented Aug 14, 2018

"haystack API" or the "needle API"

This reminds me that the module name core::pattern may also need to be changed 😄

Alternatively, we put the FnMut pattern behind a newtype wrapper, and live with the inconsistency to strings (though I'm not sure if this fixes the issue).

This unfortunately will conflict with the stabilized APIs like <[T]>::split, for this to work we'll need to instead introduce .split_matches etc. (OTOH we can keep .contains and remove .contains_match.)

@kennytm kennytm changed the title from RFC: Pattern API to RFC: Needle API (née Pattern API) Aug 25, 2018

@kennytm

This comment has been minimized.

Member

kennytm commented Aug 25, 2018

Update

Renamed "Pattern" to "Needle". The concept is changed to "Needle API" and module to core::needle, because "Haystack" is more difficult to pronounce.

Added more docs (still not available on docs.rs since it is still using an ancient version of compiler).

@scottmcm

This comment has been minimized.

Member

scottmcm commented Aug 26, 2018

All I can think of reading this is that std will now have Pins and Needles 😆

@Kimundi

This comment has been minimized.

Member

Kimundi commented Nov 4, 2018

We discussed this RFC in the last libs meeting, and decided to propose to merge it. The discussions and existing implementation so far show that this seems to be an improvement over the current unstable API, and a sensible step forward in any case.

Re-reading everything just now, there are a few things that we probably want to revisit at some point, like the naming and documentation concerns, and whether there might be some way to reduce the API complexity a bit, but that's what the stabilization period is for. 😃

@rfcbot fcp merge

@kennytm

This comment has been minimized.

Member

kennytm commented Nov 4, 2018

@Kimundi Thanks! Looks like rfcbot doesn't respond to edits though.

@Centril

This comment has been minimized.

Contributor

Centril commented Nov 4, 2018

@kennytm Indeed it does not. :)

@kennytm

This comment has been minimized.

Member

kennytm commented Nov 5, 2018

Could anyone in @rust-lang/libs execute @rfcbot fcp merge again? 😂

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Nov 5, 2018

@rfcbot fcp merge

@rfcbot

This comment has been minimized.

rfcbot commented Nov 5, 2018

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams:

Concerns:

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@Kimundi

This comment has been minimized.

Member

Kimundi commented Nov 6, 2018

Ah sorry, just assumed it takes a while to respond. 😄

@bluss

This comment has been minimized.

bluss commented Nov 12, 2018

This paper lists a few pragmatic (instead of algorithmically perfect), simple, and fast improvements on "naive" brute force search for the substring search problem where we only have an equality function. Its algorithm Quite-Naive is probably the best suited for being a reversible searcher.

https://www.dmi.unict.it/~faro/papers/conference/faro6.pdf

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Nov 14, 2018

@rfcbot concern blocked on disjointness

Stabilization of this RFC is blocked by RFC 1672 (disjointness based on associated types) which is postponed.

This seems somewhat serious. Are we very confident that some form of RFC 1672 will definitely be accepted in the future?

What exactly is blocked? Is the plan to implement this RFC and get generalized and additional methods on slices, but only keep the traits unstable?

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Nov 14, 2018

@rfcbot concern double-ended vs reverse

(Probably only need clarification in the RFC.)

The DoubleEndedSearcher and DoubleEndedConsumer are not documented or explained at all. How are they used? What does it mean to implement them or not? How do they differ from ReverseSearcher and ReverseConsumer respectively?

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Nov 14, 2018

@rfcbot concern yagni

We allow a hay to customize the Index type. While str, [T] and OsStr all use usize as the index, we do want the Needle API to support other linear structures like LinkedList<T>, where a cursor/pointer would be more suitable for allowing sub-linear splitting.

Is this really an important goal? Is there a concrete case where someone actually plans to do this?

I feel that this RFC already "spends" a very high amount of complexity budget and API surface in order to be very general and support many scenarios. Maybe this is an area where we can simplify it, and not sacrifice much in practice? (Then again maybe this simplification wouldn’t help a lot eiher.)

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented Nov 14, 2018

To add a more positive note than just a series of concerns (which are all in the details), I really like this RFC overall! Thank you for going through all that design process and juggling all those sometimes-conflicting goals.

kennytm added some commits Nov 14, 2018

@kennytm

This comment has been minimized.

Member

kennytm commented Nov 14, 2018

@SimonSapin Thanks!

disjointness

Without #1672 some third-party types are not covered by blanket Needle impls due to conflicts. If we stabilize without #1672, those third-party types could impl Needle themselves, meaning we the standard library cannot add the blanket impls later.

This is fine if we only focus on built-in needle types like &str and ignore third-party types like maybe QtStr. IMO #1672 is needed before stabilization because I don't like having obvious holes with a known solution 😉

double-ended vs reverse

Updated the RFC. These are explained in details in the library docs:

previously docs.rs didn't show them due to outdated compiler, but it has just been fixed and are now visible 🎉

I feel that this RFC already "spends" a very high amount of complexity budget and API surface in order to be very general and support many scenarios. Maybe this is an area where we can simplify it, and not sacrifice much in practice? (Then again maybe this simplification wouldn’t help a lot either.)

The simplification of ignoring LinkedList<T> would be removing the Hay::Index associated type, forcing it to be usize, making interfaces like https://docs.rs/pattern-3/0.5.0/pattern_3/haystack/trait.Haystack.html#tymethod.split_around probably easier to read. But there's no impact on the amount of traits or methods otherwise (except trivial ones like Hay::start_index).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment