Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRFC: Needle API (née Pattern API) #2500
Conversation
Centril
added
the
T-libs
label
Jul 15, 2018
This comment has been minimized.
This comment has been minimized.
|
Happy to see you got some use out of my link to https://crates.io/crates/galil-seiferas and cc @bluss who's the author of that crate. |
kennytm
referenced this pull request
Jul 17, 2018
Open
Tracking issue for RFC 2295, "Extend Pattern API to OsStr" #49802
This comment has been minimized.
This comment has been minimized.
gilescope
commented
Jul 18, 2018
|
SharedHaystack seems like a general concept. Could we call it CheapClone or something like that - I'm sure there's lots of other places we'd like to know if a clone is expensive or not. |
This comment has been minimized.
This comment has been minimized.
|
@gilescope Interesting, though if we generalize the concept it would raise the question what is meant by "cheap", e.g. is the Suppose we do introduce the pub trait ShallowClone: Clone {}
impl<'a, T: ?Sized + 'a> ShallowClone for &'a T {}
impl<T: ?Sized> ShallowClone for Rc<T> {}
impl<T: ?Sized> ShallowClone for Arc<T> {}the #[deprecated]
trait SharedHaystack = Haystack + ShallowClone;... as long as Anyway |
shepmaster
reviewed
Jul 19, 2018
|
|
||
| ### Consumer | ||
|
|
||
| A consumer provides the `.consume()` method to implement `starts_with()` and `trim_start()`. It |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
It's the most |
This comment has been minimized.
This comment has been minimized.
Experience report — implementing
|
This comment has been minimized.
This comment has been minimized.
|
Thanks @shepmaster! I'll update the descriptions in the
Could you elaborate what do you mean by "subtle"?
Yes this problem also exists for my unsafe impl<'p> Consumer<str> for RegexSearcher<'p> {
fn consume(&mut self, span: Span<&str>) -> Option<usize> {
let (hay, range) = span.into_parts();
let m = self.regex.find_at(hay, range.start)?;
if m.start() == range.start {
Some(m.end())
} else {
None
}
}
}I don't know if this could be improved performance-wise other than asking the Searcher implementor to provide such primitive too. (If we disregard the performance issue we could default impl a |
This comment has been minimized.
This comment has been minimized.
Experience report — implementing
|
kennytm
referenced this pull request
Jul 23, 2018
Merged
Change single char str patterns to chars #52646
This comment has been minimized.
This comment has been minimized.
UpdateAddressing #2500 (comment).
A general An unsafe version can be done as let (hay, range) = span.into_parts();
let hay = hay.as_inner();
unsafe { Span::from_parts(hay, range) } |
gereeter
reviewed
Aug 2, 2018
gereeter left a comment
|
This is a generally great RFC and I'm overall quite happy with the API and definitely happy with the demonstrated performance improvements. I would be interested in seeing more detail and benchmarks in regards to the behaviour with owned haystacks. From what I can tell, this makes One (verbose) solution would be to introduce an intermediate data structure, a |
| A hay can *borrowed* from a haystack. | ||
|
|
||
| ```rust | ||
| pub trait Haystack: Deref<Target: Hay> + Sized { |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
I think that, due to its unsafe methods being called from safe code (e.g. in trim_start), Haystack needs to be an unsafe trait. Otherwise, without ever writing an unsafe block and therefore promising to uphold the invariants of safe code, an invalid implementation of Haystack could violate memory safety. The fact that split_around and split_unchecked are unsafe capture the fact that the caller, to preserve memory safety, must pass in valid indices, but it does nothing to prevent the callee from doing arbitrary bad behaviour even if the indices are valid.
Hay probably also needs to be an unsafe trait. It looks like in practice, Searchers are implemented for specific Hay types, indicating trust of just those implementations, and , so it may not be strictly necessary. Additionally, one of the requirements of a valid Haystack implementation could be the validity of the associated Hay type. However, with the proposed impl<'a, H: Hay> Haystack for &'a H, this is impossible to promise, and I think it would be necessary for Hay to be an unsafe trait.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| A `SharedHaystack` is a marker sub-trait which tells the compiler this haystack can cheaply be | ||
| cheaply cloned (i.e. shared), e.g. a `&H` or `Rc<H>`. Implementing this trait alters some behavior | ||
| of the `Span` structure discussed next section. |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
I'm somewhat uncomfortable by the use of specialization to modify the behaviour of Span instead of just providing more optimized functions. Admittedly, this changed behaviour seems hard to directly observe, since into_parts is only available on shared spans. This definitely isn't a big deal.
| with invalid ranges. Implementations of these methods often start with: | ||
|
|
||
| ```rust | ||
| fn search(&mut self, span: SharedSpan<&A>) -> Option<Range<A::Index>> { |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
Is SharedSpan a relic of a previous version of this proposal? I don't see it defined anywhere and it sounds like Span<H> where H: SharedHaystack.
This comment has been minimized.
This comment has been minimized.
kennytm
Aug 3, 2018
Author
Member
Fixed. (In an ancient version there was SharedSpan<H> and UniqueSpan<H> where Haystack has an associated type Span to determine which span to use. The resulting code was quite ugly.)
| let span = unsafe { Span::from_parts("CDEFG", 3..8) }; | ||
| // we can find "CD" at the start of the span. | ||
| assert_eq!("CD".into_searcher().search(span.clone()), Some(3..5)); | ||
| assert_eq!("CD".into_searcher().consume(span.clone()), Some(5)); |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
Should this (and the other examples calling .into_searcher().consume(...)) be .into_consumer().consume(...)?
This comment has been minimized.
This comment has been minimized.
| let mut searcher = pattern.into_searcher(); | ||
| let mut rest = Span::from(haystack); | ||
| while let Some(range) = searcher.search(rest.borrow()) { | ||
| let [left, _, right] = unsafe { rest.split_around(range) }; |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
It seems very common to call split_around and then throw away one or more of the components. For owned containers like Vec, at least, this involves allocating a vector for the ignored elements, copying them to their new location, then finally dropping and deallocating. Would it be possible to add more methods to Haystack that only return some of the parts? They could have default definitions in terms of split_around, so they shouldn't cause any more difficulty for implementers, but owned containers would be able to override them for better performance.
It also occurs to me that slice_unchecked is actually one of these specialized methods, returning only the middle component.
This comment has been minimized.
This comment has been minimized.
kennytm
Aug 3, 2018
•
Author
Member
We'll need 3 more names for these [left, middle, _], [left, _, right], [_, middle, right])
| pub trait Haystack: Deref<Target: Hay> + Sized { | ||
| fn empty() -> Self; | ||
| unsafe fn split_around(self, range: Range<Self::Target::Index>) -> [Self; 3]; | ||
| unsafe fn slice_unchecked(self, range: Range<Self::Target::Index>) -> Self; |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
Could slice_unchecked have a default implementation as follows?
unsafe fn slice_unchecked(self, range: Range<Self::Target::Index>) -> Self {
let [_, middle, _] = self.split_around(range);
middle
}
This comment has been minimized.
This comment has been minimized.
|
|
||
| * Implement `Hay` to `str`, `[T]` and `OsStr`. | ||
|
|
||
| * Implement `Haystack` to `∀H: Hay. &H`, `&mut str` and `&mut [T]`. |
This comment has been minimized.
This comment has been minimized.
gereeter
Aug 2, 2018
The pattern_3 crate also has an implementation for Vec<T> (though not String or OsString). Are those owned implementations intended eventually? Is that just out of scope for this particular RFC?
This comment has been minimized.
This comment has been minimized.
kennytm
Aug 3, 2018
Author
Member
I don't intend to add these into the standard library, due to the efficiency concern you've raised.
pattern_3 does implement for Vec<T> just to illustrate that it can transfer owned data type correctly.
This comment has been minimized.
This comment has been minimized.
gereeter
commented
Aug 2, 2018
|
I separated this comment out because it is far more questionable than the rest. I know that I personally tend to go overboard with squeezing out tiny and inconsequential bits of runtime performance at the expense of compile time and ergonomics. That said,
This doesn't feel like the right trade-off to me. Reducing the number of types is definitely useful for implementers of patterns that don't have a special optimization for
If getting the implementation right is an issue, there could just be a wrapper type along the lines of pub struct SearcherConsumer<S> {
inner: S
}
impl<H, S: Searcher<H>> Consumer<H> for SearcherConsumer {
// ...
}This could then be the default type for |
This comment has been minimized.
This comment has been minimized.
This I disagree. People seldom use
There is zero performance penalty in string matching caused by merging consumer and searcher, because we already have a runtime selection between empty and non-empty needle. In |
This comment has been minimized.
This comment has been minimized.
|
@kennytm: Thank you, thank you, thank you! In general huge
|
This comment has been minimized.
This comment has been minimized.
|
Okay, so @Kimundi If impl<H, S> Pattern<H> for S
where
H: Haystack,
S: Searcher<H::Target> + Consumer<H::Target>,
{
type Searcher = Self;
type Consumer = Self;
fn into_searcher(self) -> Self { self }
fn into_consumer(self) -> Self { self }
}because due to backward compatibility we have a different blanket impl to support: impl<'h, F> Pattern<&'h str> for F
where
F: FnMut(char) -> bool,
{ ... }and these two will conflict when a type implements Given these details I'm mildly against renaming |
This comment has been minimized.
This comment has been minimized.
Experience report — consuming
|
This comment has been minimized.
This comment has been minimized.
|
Thanks for the report @shepmaster !
We could rename the trait cc @Centril (1) and @Kimundi (2) who want to rename The name fn contains<H, P>(haystack: H, needle: P) -> bool
where
H: Haystack,
P: Needle<H>;so for those not directly working with Searcher we are indeed "searching for a needle in a haystack".
maybe reading it as "search inside of a ____ of haystack" is better?
This is unfortunately impossible because it will conflict with: impl<'h, T, F> Pattern<&'h [T]> for F
where
F: FnMut(&T) -> bool,as we could |
This comment has been minimized.
This comment has been minimized.
My thinking is that anything is better than Other than that I don't have any strong opinions (or any opinions at all). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@kennytm I buy that :) |
This comment has been minimized.
This comment has been minimized.
|
Re: Ability to search a single element: We could provide a newtype-like wrapper type: ext::find(&b"alpha"[..], Needle(b'a'));Would be kind of ugly, but at least be doable. Alternatively, we put the |
This comment has been minimized.
This comment has been minimized.
This reminds me that the module name
This unfortunately will conflict with the stabilized APIs like |
This comment has been minimized.
This comment has been minimized.
bluss
commented
Nov 12, 2018
|
This paper lists a few pragmatic (instead of algorithmically perfect), simple, and fast improvements on "naive" brute force search for the substring search problem where we only have an equality function. Its algorithm Quite-Naive is probably the best suited for being a reversible searcher. |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot concern blocked on disjointness
This seems somewhat serious. Are we very confident that some form of RFC 1672 will definitely be accepted in the future? What exactly is blocked? Is the plan to implement this RFC and get generalized and additional methods on slices, but only keep the traits unstable? |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot concern double-ended vs reverse (Probably only need clarification in the RFC.) The |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot concern yagni
Is this really an important goal? Is there a concrete case where someone actually plans to do this? I feel that this RFC already "spends" a very high amount of complexity budget and API surface in order to be very general and support many scenarios. Maybe this is an area where we can simplify it, and not sacrifice much in practice? (Then again maybe this simplification wouldn’t help a lot eiher.) |
This comment has been minimized.
This comment has been minimized.
|
To add a more positive note than just a series of concerns (which are all in the details), I really like this RFC overall! Thank you for going through all that design process and juggling all those sometimes-conflicting goals. |
kennytm
added some commits
Nov 14, 2018
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin Thanks!
Without #1672 some third-party types are not covered by blanket This is fine if we only focus on built-in needle types like
Updated the RFC. These are explained in details in the library docs:
previously docs.rs didn't show them due to outdated compiler, but it has just been fixed and are now visible
The simplification of ignoring |
This comment has been minimized.
This comment has been minimized.
|
Has the It might be interesting to experiment with designing the entire API on generative typing, but without ATC it can be trickier, and there might not be a solution to containers where not every index is valid (e.g. |
This comment has been minimized.
This comment has been minimized.
|
@eddyb I tried it before (as something like I also don't think with today's Rust one could safely use |
This comment has been minimized.
This comment has been minimized.
|
@kennytm I've prototyped an existential wrapper solution for the AFAIK it's sound when used with I'll have to look into making the construction limited to calling |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot resolve double-ended vs reverse |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@rfcbot resolve yagni I still feel that this is an unprecedented amount of complexity for a standard library feature. However, stabilizing some set of traits to "explain" the existing behavior of |
rfcbot
added
the
final-comment-period
label
Nov 19, 2018
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Nov 19, 2018
|
|
rfcbot
removed
the
proposed-final-comment-period
label
Nov 19, 2018
This comment has been minimized.
This comment has been minimized.
Minor nitpick, but I think a case could be made that tasks and futures are similarly complex. |
This comment has been minimized.
This comment has been minimized.
Since |
Centril
added
A-needle
A-traits-libstd
A-types-libstd
labels
Nov 22, 2018
rfcbot
added
the
finished-final-comment-period
label
Nov 29, 2018
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Nov 29, 2018
|
The final comment period, with a disposition to merge, as per the review above, is now complete. |
rfcbot
removed
the
final-comment-period
label
Nov 29, 2018
Centril
referenced this pull request
Nov 29, 2018
Open
Tracking issue for RFC 2500, "Needle API (née Pattern API)" #56345
Centril
merged commit ef572c3
into
rust-lang:master
Nov 29, 2018
This comment has been minimized.
This comment has been minimized.
|
Tracking issue: rust-lang/rust#56345 |
kennytm commentedJul 14, 2018
•
edited by Centril