New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for string patterns #27721
Comments
String Patterns RFC tracking issue: #22477 Stabilization not only impacts implementing the pattern traits, but also of course detailed use of the Searcher trait, |
I'm having trouble seeing the purpose of the |
@jneem the StrSearcher uses the same code for |
@bluss I think that helps make my point: AFAICT, all the users of Searcher only call |
Oops, I missed one: Pattern uses |
This is useful, I'd like to use it on stable. |
cc @BurntSushi, Regex is the major pattern user and it would be great for everyone if it was stable because of that. |
@BurntSushi Are the Pattern traits sufficient for Regex? Any design issues? |
I wrote the impl for One thing that I don't think is representable with the |
Nominating for discussion. Not sure whether this is ready for stabilization, but I'd like for the team to dig into it a bit. |
Unfortunately the libs team didn't get a chance to talk about this in terms of stabilization for 1.8, but I'm going to leave the nominated tag as I think we should chat about this regardless. |
Today I investigated a bit what would need to be done to generalize the Pattern API to arbitrary slice types. As part of that I also took a more pragmatic route to the involved types and interfaces based on these assumptions:
A very rough sketch of the result can be seen below. Note that this is only the Pattern traits itself, without actual integration in the std lib or comprehensive re implementation of the existing types. https://github.com/Kimundi/pattern_api_sketch/blob/master/src/v5.rs The core changes are:
Changes in regard to the open questions:
|
Is the focus of Pattern for slices going to be specific for |
@bluss: I only used The |
I'm currently trying out some runtime scanners for let_scan!("before:after", (let word_0 <| until(":"), ":", let word_1: Everything));
assert_eq!(word_0, "before");
assert_eq!(word_1, "after"); The problem I'm having is that it's really painful to actually do this. The interface as it exists seems to assume that ownership of the pattern can be passed to the method which will use it, and as a result, can only be used exactly once. This doesn't make much sense to me. Searching for a pattern should not (in general) consume the pattern. What I want is The more I think about this, the more it comes down to the I can work around this by requiring |
Oh, another thing I just realised: there doesn't appear to be any way to find out how long a match is given a pattern and, say, |
You can use .next_reject for that. |
Hey, I don't know a lot about this API, but I noticed that
|
@withoutboats A slice of chars represents a set of possibilities, so it's not like a string; either of the chars can be matched by themselves. |
@bluss That makes sense. The semantics of |
Can we say that, after returning |
Implement 1581 (FusedIterator) * [ ] Implement on patterns. See #27721 (comment). * [ ] Handle OS Iterators. A bunch of iterators (`Args`, `Env`, etc.) in libstd wrap platform specific iterators. The current ones all appear to be well-behaved but can we assume that future ones will be? * [ ] Does someone want to audit this? On first glance, all of the iterators on which I implemented `FusedIterator` appear to be well-behaved but there are a *lot* of them so a second pair of eyes would be nice. * I haven't touched rustc internal iterators (or the internal rand) because rustc doesn't actually call `fuse()`. * `FusedIterator` can't be implemented on `std::io::{Bytes, Chars}`. Closes: #35602 (Tracking Issue) Implements: rust-lang/rfcs#1581
Implement 1581 (FusedIterator) * [ ] Implement on patterns. See rust-lang/rust#27721 (comment). * [ ] Handle OS Iterators. A bunch of iterators (`Args`, `Env`, etc.) in libstd wrap platform specific iterators. The current ones all appear to be well-behaved but can we assume that future ones will be? * [ ] Does someone want to audit this? On first glance, all of the iterators on which I implemented `FusedIterator` appear to be well-behaved but there are a *lot* of them so a second pair of eyes would be nice. * I haven't touched rustc internal iterators (or the internal rand) because rustc doesn't actually call `fuse()`. * `FusedIterator` can't be implemented on `std::io::{Bytes, Chars}`. Closes: #35602 (Tracking Issue) Implements: rust-lang/rfcs#1581
I'm a big fan of extending this to be more generic than just str, needed/wanted it for [u8] quite frequently. Unfortunately, there's an API inconsistency already - str::split takes a Pattern, whereas slice::split takes a predicate function that only looks at a single T in isolation. |
FWIW, I would probably be opposed to implementing the "or" variant in std on strings. The simple implementation of this is a performance footgun as soon as it gets beyond a few patterns. We would invariably find ourselves recommending people "not use it" unless the inputs are small. The alternative, to make it fast, is a lot of work that probably shouldn't live inside std: https://github.com/BurntSushi/aho-corasick/blob/master/DESIGN.md |
This might be obvious for some of you, but it has not been mentioned yet: Before stabilizing the API (seems like this might take some time) one should consider waiting for #44265: use core::str::pattern::{ReverseSearcher, Searcher};
pub trait Pattern {
type Searcher<'a>: Searcher<'a>;
// lifetime can be elided:
fn into_searcher(self, haystack: &str) -> Self::Searcher<'_>;
//
fn is_contained_in(self, haystack: &str) -> bool;
fn is_prefix_of(self, haystack: &str) -> bool;
fn is_suffix_of<'a>(self, haystack: &'a str) -> bool
where
Self::Searcher<'a>: ReverseSearcher<'a>;
fn strip_prefix_of(self, haystack: &str) -> Option<&'_ str>;
fn strip_suffix_of<'a>(self, haystack: &'a str) -> Option<&'a str>
where
Self::Searcher<'a>: ReverseSearcher<'a>;
} I think using an associated lifetime for the I can not really think of any practical benefits of this approach, except for lifetime elision and patterns without lifetimes: pub struct SomePattern;
impl Pattern for SomePattern {
type Searcher<'a> = SomeSearcher<'a>;
fn into_searcher(self, haystack: &str) -> Self::Searcher<'_> {
SomeSearcher(haystack)
}
}
pub struct SomeSearcher<'a>(&'a str);
unsafe impl<'a> Searcher<'a> for SomeSearcher<'a> {
fn haystack(&self) -> &'a str {
self.0
}
fn next(&mut self) -> SearchStep {
unimplemented!()
}
} |
|
@TonalidadeHidrica I wouldn't rely on that myself. To me that's not an idiomatic use case; I'd expect the function to be pure. |
Why has this trait been implemented for From the docs of
|
This seems like something that should be documented in the docs of the |
I agree that this should be documented. I guess why this function is |
Would it be better to ban the mutation, and if it is needed, use the internal mutation ( |
Not possible due to pack-compatibility. |
what is the holdup for making this stable? |
I'm a little confused by the API; the API docs for This trait provides methods for searching for non-overlapping matches of a pattern starting from the front (left) of a string. I see the following possible interpretations of this, and I want to be sure which is in use to prevent any ambiguity in implementations. Greedy approachIf you're looking for the string Match(0,2)
Match(2,4)
Reject(4,5)
Done Starts at the start of the string, but skips some letters because why not?Greedy is overrated. Let's skip the first letter and match on the rest! Reject(0,1)
Match(1,3)
Match(3,5)
Done Getting the most matches is so overrated, how about we skip some?The API definition doesn't require that the maximal number of matches be returned, so we could just ignore some matching sub-strings. Reject(0,1)
Reject(1,2)
Match(2,4)
Reject(4,5)
Done Suggestions for documentation improvements.I'd like to suggest that matching is always greedy and always maximal. Roughly the following pseudo-code (don't use this in production, it will overflow your stack, and finite state machines are faster anyways): // `results` is empty when this function is first called.
// `usize` is 0 when first called.
fn string_matcher(pattern: &str, haystack: &str, results: &mut Vec<SearchStep>, index: usize) {
if haystack.len() < pattern.len() {
if haystack.len() > 0 {
results.push(SearchStep::Reject(index, index + haystack.len()));
}
results.push(SearchStep::Done);
} else if pattern == &haystack[0..pattern.len()] {
results.push(SearchStep::Match(index, index + pattern.len()));
string_matcher(
pattern,
&haystack[pattern.len()..],
results,
index + pattern.len(),
);
} else {
results.push(SearchStep::Reject(index, index + 1));
string_matcher(pattern, &haystack[1..], results, index + 1);
}
} |
Also, what about when you want overlapping matches? I can see cases where I would want all overlapping matches in addition to what the Searcher API currently provides. |
The docs could certainly be improved. I'm not sure if "greedy" or "maximal" are the right words. Overlapping matches is a bit of a niche case and I don't think there is a compelling reason for the standard library to support them. Overlapping searches are available in the |
@BurntSushi the main reason for the overlapping case is because then you can say that the searcher needs to return all matches, even the overlapping ones. The user is then responsible for deciding which overlapping case is the interesting one(s). If the searcher implements the Iterator trait, then you can use filtering to get the parts you want. |
I don't think that's worth doing and likely has deep performance implications. |
I disagree, though I do think that it should be a completely separated from the current API (different function, different trait, whatever is deemed best)
Hah! I agree 110% with you on this! And it's the reason why having it as a separate API is likely the best way to do it. |
Overlapping searches are way way way too niche to put into std. If you want to convince folks otherwise, I would recommend giving more compelling reasons for why it should be in std. |
The best example I can give you off the top of my head is very niche, and likely not applicable to I sometimes have to decode streams of bytes coming in from a receiver that can make errors1 because of clock skew, mismatched oscillator frequencies, and noise in general. These can show up as bit flips, missing bits, or extra bits. Despite this, I want to know when a legitimate byte stream is starting. The normal (and fast) way is to define some kind of known pattern that signals that a frame is starting, which I'm going to call the start of frame pattern2. To make your own life simple, this pattern is going to be chosen to be highly unlikely to occur by accident in your environment, but it's also really, really easy to look for. One example might be just to have a stream of bits like Now here is where things get interesting; while you could use some form of forward error correction (FEC) code to encode the start of frame pattern, continuously decoding all incoming bits to look for the pattern is energy intensive, which means battery life goes down. What you want to do is find the probable start of a frame, and then start the computationally (and therefore power) expensive process of decoding bits only when you are pretty sure you've found the start of a frame. So, you don't bother with proper FEC of the frame pattern. Instead, you make your pattern simple, and your pattern matcher will be just as simple. If it sees a pattern that looks like it could be a start of frame, you turn on your full FEC decoder and start decoding bits until you either decide that you made a mistake, or you have a frame (checksums, etc. come later). The issue is that the noise I mentioned earlier can show up anywhere, including at the head of the start of frame pattern. So instead of looking for the full All of that makes good sense in a byte stream, and that is where the Footnotes
|
Yeah I totally grant that there exist use cases for overlapping search. That's not really what I'm looking for, although I appreciate you outlining your use case. What I'm trying to get at here is that they are not frequent enough to be in std. Frequency isn't our only criterion, but I can't see any other reason why std should care at all about overlapping search. If you want it to be in std, you really need to answer the question, "why can't you use a crate for it?" with a specific reason for this particular problem. (i.e., Not general complaints like "I don't want to add dependencies to my project.") |
You're right, on all counts. I don't have a good enough reason for why it should be in std and not some crate, so I'm fine with it being dropped. That said, I would like to see the documentation clarified on which non-overlapping patterns need to be returned. I'm fine with the docs stating that you can return an arbitrary set of non-overlapping matches, I just want it to be 100% clear as to what is expected of implementors. |
Could (note: i do not know how |
@Fishrock123 It is doable, but substring search algorithms are usually not amenable to being adapted straight-forwardly to support case insensitivity. (This isn't true for all of them, but I think is likely true for Two-Way at least, which is the algorithm currently used for substring search.) So in order to support such a flag, you'd probably need to dispatch to an entirely different algorithm. Another option is the |
This issue is open since over 7 years now, me and likely a lot other people would like to use this in stable. Would it be possible to move the unstable attributes from the root into the API methods instead. Then at least one could export it in a stable way as '&str' already does (in stable). Example (illustration only, i leave the working bits out):
|
I agree, it is sad to see such an issue being abandoned. |
Is GAT in its current state suitable for this? I know it has some limitations. If it is, then I imagine it must be worth considering this API while we are still unstable? Related to this, I'm writing a function that ultimately searches a &str, the obvious (to me) signature was: fn until(pattern: impl Pattern); But I guess it would need some lifetime generic with the current API. |
I think the summary here is that this API needs rework and somebody to champion it. There was some good discussion at #71780, including a rough proposal from @withoutboats in #71780 (comment). This kinda sorta echos @Luro02's outline in #27721 (comment) (it seems like GATs provide us with a more ergonomic solution in any case) Another thing to keep in mind is that slice patterns were removed (#76901 (comment)) but we may want some way to work with So the next steps forward, probably:
It is unfortunate that we more or less have to go back to square one with stabilization, but there have been a lot of lessons learned and better ways to do things since the 2014 RFC (a decade!). Really this is probably just in need of somebody to take charge of the redesign and push everything forward. |
All I'd really asked for above is to stabilize the existence of the Pattern API, that would already address a lot of problems removing unstable bits from the stable Rust stdlib API. When the API/implementation behind needs more work, that's Ok. But honestly after that much years and many people relying on patterns, overly big changes would be quite surprising. |
That would of course be nice, but we don’t want to do that until knowing for sure that we won’t need to change generics from what there currently is (a single lifetime). Probably unlikely, but there’s no way of knowing without a concrete proposal.
I think it’s the opposite: all the discussion here, the very long time with no stabilization, and the proposed replacements I linked in #27721 (comment) seem to indicate that nobody is happy enough with this API as-is. This feature needs a champion who is willing to experiment and push things along. |
(Link to original RFC: rust-lang/rfcs#528)
This is a tracking issue for the unstable
pattern
feature in the standard library. We have many APIs which support the ability to search with any number of patterns generically within a string (e.g. substrings, characters, closures, etc), but implementing your own pattern (e.g. a regex) is not stable. It would be nice if these implementations could indeed be stable!Some open questions are:
cc @Kimundi
The text was updated successfully, but these errors were encountered: