New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for string patterns #27721

Open
alexcrichton opened this Issue Aug 12, 2015 · 28 comments

Comments

Projects
None yet
@alexcrichton
Member

alexcrichton commented Aug 12, 2015

(Link to original RFC: rust-lang/rfcs#528)

This is a tracking issue for the unstable pattern feature in the standard library. We have many APIs which support the ability to search with any number of patterns generically within a string (e.g. substrings, characters, closures, etc), but implementing your own pattern (e.g. a regex) is not stable. It would be nice if these implementations could indeed be stable!

Some open questions are:

  • Have these APIs been audited for naming and consistency?
  • Are we sure these APIs are as conservative as they need to be?
  • Are we sure that these APIs are as performant as they can be?
  • Are we sure that these APIs can be used to implement all the necessary forms of searching?

cc @Kimundi

@bluss

This comment has been minimized.

Contributor

bluss commented Aug 16, 2015

String Patterns RFC tracking issue: #22477

Stabilization not only impacts implementing the pattern traits, but also of course detailed use of the Searcher trait, .next_match(), .next_reject() and so on.

@jneem

This comment has been minimized.

Contributor

jneem commented Sep 17, 2015

I'm having trouble seeing the purpose of the next() method; in all the examples I look at, it's faster and cleaner to just implement next_match() and next_reject() individually. For example, these two functions can be implemented for CharEqSearcher as one-liners using Iterator::find. Moreover, if you want an optimized implementation of Searcher for an ASCII char then in implementing next() you need to choose between an implementation that returns rejections quickly and an implementation that skips quickly over the input (e.g. using SIMD) in order to quickly find a match.

@bluss

This comment has been minimized.

Contributor

bluss commented Sep 17, 2015

@jneem the StrSearcher uses the same code for next() and next_match(), but specialized for each case, so that the next_match() case is much faster. It works well.

@jneem

This comment has been minimized.

Contributor

jneem commented Sep 17, 2015

@bluss I think that helps make my point: AFAICT, all the users of Searcher only call next_match() and next_reject(). Therefore, the only purpose of next() AFAICT is to make implementing Searcher easier -- you only need to implement one function instead of two. But that benefit isn't born out in practice. StrSearcher implements two methods anyway, and it would be simpler to implement next_reject() instead of next(). CharEqSearcher implements next() only, but it would be simpler and cleaner to optimize if it implemented next_match() and next_reject() instead.

@jneem

This comment has been minimized.

Contributor

jneem commented Sep 18, 2015

Oops, I missed one: Pattern uses Searcher::next() for the default implementation of is_prefix_of.

@mkpankov

This comment has been minimized.

Contributor

mkpankov commented Nov 17, 2015

This is useful, I'd like to use it on stable.

@bluss

This comment has been minimized.

Contributor

bluss commented Nov 17, 2015

cc @BurntSushi, Regex is the major pattern user and it would be great for everyone if it was stable because of that.

@bluss

This comment has been minimized.

Contributor

bluss commented Nov 17, 2015

@BurntSushi Are the Pattern traits sufficient for Regex? Any design issues?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Nov 17, 2015

I wrote the impl for Regex a while back, which does indeed only implement next: https://github.com/rust-lang-nursery/regex/blob/master/src/re.rs#L1104 I seem to recall that it was tricky to get the corner cases right, but that isn't too surprising. I otherwise found it pretty straight forward.

One thing that I don't think is representable with the Pattern API is fast retrieval of all matches from a suffix table. In particular, matches are reported as part of lexicographically sorted suffixes rather than order of occurrence in the string, so it can't satisfy the contract of Searcher without an additional cost. (I've already talked to @Kimundi about this, and I don't think it's a major issue. Suffix tables are a pretty niche data structure.)

@aturon

This comment has been minimized.

Member

aturon commented Jan 27, 2016

Nominating for discussion. Not sure whether this is ready for stabilization, but I'd like for the team to dig into it a bit.

@aturon aturon added the I-nominated label Jan 27, 2016

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Jan 29, 2016

Unfortunately the libs team didn't get a chance to talk about this in terms of stabilization for 1.8, but I'm going to leave the nominated tag as I think we should chat about this regardless.

@Kimundi

This comment has been minimized.

Member

Kimundi commented Feb 17, 2016

Today I investigated a bit what would need to be done to generalize the Pattern API to arbitrary slice types. As part of that I also took a more pragmatic route to the involved types and interfaces based on these assumptions:

  • Generally, the API should suit the std library, with more complex scenarios being handled by external libraries.
  • Thus, providing the same feature set for libstd slice types like str, [T] or OsStr is more important than having a API that can accommodate the most general kind of "slice type".
  • Having an more simple to maintain API suitable for the existing consumers like find, split and match_indices is a higher priority than having a more complicated API that could accommodate exotic consumers.
  • Being able to reuse the API for mutable slices is worth a moderate increase in API verbosity and unsafety-based interface surfaces.

A very rough sketch of the result can be seen below. Note that this is only the Pattern traits itself, without actual integration in the std lib or comprehensive re implementation of the existing types.

https://github.com/Kimundi/pattern_api_sketch/blob/master/src/v5.rs

The core changes are:

  • The Pattern trait now has an input parameter for the slice type. This turns Pattern<'a> into Pattern<&'a str> and allows for Pattern<&'a mut [T]>.
  • next() got removed, leaving only next_{match/reject}(). This has been done because none of the existing types could make use of the return value of next(), and none of the existing implementations benefited from needing to implement it.
    • As part of that change, the Pattern::is_{prefix/suffix}_of() methods are no longer default-implemented. But seeing how the manual implementation for a given pattern-slice combination is usually simple and straight forward, this does not appear to be a problem.
  • In order to allow for mutable slices and shared code of general iterators like match_indices between slice types,
    the return values of next_*() are now associated types of the slice type.
    They have an API that allows them to be used as abstract raw pointer types, which means generic code can use them relatively freely to create slices at or between matches, at the cost of requiring unsafe code and needing to follow aliasing rules.

Changes in regard to the open questions:

  • In addition to the existing interface, the new SearchPtrs trait has gained new ad-hoc named elements which would require additional name audition. Also, Pattern continues to be a confusing name in a language with pattern matching...
  • The sketched API change is somewhat more conservative by getting rid of the next() methods,
    concentrating instead on the pure search loops.
  • The prior point should also make the performance aspect more validateable.
    However, should the existing slice type-specific iterators types like split() be replaced by shared, generic ones there might be some regressions due to optimizations possibly not carrying over. This is optional though.
  • We should probably specify the Pattern API more closely as only being intended for iterator-like linear search operations.
@bluss

This comment has been minimized.

Contributor

bluss commented Feb 18, 2016

Is the focus of Pattern for slices going to be specific for &[u8] only? I think that's ok, I'm unsure how to really extend "substring search" further to generic &[T]. Technically, the two-way algorithm that we use for str::find(&str) can be made to work on any ordered alphabet, i.e. T: Ord is enough, but I don't know how performant or realistic this is.

@Kimundi

This comment has been minimized.

Member

Kimundi commented Feb 18, 2016

@bluss: I only used &[u8] as a quick proof of concept that the traits work with different slice types. I'm assuming that there is some sensible way to make it work with generic slices.

The &[u8] case seems like a good candidate for a specialization impl though.

@DanielKeep

This comment has been minimized.

Contributor

DanielKeep commented Mar 4, 2016

I'm currently trying out some runtime scanners for scan-rules. Basically: I want users to be able to parse text based on Patterns (e.g. accept anything that matches this pattern, consume input until this pattern is matched, etc.). The current example is being able to do something like (where until accepts a Pattern):

let_scan!("before:after", (let word_0 <| until(":"), ":", let word_1: Everything));
assert_eq!(word_0, "before");
assert_eq!(word_1, "after");

The problem I'm having is that it's really painful to actually do this. The interface as it exists seems to assume that ownership of the pattern can be passed to the method which will use it, and as a result, can only be used exactly once. This doesn't make much sense to me. Searching for a pattern should not (in general) consume the pattern.

What I want is P: Patternfor<P: Pattern> &P: Pattern. Currently, it's only true for &str. If I had this, I could take ownership of the pattern from the caller, store it, and loan it to the underlying find calls. If the caller wants to use the pattern in multiple places, and it's expensive to construct, they can instead pass my function a borrow, which will also work.

The more I think about this, the more it comes down to the FnMut implementation. The only closure kind that would allow for the kind of interface I want is Fn... but that excludes closures that test based on accumulated state, though I wonder how often that's even desireable.

I can work around this by requiring Copy patterns, but again, callables are the most useful kind of pattern (until(char::is_whitespace), etc.), so it seems a deep shame to exclude them.

@DanielKeep

This comment has been minimized.

Contributor

DanielKeep commented Mar 4, 2016

Oh, another thing I just realised: there doesn't appear to be any way to find out how long a match is given a pattern and, say, str::starts_with.

@bluss

This comment has been minimized.

Contributor

bluss commented Mar 4, 2016

You can use .next_reject for that.

@withoutboats

This comment has been minimized.

Contributor

withoutboats commented May 15, 2016

Hey, I don't know a lot about this API, but I noticed that StrSearcher is not a double ended pattern, but CharSliceSearcher is. Is this an error? The explanation of why StrSearcher is not double ended seems to apply equally well to CharSliceSearcher:

(&str)::Searcher is not a DoubleEndedSearcher because the pattern "aa" in the haystack "aaa" matches as either "[aa]a" or "a[aa]", depending from which side it is searched.

@bluss

This comment has been minimized.

Contributor

bluss commented May 15, 2016

@withoutboats A slice of chars represents a set of possibilities, so it's not like a string; either of the chars can be matched by themselves.

@withoutboats

This comment has been minimized.

Contributor

withoutboats commented May 15, 2016

@bluss That makes sense. The semantics of CharSliceSearcher are not documented; I assumed the char slice was treated as an ordered sequence rather than a set.

@aturon aturon removed the I-nominated label Jun 9, 2016

@Stebalien

This comment has been minimized.

Contributor

Stebalien commented Aug 13, 2016

Can we say that, after returning Done once, a Searcher must continue to return Done forever more? That is, can I make this assumption when implementing FusedIterator?

@Stebalien Stebalien referenced this issue Aug 13, 2016

Merged

Implement 1581 (FusedIterator) #35656

0 of 3 tasks complete

bors added a commit that referenced this issue Aug 23, 2016

Auto merge of #35656 - Stebalien:fused, r=alexcrichton
Implement 1581 (FusedIterator)

* [ ] Implement on patterns. See #27721 (comment).
* [ ] Handle OS Iterators. A bunch of iterators (`Args`, `Env`, etc.) in libstd wrap platform specific iterators. The current ones all appear to be well-behaved but can we assume that future ones will be?
* [ ] Does someone want to audit this? On first glance, all of the iterators on which I implemented `FusedIterator` appear to be well-behaved but there are a *lot* of them so a second pair of eyes would be nice.
* I haven't touched rustc internal iterators (or the internal rand) because rustc doesn't actually call `fuse()`.
* `FusedIterator` can't be implemented on `std::io::{Bytes, Chars}`.

Closes: #35602 (Tracking Issue)
Implements: rust-lang/rfcs#1581

cuviper pushed a commit to cuviper/rayon that referenced this issue Mar 28, 2017

Auto merge of #35656 - Stebalien:fused, r=alexcrichton
Implement 1581 (FusedIterator)

* [ ] Implement on patterns. See rust-lang/rust#27721 (comment).
* [ ] Handle OS Iterators. A bunch of iterators (`Args`, `Env`, etc.) in libstd wrap platform specific iterators. The current ones all appear to be well-behaved but can we assume that future ones will be?
* [ ] Does someone want to audit this? On first glance, all of the iterators on which I implemented `FusedIterator` appear to be well-behaved but there are a *lot* of them so a second pair of eyes would be nice.
* I haven't touched rustc internal iterators (or the internal rand) because rustc doesn't actually call `fuse()`.
* `FusedIterator` can't be implemented on `std::io::{Bytes, Chars}`.

Closes: #35602 (Tracking Issue)
Implements: rust-lang/rfcs#1581
@shahn

This comment has been minimized.

Contributor

shahn commented Apr 12, 2017

I'm a big fan of extending this to be more generic than just str, needed/wanted it for [u8] quite frequently. Unfortunately, there's an API inconsistency already - str::split takes a Pattern, whereas slice::split takes a predicate function that only looks at a single T in isolation.

@RReverser

This comment has been minimized.

Contributor

RReverser commented Dec 20, 2017

@shahn That shouldn't be a big problem - if Pattern is implemented for such function, then slice::split could support Pattern without breaking backwards compatibility.

@Lisoph

This comment has been minimized.

Lisoph commented Jan 11, 2018

I strongly agree with widening the API to slice and maybe Vec.

@bstrie

This comment has been minimized.

Contributor

bstrie commented May 13, 2018

What's the status here? The original RFC is quite old -- from 2014, before 1.0 -- do we need to completely revisit the design? What are the current blockers?

@shepmaster

This comment has been minimized.

Member

shepmaster commented May 13, 2018

With SIMD stabilizing in 1.27, Jetscii will be available on stable, but its Pattern support won't — I never expected that SIMD would arrive before this! 😇

@jeekobu

This comment has been minimized.

jeekobu commented May 14, 2018

Would be nice to see these for OsStr, CStr, [u8] and similar, as mentioned a year ago. Non-UTF in Rust feels very much a second class citizen and this would be a start.

@SimonSapin

This comment has been minimized.

Contributor

SimonSapin commented May 14, 2018

@jeekobu #49802 is accepted but not yet implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment