Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upregex 1.0 #1620
Conversation
BurntSushi
added
the
T-libs
label
May 18, 2016
huonw
reviewed
May 18, 2016
| ### Expansion concerns | ||
| [expansion-concerns]: #expansion-concerns | ||
|
|
||
| There are a few possible avenues for expansion, and we take measures to make |
This comment has been minimized.
This comment has been minimized.
huonw
May 18, 2016
Member
Another possibility here is "compatibility" flags, e.g. if there is a breaking change for whatever reason, people can opt-in to it with some sort of flag like (?X), ala (?i).
This comment has been minimized.
This comment has been minimized.
huonw
reviewed
May 18, 2016
| ``` | ||
|
|
||
| And five core search methods. All searching completes in worst case linear | ||
| time. |
This comment has been minimized.
This comment has been minimized.
huonw
May 18, 2016
Member
(Not directly relevant, but is this linear with respect to both pattern and text, i.e., O(n + m)? Or maybe linear with respect to each while holding the other constant, i.e., O(nm)?)
This comment has been minimized.
This comment has been minimized.
BurntSushi
force-pushed the
BurntSushi:regex-1.0
branch
from
ce2dfb6
to
5e233fa
May 18, 2016
This comment has been minimized.
This comment has been minimized.
|
I wonder if They could return pub struct Found<'a> {
string: &'a str,
start: usize,
end: usize,
}
impl<'a> Found<'a> {
fn as_str(&self) -> &'a str { self.string }
fn as_indices(&self) -> (usize, usize) { (self.start, self.end) }
}(I'm not happy with the names.) |
huonw
reviewed
May 18, 2016
| `String`. | ||
| * The `Error` type no longer has the `InvalidSet` variant, since the error is | ||
| no longer possible. Its `Syntax` variant was also modified to wrap a `String` | ||
| instead of a `regex_syntax::Error`. If you need access to specific parse |
This comment has been minimized.
This comment has been minimized.
huonw
May 18, 2016
Member
Maybe it could wrap a type defined in regex that internally contains a regex_syntax::Error (the type may just implement Error, Display etc. for now), so we can choose to expose more detailed info in future if we want.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@huonw I do like I think I can buy that Another alternative is adding a new method with a different return type, but I've tried hard to resist that urge. Another alternative is to just have |
This comment has been minimized.
This comment has been minimized.
|
Hmm, neither the byte or string based regex implementations would be suited for matching on It would be nice if the actual matching code, and all FSM related code could be generic over an element type. That way the regular expression syntax would be specific to the use (so it might only define regular expression syntax for unicode strings and byte sequences initially) but the rest of the algorithm is reusable, such that external crates could define their own "front ends" to the matching engine. It would also allow future expansions to other string-like types (such as defining a regex syntax for matching |
This comment has been minimized.
This comment has been minimized.
The failure I was imagining is that code with several strings (possibly several that get regex'd) could accidentally mix up which strings are being sliced: Of course, you're in closer touch with the API/the crate than me so you're likely to have a better sense for the trade-offs here. |
This comment has been minimized.
This comment has been minimized.
Nope. The FSM implementation is currently very much coupled to UTF-8.
I think this is a valuable thing, but is out of scope for this RFC. This RFC is not for breaking new ground and inventing a general purpose regular expression interface. I perceive this RFC to be for stabilizing well trodden ground. I'll be more clear: I personally don't want to design a regex interface. It's hard work, requires lots of experimentation, lots of use and at least two real implementations of said interface before I could believe it is good enough to stabilize. |
This comment has been minimized.
This comment has been minimized.
Your point is totally valid and your suggestion clearly has more safety. But that little bit of extra indirection is a touch annoying. I honestly don't think there is a clear right answer. Perhaps others can weigh in! |
This comment has been minimized.
This comment has been minimized.
What about |
This comment has been minimized.
This comment has been minimized.
Veedrac
commented
May 19, 2016
|
I was on the (likely unfounded) assumption that the |
This comment has been minimized.
This comment has been minimized.
There is an implementation of |
This comment has been minimized.
This comment has been minimized.
|
@eddyb I don't follow. Could you explain a touch more please? |
This comment has been minimized.
This comment has been minimized.
Veedrac
commented
May 19, 2016
|
I wasn't saying that we should block anything, just that I don't think we need to return anything more than I'm curious as to what you think about the "janky API", though. Do you think the |
This comment has been minimized.
This comment has been minimized.
I don't know. I'm not sure what "second class citizen" means in this context. It seems like it depends on how you're using it. I said "janky" at least partially because I misunderstood your suggestion. For example, if we removed the If you're not advocating removing |
This comment has been minimized.
This comment has been minimized.
@eddyb clarified on IRC that he was talking about changing pub fn find(&self, text: &str) -> Option<::std::ops::Range<usize>>;Which is kind of nice, but I don't think actually addresses @huonw's concern. :-/ |
steveklabnik
reviewed
May 19, 2016
| http://doc.rust-lang.org/regex/regex/index.html#syntax | ||
|
|
||
| To my knowledge, the evolution as proposed in this RFC has been followed since | ||
| `regex` was created. The syntax has largely remain unchanged with few |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Dr-Emann
commented on text/0000-regex-1.0.md in 5e233fa
May 19, 2016
•
|
|
This comment has been minimized.
This comment has been minimized.
|
Do'h. Fixed! Thanks! |
This comment has been minimized.
This comment has been minimized.
RustPowers
commented
May 19, 2016
|
Why the |
This comment has been minimized.
This comment has been minimized.
mawww
commented
May 19, 2016
|
I'd suggest to really think about the trait based version, or any alternative that allow the use of arbitrary iterators. Regex matching on non |
This comment has been minimized.
This comment has been minimized.
This RFC need not define the new API, but it needs to consider the current API in the context of future expansion: will the current API make it difficult to add these features in future without breaking backwards compatibility? |
robinst
reviewed
May 19, 2016
| 2. It was faster. | ||
|
|
||
| Today, (1) can be simulated in practice with the use of a Clippy lint and (2) | ||
| is no longer true. In fact, `regex!` is at least one order of magnitude faster |
This comment has been minimized.
This comment has been minimized.
BurntSushi
added some commits
May 19, 2016
This comment has been minimized.
This comment has been minimized.
@RustPowers Could you explain why it is much better? Cloning the pattern string is insignificant (and always will be). |
This comment has been minimized.
This comment has been minimized.
@Diggsey The current API is defined to match on UTF-8 encoded text (or "ASCII compatible" text when the Unicode flag is disabled in a |
This comment has been minimized.
This comment has been minimized.
I don't want to split the regex API over multiple traits. I want a single cohesive interface because I think that is easiest to consume. I feel strongly about this. The current regex crate API has been out in the wild and in use for two years at this point, there's no hurry.
I don't necessarily agree. If mandatory arguments are in the constructor, then there's no way to misuse the API by "forgetting" to call a method before compilation.
Yes, this was brought up above and I agreeish.
Because
What are the semantics of
Yes, this is in the RFC. It is being removed.
This is out of scope IMO. If someone can write a meaningful benchmark that would benefit from those methods, then I think I'd be happy to oblige, but we don't need to resolve this for 1.0.
Because reading the original string of the regex that was compiled is occasionally useful and doesn't really cost us anything.
Do you mean reverse searching? Maybe one day, but it's out of scope for 1.0.
Yes, I'm going to go over the iterator names and make sure they line up with conventions in |
This comment has been minimized.
This comment has been minimized.
|
@RustPowers Thanks for the feedback! I purposefully didn't respond to a few of your concerns because they were about the implementation. If you'd like to talk about those, I'd be happy to, but please open an issue for each on the regex repo. |
BurntSushi
added some commits
Aug 5, 2016
This comment has been minimized.
This comment has been minimized.
|
Sorry everyone for the delay, but summer hit and it was hard for me to find my way back. I think I'm now finally caught up with everyone's feedback. Here are some changes:
The implementation has also been updated to reflect these changes. |
This comment has been minimized.
This comment has been minimized.
|
The @rust-lang/libs team is inclined to merge this RFC, given the recent updates @BurntSushi has made. @rust-lang/libs if you could check off your name below or write a comment below to the contrary: |
alexcrichton
added
the
final-comment-period
label
Aug 23, 2016
This comment has been minimized.
This comment has been minimized.
leodasvacas
commented
Aug 23, 2016
|
Is implementing |
This comment has been minimized.
This comment has been minimized.
ahmedcharles
commented
Aug 23, 2016
|
Just an amusing comment, since I've had this discussion about other languages in the past. Most languages (which support functions on objects) have a regex api where the regex is an object and searching/matching is done with methods on those objects. C++ seems to stand out in that, the regex class has very few methods and searching/matching is done through free functions. The motivation is that regex's don't actually perform searching or matching, they are only an input into a searching/matching algorithm. I'm mildly curious if this was ever considered? (While it would likely be easy to change, given Rust's privacy model, it's probably too late in the process to have it be worth it.) |
This comment has been minimized.
This comment has been minimized.
|
@ahmedcharles Not really. The Rust ecosystem generally eschews free functions in favor of types with methods. I also disagree that regexes don't actually perform searching or matching. Much of the API is very closely coupled with regexes. (For example, only a subset of the |
This comment has been minimized.
This comment has been minimized.
|
Also, a |
This comment has been minimized.
This comment has been minimized.
|
I feel like allocator support and the synchronization problem are closely tied together---an API to explicitly preallocate and optionally share scratch space solves this nicely. I'm also wary of the bytes duplication / want to extend bytes to supports iterators or arbitrary sized types (streamed, fix-sized atoms). Most importantly, I think the recent acceptance of procedural macros 1.1 can put the wind back into the sails of regex-macros. I assume that the vast majority of use-cases use statically known regexes and thus are more appropriately handled with such an interface (I don't like the smell of lint + unwrap). Given all this, I think the time will soon be ripe for a big overhaul, in a way it hasn't been since regex was first created. If making a 1.0 doesn't impede progress on a 2.0 at all, sure, go ahead. Otherwise |
This comment has been minimized.
This comment has been minimized.
|
Improving The Rust ecosystem hasn't solved the problem of allocator support yet. If it makes sense to build regex on top of that, and if that absolutely requires breaking changes, than a 2.0 release seems fine to me. |
This comment has been minimized.
This comment has been minimized.
|
@BurntSushi do you know where the maintenance policy for stable rust-lang crates is laid out? If 2.0 can released not because breaking changes are "absolutely required" but just because they're nice to have, that would be good to know. |
This comment has been minimized.
This comment has been minimized.
|
@Ericson2314 That would probably come under the purview of the rust-lang crates RFC. Specifically:
To specifically answer your question, my interpretation is that subsequent major version bumps are largely left up to the discretion of the maintainers, community, libs team and the willingness of someone to write an RFC for it. |
This comment has been minimized.
This comment has been minimized.
|
@BurntSushi Thanks for the link and quote. I either crates.io can support per-release versions, or prototypes can be disseminated via git and git dependencies (master vs stable branch for example). Does that sounds good to you as maintainer? [N.B. I'm reminded that bitflags should probably do this for experimenting with associated constants.] |
This comment has been minimized.
This comment has been minimized.
|
@Ericson2314 It doesn't sound unreasonable to me, but I can't really know what we'll do because I can't predict what exactly will provoke a new major version upgrade. |
This comment has been minimized.
This comment has been minimized.
|
That's reasonable. |
This comment has been minimized.
This comment has been minimized.
|
Just wanted to say: this is a beautiful and inspiring RFC. Basically every question I had while reading it was answered in the next paragraph. I'm My biggest worry is about the bytes/utf8 split, but I pretty much buy your reasoning that the proposed API is the simplest, most ergonomic way to deal with it and that there aren't a lot of advantages for trying to be more clever. We've talked sometimes about "inherent trait impls", where you implement inherent methods and a trait at the same time. That might eventually provide a way to ensure a consistent API across the two variants while retaining the simplicity and ergonomics of the current setup. But that's something to explore later down the line. |
This comment has been minimized.
This comment has been minimized.
|
Ok, the decision of the libs team was to merge and accept, so I will do so! Thanks again for the RFC @BurntSushi! |
alexcrichton
merged commit 8434b82
into
rust-lang:master
Sep 12, 2016
BurntSushi
deleted the
BurntSushi:regex-1.0
branch
Sep 12, 2016
This comment has been minimized.
This comment has been minimized.
|
"rendered" link should now point here https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md |
This comment has been minimized.
This comment has been minimized.
|
@ciphergoth Thanks! Updated! |
BurntSushi commentedMay 18, 2016
•
edited
rendered
Pretty much all of the RFC is implemented and can be tracked at this PR: rust-lang/regex#230
rustdocoutput of the corresponding API is also available: http://burntsushi.net/stuff/regex-rfc/regex/