New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex 1.0 #1620

Merged
merged 8 commits into from Sep 12, 2016

Conversation

Projects
None yet
@BurntSushi
Member

BurntSushi commented May 18, 2016

rendered

Pretty much all of the RFC is implemented and can be tracked at this PR: rust-lang/regex#230

rustdoc output of the corresponding API is also available: http://burntsushi.net/stuff/regex-rfc/regex/

@BurntSushi BurntSushi added the T-libs label May 18, 2016

### Expansion concerns
[expansion-concerns]: #expansion-concerns
There are a few possible avenues for expansion, and we take measures to make

This comment has been minimized.

@huonw

huonw May 18, 2016

Member

Another possibility here is "compatibility" flags, e.g. if there is a breaking change for whatever reason, people can opt-in to it with some sort of flag like (?X), ala (?i).

This comment has been minimized.

@BurntSushi

BurntSushi May 18, 2016

Member

Yup, good idea, added!

```
And five core search methods. All searching completes in worst case linear
time.

This comment has been minimized.

@huonw

huonw May 18, 2016

Member

(Not directly relevant, but is this linear with respect to both pattern and text, i.e., O(n + m)? Or maybe linear with respect to each while holding the other constant, i.e., O(nm)?)

This comment has been minimized.

@BurntSushi

BurntSushi May 18, 2016

Member

It is indeed O(mn). I've updated the text to be clearer.

@BurntSushi BurntSushi force-pushed the BurntSushi:regex-1.0 branch from ce2dfb6 to 5e233fa May 18, 2016

@huonw

This comment has been minimized.

Member

huonw commented May 18, 2016

I wonder if (usize, usize) is the best signature for find/find_iter? It seems to me that "forcing" users to slice themselves is annoying and somewhat error-prone in find, and doubly so for find_iter where one is less likely to slice to manually use the tail for something (the find_iter is already using the tail).

They could return &str, but this makes it harder to manually slice when that is desired (although, the standard library could provide a function on str like the old subslice_offset or maybe fn prefix_suffix(&self, substring: &str) -> (&str, &str) that requires substring be an in-memory substring of self and then returns the pieces before and after that substring). Thus, maybe they could return:

pub struct Found<'a> {
     string: &'a str,
     start: usize,
     end: usize,
}
impl<'a> Found<'a> {
     fn as_str(&self) -> &'a str { self.string }
     fn as_indices(&self) -> (usize, usize) { (self.start, self.end) }
}

(I'm not happy with the names.)

`String`.
* The `Error` type no longer has the `InvalidSet` variant, since the error is
no longer possible. Its `Syntax` variant was also modified to wrap a `String`
instead of a `regex_syntax::Error`. If you need access to specific parse

This comment has been minimized.

@huonw

huonw May 18, 2016

Member

Maybe it could wrap a type defined in regex that internally contains a regex_syntax::Error (the type may just implement Error, Display etc. for now), so we can choose to expose more detailed info in future if we want.

This comment has been minimized.

@BurntSushi

BurntSushi May 18, 2016

Member

That is a universally better idea.

This comment has been minimized.

@BurntSushi
@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 18, 2016

@huonw I do like (usize, usize) because it's very straight-forward. Returning something like your Found struct is certainly plausible (&str on its own is no good for the reasons you stated), but the extra indirection as defined seems unfortunate. I suppose we could impl Deref on Found to &str, but that also feels unfortunate.

I think I can buy that (usize, usize) is mildly annoying, but I'm not sure how error prone it is if all you want is the matched text. You typically don't need to do any arithmetic; something like re.find(text).map(|(s, e)| &text[s..e]) will work.

Another alternative is adding a new method with a different return type, but I've tried hard to resist that urge.

Another alternative is to just have find return a Captures that only has the first capturing group, but that's less ergonomic then your Found type.

@Diggsey

This comment has been minimized.

Contributor

Diggsey commented May 18, 2016

Hmm, neither the byte or string based regex implementations would be suited for matching on OsStrs.

It would be nice if the actual matching code, and all FSM related code could be generic over an element type. That way the regular expression syntax would be specific to the use (so it might only define regular expression syntax for unicode strings and byte sequences initially) but the rest of the algorithm is reusable, such that external crates could define their own "front ends" to the matching engine.

It would also allow future expansions to other string-like types (such as defining a regex syntax for matching OsStrs, where the regex itself might be an OsStr, so that it's capable of matching even non-unicode sequences, without depending on the platform-specific representation)

@huonw

This comment has been minimized.

Member

huonw commented May 19, 2016

I think I can buy that (usize, usize) is mildly annoying, but I'm not sure how error prone it is if all you want is the matched text. You typically don't need to do any arithmetic; something like re.find(text).map(|(s, e)| &text[s..e]) will work.

The failure I was imagining is that code with several strings (possibly several that get regex'd) could accidentally mix up which strings are being sliced: re.find(text).map(|(s, e)| &text2[s..e]). The API as currently written has a disconnect between the input string and the output value, that requires the user to connect them back together right, but by providing direct access to the &str---which most people will be interested in---we completely side-step that.

Of course, you're in closer touch with the API/the crate than me so you're likely to have a better sense for the trade-offs here.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

Hmm, neither the byte or string based regex implementations would be suited for matching on OsStrs.

Nope. The FSM implementation is currently very much coupled to UTF-8.

such that external crates could define their own "front ends" to the matching engine

I think this is a valuable thing, but is out of scope for this RFC. This RFC is not for breaking new ground and inventing a general purpose regular expression interface. I perceive this RFC to be for stabilizing well trodden ground. I'll be more clear: I personally don't want to design a regex interface. It's hard work, requires lots of experimentation, lots of use and at least two real implementations of said interface before I could believe it is good enough to stabilize.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

Of course, you're in closer touch with the API/the crate than me so you're likely to have a better sense for the trade-offs here.

Your point is totally valid and your suggestion clearly has more safety. But that little bit of extra indirection is a touch annoying. I honestly don't think there is a clear right answer. Perhaps others can weigh in!

@eddyb

This comment has been minimized.

Member

eddyb commented May 19, 2016

something like re.find(text).map(|(s, e)| &text[s..e]) will work.

What about Range<usize>? That seems to have more semantic information, maybe just enough to make it palatable.

@Veedrac

This comment has been minimized.

Veedrac commented May 19, 2016

I was on the (likely unfounded) assumption that the -> &str API was to be dealt with by implementing Pattern when that stabilizes.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

I was on the (likely unfounded) assumption

There is an implementation of Pattern in the regex crate, but it's only available on a nightly. The Pattern API doesn't fit all the use cases of Regex matching. I suppose we could toss out any methods strictly covered by Pattern (and also block regex 1.0 on Pattern stabilizing), but: 1) that makes for a very janky API experience, where some methods have the haystack as the receiver and others have the regex as a receiver and 2) it seems very unfortunate to block this RFC on stabilization of Pattern. The value of the Pattern impl IMO is that a Regex could be used in a more generic context.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

@eddyb I don't follow. Could you explain a touch more please?

@Veedrac

This comment has been minimized.

Veedrac commented May 19, 2016

I wasn't saying that we should block anything, just that I don't think we need to return anything more than (usize, usize) when &str-returning methods are already going to be available through other convenience APIs.

I'm curious as to what you think about the "janky API", though. Do you think the Pattern API is going to be a bit of a second-class citizen for regex?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

I'm curious as to what you think about the "janky API", though. Do you think the Pattern API is going to be a bit of a second-class citizen for regex?

I don't know. I'm not sure what "second class citizen" means in this context. It seems like it depends on how you're using it. I said "janky" at least partially because I misunderstood your suggestion. For example, if we removed the find method from Regex under the logic that it is provided by Pattern, then the public API of Regex would be janky because a critical feature of Regex (find) is over there in this other interface with a method signature different than the other methods on Regex.

If you're not advocating removing find, then my comment doesn't really apply.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

What about Range? That seems to have more semantic information, maybe just enough to make it palatable.

@eddyb clarified on IRC that he was talking about changing find to:

pub fn find(&self, text: &str) -> Option<::std::ops::Range<usize>>;

Which is kind of nice, but I don't think actually addresses @huonw's concern. :-/

http://doc.rust-lang.org/regex/regex/index.html#syntax
To my knowledge, the evolution as proposed in this RFC has been followed since
`regex` was created. The syntax has largely remain unchanged with few

This comment has been minimized.

@steveklabnik

steveklabnik May 19, 2016

Member

typo: "remained"

This comment has been minimized.

@BurntSushi

BurntSushi May 19, 2016

Member

Fixed! Thanks!

@Dr-Emann

This comment has been minimized.

Dr-Emann commented on text/0000-regex-1.0.md in 5e233fa May 19, 2016

s/faster/slower/?

This comment has been minimized.

Owner

BurntSushi replied May 19, 2016

Do'h. Fixed! Thanks!

@RustPowers

This comment has been minimized.

RustPowers commented May 19, 2016

Why the RegexBuilder is using move semantics? This option looks much better: https://is.gd/0aLEJq. But if it is wrong, then why not clone the data only 1 time in the method compile?

@mawww

This comment has been minimized.

mawww commented May 19, 2016

I'd suggest to really think about the trait based version, or any alternative that allow the use of arbitrary iterators. Regex matching on non &str is an important feature, think about the text editor use case that wont be storing its whole buffer content in a contiguous memory buffer (be it an array of lines, a rope, a gap buffer or anything else). Currently almost every regex engines except the C++ ones (std::regex and boost::regex namely) require a contiguous buffer and are a pain to work with for these cases.

@Diggsey

This comment has been minimized.

Contributor

Diggsey commented May 19, 2016

I think this is a valuable thing, but is out of scope for this RFC.

This RFC need not define the new API, but it needs to consider the current API in the context of future expansion: will the current API make it difficult to add these features in future without breaking backwards compatibility?

2. It was faster.
Today, (1) can be simulated in practice with the use of a Clippy lint and (2)
is no longer true. In fact, `regex!` is at least one order of magnitude faster

This comment has been minimized.

@robinst

robinst May 19, 2016

Contributor

faster -> slower?

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

This option looks much better

@RustPowers Could you explain why it is much better? Cloning the pattern string is insignificant (and always will be).

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 19, 2016

This RFC need not define the new API, but it needs to consider the current API in the context of future expansion: will the current API make it difficult to add these features in future without breaking backwards compatibility?

@Diggsey The current API is defined to match on UTF-8 encoded text (or "ASCII compatible" text when the Unicode flag is disabled in a bytes::Regex). And no, I do not think it is the role of this RFC to design an interface for expanding a regex implementation to arbitrary text encodings.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jul 26, 2016

Well, then I suggest a compromise: fresh the API and release a 0.2 version. During the year, Pattern accurately standardize, and then you can release the cherished 1.0. Why wait so long? There's no hurry, besides, libraries which experienced several major version seem unreliable.

I don't want to split the regex API over multiple traits. I want a single cohesive interface because I think that is easiest to consume. I feel strongly about this.

The current regex crate API has been out in the wild and in use for two years at this point, there's no hurry.

Mandatory arguments better to be in the build function rather than in the constructor.

I don't necessarily agree. If mandatory arguments are in the constructor, then there's no way to misuse the API by "forgetting" to call a method before compilation.

It is advisable to implement the borrow chaining.

Yes, this was brought up above and I agreeish.

Why RegexBuilder::new() takes &str regexp pattern, converts it to String and adds to Vec?

Because &str is the most friendly API. There's no compelling reason I'm aware of to make it generic.

I look at the Regex::find_Iter and expect to see Regex::find_iter_mut, but it is not, so the current name is wrong. Maybe should name similar to the Pattern methods?

What are the semantics of find_iter_mut on Regex? Why does its absence imply the name find_iter is wrong?

The constructor Regex::with_size_limit should be remove, for this using RegexBuilder.

Yes, this is in the RFC. It is being removed.

Maybe should add *_unchecked methods to the Captures?

This is out of scope IMO. If someone can write a meaningful benchmark that would benefit from those methods, then I think I'd be happy to oblige, but we don't need to resolve this for 1.0.

Method Regex::as_str seems odd, because why in the compiled regex keep its textual representation?

Because reading the original string of the regex that was compiled is occasionally useful and doesn't really cost us anything.

Needs rightmost methods like a .find, .rfind, .is_match, .is_rmatch, etc.

Do you mean reverse searching? Maybe one day, but it's out of scope for 1.0.

I think it is worth to rename some of the methods and add more

Yes, I'm going to go over the iterator names and make sure they line up with conventions in std. I struggle with whether to use the Iter suffix because the suffix makes it really easy to distinguish which types are iterators and which aren't when reading the documentation. Although, I suppose there are other ways to solve that problem that don't involve the Iter suffix.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Jul 26, 2016

@RustPowers Thanks for the feedback! I purposefully didn't respond to a few of your concerns because they were about the implementation. If you'd like to talk about those, I'd be happy to, but please open an issue for each on the regex repo.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Aug 21, 2016

Sorry everyone for the delay, but summer hit and it was hard for me to find my way back. I think I'm now finally caught up with everyone's feedback. Here are some changes:

  • find and find_iter now return a Match.
  • The iterators defined on Captures have been removed in favor of one capture_names method defined on Regex.
  • RegexBuilder now uses &mut self instead of self and is eliminated using build instead of compile.
  • Most iterator types have been renamed to match the names found in std::str.
  • Upgrading Unicode definitions shouldn't qualify as a breaking change.

The implementation has also been updated to reflect these changes.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Aug 23, 2016

🔔 This RFC is now entering its final comment period 🔔

The @rust-lang/libs team is inclined to merge this RFC, given the recent updates @BurntSushi has made. @rust-lang/libs if you could check off your name below or write a comment below to the contrary:

@leodasvacas

This comment has been minimized.

leodasvacas commented Aug 23, 2016

Is implementing Deref for Match an unresolved question?

@ahmedcharles

This comment has been minimized.

ahmedcharles commented Aug 23, 2016

Just an amusing comment, since I've had this discussion about other languages in the past.

Most languages (which support functions on objects) have a regex api where the regex is an object and searching/matching is done with methods on those objects. C++ seems to stand out in that, the regex class has very few methods and searching/matching is done through free functions.

The motivation is that regex's don't actually perform searching or matching, they are only an input into a searching/matching algorithm. I'm mildly curious if this was ever considered? (While it would likely be easy to change, given Rust's privacy model, it's probably too late in the process to have it be worth it.)

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Aug 23, 2016

@ahmedcharles Not really. The Rust ecosystem generally eschews free functions in favor of types with methods. I also disagree that regexes don't actually perform searching or matching. Much of the API is very closely coupled with regexes. (For example, only a subset of the Regex API methods are available via the generic str::Pattern trait.) Defining a bunch of free functions like fn is_match(re: &Regex, haystack: &str) -> bool would generally be unidiomatic IMO.

@Diggsey

This comment has been minimized.

Contributor

Diggsey commented Aug 23, 2016

Also, a Regex is not just the pattern to match, but is a compiled representation of that pattern which is directly tied to how the matching algorithm is implemented.

@Ericson2314

This comment has been minimized.

Contributor

Ericson2314 commented Aug 25, 2016

I feel like allocator support and the synchronization problem are closely tied together---an API to explicitly preallocate and optionally share scratch space solves this nicely. I'm also wary of the bytes duplication / want to extend bytes to supports iterators or arbitrary sized types (streamed, fix-sized atoms). Most importantly, I think the recent acceptance of procedural macros 1.1 can put the wind back into the sails of regex-macros. I assume that the vast majority of use-cases use statically known regexes and thus are more appropriately handled with such an interface (I don't like the smell of lint + unwrap).

Given all this, I think the time will soon be ripe for a big overhaul, in a way it hasn't been since regex was first created. If making a 1.0 doesn't impede progress on a 2.0 at all, sure, go ahead. Otherwise 👎 from me.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Aug 25, 2016

Improving regex_macros such that it is on par with the dynamic implementation (which, on the contrary, has been overhauled quite a bit since regex was first created) is a significant project and I strongly disagree with that blocking a 1.0 release in any form. The really important point here is that it's not clear how much faster such an implementation could even be, and it probably comes with significant costs such as code bloat.

The Rust ecosystem hasn't solved the problem of allocator support yet. If it makes sense to build regex on top of that, and if that absolutely requires breaking changes, than a 2.0 release seems fine to me.

@Ericson2314

This comment has been minimized.

Contributor

Ericson2314 commented Aug 25, 2016

@BurntSushi do you know where the maintenance policy for stable rust-lang crates is laid out? If 2.0 can released not because breaking changes are "absolutely required" but just because they're nice to have, that would be good to know.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Aug 25, 2016

@Ericson2314 That would probably come under the purview of the rust-lang crates RFC. Specifically:

Crates in rust-lang may issue new major versions, just like any other crates, though such changes should go through the RFC process. While the library subteam is responsible for major decisions about the library after 1.0, its original author(s) will of course wield a great deal of influence, and their objections will be given due weight in the consensus process.

To specifically answer your question, my interpretation is that subsequent major version bumps are largely left up to the discretion of the maintainers, community, libs team and the willingness of someone to write an RFC for it.

@Ericson2314

This comment has been minimized.

Contributor

Ericson2314 commented Aug 27, 2016

@BurntSushi Thanks for the link and quote. I either crates.io can support per-release versions, or prototypes can be disseminated via git and git dependencies (master vs stable branch for example). Does that sounds good to you as maintainer?

[N.B. I'm reminded that bitflags should probably do this for experimenting with associated constants.]

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Aug 27, 2016

@Ericson2314 It doesn't sound unreasonable to me, but I can't really know what we'll do because I can't predict what exactly will provoke a new major version upgrade.

@Ericson2314

This comment has been minimized.

Contributor

Ericson2314 commented Aug 27, 2016

That's reasonable.

@aturon

This comment has been minimized.

Member

aturon commented Aug 29, 2016

Just wanted to say: this is a beautiful and inspiring RFC. Basically every question I had while reading it was answered in the next paragraph.

I'm 👍 on landing this!

My biggest worry is about the bytes/utf8 split, but I pretty much buy your reasoning that the proposed API is the simplest, most ergonomic way to deal with it and that there aren't a lot of advantages for trying to be more clever.

We've talked sometimes about "inherent trait impls", where you implement inherent methods and a trait at the same time. That might eventually provide a way to ensure a consistent API across the two variants while retaining the simplicity and ergonomics of the current setup. But that's something to explore later down the line.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Sep 12, 2016

Ok, the decision of the libs team was to merge and accept, so I will do so! Thanks again for the RFC @BurntSushi!

@alexcrichton alexcrichton merged commit 8434b82 into rust-lang:master Sep 12, 2016

@BurntSushi BurntSushi deleted the BurntSushi:regex-1.0 branch Sep 12, 2016

@ciphergoth

This comment has been minimized.

Contributor

ciphergoth commented Sep 15, 2016

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Sep 15, 2016

@ciphergoth Thanks! Updated!

@chriskrycho chriskrycho referenced this pull request Feb 10, 2017

Closed

Document all features in the reference #38643

0 of 17 tasks complete

@chriskrycho chriskrycho referenced this pull request Mar 11, 2017

Closed

Document all features #9

18 of 48 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment