Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing mapping from names to group indices into Regex #158

Merged
merged 5 commits into from
Feb 19, 2016

Conversation

defuz
Copy link
Contributor

@defuz defuz commented Feb 1, 2016

The main goal I wanted to achieve is to get rid of creation HashMap that storing group name to group index mapping every time when creating a new instance of Captures.

Now this map is stored directly into Regex. For Regex::Dynamic it is stored as HashMap and for Regex::Native it is stored as sorted vector Vec<(&'static str, usize)>, where the first element of the tuple is the group name, and the second element is group index.

This change doesn't add backward incompatible changes, but changes lifetime parameters for some structs, like Captures<'r, 't>, where 'r refers to Regex instance.

Also, this PR is lacking documentation updates and I'm not sure about the correctness of lifetime parameters (I think we can reduce counts for some structs), so please consider this request more like a draft.

@BurntSushi what you think?

@BurntSushi
Copy link
Member

@defuz I like it. It kind of makes me wonder why I didn't do it this way initially---recreating a hashmap on each match seems a little crazy! My best guess is that I thought it'd be nice to not create the hash map if we didn't need it (e.g., never ask for captures or don't have any named groups), but compatively speaking, one (usually small) hashmap seems pretty harmless to me. Regardless, we could lazily build it once using interior mutability, but I don't think that's necessary.

With all that said, this is undoubtedly breaking and backwards incompatible. You're pretty much guaranteed this when you add a lifetime parameter to a public type, because any downstream use of that type now requires specifying the additional lifetime. (Unless, of course, lifetime elision kicks in.) In this case, I doubt there would be many breaking changes since one rarely utters the types you've changed explicitly, and instead they're usually inferred.

What do you think of keeping this PR open and putting it on the 1.0 milestone? There is already at least one other breaking change cooking (#151), so I like the idea of batching them all up to reduce churn in the ecosystem.

In any case, thank you!

@BurntSushi
Copy link
Member

I think one thing this makes impossible is keeping captures around after the regex has gone out of scope, but that seems a little weird to me.

@defuz
Copy link
Contributor Author

defuz commented Feb 2, 2016

@BurntSushi

You're pretty much guaranteed this when you add a lifetime parameter to a public type because any downstream use of that type now requires specifying the additional lifetime.

You're right, I was too optimistic about backward compatibility. :)

What do you think of keeping this PR open and putting it on the 1.0 milestone?

This sounds reasonable to me.

I think one thing this makes impossible is keeping captures around after the regex has gone out of scope, but that seems a little weird to me.

Yes, this is the price for the fact that we do not create HashMap every time. Although, we can simply clone it every time when we create Captures.

Moreover, I think we can use Vec<(&str, usize)> instead of HashMap<String, usize> for Regex::Dynamic, so that we can get rid of dynamic dispatch into SubCapturesNamed and Captures::name. And binary search is much faster than a hashmap for small collections.

@BurntSushi BurntSushi added this to the 1.0 milestone Feb 2, 2016
@BurntSushi
Copy link
Member

@defuz Sounds good to me. Moving the HashMap into Regex seems like an obvious win, but I'd strongly recommend setting up some micro-benchmarks before going further than that. :-)

N.B. This branch will need to be rebased before 1.0 because of changes I'm making. I'll ping you then or I can do it! Thanks!

@pczarn
Copy link

pczarn commented Feb 15, 2016

I think this improvement can be added backwards compatibly with Rc instead of borrowing.

@defuz
Copy link
Contributor Author

defuz commented Feb 17, 2016

@pczarn Probably we will need Arc, since Regex and Captures are Send/Sync.

@BurntSushi I thought that if this is a backward incompatible change, can we go further? We can separate features of Captures between the two structures Captures and NamedCaptures:

impl Regex {
    ...
    pub fn captures<'t>(&self, text: &'t str) -> Option<Captures<'t>> { ... }
    pub fn named_captures<'r, 't>(&'r self, text: &'t str) -> Option<NamedCaptures<'r, 't>> { ... }
    ...
}

impl Captures<'t> {
    // Like today, but there are no more field `named` and methods `names` and `iter_names`
}

struct NamedCaptures<'r, 't> {
    regex: &'r Regex,
    captures: Captures<'t>
}

impl<'r, 't> NamedCaptures<'r, 't> {
     fn captures(&self) -> &Captures<'t> { ... }
     fn name(&self, name: &str) -> Option<&'t str> { ... }
     fn iter(&self) -> SubCapturesNamed<'r, 't> { ... }
}

The main thing that we are doing here is moving methods name and iter_named from Captures to new NamedCaptures. This solution offers the most flexibility:

  • Captures is not dependent on Regex, like today, but loses methods name and iter_named
  • NamedCaptures is depended on Regex (lifetime 'r) and allows you to work with names, borrowing them from Regex.

@BurntSushi
Copy link
Member

@defuz I am honestly not a huge fan of increasing the API surface. IMO, there are already too many methods with too many decisions that the caller needs to make. I would be fine with an Arc for now. (And nice catch about Rc. It is important for Regex to be Sync!)

@defuz
Copy link
Contributor Author

defuz commented Feb 18, 2016

I made some updates:

The problems that I see now:

  • Documentation about lifetimes should be updated
  • It makes sense to add NamedGroups::Empty variant to avoid useless copying of Arc<HashMap<...>> in a case when the regular expression does not have the named group.

@BurntSushi what you think?

I also do not quite understand the reason of duplicate original, cap_names and named_groups fields of Program inside Exec struct. Can we store these fields directly in the Exec?

use std::fmt;
use std::ops::Index;
#[cfg(feature = "pattern")]
use std::str::pattern::{Pattern, Searcher, SearchStep};
use std::str::FromStr;
use std::collections::HashMap;
use std::sync::Arc;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, but I try to order imports by:

  1. Category. (Group by std, then group by third party crates and then group by internal imports.)
  2. Alphabetical.

@BurntSushi
Copy link
Member

@defuz Lookin' pretty good! I had some nits, and I would like to leave out breaking changes so that we can get the Arc stuff merged now.

I also do not quite understand the reason of duplicate original, cap_names and named_groups fields of Program inside Exec struct. Can we store these fields directly in the Exec?

Hmm, I don't think any of those fields are in the Exec struct? In my multi branch, I've moved cap_names to the Insts type, and would probably also move named_groups there.

There may be some other refactoring to be done here, but I'd prefer holding off because my multi branch is going to mess around with it in quite a big way. (To support matching multiple regexes and reporting overlapping matches in a single pass.)

@defuz
Copy link
Contributor Author

defuz commented Feb 19, 2016

@BurntSushi Perhaps we can release fixes for lifetimes before 1.0? According to RFC 1105 (API evolution) generalizing of function is minor change:

The type of an argument to a function, or its return value, can be generalized to use generics, including by introducing a new type parameter (as long as it can be instantiated to the original type).
...
Introducing generics in this way can potentially create type inference failures, but these are considered acceptable per the principles of the RFC: they only require local annotations that could have been inserted in advance.

@BurntSushi
Copy link
Member

@defuz Right, but we're not just changing function parameters. It's adding a new lifetime parameter to a public type. In theory, it's possible to add a new type parameter if we give it a default, but you can't do that with lifetime parameters.

My goal is to march towards 1.0 so that we can fix these and other issues. Hopefully it happens in the next couple months.

@defuz
Copy link
Contributor Author

defuz commented Feb 19, 2016

@BurntSushi Oh, you're totally right, it seems that I'm blindness to incompatible changes. :(

I reverted backward incompatible changes but keep renaming of the lifetimes and clarifying in the documentation. Now everything looks good to me -- let me know if you have any comments.

@BurntSushi
Copy link
Member

@defuz This looks exquisite! Really nice work! As soon as CI comes back this is good to merge. Thank you. :-)

BurntSushi added a commit that referenced this pull request Feb 19, 2016
Storing mapping from names to group indices into Regex
@BurntSushi BurntSushi merged commit aa124b1 into rust-lang:master Feb 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants