Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add regexp crate to Rust distribution (implements RFC 7) #13700

Merged
merged 3 commits into from Apr 25, 2014

Conversation

@BurntSushi
Copy link
Member

commented Apr 23, 2014

Implements RFC 7 and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the basic, nullsubexpr and repetition tests from Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite. I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a regex-dna benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

  1. More than half the number of lines is dedicated to Unicode character classes.
  2. Of the ~4,500 lines remaining, 1,225 of them are comments.
  3. Another ~800 are tests.
  4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for regexp!) make up the rest.
@UtherII

This comment has been minimized.

Copy link

commented Apr 23, 2014

Maybe a silly question, but wouldn't it make sense to put Unicode character classes support into the standard rust string library?

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 23, 2014

Possibly. But I'm not sure. What would they be used for in std::str in their current form?

Note that the matching algorithm depends on those Unicode classes to be available in sorted non-overlapping order, so that they are amenable to binary search.

One possible path forward is to leave them in regexp and rip them out if and when std::str (or something else) wants them.

//!
//! ## Matching one character
//!
//! <pre class="rust">

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

We've generally tried to not use html tags in our documentation, this is done to not run the test/lexer over the contents? You may be able to get away with a notrust tag after three backticks.

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Actually, the reasoning is more insidious: I was unable to write a plain \ character in a fenced code block, so I resorted to the simpler solution of just writing the HTML. (I wasn't able to determine if this was a bug in the sundown parser or elsewhere...)

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Oh well, it was worth a try!

//!
//! <pre class="rust">
//! (exp) numbered capture group (indexed by opening parenthesis)
//! (?P&lt;name&gt;exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

You may want to double check this, but I don't think the html-escapes are necessary if this is in a backtick-enclosed block.

html_root_url = "http://static.rust-lang.org/doc/master")]

#![feature(macro_rules, phase)]
#![deny(missing_doc)]

This comment has been minimized.

Copy link
@alexcrichton
/// syntax extension. Do not rely on it.
///
/// See the comments for the `program` module in `lib.rs` for a more
/// detailed explanation for what `regexp!` requires.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

In an ideal world we could make each field as #[experimental] to have the compiler generate warnings. This is certainly ok for now though.

}
}

impl Regexp {

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Could you merge these two impl blocks?

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Ah, whoops. Fixed. Remnant from a bygone era...

/// ```rust
/// # #![feature(phase)]
/// # extern crate regexp; #[phase(syntax)] extern crate regexp_macros;
/// # use regexp::NoExpand; fn main() {

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Could you uncomment the import of NoExpand here?

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Done.

new.push_str(rep.reg_replace(&cap).as_slice());
last_match = e;
}
new.push_str(unsafe { raw::slice_bytes(text, last_match, text.len()) });

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Did you see a good perf improvement from using unsafe slice_bytes methods? I would have figured that the allocation going on would dominate the bounds checking.

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

I suspect you're right. I think I had that in there because I was mimicing the std replace, but that's not a good reason.

I just removed them and I cannot produce a benchmark that can tell the difference.

They've been removed now. Less unsafe. Woohoo.

// The following is based on the code in slice::from_iter, but
// shortened since we know we're dealing with bytes. The key is that
// we already have a Vec<u8>, so there's no reason to re-collect it
// (which is what from_iter currently does).

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Can you tag this with a FIXME pointing at #12938? You may also want to mention that this should look exactly like:

new.into_owned()

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Fixed.

This comment has been minimized.

Copy link
@thestinger

thestinger Apr 23, 2014

Contributor

I don't understand why ~str is being returned here at all. It will have overhead when DST lands too, as the capacity is being lost and it will need to shrink the allocation. The convention in Rust is to return the type you have directly, rather than making non-free boxing choices for your caller that they can do themselves.

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Well, the reason why it's returning ~str is that str::str::replace also returns ~str, so I figured it'd be good to stay consistent.

Also, if it returned a StrBuf, it would be difficult for the caller to safely and efficiently transform it into a ~str. And I don't mean avoiding the shrinking, but avoiding the redundant collect in the from_iter implementation of ~[].

Could we leave it as ~str with the note to revisit it once DST happens?

This comment has been minimized.

Copy link
@thestinger

thestinger Apr 23, 2014

Contributor

The StrBuf type is more useful to the caller than ~str. It has the same functionality available along with the ability to be resized. The only reason std::str::replace returns ~str is that it's a legacy function. There's no need to be consistent with legacy design decisions pre-dating StrBuf.

This will be inefficient and unidiomatic when the DST changes happen too. You have a StrBuf internally, so you should be returning it to the caller to do as they wish with it. There's no advantage to discarding the capacity and forcing shrinking of the allocation. It's the same anti-pattern as ~T when the callee has T internally. If the caller wants to lose the excess capacity, they can do it themselves.

This comment has been minimized.

Copy link
@thestinger

thestinger Apr 23, 2014

Contributor

It's totally unnecessary because it can return StrBuf here. This avoids the unsafe code and will avoid other costs in the future from dropping the excess capacity. I don't think it's acceptable to sneak in unsafe code to push your view on the string/vector issue. I'm strongly against this and will do everything I can to stop this from landing in the current form. There's a clear and simple way to do it without any unsafe code and you're only in favour of using it because it enshrines returning ~str in the API.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

You are singling out one use case of where StrBuf could be returned, but it isn't. In today's rust, it is consistent to return ~str, not StrBuf. Regardless of what your opinion about what it should be is, that is the current state of affairs.

If you would like to change values to returning StrBuf, then I recommend you do so in a separate issue or PR which discusses all return values, not just this one use case in an experimental library that hasn't been merged yet. Focusing on this one case is not very helpful.

I can understand you being strongly against return ~str where a StrBuf is available, but I do not believe that this is the PR to make that decision.

Please do not take my comments as an endorsement of returning ~str. That is a misconception of what I am saying. I would like to merge this library because it will have significant benefit to all users of rust. Blocking this over an ongoing discussion which has no current resolution is not really helping anyone.

This comment has been minimized.

Copy link
@thestinger

thestinger Apr 23, 2014

Contributor

I'm not singling out one use case. I'm reviewing a pull request adding new code to the standard library, and am strongly opposed to merging it while it has completely unnecessary unsafe code doing the opposite of optimization. There is no rationale for why this unsafe code is used rather than returning the StrBuf and there is no rationale for why a performance hit should be taken in the future to return it. The burden of proof rests on the person proposing we add more unsafe code, not me.

Please do not take my comments as an endorsement of returning ~str.

You already endorsed it by misrepresenting your view on the topic as the established consensus in your post to the mailing list.

That is a misconception of what I am saying. I would like to merge this library because it will have significant benefit to all users of rust. Blocking this over an ongoing discussion which has no current resolution is not really helping anyone.

It should not go in as long as it's going to great lengths with unsafe code to back up a minority opinion on the Vec<T> issue. It could simply return the StrBuf it has internally instead of using a convoluted unsafe workaround.

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 24, 2014

Author Member

The only reason it's returning a ~str is because other parts of std also return ~str even when a StrBuf could be returned. I made this decision because it's consistent. The unsafe code followed from that. I did not make this decision because of an opinion on the Vec<T> issue.

With that said, I'm happy to change to StrBuf. (I'd also change the regex-dna benchmark to use a StrBuf. This would actually avoid a copy for each replacement done, so it'd probably improve performance.)

I'm not familiar with your governance model, so I'll otherwise keep quiet. But I just wanted to make sure that my point of view was clear.

This comment has been minimized.

Copy link
@thestinger

thestinger Apr 24, 2014

Contributor

The rest of the standard library uses ~str because StrBuf never existed until recently and ~str used to be resizeable. If this case had a choice between ~str and StrBuf without requiring a conversion between them, then I wouldn't have mentioned anything.

However, at the moment it's going to great lengths to avoid simply returning the StdBuf that's inside the function. It's hurting performance in the caller and it's adding unnecessary unsafety.

/// The `'a` lifetime refers to the lifetime of a borrowed string when
/// a new owned string isn't needed (e.g., for `NoExpand`).
fn reg_replace<'a>(&'a self, caps: &Captures) -> MaybeOwned<'a>;
}

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Due to #12224, implementing this trait for a closure will require this to use &mut self instead of &self. It looks like it'd be a pretty easy drop-in adjustment though (see modifications in #13686 to the CharEq trait for an example)

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Easy peasy. Fixed.

/// Returns an iterator over all the non-overlapping capture groups matched
/// in `text`. This is operationally the same as `find_iter` (except it
/// yields information about submatches).
pub fn captures_iter<'r, 't>(&'r self, text: &'t str)

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

If you have time (certainly not a blocker), could you add a small example to this and the above few methods? I think I understand how to use them, but examples are always super helpful!

(again, not a blocker, just a cherry on top)

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

I hate cherries, but they sure look nice. Added!

self.text.slice(s, e)
}
}
}

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

I'm curious if this type could one day implement the Index trait (basically leverage the foo[bar] syntax). Do you know which of these methods would be most appropriate for that?

It seems a bit odd to me that pos has the same return value for an empty match and an out-of-bounds index (and that kinds leaks over to at as well). Did you find precedent in other regex engines? Just something to think about, I'm ok with it as-is due to the len() method being available.

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Honestly, I stayed away from the Index trait because there seems to be a lot of buzz about it being removed or substantially changed. I figured that if we're going to live with Vec not having index notation, than we should probably also live with Captures not having it either. (For the time being.) It would be nice if it could support caps[1] and caps["name"] (corresponding to the at and name methods), but I don't think that's currently possible? I didn't dig too much.

RE pos: Yes, it is a bit odd. The only alternatives I can think of are to assert that the index is in range or to encode the failure in the type. I don't think I really checked precedent for this in other libraries. Python seems to raise an IndexError. Similarly for asking for a named capture group that doesn't exist.

At the moment, I'm thinking that handling out-of-bounds like the rest of the standard lib does might be the best way to go (and this would, e.g., be consistent with Python). My least favorite option is to encode the failure into the return type.

This comment has been minimized.

Copy link
@thestinger

thestinger Apr 23, 2014

Contributor

I don't think any new Index implementations should be added, because they're likely all going to need to be removed before landing the new traits. It's just going to create unnecessary churn.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Oh no, I do not think that this should implement Index now, I was merely wondering about the future and how this may leverage it.

Let's leave these as-is. This is why the crate is experimental, I was just musing.

Some(i) => self.at(i).to_owned(),
}
});
text.replace("$$", "$")

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

regexes used to implement regexes! (I thought bootstrapping a compiler was hard!)

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

:P

DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

cc @brson, just confirming this is ok

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

I got confirmation this was ok.

return StepMatch
}
Submatches => {
unsafe { groups.copy_memory(caps) }

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

Did you see that manual loops didn't optimize to a mempcy? I would expect something like this to optimize to a memcpy:

for (slot, val) in groups.mut_iter().zip(caps.iter()) {
    *slot = *val;
}

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

I did not realize that! Awesome. I can't seem to produce any significant and consistent change in benchmark results. I've removed all 4 unsafe blocks for using copy_memory.

}
(false, Submatches) => unsafe {
t.groups.as_mut_slice().copy_memory(groups)
}

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

As above, I would be curious about the usage of unsafe here.

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Removed. See #13700 (comment)

v.set_len(elts);
::std::ptr::copy_nonoverlapping_memory(
v.as_mut_ptr(), groups.as_ptr(), elts);
}

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

I'm curious why unsafe was used here rather than an iterator and a collect?

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Removed! Same as before: no difference in benchmarks when using, e.g., groups.iter().map(|x| *x).collect(). Awesome.

There are now only three uses of unsafe: two for using unitialized memory for sparse sets and one for reducing allocation in string replacement. (Which will hopefully be removed at some point.)

return StepMatch
}
Submatches => {
unsafe { groups.copy_memory(caps.as_slice()) }

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

This seems to generate a good bit of unsafe blocks. Did you not see the common idioms optimized to essentially what the unsafe blocks are doing?

If necessary, it would be nice to have some comments about why unsafe is necessary in these locations.

This comment has been minimized.

Copy link
@BurntSushi
}
}

#[inline(always)]

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

We're generally trying to avoid inline(always) annotations, did run into problems if these were tagged with #[inline]?

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

I have no idea why I used inline(always).

Changed all of them to inline. No perf difference. Fixed.

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 23, 2014

This looks even better than I thought it was going to be, amazing work, and thank you so much!

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 23, 2014

Ah, one more small thing, we're trying to ensure that commits can be traced back to the RFC they implemented, so could you make sure that this shows up at the bottom of the first commit message (you can wait to rebase until later)

RFC: 0007-regexps
None => "",
Some(ref h) => {
match h.find(&name.to_owned()) {
None => "",

This comment has been minimized.

Copy link
@zkamsler

zkamsler Apr 23, 2014

Contributor

Could you use h.find_equiv(name) here in order to avoid allocating an owned string?

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

Indeed I can. Fixed.

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 23, 2014

@alexcrichton Thanks! And thanks very much for all your comments so far. Very helpful. I will make sure to add RFC: 0007-regexps to the commit message.

Also, when I rebase, won't it change my commit history? I assume I'll have to force push. (Just want to make sure that's what's expected.)

@thestinger

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2014

@BurntSushi: Yeah, you'll have to force push.

// except according to those terms.

// ignore-stage1
// ignore-cross-compile #12102

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

The most recently landed PR actually makes this so ignore-cross-compile isn't necessary. The stack of commits will need to get rebase anyway, so just something to include in the rebasing.

Threads {
which: which,
queue: unsafe { ::std::mem::uninit() },
sparse: unsafe { ::std::mem::uninit() },

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 23, 2014

Member

How come this uninit is needed? It seems quite unsafe. If it's necessary for perf, can you add a comment explaining why?

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 23, 2014

Author Member

They are not needed, but these unsafe blocks actually do make a performance difference. The trick being used here is to represent sparse sets using uninitialized memory. It's described in more detail here: http://research.swtch.com/sparse

In this case, I can actually produce evidence. The first column is without unsafe and the second column is the code as you see it:

anchored_literal_long_match             264 ns/iter (+/- 2)                    165 ns/iter (+/- 4)
anchored_literal_long_non_match        5867 ns/iter (+/- 8)                   5822 ns/iter (+/- 45)
anchored_literal_short_match            232 ns/iter (+/- 8)                    161 ns/iter (+/- 2)
anchored_literal_short_non_match        495 ns/iter (+/- 1)                    424 ns/iter (+/- 3)
easy0_1K                               1808 ns/iter (+/- 111) = 566 MB/s      1277 ns/iter (+/- 170) = 801 MB/s
easy0_32                                330 ns/iter (+/- 2) = 96 MB/s          276 ns/iter (+/- 3) = 115 MB/s
easy0_32K                             48878 ns/iter (+/- 650) = 670 MB/s     33323 ns/iter (+/- 968) = 983 MB/s
easy1_1K                               1881 ns/iter (+/- 556) = 544 MB/s      1794 ns/iter (+/- 684) = 570 MB/s
easy1_32                                391 ns/iter (+/- 93) = 81 MB/s         341 ns/iter (+/- 70) = 93 MB/s
easy1_32K                             49735 ns/iter (+/- 2484) = 658 MB/s    48367 ns/iter (+/- 2864) = 677 MB/s
hard_1K                               47163 ns/iter (+/- 268) = 21 MB/s      35070 ns/iter (+/- 169) = 29 MB/s
hard_32                                1840 ns/iter (+/- 38) = 17 MB/s        1389 ns/iter (+/- 17) = 23 MB/s
hard_32K                            1497950 ns/iter (+/- 5921) = 21 MB/s   1112845 ns/iter (+/- 2605) = 29 MB/s
literal                                 142 ns/iter (+/- 2)                    131 ns/iter (+/- 0)
match_class                            1403 ns/iter (+/- 6)                   1394 ns/iter (+/- 6)
match_class_in_range                   1448 ns/iter (+/- 3)                   1347 ns/iter (+/- 4)
medium_1K                             17310 ns/iter (+/- 255) = 59 MB/s      17475 ns/iter (+/- 166) = 58 MB/s
medium_32                               888 ns/iter (+/- 29) = 36 MB/s         835 ns/iter (+/- 34) = 38 MB/s
medium_32K                           542510 ns/iter (+/- 2595) = 60 MB/s    550793 ns/iter (+/- 2491) = 59 MB/s
no_exponential                       274104 ns/iter (+/- 466)               278257 ns/iter (+/- 906)
not_literal                            1104 ns/iter (+/- 5)                   1080 ns/iter (+/- 4)
one_pass_long_prefix                    548 ns/iter (+/- 5)                    379 ns/iter (+/- 3)
one_pass_long_prefix_not                520 ns/iter (+/- 2)                    409 ns/iter (+/- 2)
one_pass_short_a                       1326 ns/iter (+/- 18)                  1291 ns/iter (+/- 8)
one_pass_short_a_not                   1945 ns/iter (+/- 21)                  1585 ns/iter (+/- 29)
one_pass_short_b                        913 ns/iter (+/- 3)                    816 ns/iter (+/- 8)
one_pass_short_b_not                   1242 ns/iter (+/- 7)                   1401 ns/iter (+/- 9)
replace_all                            1353 ns/iter (+/- 13)                  1291 ns/iter (+/- 11)               

My guess as to what's happening---particularly in the hard benchmarks---is that the mem::uninit saves a lot of time by not initializing threads that never need to be initialized, particularly with larger regexps (like the hard benchmark) with a lot of instructions.

I've included a justification in a comment and a link to Russ Cox's article.

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 23, 2014

Just a few small nits left, and otherwise this looks fantastic. After a rebasing, I think this is good to go!

@chris-morgan

This comment has been minimized.

Copy link
Member

commented Apr 24, 2014

Argh, I didn't notice that when RFC 7 was accepted that it kept the name regexp rather than shifting to regex. Can we fix that? (Citation for why we should change it: regex is the name everyone uses.)

@chris-morgan

This comment has been minimized.

Copy link
Member

commented Apr 24, 2014

@BurntSushi There still remains the question of Regex vs. RegEx.

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2014

I prefer Regex. It just looks nicer to me.

@seanmonstar

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2014

Rust convention is CamelCase for types.

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2014

@seanmonstar Depends on whether you consider regex a word all by itself. :P

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2014

If we have RegEx on the grounds that type names are CamelCase, then I guess we'd either need to use re! for the macro or reg_ex!. (Since I believe underscores delimit words in function/macro names.)

@blaenk

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2014

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

Count me for that one, Regex that is. I don't like RegEx; that uppercase 'E' adds a break in the flow of typing it, and reg_ex is just plain ugly. I also agree re is too short, and adding a p at the end of regex just makes it weird.

I think Regex is the best.

@liigo

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2014

regex +1, for its shorter, and readable.

@chris-morgan

This comment has been minimized.

Copy link
Member

commented Apr 24, 2014

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

You meant the crate as regex rather than Regex. Given that, +1.

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2014

@chris-morgan yes absolutely! Nice catch. Edited.

@BurntSushi

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2014

OK, I've changed the name of the crate to regex, the type to Regex and the macro to regex!.

r? @alexcrichton @thestinger

bors added a commit that referenced this pull request Apr 24, 2014
auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
};

/// For the `regex!` syntax extension. Do not use.
#[macro_registrar]

This comment has been minimized.

Copy link
@sfackler

sfackler Apr 24, 2014

Member

I'd just mark this as #[doc(hidden)]

This comment has been minimized.

Copy link
@BurntSushi

BurntSushi Apr 24, 2014

Author Member

Fixed. Thanks!

bors added a commit that referenced this pull request Apr 24, 2014
auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
bors added a commit that referenced this pull request Apr 24, 2014
auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
bors added a commit that referenced this pull request Apr 24, 2014
auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
BurntSushi added 2 commits Apr 25, 2014
Add a regex crate to the Rust distribution.
Also adds a regex_macros crate, which provides natively compiled
regular expressions with a syntax extension.

Closes #3591.

RFC: 0007-regexps
@alexcrichton

This comment has been minimized.

Copy link

commented on 09a8b38 Apr 25, 2014

r+

@bors

This comment has been minimized.

Copy link
Contributor

commented on 09a8b38 Apr 25, 2014

saw approval from alexcrichton
at BurntSushi@09a8b38

This comment has been minimized.

Copy link
Contributor

replied Apr 25, 2014

merging BurntSushi/rust/regexp = 09a8b38 into auto

This comment has been minimized.

Copy link
Contributor

replied Apr 25, 2014

BurntSushi/rust/regexp = 09a8b38 merged ok, testing candidate = dd2a48f

bors added a commit that referenced this pull request Apr 25, 2014
auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
@alexcrichton

This comment has been minimized.

Copy link

commented on 7269bc7 Apr 25, 2014

r+

@bors

This comment has been minimized.

Copy link
Contributor

commented on 7269bc7 Apr 25, 2014

saw approval from alexcrichton
at BurntSushi@7269bc7

This comment has been minimized.

Copy link
Contributor

replied Apr 25, 2014

merging BurntSushi/rust/regexp = 7269bc7 into auto

This comment has been minimized.

Copy link
Contributor

replied Apr 25, 2014

BurntSushi/rust/regexp = 7269bc7 merged ok, testing candidate = eea4909

This comment has been minimized.

Copy link
Contributor

replied Apr 25, 2014

fast-forwarding master to auto = eea4909

bors added a commit that referenced this pull request Apr 25, 2014
auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.

@bors bors closed this Apr 25, 2014

@bors bors merged commit 7269bc7 into rust-lang:master Apr 25, 2014

2 checks passed

continuous-integration/travis-ci The Travis CI build passed
Details
default all tests passed

@BurntSushi BurntSushi deleted the BurntSushi:regexp branch Apr 25, 2014

@alexcrichton

This comment has been minimized.

Copy link
Member

commented Apr 25, 2014

Nice work @BurntSushi!

1 similar comment
@pyros2097

This comment has been minimized.

Copy link

commented Nov 21, 2015

Nice work @BurntSushi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.