Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

add regexp crate to Rust distribution (implements RFC 7) #13700

Merged
merged 3 commits into from
@BurntSushi
Collaborator

Implements RFC 7 and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the basic, nullsubexpr and repetition tests from Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite. I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a regex-dna benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

  1. More than half the number of lines is dedicated to Unicode character classes.
  2. Of the ~4,500 lines remaining, 1,225 of them are comments.
  3. Another ~800 are tests.
  4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for regexp!) make up the rest.
@UtherII

Maybe a silly question, but wouldn't it make sense to put Unicode character classes support into the standard rust string library?

@BurntSushi
Collaborator

Possibly. But I'm not sure. What would they be used for in std::str in their current form?

Note that the matching algorithm depends on those Unicode classes to be available in sorted non-overlapping order, so that they are amenable to binary search.

One possible path forward is to leave them in regexp and rip them out if and when std::str (or something else) wants them.

@BurntSushi BurntSushi referenced this pull request in BurntSushi/regexp
Closed

Expand macros before looking for string literal #2

src/libregexp/lib.rs
((182 lines not shown))
+//! # #![feature(phase)]
+//! # extern crate regexp; #[phase(syntax)] extern crate regexp_macros;
+//! # fn main() {
+//! let re = regexp!(r"[\pN\p{Greek}\p{Cherokee}]+");
+//! assert_eq!(re.find("abcΔᎠβⅠᏴγδⅡxyz"), Some((3, 23)));
+//! # }
+//! ```
+//!
+//! # Syntax
+//!
+//! The syntax supported in this crate is almost in an exact correspondence
+//! with the syntax supported by RE2.
+//!
+//! ## Matching one character
+//!
+//! <pre class="rust">
@alexcrichton Owner

We've generally tried to not use html tags in our documentation, this is done to not run the test/lexer over the contents? You may be able to get away with a notrust tag after three backticks.

@BurntSushi Collaborator

Actually, the reasoning is more insidious: I was unable to write a plain \ character in a fenced code block, so I resorted to the simpler solution of just writing the HTML. (I wasn't able to determine if this was a bug in the sundown parser or elsewhere...)

@alexcrichton Owner

Oh well, it was worth a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/lib.rs
((240 lines not shown))
+//! ## Empty matches
+//!
+//! <pre class="rust">
+//! ^ the beginning of text (or start-of-line with multi-line mode)
+//! $ the end of text (or end-of-line with multi-line mode)
+//! \A only the beginning of text (even with multi-line mode enabled)
+//! \z only the end of text (even with multi-line mode enabled)
+//! \b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
+//! \B not a Unicode word boundary
+//! </pre>
+//!
+//! ## Grouping and flags
+//!
+//! <pre class="rust">
+//! (exp) numbered capture group (indexed by opening parenthesis)
+//! (?P&lt;name&gt;exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])
@alexcrichton Owner

You may want to double check this, but I don't think the html-escapes are necessary if this is in a backtick-enclosed block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/lib.rs
((350 lines not shown))
+//! The story is a bit better with untrusted search text, since this crate's
+//! implementation provides `O(nm)` search where `n` is the number of
+//! characters in the search text and `m` is the number of instructions in a
+//! compiled expression.
+
+#![crate_id = "regexp#0.11-pre"]
+#![crate_type = "rlib"]
+#![crate_type = "dylib"]
+#![experimental]
+#![license = "MIT/ASL2"]
+#![doc(html_logo_url = "http://www.rust-lang.org/logos/rust-logo-128x128-blk-v2.png",
+ html_favicon_url = "http://www.rust-lang.org/favicon.ico",
+ html_root_url = "http://static.rust-lang.org/doc/master")]
+
+#![feature(macro_rules, phase)]
+#![deny(missing_doc)]
@alexcrichton Owner

\o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((100 lines not shown))
+/// ```
+///
+/// Given an incorrect regular expression, `regexp!` will cause the Rust
+/// compiler to produce a compile time error.
+/// Note that `regexp!` will compile the expression to native Rust code, which
+/// makes it much faster when searching text.
+/// More details about the `regexp!` macro can be found in the `regexp` crate
+/// documentation.
+#[deriving(Clone)]
+#[allow(visible_private_types)]
+pub struct Regexp {
+ /// The representation of `Regexp` is exported to support the `regexp!`
+ /// syntax extension. Do not rely on it.
+ ///
+ /// See the comments for the `program` module in `lib.rs` for a more
+ /// detailed explanation for what `regexp!` requires.
@alexcrichton Owner

In an ideal world we could make each field as #[experimental] to have the compiler generate warnings. This is certainly ok for now though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((145 lines not shown))
+impl Regexp {
+ /// Compiles a dynamic regular expression. Once compiled, it can be
+ /// used repeatedly to search, split or replace text in a string.
+ ///
+ /// When possible, you should prefer the `regexp!` macro since it is
+ /// safer and always faster.
+ ///
+ /// If an invalid expression is given, then an error is returned.
+ pub fn new(re: &str) -> Result<Regexp, parse::Error> {
+ let ast = try!(parse::parse(re));
+ let (prog, names) = Program::new(ast);
+ Ok(Regexp { original: re.to_owned(), names: names, p: Dynamic(prog) })
+ }
+}
+
+impl Regexp {
@alexcrichton Owner

Could you merge these two impl blocks?

@BurntSushi Collaborator

Ah, whoops. Fixed. Remnant from a bygone era...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((325 lines not shown))
+ /// let result = re.replace("Springsteen, Bruce", "$first $last");
+ /// assert_eq!(result, ~"Bruce Springsteen");
+ /// # }
+ /// ```
+ ///
+ /// Note that using `$2` instead of `$first` or `$1` instead of `$last`
+ /// would produce the same result. To write a literal `$` use `$$`.
+ ///
+ /// Finally, sometimes you just want to replace a literal string with no
+ /// submatch expansion. This can be done by wrapping a string with
+ /// `NoExpand`:
+ ///
+ /// ```rust
+ /// # #![feature(phase)]
+ /// # extern crate regexp; #[phase(syntax)] extern crate regexp_macros;
+ /// # use regexp::NoExpand; fn main() {
@alexcrichton Owner

Could you uncomment the import of NoExpand here?

@BurntSushi Collaborator

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((369 lines not shown))
+ let mut last_match = 0u;
+ let mut i = 0;
+ for cap in self.captures_iter(text) {
+ // It'd be nicer to use the 'take' iterator instead, but it seemed
+ // awkward given that '0' => no limit.
+ if limit > 0 && i >= limit {
+ break
+ }
+ i += 1;
+
+ let (s, e) = cap.pos(0).unwrap(); // captures only reports matches
+ new.push_str(unsafe { raw::slice_bytes(text, last_match, s) });
+ new.push_str(rep.reg_replace(&cap).as_slice());
+ last_match = e;
+ }
+ new.push_str(unsafe { raw::slice_bytes(text, last_match, text.len()) });
@alexcrichton Owner

Did you see a good perf improvement from using unsafe slice_bytes methods? I would have figured that the allocation going on would dominate the bounds checking.

@BurntSushi Collaborator

I suspect you're right. I think I had that in there because I was mimicing the std replace, but that's not a good reason.

I just removed them and I cannot produce a benchmark that can tell the difference.

They've been removed now. Less unsafe. Woohoo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((381 lines not shown))
+ new.push_str(rep.reg_replace(&cap).as_slice());
+ last_match = e;
+ }
+ new.push_str(unsafe { raw::slice_bytes(text, last_match, text.len()) });
+
+ // The lengths we will go to avoid allocation.
+ // This has a *dramatic* affect on the regex-dna benchmark (and indeed,
+ // any code that uses 'replace' on a large corpus of text multiple
+ // times). The trick is to avoid the obscene amount of allocation
+ // currently done in slice::from_iter. I've been promised that DST will
+ // fix this.
+ //
+ // The following is based on the code in slice::from_iter, but
+ // shortened since we know we're dealing with bytes. The key is that
+ // we already have a Vec<u8>, so there's no reason to re-collect it
+ // (which is what from_iter currently does).
@alexcrichton Owner

Can you tag this with a FIXME pointing at #12938? You may also want to mention that this should look exactly like:

new.into_owned()
@BurntSushi Collaborator

Fixed.

I don't understand why ~str is being returned here at all. It will have overhead when DST lands too, as the capacity is being lost and it will need to shrink the allocation. The convention in Rust is to return the type you have directly, rather than making non-free boxing choices for your caller that they can do themselves.

@BurntSushi Collaborator

Well, the reason why it's returning ~str is that str::str::replace also returns ~str, so I figured it'd be good to stay consistent.

Also, if it returned a StrBuf, it would be difficult for the caller to safely and efficiently transform it into a ~str. And I don't mean avoiding the shrinking, but avoiding the redundant collect in the from_iter implementation of ~[].

Could we leave it as ~str with the note to revisit it once DST happens?

The StrBuf type is more useful to the caller than ~str. It has the same functionality available along with the ability to be resized. The only reason std::str::replace returns ~str is that it's a legacy function. There's no need to be consistent with legacy design decisions pre-dating StrBuf.

This will be inefficient and unidiomatic when the DST changes happen too. You have a StrBuf internally, so you should be returning it to the caller to do as they wish with it. There's no advantage to discarding the capacity and forcing shrinking of the allocation. It's the same anti-pattern as ~T when the callee has T internally. If the caller wants to lose the excess capacity, they can do it themselves.

@alexcrichton Owner

StrBuf vs ~str is still in flux. Let's leave this as is and revisit it later. This is not a reason to block this crate in its entirety.

Whether or not it's in flux doesn't apply here. It uses StrBuf internally and should be returning StrBuf. It's reason enough to block the pull request for me, because I'm certain the existing API of various modules is going to used as an argument against changing it later. I have no problem putting an r- on this over it and turning it into a long mailing list discussion.

In the current language, it's also even more inefficient than it would be with DST. So not only is it a dubious API design for the future, but it's going out of the way to use an opinionated API even though it doesn't work well today.

Anyway, it seems pretty odd to use unsafe code for micro-optimizations when it could be just using safe code that will remain faster than into_owned in the future. Since when is using totally unnecessary unsafe code acceptable?

@alexcrichton Owner

This is not totally unnecessary unsafe code, it is clearly documented as being necessary for some extra perf on the shootout benchmark. Additionally, it is clearly documented as to what exactly it is replacing, and why the current machinery doesn't quite cut it.

Once DST lands (soon), this unsafe block will go away and will no longer be necessary, there is a clear path forward. Currently today foo.move_iter().collect() allocates a temporary Vec<T>, then allocates a ~[T], then copies data. This is why into_owned() is so slow today.

It's totally unnecessary because it can return StrBuf here. This avoids the unsafe code and will avoid other costs in the future from dropping the excess capacity. I don't think it's acceptable to sneak in unsafe code to push your view on the string/vector issue. I'm strongly against this and will do everything I can to stop this from landing in the current form. There's a clear and simple way to do it without any unsafe code and you're only in favour of using it because it enshrines returning ~str in the API.

@alexcrichton Owner

You are singling out one use case of where StrBuf could be returned, but it isn't. In today's rust, it is consistent to return ~str, not StrBuf. Regardless of what your opinion about what it should be is, that is the current state of affairs.

If you would like to change values to returning StrBuf, then I recommend you do so in a separate issue or PR which discusses all return values, not just this one use case in an experimental library that hasn't been merged yet. Focusing on this one case is not very helpful.

I can understand you being strongly against return ~str where a StrBuf is available, but I do not believe that this is the PR to make that decision.

Please do not take my comments as an endorsement of returning ~str. That is a misconception of what I am saying. I would like to merge this library because it will have significant benefit to all users of rust. Blocking this over an ongoing discussion which has no current resolution is not really helping anyone.

I'm not singling out one use case. I'm reviewing a pull request adding new code to the standard library, and am strongly opposed to merging it while it has completely unnecessary unsafe code doing the opposite of optimization. There is no rationale for why this unsafe code is used rather than returning the StrBuf and there is no rationale for why a performance hit should be taken in the future to return it. The burden of proof rests on the person proposing we add more unsafe code, not me.

Please do not take my comments as an endorsement of returning ~str.

You already endorsed it by misrepresenting your view on the topic as the established consensus in your post to the mailing list.

That is a misconception of what I am saying. I would like to merge this library because it will have significant benefit to all users of rust. Blocking this over an ongoing discussion which has no current resolution is not really helping anyone.

It should not go in as long as it's going to great lengths with unsafe code to back up a minority opinion on the Vec<T> issue. It could simply return the StrBuf it has internally instead of using a convoluted unsafe workaround.

@BurntSushi Collaborator

The only reason it's returning a ~str is because other parts of std also return ~str even when a StrBuf could be returned. I made this decision because it's consistent. The unsafe code followed from that. I did not make this decision because of an opinion on the Vec<T> issue.

With that said, I'm happy to change to StrBuf. (I'd also change the regex-dna benchmark to use a StrBuf. This would actually avoid a copy for each replacement done, so it'd probably improve performance.)

I'm not familiar with your governance model, so I'll otherwise keep quiet. But I just wanted to make sure that my point of view was clear.

The rest of the standard library uses ~str because StrBuf never existed until recently and ~str used to be resizeable. If this case had a choice between ~str and StrBuf without requiring a conversion between them, then I wouldn't have mentioned anything.

However, at the moment it's going to great lengths to avoid simply returning the StdBuf that's inside the function. It's hurting performance in the caller and it's adding unnecessary unsafety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((414 lines not shown))
+/// It can be used with `replace` and `replace_all` to do a literal
+/// string replacement without expanding `$name` to their corresponding
+/// capture groups.
+///
+/// `'r` is the lifetime of the literal text.
+pub struct NoExpand<'t>(pub &'t str);
+
+/// Replacer describes types that can be used to replace matches in a string.
+pub trait Replacer {
+ /// Returns a possibly owned string that is used to replace the match
+ /// corresponding the the `caps` capture group.
+ ///
+ /// The `'a` lifetime refers to the lifetime of a borrowed string when
+ /// a new owned string isn't needed (e.g., for `NoExpand`).
+ fn reg_replace<'a>(&'a self, caps: &Captures) -> MaybeOwned<'a>;
+}
@alexcrichton Owner

Due to #12224, implementing this trait for a closure will require this to use &mut self instead of &self. It looks like it'd be a pretty easy drop-in adjustment though (see modifications in #13686 to the CharEq trait for an example)

@BurntSushi Collaborator

Easy peasy. Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((193 lines not shown))
+ /// Returns the capture groups corresponding to the leftmost-first
+ /// match in `text`. Capture group `0` always corresponds to the entire
+ /// match. If no match is found, then `None` is returned.
+ ///
+ /// You should only use `captures` if you need access to submatches.
+ /// Otherwise, `find` is faster for discovering the location of the overall
+ /// match.
+ pub fn captures<'t>(&self, text: &'t str) -> Option<Captures<'t>> {
+ let caps = exec(self, Submatches, text);
+ Captures::new(self, text, caps)
+ }
+
+ /// Returns an iterator over all the non-overlapping capture groups matched
+ /// in `text`. This is operationally the same as `find_iter` (except it
+ /// yields information about submatches).
+ pub fn captures_iter<'r, 't>(&'r self, text: &'t str)
@alexcrichton Owner

If you have time (certainly not a blocker), could you add a small example to this and the above few methods? I think I understand how to use them, but examples are always super helpful!

(again, not a blocker, just a cherry on top)

@BurntSushi Collaborator

I hate cherries, but they sure look nice. Added!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((564 lines not shown))
+ return None
+ }
+ Some((self.locs.get(s).unwrap(), self.locs.get(e).unwrap()))
+ }
+
+ /// Returns the matched string for the capture group `i`.
+ /// If `i` isn't a valid capture group or didn't match anything, then the
+ /// empty string is returned.
+ pub fn at(&self, i: uint) -> &'t str {
+ match self.pos(i) {
+ None => "",
+ Some((s, e)) => {
+ self.text.slice(s, e)
+ }
+ }
+ }
@alexcrichton Owner

I'm curious if this type could one day implement the Index trait (basically leverage the foo[bar] syntax). Do you know which of these methods would be most appropriate for that?

It seems a bit odd to me that pos has the same return value for an empty match and an out-of-bounds index (and that kinds leaks over to at as well). Did you find precedent in other regex engines? Just something to think about, I'm ok with it as-is due to the len() method being available.

@BurntSushi Collaborator

Honestly, I stayed away from the Index trait because there seems to be a lot of buzz about it being removed or substantially changed. I figured that if we're going to live with Vec not having index notation, than we should probably also live with Captures not having it either. (For the time being.) It would be nice if it could support caps[1] and caps["name"] (corresponding to the at and name methods), but I don't think that's currently possible? I didn't dig too much.

RE pos: Yes, it is a bit odd. The only alternatives I can think of are to assert that the index is in range or to encode the failure in the type. I don't think I really checked precedent for this in other libraries. Python seems to raise an IndexError. Similarly for asking for a named capture group that doesn't exist.

At the moment, I'm thinking that handling out-of-bounds like the rest of the standard lib does might be the best way to go (and this would, e.g., be consistent with Python). My least favorite option is to encode the failure into the return type.

I don't think any new Index implementations should be added, because they're likely all going to need to be removed before landing the new traits. It's just going to create unnecessary churn.

@alexcrichton Owner

Oh no, I do not think that this should implement Index now, I was merely wondering about the future and how this may leverage it.

Let's leave these as-is. This is why the crate is experimental, I was just musing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/re.rs
((617 lines not shown))
+ /// If `name` isn't a valid capture group (whether the name doesn't exist or
+ /// isn't a valid index), then it is replaced with the empty string.
+ ///
+ /// To write a literal `$` use `$$`.
+ pub fn expand(&self, text: &str) -> ~str {
+ // How evil can you get?
+ // FIXME: Don't use regexes for this. It's completely unnecessary.
+ let re = Regexp::new(r"(^|[^$]|\b)\$(\w+)").unwrap();
+ let text = re.replace_all(text, |refs: &Captures| -> ~str {
+ let (pre, name) = (refs.at(1), refs.at(2));
+ pre + match from_str::<uint>(name) {
+ None => self.name(name).to_owned(),
+ Some(i) => self.at(i).to_owned(),
+ }
+ });
+ text.replace("$$", "$")
@alexcrichton Owner

regexes used to implement regexes! (I thought bootstrapping a compiler was hard!)

@BurntSushi Collaborator

:P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/testdata/LICENSE
((4 lines not shown))
+copy of THIS SOFTWARE FILE (the "Software"), to deal in the Software
+without restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, and/or sell copies of the
+Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following disclaimer:
+
+THIS SOFTWARE IS PROVIDED BY AT&T ``AS IS'' AND ANY EXPRESS OR IMPLIED
+WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+IN NO EVENT SHALL AT&T BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@alexcrichton Owner

cc @brson, just confirming this is ok

@alexcrichton Owner

I got confirmation this was ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/vm.rs
((194 lines not shown))
+ fn step(&self, groups: &mut [Option<uint>], nlist: &mut Threads,
+ caps: &mut [Option<uint>], pc: uint)
+ -> StepState {
+ match *self.prog.insts.get(pc) {
+ Match => {
+ match self.which {
+ Exists => {
+ return StepMatchEarlyReturn
+ }
+ Location => {
+ groups[0] = caps[0];
+ groups[1] = caps[1];
+ return StepMatch
+ }
+ Submatches => {
+ unsafe { groups.copy_memory(caps) }
@alexcrichton Owner

Did you see that manual loops didn't optimize to a mempcy? I would expect something like this to optimize to a memcpy:

for (slot, val) in groups.mut_iter().zip(caps.iter()) {
    *slot = *val;
}
@BurntSushi Collaborator

I did not realize that! Awesome. I can't seem to produce any significant and consistent change in benchmark results. I've removed all 4 unsafe blocks for using copy_memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp/vm.rs
((459 lines not shown))
+ size: 0,
+ }
+ }
+
+ fn add(&mut self, pc: uint, groups: &[Option<uint>], empty: bool) {
+ let t = self.queue.get_mut(self.size);
+ t.pc = pc;
+ match (empty, self.which) {
+ (_, Exists) | (true, _) => {},
+ (false, Location) => {
+ *t.groups.get_mut(0) = groups[0];
+ *t.groups.get_mut(1) = groups[1];
+ }
+ (false, Submatches) => unsafe {
+ t.groups.as_mut_slice().copy_memory(groups)
+ }
@alexcrichton Owner

As above, I would be curious about the usage of unsafe here.

@BurntSushi Collaborator

Removed. See #13700 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp_macros/lib.rs
((201 lines not shown))
+ i += 1;
+ }
+ ::std::mem::swap(&mut clist, &mut nlist);
+ nlist.empty();
+ }
+ match self.which {
+ Exists if matched => vec![Some(0u), Some(0u)],
+ Exists => vec![None, None],
+ Location | Submatches => {
+ let elts = groups.len();
+ let mut v = Vec::with_capacity(elts);
+ unsafe {
+ v.set_len(elts);
+ ::std::ptr::copy_nonoverlapping_memory(
+ v.as_mut_ptr(), groups.as_ptr(), elts);
+ }
@alexcrichton Owner

I'm curious why unsafe was used here rather than an iterator and a collect?

@BurntSushi Collaborator

Removed! Same as before: no difference in benchmarks when using, e.g., groups.iter().map(|x| *x).collect(). Awesome.

There are now only three uses of unsafe: two for using unitialized memory for sparse sets and one for reducing allocation in string replacement. (Which will hopefully be removed at some point.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp_macros/lib.rs
((424 lines not shown))
+ let arms = self.prog.insts.iter().enumerate().map(|(pc, inst)| {
+ let nextpc = pc + 1;
+ let body = match *inst {
+ Match => {
+ quote_expr!(self.cx, {
+ match self.which {
+ Exists => {
+ return StepMatchEarlyReturn
+ }
+ Location => {
+ groups[0] = caps[0];
+ groups[1] = caps[1];
+ return StepMatch
+ }
+ Submatches => {
+ unsafe { groups.copy_memory(caps.as_slice()) }
@alexcrichton Owner

This seems to generate a good bit of unsafe blocks. Did you not see the common idioms optimized to essentially what the unsafe blocks are doing?

If necessary, it would be nice to have some comments about why unsafe is necessary in these locations.

@BurntSushi Collaborator

Removed. #13700 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp_macros/lib.rs
((247 lines not shown))
+ queue: [Thread, ..$num_insts],
+ sparse: [uint, ..$num_insts],
+ size: uint,
+ }
+
+ impl Threads {
+ fn new(which: MatchKind) -> Threads {
+ Threads {
+ which: which,
+ queue: unsafe { ::std::mem::uninit() },
+ sparse: unsafe { ::std::mem::uninit() },
+ size: 0,
+ }
+ }
+
+ #[inline(always)]
@alexcrichton Owner

We're generally trying to avoid inline(always) annotations, did run into problems if these were tagged with #[inline]?

@BurntSushi Collaborator

I have no idea why I used inline(always).

Changed all of them to inline. No perf difference. Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@alexcrichton

This looks even better than I thought it was going to be, amazing work, and thank you so much!

@alexcrichton

Ah, one more small thing, we're trying to ensure that commits can be traced back to the RFC they implemented, so could you make sure that this shows up at the bottom of the first commit message (you can wait to rebase until later)

RFC: 0007-regexps
src/libregexp/re.rs
((692 lines not shown))
+ None => "",
+ Some((s, e)) => {
+ self.text.slice(s, e)
+ }
+ }
+ }
+
+ /// Returns the matched string for the capture group named `name`.
+ /// If `name` isn't a valid capture group or didn't match anything, then
+ /// the empty string is returned.
+ pub fn name(&self, name: &str) -> &'t str {
+ match self.named {
+ None => "",
+ Some(ref h) => {
+ match h.find(&name.to_owned()) {
+ None => "",

Could you use h.find_equiv(name) here in order to avoid allocating an owned string?

@BurntSushi Collaborator

Indeed I can. Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@BurntSushi
Collaborator

@alexcrichton Thanks! And thanks very much for all your comments so far. Very helpful. I will make sure to add RFC: 0007-regexps to the commit message.

Also, when I rebase, won't it change my commit history? I assume I'll have to force push. (Just want to make sure that's what's expected.)

@thestinger

@BurntSushi: Yeah, you'll have to force push.

src/test/compile-fail/syntax-extension-regexp-invalid.rs
@@ -0,0 +1,27 @@
+// Copyright 2014 The Rust Project Developers. See the COPYRIGHT
+// file at the top-level directory of this distribution and at
+// http://rust-lang.org/COPYRIGHT.
+//
+// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
+// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
+// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
+// option. This file may not be copied, modified, or distributed
+// except according to those terms.
+
+// ignore-stage1
+// ignore-cross-compile #12102
@alexcrichton Owner

The most recently landed PR actually makes this so ignore-cross-compile isn't necessary. The stack of commits will need to get rebase anyway, so just something to include in the rebasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/libregexp_macros/lib.rs
((233 lines not shown))
+ groups: Captures,
+ }
+
+ struct Threads {
+ which: MatchKind,
+ queue: [Thread, ..$num_insts],
+ sparse: [uint, ..$num_insts],
+ size: uint,
+ }
+
+ impl Threads {
+ fn new(which: MatchKind) -> Threads {
+ Threads {
+ which: which,
+ queue: unsafe { ::std::mem::uninit() },
+ sparse: unsafe { ::std::mem::uninit() },
@alexcrichton Owner

How come this uninit is needed? It seems quite unsafe. If it's necessary for perf, can you add a comment explaining why?

@BurntSushi Collaborator

They are not needed, but these unsafe blocks actually do make a performance difference. The trick being used here is to represent sparse sets using uninitialized memory. It's described in more detail here: http://research.swtch.com/sparse

In this case, I can actually produce evidence. The first column is without unsafe and the second column is the code as you see it:

anchored_literal_long_match             264 ns/iter (+/- 2)                    165 ns/iter (+/- 4)
anchored_literal_long_non_match        5867 ns/iter (+/- 8)                   5822 ns/iter (+/- 45)
anchored_literal_short_match            232 ns/iter (+/- 8)                    161 ns/iter (+/- 2)
anchored_literal_short_non_match        495 ns/iter (+/- 1)                    424 ns/iter (+/- 3)
easy0_1K                               1808 ns/iter (+/- 111) = 566 MB/s      1277 ns/iter (+/- 170) = 801 MB/s
easy0_32                                330 ns/iter (+/- 2) = 96 MB/s          276 ns/iter (+/- 3) = 115 MB/s
easy0_32K                             48878 ns/iter (+/- 650) = 670 MB/s     33323 ns/iter (+/- 968) = 983 MB/s
easy1_1K                               1881 ns/iter (+/- 556) = 544 MB/s      1794 ns/iter (+/- 684) = 570 MB/s
easy1_32                                391 ns/iter (+/- 93) = 81 MB/s         341 ns/iter (+/- 70) = 93 MB/s
easy1_32K                             49735 ns/iter (+/- 2484) = 658 MB/s    48367 ns/iter (+/- 2864) = 677 MB/s
hard_1K                               47163 ns/iter (+/- 268) = 21 MB/s      35070 ns/iter (+/- 169) = 29 MB/s
hard_32                                1840 ns/iter (+/- 38) = 17 MB/s        1389 ns/iter (+/- 17) = 23 MB/s
hard_32K                            1497950 ns/iter (+/- 5921) = 21 MB/s   1112845 ns/iter (+/- 2605) = 29 MB/s
literal                                 142 ns/iter (+/- 2)                    131 ns/iter (+/- 0)
match_class                            1403 ns/iter (+/- 6)                   1394 ns/iter (+/- 6)
match_class_in_range                   1448 ns/iter (+/- 3)                   1347 ns/iter (+/- 4)
medium_1K                             17310 ns/iter (+/- 255) = 59 MB/s      17475 ns/iter (+/- 166) = 58 MB/s
medium_32                               888 ns/iter (+/- 29) = 36 MB/s         835 ns/iter (+/- 34) = 38 MB/s
medium_32K                           542510 ns/iter (+/- 2595) = 60 MB/s    550793 ns/iter (+/- 2491) = 59 MB/s
no_exponential                       274104 ns/iter (+/- 466)               278257 ns/iter (+/- 906)
not_literal                            1104 ns/iter (+/- 5)                   1080 ns/iter (+/- 4)
one_pass_long_prefix                    548 ns/iter (+/- 5)                    379 ns/iter (+/- 3)
one_pass_long_prefix_not                520 ns/iter (+/- 2)                    409 ns/iter (+/- 2)
one_pass_short_a                       1326 ns/iter (+/- 18)                  1291 ns/iter (+/- 8)
one_pass_short_a_not                   1945 ns/iter (+/- 21)                  1585 ns/iter (+/- 29)
one_pass_short_b                        913 ns/iter (+/- 3)                    816 ns/iter (+/- 8)
one_pass_short_b_not                   1242 ns/iter (+/- 7)                   1401 ns/iter (+/- 9)
replace_all                            1353 ns/iter (+/- 13)                  1291 ns/iter (+/- 11)               

My guess as to what's happening---particularly in the hard benchmarks---is that the mem::uninit saves a lot of time by not initializing threads that never need to be initialized, particularly with larger regexps (like the hard benchmark) with a lot of instructions.

I've included a justification in a comment and a link to Russ Cox's article.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@alexcrichton

Just a few small nits left, and otherwise this looks fantastic. After a rebasing, I think this is good to go!

@chris-morgan

Argh, I didn't notice that when RFC 7 was accepted that it kept the name regexp rather than shifting to regex. Can we fix that? (Citation for why we should change it: regex is the name everyone uses.)

@BurntSushi
Collaborator

r? @thestinger (I've changed the return type of the replace functions from ~str to StrBuf.)

@BurntSushi
Collaborator

@chris-morgan There didn't seem to be any strong consensus on the name so I just went with what I had. There's precedent for naming a regular expression library regexp.

@alexcrichton

While I prefer the name regex over regexp, google trends may not be the definitive source of information on this topic:

@chris-morgan

@alexcrichton as the trends demonstrate, regexp used to be popular. That was when ones like JavaScript and Ruby got their names. Go using Regexp puzzles me a little. The trend has shifted, though, and I think regex is now very definitely the one that is winning. See also Google Ngram Viewer on regex,regexp,regexpr, where regex is also winning over regexp (and regexpr is nowhere).

@thestinger

Python just calls the module re and that would fit in with Rust's love of short names :P. Of course a longer name for the type is still needed.

@chris-morgan

@thestinger I like re (especially for the macro), but I can see the case for being just a little more verbose. Rust is not scared of longer, more meaningful identifiers.

@thestinger

Well I would like re::RegEx/re::RegExp and re!(...) but it's not at all an important issue.

@chris-morgan

Whoa! Look at the ngrams with re added! And the trends! Obviously everyone's actually using re as the name.

@seanmonstar

I just saw after a reload. Deleted my comment.

@huonw
Owner

C++ uses regex too.

@BurntSushi
Collaborator

I don't really like re because I think it's too short. Note that Python's original regexp module was called regex. (re was added in Python 1.5 and regex wasn't removed until Python 2.5, so they had to coexist.) I do like the notion of writing re::Regexp, but I also think it's useful for the name of the crate to stand on its own.

I would not be opposed to naming the macro re! though, it seemed like there was some value in making it the same name as the crate. I don't feel strongly about it.

  • Languages that use some variation of regexp: Go, Ruby, Javascript
  • Languages that use some variation of regex: C++, Java, Python, OCaml, Haskell

(The .NET crowd is notably missing, but they call their module RegularExpressions. Objective C calls theirs NSRegularExpression.)

I don't know what it means to choose one name over another based on Google Trends telling me that there is a 0.0000098675% difference between the two in 2008.

There seems to be a slight overall preference toward Regex and there are more language libraries using Regex as the name of their module which provides regexps. If I switch the name of the crate, type and macro to regex, Regex and regex!, respectively, will that make everyone reasonably happy?

@chris-morgan

@BurntSushi There still remains the question of Regex vs. RegEx.

@BurntSushi
Collaborator

I prefer Regex. It just looks nicer to me.

@seanmonstar

Rust convention is CamelCase for types.

@BurntSushi
Collaborator

@seanmonstar Depends on whether you consider regex a word all by itself. :P

@BurntSushi
Collaborator

If we have RegEx on the grounds that type names are CamelCase, then I guess we'd either need to use re! for the macro or reg_ex!. (Since I believe underscores delimit words in function/macro names.)

@blaenk

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

Count me for that one, Regex that is. I don't like RegEx; that uppercase 'E' adds a break in the flow of typing it, and reg_ex is just plain ugly. I also agree re is too short, and adding a p at the end of regex just makes it weird.

I think Regex is the best.

@liigo

regex +1, for its shorter, and readable.

@chris-morgan

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

You meant the crate as regex rather than Regex. Given that, +1.

@BurntSushi
Collaborator

@chris-morgan yes absolutely! Nice catch. Edited.

@BurntSushi
Collaborator

OK, I've changed the name of the crate to regex, the type to Regex and the macro to regex!.

r? @alexcrichton @thestinger

@bors bors referenced this pull request from a commit
@bors bors auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
1c91b13
@sfackler sfackler commented on the diff
src/libregex_macros/lib.rs
((32 lines not shown))
+ NormalTT, BasicMacroExpander,
+};
+use syntax::parse;
+use syntax::parse::token;
+use syntax::print::pprust;
+
+use regex::Regex;
+use regex::native::{
+ OneChar, CharClass, Any, Save, Jump, Split,
+ Match, EmptyBegin, EmptyEnd, EmptyWordBoundary,
+ Program, Dynamic, Native,
+ FLAG_NOCASE, FLAG_MULTI, FLAG_DOTNL, FLAG_NEGATED,
+};
+
+/// For the `regex!` syntax extension. Do not use.
+#[macro_registrar]
@sfackler Collaborator

I'd just mark this as #[doc(hidden)]

@BurntSushi Collaborator

Fixed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bors bors referenced this pull request from a commit
@bors bors auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
d988040
@bors bors referenced this pull request from a commit
@bors bors auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
c251f5b
@bors bors referenced this pull request from a commit
@bors bors auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
4ee2b4e
BurntSushi added some commits
@BurntSushi BurntSushi Add a regex crate to the Rust distribution.
Also adds a regex_macros crate, which provides natively compiled
regular expressions with a syntax extension.

Closes #3591.

RFC: 0007-regexps
b8b7484
@BurntSushi BurntSushi mk: Copy fewer libraries into the host artifacts
09a8b38
@bors bors referenced this pull request from a commit
@bors bors auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
dd2a48f
@BurntSushi BurntSushi Ignore regex tests (regular, cfail and benchmark) on Windows (for now).
7269bc7
@alexcrichton

r+

Collaborator

saw approval from alexcrichton
at BurntSushi@7269bc7

Collaborator

merging BurntSushi/rust/regexp = 7269bc7 into auto

Collaborator

BurntSushi/rust/regexp = 7269bc7 merged ok, testing candidate = eea4909

Collaborator

fast-forwarding master to auto = eea4909

@bors bors referenced this pull request from a commit
@bors bors auto merge of #13700 : BurntSushi/rust/regexp, r=alexcrichton
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
eea4909
@bors bors closed this
@bors bors merged commit 7269bc7 into rust-lang:master
@BurntSushi BurntSushi deleted the BurntSushi:regexp branch
@alexcrichton

Nice work @BurntSushi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Apr 24, 2014
  1. @BurntSushi

    Add a regex crate to the Rust distribution.

    BurntSushi authored
    Also adds a regex_macros crate, which provides natively compiled
    regular expressions with a syntax extension.
    
    Closes #3591.
    
    RFC: 0007-regexps
  2. @BurntSushi

    mk: Copy fewer libraries into the host artifacts

    BurntSushi authored
  3. @BurntSushi

    Ignore regex tests (regular, cfail and benchmark) on Windows (for now).

    BurntSushi authored
Something went wrong with that request. Please try again.