add regexp crate to Rust distribution (implements RFC 7) #13700

BurntSushi · 2014-04-23T06:57:48Z

Implements RFC 7 and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the basic, nullsubexpr and repetition tests from Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite. I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a regex-dna benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

More than half the number of lines is dedicated to Unicode character classes.
Of the ~4,500 lines remaining, 1,225 of them are comments.
Another ~800 are tests.
That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for regexp!) make up the rest.

UtherII · 2014-04-23T16:00:32Z

Maybe a silly question, but wouldn't it make sense to put Unicode character classes support into the standard rust string library?

BurntSushi · 2014-04-23T16:17:31Z

Possibly. But I'm not sure. What would they be used for in std::str in their current form?

Note that the matching algorithm depends on those Unicode classes to be available in sorted non-overlapping order, so that they are amenable to binary search.

One possible path forward is to leave them in regexp and rip them out if and when std::str (or something else) wants them.

alexcrichton · 2014-04-23T17:09:43Z

src/libregexp/lib.rs

+//!
+//! ## Matching one character
+//!
+//! <pre class="rust">


We've generally tried to not use html tags in our documentation, this is done to not run the test/lexer over the contents? You may be able to get away with a notrust tag after three backticks.

Actually, the reasoning is more insidious: I was unable to write a plain \ character in a fenced code block, so I resorted to the simpler solution of just writing the HTML. (I wasn't able to determine if this was a bug in the sundown parser or elsewhere...)

Oh well, it was worth a try!

alexcrichton · 2014-04-23T17:56:39Z

This looks even better than I thought it was going to be, amazing work, and thank you so much!

alexcrichton · 2014-04-23T17:59:15Z

Ah, one more small thing, we're trying to ensure that commits can be traced back to the RFC they implemented, so could you make sure that this shows up at the bottom of the first commit message (you can wait to rebase until later)

RFC: 0007-regexps

zkamsler · 2014-04-23T19:52:58Z

src/libregexp/re.rs

+            None => "",
+            Some(ref h) => {
+                match h.find(&name.to_owned()) {
+                    None => "",


Could you use h.find_equiv(name) here in order to avoid allocating an owned string?

Indeed I can. Fixed.

BurntSushi · 2014-04-23T20:46:50Z

@alexcrichton Thanks! And thanks very much for all your comments so far. Very helpful. I will make sure to add RFC: 0007-regexps to the commit message.

Also, when I rebase, won't it change my commit history? I assume I'll have to force push. (Just want to make sure that's what's expected.)

thestinger · 2014-04-23T21:04:34Z

@BurntSushi: Yeah, you'll have to force push.

alexcrichton · 2014-04-23T21:21:38Z

src/test/compile-fail/syntax-extension-regexp-invalid.rs

+// except according to those terms.
+
+// ignore-stage1
+// ignore-cross-compile #12102


The most recently landed PR actually makes this so ignore-cross-compile isn't necessary. The stack of commits will need to get rebase anyway, so just something to include in the rebasing.

alexcrichton · 2014-04-23T21:26:40Z

Just a few small nits left, and otherwise this looks fantastic. After a rebasing, I think this is good to go!

chris-morgan · 2014-04-24T00:55:43Z

Argh, I didn't notice that when RFC 7 was accepted that it kept the name regexp rather than shifting to regex. Can we fix that? (Citation for why we should change it: regex is the name everyone uses.)

chris-morgan · 2014-04-24T01:27:16Z

Whoa! Look at the ngrams with re added! And the trends! Obviously everyone's actually using re as the name.

seanmonstar · 2014-04-24T01:44:41Z

I just saw after a reload. Deleted my comment.

huonw · 2014-04-24T01:46:29Z

C++ uses regex too.

BurntSushi · 2014-04-24T02:11:42Z

I don't really like re because I think it's too short. Note that Python's original regexp module was called regex. (re was added in Python 1.5 and regex wasn't removed until Python 2.5, so they had to coexist.) I do like the notion of writing re::Regexp, but I also think it's useful for the name of the crate to stand on its own.

I would not be opposed to naming the macro re! though, it seemed like there was some value in making it the same name as the crate. I don't feel strongly about it.

Languages that use some variation of regexp: Go, Ruby, Javascript
Languages that use some variation of regex: C++, Java, Python, OCaml, Haskell

(The .NET crowd is notably missing, but they call their module RegularExpressions. Objective C calls theirs NSRegularExpression.)

I don't know what it means to choose one name over another based on Google Trends telling me that there is a 0.0000098675% difference between the two in 2008.

There seems to be a slight overall preference toward Regex and there are more language libraries using Regex as the name of their module which provides regexps. If I switch the name of the crate, type and macro to regex, Regex and regex!, respectively, will that make everyone reasonably happy?

chris-morgan · 2014-04-24T02:14:27Z

@BurntSushi There still remains the question of Regex vs. RegEx.

BurntSushi · 2014-04-24T02:16:00Z

I prefer Regex. It just looks nicer to me.

seanmonstar · 2014-04-24T02:18:08Z

Rust convention is CamelCase for types.

BurntSushi · 2014-04-24T02:19:47Z

@seanmonstar Depends on whether you consider regex a word all by itself. :P

BurntSushi · 2014-04-24T02:21:29Z

If we have RegEx on the grounds that type names are CamelCase, then I guess we'd either need to use re! for the macro or reg_ex!. (Since I believe underscores delimit words in function/macro names.)

blaenk · 2014-04-24T02:25:06Z

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

Count me for that one, Regex that is. I don't like RegEx; that uppercase 'E' adds a break in the flow of typing it, and reg_ex is just plain ugly. I also agree re is too short, and adding a p at the end of regex just makes it weird.

I think Regex is the best.

liigo · 2014-04-24T02:32:04Z

regex +1, for its shorter, and readable.

chris-morgan · 2014-04-24T02:39:49Z

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

You meant the crate as regex rather than Regex. Given that, +1.

BurntSushi · 2014-04-24T02:41:35Z

@chris-morgan yes absolutely! Nice catch. Edited.

BurntSushi · 2014-04-24T05:29:06Z

OK, I've changed the name of the crate to regex, the type to Regex and the macro to regex!.

r? @alexcrichton @thestinger

sfackler · 2014-04-24T06:53:51Z

src/libregex_macros/lib.rs

+};
+
+/// For the `regex!` syntax extension. Do not use.
+#[macro_registrar]


I'd just mark this as #[doc(hidden)]

Fixed. Thanks!

Also adds a regex_macros crate, which provides natively compiled regular expressions with a syntax extension. Closes rust-lang#3591. RFC: 0007-regexps

Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code. Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout. I know the addition looks huge at first, but consider these things: 1. More than half the number of lines is dedicated to Unicode character classes. 2. Of the ~4,500 lines remaining, 1,225 of them are comments. 3. Another ~800 are tests. 4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.

alexcrichton · 2014-04-25T08:10:19Z

Nice work @BurntSushi!

pyrossh · 2015-11-21T16:46:21Z

Nice work @BurntSushi!

BurntSushi mentioned this pull request Apr 23, 2014

Expand macros before looking for string literal BurntSushi/regexp#2

Closed

alexcrichton reviewed Apr 23, 2014
View reviewed changes

zkamsler reviewed Apr 23, 2014
View reviewed changes

alexcrichton reviewed Apr 23, 2014
View reviewed changes

sfackler reviewed Apr 24, 2014
View reviewed changes

BurntSushi added 2 commits April 25, 2014 00:27

Add a regex crate to the Rust distribution.

b8b7484

Also adds a regex_macros crate, which provides natively compiled regular expressions with a syntax extension. Closes rust-lang#3591. RFC: 0007-regexps

mk: Copy fewer libraries into the host artifacts

09a8b38

Ignore regex tests (regular, cfail and benchmark) on Windows (for now).

7269bc7

bors closed this Apr 25, 2014

bors merged commit 7269bc7 into rust-lang:master Apr 25, 2014

BurntSushi deleted the regexp branch April 25, 2014 07:58

SimonSapin mentioned this pull request Jun 17, 2018

Treat gc=No characters as numeric #51609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add regexp crate to Rust distribution (implements RFC 7) #13700

add regexp crate to Rust distribution (implements RFC 7) #13700

BurntSushi commented Apr 23, 2014

UtherII commented Apr 23, 2014

BurntSushi commented Apr 23, 2014

alexcrichton Apr 23, 2014

BurntSushi Apr 23, 2014

alexcrichton Apr 23, 2014

alexcrichton commented Apr 23, 2014

alexcrichton commented Apr 23, 2014

zkamsler Apr 23, 2014

BurntSushi Apr 23, 2014

BurntSushi commented Apr 23, 2014

thestinger commented Apr 23, 2014

alexcrichton Apr 23, 2014

alexcrichton commented Apr 23, 2014

chris-morgan commented Apr 24, 2014

chris-morgan commented Apr 24, 2014

seanmonstar commented Apr 24, 2014

huonw commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

chris-morgan commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

seanmonstar commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

blaenk commented Apr 24, 2014

liigo commented Apr 24, 2014

chris-morgan commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

sfackler Apr 24, 2014

BurntSushi Apr 24, 2014

alexcrichton commented Apr 25, 2014

pyrossh commented Nov 21, 2015

add regexp crate to Rust distribution (implements RFC 7) #13700

add regexp crate to Rust distribution (implements RFC 7) #13700

Conversation

BurntSushi commented Apr 23, 2014

UtherII commented Apr 23, 2014

BurntSushi commented Apr 23, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Apr 23, 2014

alexcrichton commented Apr 23, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi commented Apr 23, 2014

thestinger commented Apr 23, 2014

Choose a reason for hiding this comment

alexcrichton commented Apr 23, 2014

chris-morgan commented Apr 24, 2014

chris-morgan commented Apr 24, 2014

seanmonstar commented Apr 24, 2014

huonw commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

chris-morgan commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

seanmonstar commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

blaenk commented Apr 24, 2014

liigo commented Apr 24, 2014

chris-morgan commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

BurntSushi commented Apr 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Apr 25, 2014

pyrossh commented Nov 21, 2015