Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add regexp crate to Rust distribution (implements RFC 7) #13700

Merged
merged 3 commits into from Apr 25, 2014

Conversation

BurntSushi
Copy link
Member

Implements RFC 7 and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the basic, nullsubexpr and repetition tests from Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite. I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a regex-dna benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

  1. More than half the number of lines is dedicated to Unicode character classes.
  2. Of the ~4,500 lines remaining, 1,225 of them are comments.
  3. Another ~800 are tests.
  4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for regexp!) make up the rest.

@UtherII
Copy link

UtherII commented Apr 23, 2014

Maybe a silly question, but wouldn't it make sense to put Unicode character classes support into the standard rust string library?

@BurntSushi
Copy link
Member Author

Possibly. But I'm not sure. What would they be used for in std::str in their current form?

Note that the matching algorithm depends on those Unicode classes to be available in sorted non-overlapping order, so that they are amenable to binary search.

One possible path forward is to leave them in regexp and rip them out if and when std::str (or something else) wants them.

//!
//! ## Matching one character
//!
//! <pre class="rust">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've generally tried to not use html tags in our documentation, this is done to not run the test/lexer over the contents? You may be able to get away with a notrust tag after three backticks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the reasoning is more insidious: I was unable to write a plain \ character in a fenced code block, so I resorted to the simpler solution of just writing the HTML. (I wasn't able to determine if this was a bug in the sundown parser or elsewhere...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, it was worth a try!

@alexcrichton
Copy link
Member

This looks even better than I thought it was going to be, amazing work, and thank you so much!

@alexcrichton
Copy link
Member

Ah, one more small thing, we're trying to ensure that commits can be traced back to the RFC they implemented, so could you make sure that this shows up at the bottom of the first commit message (you can wait to rebase until later)

RFC: 0007-regexps

None => "",
Some(ref h) => {
match h.find(&name.to_owned()) {
None => "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use h.find_equiv(name) here in order to avoid allocating an owned string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I can. Fixed.

@BurntSushi
Copy link
Member Author

@alexcrichton Thanks! And thanks very much for all your comments so far. Very helpful. I will make sure to add RFC: 0007-regexps to the commit message.

Also, when I rebase, won't it change my commit history? I assume I'll have to force push. (Just want to make sure that's what's expected.)

@thestinger
Copy link
Contributor

@BurntSushi: Yeah, you'll have to force push.

// except according to those terms.

// ignore-stage1
// ignore-cross-compile #12102
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most recently landed PR actually makes this so ignore-cross-compile isn't necessary. The stack of commits will need to get rebase anyway, so just something to include in the rebasing.

@alexcrichton
Copy link
Member

Just a few small nits left, and otherwise this looks fantastic. After a rebasing, I think this is good to go!

@chris-morgan
Copy link
Member

Argh, I didn't notice that when RFC 7 was accepted that it kept the name regexp rather than shifting to regex. Can we fix that? (Citation for why we should change it: regex is the name everyone uses.)

@chris-morgan
Copy link
Member

Whoa! Look at the ngrams with re added! And the trends! Obviously everyone's actually using re as the name.

@seanmonstar
Copy link
Contributor

I just saw after a reload. Deleted my comment.

@huonw
Copy link
Member

huonw commented Apr 24, 2014

C++ uses regex too.

@BurntSushi
Copy link
Member Author

I don't really like re because I think it's too short. Note that Python's original regexp module was called regex. (re was added in Python 1.5 and regex wasn't removed until Python 2.5, so they had to coexist.) I do like the notion of writing re::Regexp, but I also think it's useful for the name of the crate to stand on its own.

I would not be opposed to naming the macro re! though, it seemed like there was some value in making it the same name as the crate. I don't feel strongly about it.

  • Languages that use some variation of regexp: Go, Ruby, Javascript
  • Languages that use some variation of regex: C++, Java, Python, OCaml, Haskell

(The .NET crowd is notably missing, but they call their module RegularExpressions. Objective C calls theirs NSRegularExpression.)

I don't know what it means to choose one name over another based on Google Trends telling me that there is a 0.0000098675% difference between the two in 2008.

There seems to be a slight overall preference toward Regex and there are more language libraries using Regex as the name of their module which provides regexps. If I switch the name of the crate, type and macro to regex, Regex and regex!, respectively, will that make everyone reasonably happy?

@chris-morgan
Copy link
Member

@BurntSushi There still remains the question of Regex vs. RegEx.

@BurntSushi
Copy link
Member Author

I prefer Regex. It just looks nicer to me.

@seanmonstar
Copy link
Contributor

Rust convention is CamelCase for types.

@BurntSushi
Copy link
Member Author

@seanmonstar Depends on whether you consider regex a word all by itself. :P

@BurntSushi
Copy link
Member Author

If we have RegEx on the grounds that type names are CamelCase, then I guess we'd either need to use re! for the macro or reg_ex!. (Since I believe underscores delimit words in function/macro names.)

@blaenk
Copy link
Contributor

blaenk commented Apr 24, 2014

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

Count me for that one, Regex that is. I don't like RegEx; that uppercase 'E' adds a break in the flow of typing it, and reg_ex is just plain ugly. I also agree re is too short, and adding a p at the end of regex just makes it weird.

I think Regex is the best.

@liigo
Copy link
Contributor

liigo commented Apr 24, 2014

regex +1, for its shorter, and readable.

@chris-morgan
Copy link
Member

If I switch the name of the crate, type and macro to Regex, Regex and regex!, respectively, will that make everyone reasonably happy?

You meant the crate as regex rather than Regex. Given that, +1.

@BurntSushi
Copy link
Member Author

@chris-morgan yes absolutely! Nice catch. Edited.

@BurntSushi
Copy link
Member Author

OK, I've changed the name of the crate to regex, the type to Regex and the macro to regex!.

r? @alexcrichton @thestinger

};

/// For the `regex!` syntax extension. Do not use.
#[macro_registrar]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just mark this as #[doc(hidden)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Also adds a regex_macros crate, which provides natively compiled
regular expressions with a syntax extension.

Closes rust-lang#3591.

RFC: 0007-regexps
bors added a commit that referenced this pull request Apr 25, 2014
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
@bors bors closed this Apr 25, 2014
@bors bors merged commit 7269bc7 into rust-lang:master Apr 25, 2014
@BurntSushi BurntSushi deleted the regexp branch April 25, 2014 07:58
@alexcrichton
Copy link
Member

Nice work @BurntSushi!

1 similar comment
@pyrossh
Copy link

pyrossh commented Nov 21, 2015

Nice work @BurntSushi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a regular expressions library to the distribution