Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upadd a regexp crate to the Rust distribution #42
Conversation
BurntSushi
referenced this pull request
Apr 13, 2014
Closed
Add a regular expressions library to the distribution #3591
huonw
reviewed
Apr 13, 2014
| A nice implementation strategy to support Unicode is to implement a VM that | ||
| matches characters instead of bytes. Indeed, my implementation does this. | ||
| However, the public API of a regular expression library should expose *byte | ||
| indices* corresponding to match locations (which ought to be guaranteed to be |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
Apr 13, 2014
Member
(The APIs in std::str expose byte indices too, so this is well supported in Rust-land.)
huonw
Apr 13, 2014
Member
(The APIs in std::str expose byte indices too, so this is well supported in Rust-land.)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
reviewed
Apr 13, 2014
View changes
active/0000-regexps.md
| found this difficult to do with zero-runtime cost. Either way, the ability to | ||
| statically declare a regexp is pretty cool I think. | ||
| Note that the syntax extension is the reason for the `regexp_re` crate. It's |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
Apr 13, 2014
Member
We probably should have a convention for crates and their syntax extension pairs, e.g. for a crate foo, have foo_macros or foo_synext or something. (I'd personally be ok with foo_macros, e.g. regexp_macros in this case.)
huonw
Apr 13, 2014
Member
We probably should have a convention for crates and their syntax extension pairs, e.g. for a crate foo, have foo_macros or foo_synext or something. (I'd personally be ok with foo_macros, e.g. regexp_macros in this case.)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexcrichton
Apr 13, 2014
Member
In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.
Essentially, I wouldn't worry too much about the name.
alexcrichton
Apr 13, 2014
Member
In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.
Essentially, I wouldn't worry too much about the name.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 13, 2014
Member
OK. I changed the name for now to regexp_macros. Even if it isn't necessary, I think it's probably a better name on its own than regexp_re. Happy to comply with anything though.
BurntSushi
Apr 13, 2014
Member
OK. I changed the name for now to regexp_macros. Even if it isn't necessary, I think it's probably a better name on its own than regexp_re. Happy to comply with anything though.
alexcrichton
reviewed
Apr 13, 2014
View changes
active/0000-regexps.md
| include some kind of support for regular expressions in its standard library. | ||
| The outcome of this RFC is to include a regular expression library in the Rust | ||
| distribution. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 13, 2014
Member
OK, I added a reference to the issue. (Please let me know if that wasn't what you meant by cc.)
BurntSushi
Apr 13, 2014
Member
OK, I added a reference to the issue. (Please let me know if that wasn't what you meant by cc.)
alexcrichton
reviewed
Apr 13, 2014
View changes
active/0000-regexps.md
| [#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people | ||
| favor a native Rust implementation if it's to be included in the Rust | ||
| distribution. (Does the `re!` macro require it? If so, that's a huge | ||
| advantage.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexcrichton
Apr 13, 2014
Member
Another small downside of binding to an existing library is that it's not necessarily as portable as rust code. Libraries written in rust are maximally portable because they'll go wherever rust goes.
alexcrichton
Apr 13, 2014
Member
Another small downside of binding to an existing library is that it's not necessarily as portable as rust code. Libraries written in rust are maximally portable because they'll go wherever rust goes.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
This looks amazing, fantastic work! |
BurntSushi
added some commits
Apr 13, 2014
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chris-morgan
Apr 13, 2014
Member
In the future all that will be necessary is
#[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason whyphaseis feature gated.
But doesn't #[phase(syntax, link)] extern crate regexp; work? That's the recommended invocation for log, and how it is being used.
I certainly don't want any public _macros convention—I want it to Just Work™. (Ideally without needing to specify #[phase] at all, but I'll live with it for the moment.)
But doesn't I certainly don't want any public |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
Apr 13, 2014
Member
If a procedural* macro is defined in a crate, then it will want to pull in syntax as a (dynamic) dependency. Hence having the procedural macro defined in the regexp crate will mean everything that uses regexp will need syntax at runtime, which is entirely unacceptable. Having it in a separate crate allows you to depend on that crate only at compile time, with no runtime effect (there are possibly bugs with this ATM).
*It doesn't affect liblog because all it's macros are macro_rules, which don't need libsyntax to be linked in.
|
If a procedural* macro is defined in a crate, then it will want to pull in *It doesn't affect liblog because all it's macros are macro_rules, which don't need libsyntax to be linked in. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chris-morgan
Apr 13, 2014
Member
@huonw That should be able to be fixed in dead code removal for link time optimisation (would it in practice be?) but I get the point now. Thanks for the explanation.
|
@huonw That should be able to be fixed in dead code removal for link time optimisation (would it in practice be?) but I get the point now. Thanks for the explanation. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chris-morgan
Apr 13, 2014
Member
A couple of other ideas that I have had with regards to regular expressions are:
- Truly compiled regular expressions: as in, no dependency on a
regexplibrary at runtime at all, but rather expanding it to approximately what a person might have written by hand without a regular expressions library. - Create anonymous structs for matches, with direct field access (or indexed access) for groups.
I would expect that these would lead to somewhat larger compiled code, but to code that should run more efficiently. I'm not sure if it's a good trade-off or not.
Anyway, these lead to something like this:
re!(FancyIdentifier, r"^(?P<letters>[a-z]+)(?P<numbers>[0-9]+)?$")
expanding to something approximating this, plus quite a bit more (I recognise that it isn't a valid expansion in a static value and has various other issues, but it gives the general idea of what I think would be really nice):
struct FancyIdentifier<'a> {
all: &'a str,
letters: &'a str,
numbers: Option<&'a str>,
}
impl<'a> Index<uint, Option<&'a str>> for FancyIdentifier<'a> {
fn index(&'a self, index: &uint) -> Option<&'a str> {
if *index == 0u {
Some(self.all)
} else if *index == 1u {
Some(self.letters)
} else if *index == 2u {
self.numbers
} else {
fail!("no such group {}", *index);
}
}
}
impl<'a> FancyIdentifier<'a> {
pub fn captures<'t>(text: &'t str) -> Option<FancyIdentifier<'t>> {
let mut chars = text.chars();
loop {
// go through, byte/char by byte/char, keeping track of position
if b < 'a' || b > 'z' {
return None;
}
}
loop {
// … get numbers in much the same way …
}
Some(FancyIdentifier {
all: text,
letters: letters,
numbers: numbers,
})
}
}
This allows nicer usage:
let foo12 = FancyIdentifier::captures("foo12");
assert_eq!(foo12.letters, "foo");
assert_eq!(foo12.numbers, Some("12"));
assert_eq!(foo12[0], Some("foo12"));
assert_eq!(foo12[2], Some("12"));
I expect this would be rather difficult to implement, too. Still, just thought I'd toss the idea into the ring as I haven't seen it suggested, but it's been sitting in my mind the whole time the discussion has gone on.
|
A couple of other ideas that I have had with regards to regular expressions are:
I would expect that these would lead to somewhat larger compiled code, but to code that should run more efficiently. I'm not sure if it's a good trade-off or not. Anyway, these lead to something like this:
expanding to something approximating this, plus quite a bit more (I recognise that it isn't a valid expansion in a static value and has various other issues, but it gives the general idea of what I think would be really nice):
This allows nicer usage:
I expect this would be rather difficult to implement, too. Still, just thought I'd toss the idea into the ring as I haven't seen it suggested, but it's been sitting in my mind the whole time the discussion has gone on. |
huonw
reviewed
Apr 13, 2014
View changes
active/0000-regexps.md
| case---but I'm just hazarding a guess here. (If we go this route, then we'd | ||
| probably also have to expose the regexp parser and AST and possibly the | ||
| compiler and instruction set to make writing your own backend easier. That | ||
| sounds restrictive with respect to making performance improvements in the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
Apr 13, 2014
Member
We could expose it as an #[unstable] or even #[experimental] interface: i.e. subject to change, but it's possible to use if you really need it.
huonw
Apr 13, 2014
Member
We could expose it as an #[unstable] or even #[experimental] interface: i.e. subject to change, but it's possible to use if you really need it.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 13, 2014
Member
@chris-morgan That's a really interesting idea. I hadn't thought of it. (I spoke with @eddyb about it on IRC.)
I think it's something worth trying and could potentially increase performance dramatically, but I also think it's complex enough that it be thrown in the bin of future work. @eddyb and I both agree that it would require a specialization of the Pike VM, which I think is doable (without allocation even). A more naive implementation is difficult because (I think) it would rely on recursion for handling non-determinism, which would pretty easily result in stack overflows. (In fact, most of the complexity of the Pike VM is a direct result of manually managing a queue of states. Any recursion in the VM is strictly bounded to the number of instructions in the regexp.)
If this sounds OK to you, I'll add it to the RFC as possible future work. (Since it may require an API change, I think the #[unstable] and #[experimental] would make that OK.)
|
@chris-morgan That's a really interesting idea. I hadn't thought of it. (I spoke with @eddyb about it on IRC.) I think it's something worth trying and could potentially increase performance dramatically, but I also think it's complex enough that it be thrown in the bin of future work. @eddyb and I both agree that it would require a specialization of the Pike VM, which I think is doable (without allocation even). A more naive implementation is difficult because (I think) it would rely on recursion for handling non-determinism, which would pretty easily result in stack overflows. (In fact, most of the complexity of the Pike VM is a direct result of manually managing a queue of states. Any recursion in the VM is strictly bounded to the number of instructions in the regexp.) If this sounds OK to you, I'll add it to the RFC as possible future work. (Since it may require an API change, I think the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexcrichton
Apr 13, 2014
Member
@chris-morgan, sadly #[phase(syntax, link)] extern crate regexp; will not work becuase this is using procedural macros rather than macro_rules macros (two separate systems).
|
@chris-morgan, sadly |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sfackler
Apr 13, 2014
Member
@alexcrichton well, it'll work, but introduce a runtime dependency on libsyntax :P
|
@alexcrichton well, it'll work, but introduce a runtime dependency on libsyntax :P |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
brendanzab
Apr 13, 2014
Member
@alexcrichton Is that a current wart that should be fixed, or will that remain the same?
|
@alexcrichton Is that a current wart that should be fixed, or will that remain the same? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexcrichton
Apr 13, 2014
Member
Yes, that is the wart that will be fixed. In the future world, there will be no need to manually compile two crates, and there will be no runtime dependency on libsyntax, and the syntax will be #[phase(syntax, link)] extern crate my_crate_with_syntax_extensions;
|
Yes, that is the wart that will be fixed. In the future world, there will be no need to manually compile two crates, and there will be no runtime dependency on libsyntax, and the syntax will be |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
brendanzab
Apr 13, 2014
Member
@BurntSushi Awesome work. Have you by any chance seen D's compile time regex with templates? (Scroll down to "Regular Expression Compiler"). I'm guessing you are probably using a very different method to statically compile things though, but its still interesting.
Also, could you explain why you chose the specific identifier for your library and types? Here are some choices you could have made:
Re,libre: too ambiguous? but consistent withre!Regex,libregex: the shortening most people use in conversationRegexp,libregexp: current proposalRegexpr,libregexpr: rust usesexprin themacro_rulesthing - more consistent maybe?RegExp,libreg_exp: might be more consistent with the accepted identifier styleRegExpr,libreg_expr: see above
We could bikeshed this forever, but I do think it deserves at least some passing consideration before we pull the trigger.
|
@BurntSushi Awesome work. Have you by any chance seen D's compile time regex with templates? (Scroll down to "Regular Expression Compiler"). I'm guessing you are probably using a very different method to statically compile things though, but its still interesting. Also, could you explain why you chose the specific identifier for your library and types? Here are some choices you could have made:
We could bikeshed this forever, but I do think it deserves at least some passing consideration before we pull the trigger. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 13, 2014
Member
@bjz Thanks! I did look at D's regexes this morning. From what I understand D provides something similar to my current re! macro, which compiles a regexp at compile time but still relies on a general implementation to do matching. The example I found here (toward the bottom) is: static r = regex("Boo-hoo");. D also supports compiling a regexp to native code with ctRegex, which is what @chris-morgan suggested above. I'm not exactly sure why they are separate in the public API though.
Also, for the name, I didn't put much thought into it. If you pressed me, I'd say I used it simply because that's the name of the package in Go's standard library. (Which isn't that good of a reason.)
re is what Python uses, but I agree with you that it might be too ambiguous.
I'd also be happy with Regex and libregex. I'm less a fan of the other suggestions, just because they look more ugly to me. Also, I think people tend to refer to them as either "regexes" or "regexps" rather than "regexprs", so maybe that's another reason to stick with regex/regexp.
|
@bjz Thanks! I did look at D's regexes this morning. From what I understand D provides something similar to my current Also, for the name, I didn't put much thought into it. If you pressed me, I'd say I used it simply because that's the name of the package in Go's standard library. (Which isn't that good of a reason.)
I'd also be happy with |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Meta: should the RFC include a discussion/justification of the name? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
brendanzab
Apr 13, 2014
Member
Regarding D, cool to hear your impressions! A while back I heard Andrei make some very bold claims about D's regex performance compared to other libs, and I would wonder how yours would compare. I realise however that this is an RFC in regarding to the public API, and the internals could be improved later.
I would make a mention of the naming in the RFC – I think it is important to show you have considered alternatives and precedents rather than jumping on the first one that came to mind. The bike shedding is inevitable (and sometimes necessary), but at least it helps to focus the debate.
|
Regarding D, cool to hear your impressions! A while back I heard Andrei make some very bold claims about D's regex performance compared to other libs, and I would wonder how yours would compare. I realise however that this is an RFC in regarding to the public API, and the internals could be improved later. I would make a mention of the naming in the RFC – I think it is important to show you have considered alternatives and precedents rather than jumping on the first one that came to mind. The bike shedding is inevitable (and sometimes necessary), but at least it helps to focus the debate. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 13, 2014
Member
@bjz RE D: Yeah, I think it would be very exciting to see what @chris-morgan's suggestion would do to performance. That along with implementing a DFA are two major optimizations for future work. (Along with a few other minor ones, like a one-pass NFA.)
I've added some stuff about the name to the RFC.
|
@bjz RE D: Yeah, I think it would be very exciting to see what @chris-morgan's suggestion would do to performance. That along with implementing a DFA are two major optimizations for future work. (Along with a few other minor ones, like a one-pass NFA.) I've added some stuff about the name to the RFC. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 13, 2014
Member
@bjz I agree. I would actually prefer that it be called regex! or regexp! (whatever the crate name is I guess). I called it re! because that's what people had been writing (when talking about a hypothetical macro).
|
@bjz I agree. I would actually prefer that it be called |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chris-morgan
Apr 14, 2014
Member
I personally prefer crate re and macro re!. But then, I come from a Python background, so don't trust me.
|
I personally prefer crate |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chris-morgan
Apr 14, 2014
Member
Google trends for the regexpr, regexp and regex:
- "regexpr" is basically never used;
- "regexp" is steadily declining in usage;
- "regex" has been the preferred form for at least ten years.
|
Google trends for the regexpr, regexp and regex:
|
lfairy
reviewed
Apr 14, 2014
View changes
active/0000-regexps.md
| The `\w` character class and the zero-width word boundary assertion `\b` are | ||
| defined in terms of the ASCII character set. I'm not aware of any | ||
| implementation that defines these in terms of proper Unicode character classes. | ||
| Do we want to be the first? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lfairy
Apr 14, 2014
Contributor
\w and \d and \s all default to Unicode under Python 3. So there's a little bit of precedent.
lfairy
Apr 14, 2014
Contributor
\w and \d and \s all default to Unicode under Python 3. So there's a little bit of precedent.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 14, 2014
Member
Ah! I actually think D also does it. I'd say that's probably enough precedent to go with Unicode. (For word boundaries too, I think.)
BurntSushi
Apr 14, 2014
Member
Ah! I actually think D also does it. I'd say that's probably enough precedent to go with Unicode. (For word boundaries too, I think.)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lfairy
Apr 14, 2014
Contributor
DFA compilation would be great, though probably as an option. The main advantage is performance: it matches in O(n) time and O(1) memory (zero allocations!). It has no runtime dependencies. Plus, since it's effectively a finite state machine, it's straightforward to translate to LLVM.
It's not a free lunch though -- a DFA matcher has worst case exponential code size, which can make it impractical for complex expressions.
If a DFA compiler is implemented, we can either tuck it under a flag, or enable it by default but fall back if the code becomes too large. Either way, I think finishing Unicode support (especially case folding) is a higher priority.
(As for the name, I vote for regex. The 'p' in regexp doesn't add anything semantically; we might as well take it out.)
|
DFA compilation would be great, though probably as an option. The main advantage is performance: it matches in O(n) time and O(1) memory (zero allocations!). It has no runtime dependencies. Plus, since it's effectively a finite state machine, it's straightforward to translate to LLVM. It's not a free lunch though -- a DFA matcher has worst case exponential code size, which can make it impractical for complex expressions. If a DFA compiler is implemented, we can either tuck it under a flag, or enable it by default but fall back if the code becomes too large. Either way, I think finishing Unicode support (especially case folding) is a higher priority. (As for the name, I vote for |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 14, 2014
Member
@lfairy Note that a compiled NFA should also have zero (heap) allocations. (The generalized NFA simulator has O(m) (heap) space complexity, where m is the number of instructions in the regexp.)
I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).
|
@lfairy Note that a compiled NFA should also have zero (heap) allocations. (The generalized NFA simulator has I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
Apr 14, 2014
Member
Maybe the DFA approach could be performed by hooking up Ragel to generate Rust AST (this may be tricky). (cc https://github.com/erickt/ragel, which is generating Rust code as text.)
|
Maybe the DFA approach could be performed by hooking up Ragel to generate Rust AST (this may be tricky). (cc https://github.com/erickt/ragel, which is generating Rust code as text.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
seanmonstar
Apr 14, 2014
Contributor
I find it slightly odd that Regexp::new() returns a Result. I've come
to assume that new() will always return that object, and that it's safe to
do so.
Would Regexp::compile(str) -> Result feel nicer?
|
I find it slightly odd that Would |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lfairy
Apr 14, 2014
Contributor
@BurntSushi The VM does allocate, to create a list of running threads -- but given the allocation only happens once, I see where you're coming from.
I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).
That sounds right.
@huonw Thanks for the link. I suspect people who need the performance/expressiveness of Ragel would use it directly though.
This leaves the DFA approach in an awkward spot, methinks -- simple cases work well with re2-style state caching, and advanced cases can use a lexer generator or something magical like Ragel. Looks like Russ Cox had it right all along ;)
|
@BurntSushi The VM does allocate, to create a list of running threads -- but given the allocation only happens once, I see where you're coming from.
That sounds right. @huonw Thanks for the link. I suspect people who need the performance/expressiveness of Ragel would use it directly though. This leaves the DFA approach in an awkward spot, methinks -- simple cases work well with re2-style state caching, and advanced cases can use a lexer generator or something magical like Ragel. Looks like Russ Cox had it right all along ;) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
huonw
Apr 14, 2014
Member
I find it slightly odd that
Regexp::new()returns aResult. I've come to assume that new() will always return that object, and that it's safe to do so. WouldRegexp::compile(str) -> Resultfeel nicer?
The types mean that you can never accidentally use the return of ::new() incorrectly (I'm personally fine with using new for this reason: strong types).
I suspect people who need the performance/expressiveness of Ragel would use it directly though.
That doesn't preclude using Ragel just as a step in an efficient regex -> native code translator. (That is, writing a regex syntax -> ragel syntax translator and an output-to-Rust-AST mode for ragel may be easier than writing a direct regex syntax -> Rust-AST translator that results in equally good code. Of course, adding a ragel dependency to the core distribution would be a no-go.)
The types mean that you can never accidentally use the return of
That doesn't preclude using Ragel just as a step in an efficient regex -> native code translator. (That is, writing a regex syntax -> ragel syntax translator and an output-to-Rust-AST mode for ragel may be easier than writing a direct regex syntax -> Rust-AST translator that results in equally good code. Of course, adding a ragel dependency to the core distribution would be a no-go.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 18, 2014
Member
I've been working on what @chris-morgan suggested: real compilation to native Rust with the re! macro. I'm sure there are more performance gains to be had, but I'm at a reasonable place right now. One bummer is that it can no longer be declared statically. Instead, it can be used anywhere an expression can be used. But, it is also indistinguishable from a regexp compiled at runtime. Internally, the representation looks like this:
pub enum MaybeNative {
Dynamic(~[Inst]),
Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
}This makes runtime and compiled regexps have a completely identical API. It means it might not be as nice as the API that @chris-morgan suggested, but I think a consistent API between both is probably more valuable.
The doco for the updated code is at a different URL. There are some examples using the new macro: http://burntsushi.net/rustdoc/exp/regexp/index.html (But other portions of the doco are rightfully unchanged.)
The benchmarks are also very encouraging. Left column is for dynamic regexps and the right column is for natively compiled regexps.
literal 422 ns/iter (+/- 2) 120 ns/iter (+/- 17)
not_literal 1904 ns/iter (+/- 7) 931 ns/iter (+/- 642)
match_class 2452 ns/iter (+/- 6) 1276 ns/iter (+/- 336)
match_class_in_range 2559 ns/iter (+/- 91) 1298 ns/iter (+/- 433)
replace_all 5221 ns/iter (+/- 529) 1216 ns/iter (+/- 648)
anchored_literal_short_non_match 939 ns/iter (+/- 10) 420 ns/iter (+/- 168)
anchored_literal_long_non_match 8979 ns/iter (+/- 64) 5407 ns/iter (+/- 1982)
anchored_literal_short_match 576 ns/iter (+/- 5) 126 ns/iter (+/- 92)
anchored_literal_long_match 553 ns/iter (+/- 8) 150 ns/iter (+/- 103)
one_pass_short_a 2039 ns/iter (+/- 20) 1036 ns/iter (+/- 397)
one_pass_short_a_not 2698 ns/iter (+/- 8) 1365 ns/iter (+/- 623)
one_pass_short_b 1457 ns/iter (+/- 14) 710 ns/iter (+/- 495)
one_pass_short_b_not 2037 ns/iter (+/- 13) 974 ns/iter (+/- 552)
one_pass_long_prefix 1188 ns/iter (+/- 7) 383 ns/iter (+/- 117)
one_pass_long_prefix_not 1217 ns/iter (+/- 7) 344 ns/iter (+/- 196)
easy0_32 564 ns/iter (+/- 14) = 56 MB/s 44 ns/iter (+/- 12) = 727 MB/s
easy0_1K 2389 ns/iter (+/- 167) = 428 MB/s 1903 ns/iter (+/- 390) = 538 MB/s
easy0_32K 59404 ns/iter (+/- 882) = 551 MB/s 59889 ns/iter (+/- 34128) = 547 MB/s
easy1_32 543 ns/iter (+/- 145) = 58 MB/s 55 ns/iter (+/- 58) = 581 MB/s
easy1_1K 3495 ns/iter (+/- 829) = 292 MB/s 1629 ns/iter (+/- 601) = 628 MB/s
easy1_32K 92901 ns/iter (+/- 5203) = 352 MB/s 48938 ns/iter (+/- 8302) = 669 MB/s
medium_32 1611 ns/iter (+/- 61) = 19 MB/s 526 ns/iter (+/- 60) = 60 MB/s
medium_1K 33457 ns/iter (+/- 621) = 30 MB/s 14541 ns/iter (+/- 5849) = 70 MB/s
medium_32K 1044635 ns/iter (+/- 19853) = 31 MB/s 472571 ns/iter (+/- 177623) = 69 MB/s
hard_32 2447 ns/iter (+/- 129) = 13 MB/s 1025 ns/iter (+/- 516) = 31 MB/s
hard_1K 54844 ns/iter (+/- 297) = 18 MB/s 30248 ns/iter (+/- 11665) = 33 MB/s
hard_32K 1744529 ns/iter (+/- 40267) = 18 MB/s 993564 ns/iter (+/- 455100) = 32 MB/s
There's also a similarly big jump in performance on the regex-dna benchmark. Old. New.
|
I've been working on what @chris-morgan suggested: real compilation to native Rust with the pub enum MaybeNative {
Dynamic(~[Inst]),
Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
}This makes runtime and compiled regexps have a completely identical API. It means it might not be as nice as the API that @chris-morgan suggested, but I think a consistent API between both is probably more valuable. The doco for the updated code is at a different URL. There are some examples using the new macro: http://burntsushi.net/rustdoc/exp/regexp/index.html (But other portions of the doco are rightfully unchanged.) The benchmarks are also very encouraging. Left column is for dynamic regexps and the right column is for natively compiled regexps.
There's also a similarly big jump in performance on the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 18, 2014
Member
I've updated the RFC to use natively compiled regexps and simplified some sections based on discussion here. And the implementation is now Unicode friendly for Perl character classes and word boundaries.
Aside from the name of the crate (how is that decided?), I think I've incorporated all feedback given.
|
I've updated the RFC to use natively compiled regexps and simplified some sections based on discussion here. And the implementation is now Unicode friendly for Perl character classes and word boundaries. Aside from the name of the crate (how is that decided?), I think I've incorporated all feedback given. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sfackler
Apr 18, 2014
Member
How does the codegen size compare between the old and new syntax extension implementations? Will a binary with a lot of regexes need to avoid native compilation because it would bloat the binary too much?
|
How does the codegen size compare between the old and new syntax extension implementations? Will a binary with a lot of regexes need to avoid native compilation because it would bloat the binary too much? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
seanmonstar
Apr 18, 2014
Contributor
indeed, i was thinking similarly: If I use regexp! several times, won't it be generating a lot of redundant code? I imagine that repeatable part can be put into a regexp::native module, and the macro can just expand calling some of those functions with the expanded values.
|
indeed, i was thinking similarly: If I use |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 19, 2014
Member
@sfackler I don't (yet) have a comparison with the old regexp! macro, but with native compilation, my test binary with roughly 434 regexps (many of which are pretty big) is 17MB compiled without optimization. Compiled with -O, the binary shrinks to 6.7MB. Compiled with --opt-level=3 -Z lto, the binary is 5.3MB.
As a baseline, if I compile test using dynamic regexps, then the binary sizes are 6MB, 4.3MB and 2.7MB, respectively.
These sizes seem pretty reasonable to me, since I think 400+ regexps is a pretty extreme case.
@seanmonstar That is indeed possible and I'm already doing it for some pieces. There is more that could be done though. Any piece that has knowledge of types like [T, ..N] has to be specialized though.
|
@sfackler I don't (yet) have a comparison with the old As a baseline, if I compile These sizes seem pretty reasonable to me, since I think 400+ regexps is a pretty extreme case. @seanmonstar That is indeed possible and I'm already doing it for some pieces. There is more that could be done though. Any piece that has knowledge of types like |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Apr 19, 2014
Member
Here's another perspective. Given a minimal binary with a single small regexp (that prints all capture groups), compiling without optimization increases the binary size by 54KB (comparing dynamic vs. native regexp). Compiling with -O increases size by 9KB. Compiling with --opt-level=3 -Z lto decreases size by 184KB. (That seems wicked. Maybe the optimizer knows to leave out Regexp::new and all of its requisite machinery? e.g., The VM, parser and compiler.)
|
Here's another perspective. Given a minimal binary with a single small regexp (that prints all capture groups), compiling without optimization increases the binary size by 54KB (comparing dynamic vs. native regexp). Compiling with |
alexcrichton
merged commit c250f8b
into
rust-lang:master
Apr 22, 2014
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexcrichton
Apr 22, 2014
Member
We discussed this in today's meeting and decided to merge it.
The only caveat we'd like to attach is that the entire crate is #[experimental] for now (so we can get some traction first). Other than that though, we're all looking forward to being able to use regular expressions!
|
We discussed this in today's meeting and decided to merge it. The only caveat we'd like to attach is that the entire crate is |
BurntSushi
deleted the
BurntSushi:regexps
branch
Apr 23, 2014
chris-morgan
referenced this pull request
Apr 24, 2014
Merged
add regexp crate to Rust distribution (implements RFC 7) #13700
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bearophile
Apr 27, 2014
@BurntSushi: >I'm not exactly sure why they are separate in the public API though.<
I think because until someone patches D the interpreter of Compile Time Function Excution to make it more memory-efficient, you sometimes want to avoid ctRegex, to use less compilation memory.
bearophile
commented
Apr 27, 2014
|
@BurntSushi: >I'm not exactly sure why they are separate in the public API though.< I think because until someone patches D the interpreter of Compile Time Function Excution to make it more memory-efficient, you sometimes want to avoid ctRegex, to use less compilation memory. |
BurntSushi commentedApr 13, 2014
Links to an existing implementation, documentation and benchmarks are in the RFC. This RFC is meant to resolve issue #3591.
I apologize in advance if I've made any amateur mistakes. I'm still fairly new to the Rust world (~1 month), so I'm sure I still have some misunderstandings about the language lurking somewhere.