add a regexp crate to the Rust distribution #42

Merged
merged 12 commits into from Apr 22, 2014

Conversation

Projects
None yet
9 participants
@BurntSushi
Member

BurntSushi commented Apr 13, 2014

Links to an existing implementation, documentation and benchmarks are in the RFC. This RFC is meant to resolve issue #3591.

I apologize in advance if I've made any amateur mistakes. I'm still fairly new to the Rust world (~1 month), so I'm sure I still have some misunderstandings about the language lurking somewhere.

+A nice implementation strategy to support Unicode is to implement a VM that
+matches characters instead of bytes. Indeed, my implementation does this.
+However, the public API of a regular expression library should expose *byte
+indices* corresponding to match locations (which ought to be guaranteed to be

This comment has been minimized.

@huonw

huonw Apr 13, 2014

Member

(The APIs in std::str expose byte indices too, so this is well supported in Rust-land.)

@huonw

huonw Apr 13, 2014

Member

(The APIs in std::str expose byte indices too, so this is well supported in Rust-land.)

This comment has been minimized.

@BurntSushi

BurntSushi Apr 13, 2014

Member

Nice catch! Fixed.

@BurntSushi

BurntSushi Apr 13, 2014

Member

Nice catch! Fixed.

@huonw

View changes

active/0000-regexps.md
+found this difficult to do with zero-runtime cost. Either way, the ability to
+statically declare a regexp is pretty cool I think.
+
+Note that the syntax extension is the reason for the `regexp_re` crate. It's

This comment has been minimized.

@huonw

huonw Apr 13, 2014

Member

We probably should have a convention for crates and their syntax extension pairs, e.g. for a crate foo, have foo_macros or foo_synext or something. (I'd personally be ok with foo_macros, e.g. regexp_macros in this case.)

@huonw

huonw Apr 13, 2014

Member

We probably should have a convention for crates and their syntax extension pairs, e.g. for a crate foo, have foo_macros or foo_synext or something. (I'd personally be ok with foo_macros, e.g. regexp_macros in this case.)

This comment has been minimized.

@BurntSushi

BurntSushi Apr 13, 2014

Member

I like foo_macros too. (I'll change this once there's a consensus?)

@BurntSushi

BurntSushi Apr 13, 2014

Member

I like foo_macros too. (I'll change this once there's a consensus?)

This comment has been minimized.

@sfackler

sfackler Apr 13, 2014

Member

I've used foo_mac but foo_macros seems fine.

@sfackler

sfackler Apr 13, 2014

Member

I've used foo_mac but foo_macros seems fine.

This comment has been minimized.

@alexcrichton

alexcrichton Apr 13, 2014

Member

In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.

Essentially, I wouldn't worry too much about the name.

@alexcrichton

alexcrichton Apr 13, 2014

Member

In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.

Essentially, I wouldn't worry too much about the name.

This comment has been minimized.

@BurntSushi

BurntSushi Apr 13, 2014

Member

OK. I changed the name for now to regexp_macros. Even if it isn't necessary, I think it's probably a better name on its own than regexp_re. Happy to comply with anything though.

@BurntSushi

BurntSushi Apr 13, 2014

Member

OK. I changed the name for now to regexp_macros. Even if it isn't necessary, I think it's probably a better name on its own than regexp_re. Happy to comply with anything though.

@alexcrichton

View changes

active/0000-regexps.md
+include some kind of support for regular expressions in its standard library.
+
+The outcome of this RFC is to include a regular expression library in the Rust
+distribution.

This comment has been minimized.

@alexcrichton

alexcrichton Apr 13, 2014

Member

Could you cc the issue in the rust repository here as well?

@alexcrichton

alexcrichton Apr 13, 2014

Member

Could you cc the issue in the rust repository here as well?

This comment has been minimized.

@BurntSushi

BurntSushi Apr 13, 2014

Member

OK, I added a reference to the issue. (Please let me know if that wasn't what you meant by cc.)

@BurntSushi

BurntSushi Apr 13, 2014

Member

OK, I added a reference to the issue. (Please let me know if that wasn't what you meant by cc.)

@alexcrichton

View changes

active/0000-regexps.md
+[#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people
+favor a native Rust implementation if it's to be included in the Rust
+distribution. (Does the `re!` macro require it? If so, that's a huge
+advantage.)

This comment has been minimized.

@alexcrichton

alexcrichton Apr 13, 2014

Member

Another small downside of binding to an existing library is that it's not necessarily as portable as rust code. Libraries written in rust are maximally portable because they'll go wherever rust goes.

@alexcrichton

alexcrichton Apr 13, 2014

Member

Another small downside of binding to an existing library is that it's not necessarily as portable as rust code. Libraries written in rust are maximally portable because they'll go wherever rust goes.

This comment has been minimized.

@BurntSushi

BurntSushi Apr 13, 2014

Member

Ah, right. Fixed.

@BurntSushi

BurntSushi Apr 13, 2014

Member

Ah, right. Fixed.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Apr 13, 2014

Member

This looks amazing, fantastic work!

Member

alexcrichton commented Apr 13, 2014

This looks amazing, fantastic work!

@chris-morgan

This comment has been minimized.

Show comment
Hide comment
@chris-morgan

chris-morgan Apr 13, 2014

Member

@alexcrichton:

In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.

But doesn't #[phase(syntax, link)] extern crate regexp; work? That's the recommended invocation for log, and how it is being used.

I certainly don't want any public _macros convention—I want it to Just Work™. (Ideally without needing to specify #[phase] at all, but I'll live with it for the moment.)

Member

chris-morgan commented Apr 13, 2014

@alexcrichton:

In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.

But doesn't #[phase(syntax, link)] extern crate regexp; work? That's the recommended invocation for log, and how it is being used.

I certainly don't want any public _macros convention—I want it to Just Work™. (Ideally without needing to specify #[phase] at all, but I'll live with it for the moment.)

@huonw

This comment has been minimized.

Show comment
Hide comment
@huonw

huonw Apr 13, 2014

Member

If a procedural* macro is defined in a crate, then it will want to pull in syntax as a (dynamic) dependency. Hence having the procedural macro defined in the regexp crate will mean everything that uses regexp will need syntax at runtime, which is entirely unacceptable. Having it in a separate crate allows you to depend on that crate only at compile time, with no runtime effect (there are possibly bugs with this ATM).

*It doesn't affect liblog because all it's macros are macro_rules, which don't need libsyntax to be linked in.

Member

huonw commented Apr 13, 2014

If a procedural* macro is defined in a crate, then it will want to pull in syntax as a (dynamic) dependency. Hence having the procedural macro defined in the regexp crate will mean everything that uses regexp will need syntax at runtime, which is entirely unacceptable. Having it in a separate crate allows you to depend on that crate only at compile time, with no runtime effect (there are possibly bugs with this ATM).

*It doesn't affect liblog because all it's macros are macro_rules, which don't need libsyntax to be linked in.

@chris-morgan

This comment has been minimized.

Show comment
Hide comment
@chris-morgan

chris-morgan Apr 13, 2014

Member

@huonw That should be able to be fixed in dead code removal for link time optimisation (would it in practice be?) but I get the point now. Thanks for the explanation.

Member

chris-morgan commented Apr 13, 2014

@huonw That should be able to be fixed in dead code removal for link time optimisation (would it in practice be?) but I get the point now. Thanks for the explanation.

@chris-morgan

This comment has been minimized.

Show comment
Hide comment
@chris-morgan

chris-morgan Apr 13, 2014

Member

A couple of other ideas that I have had with regards to regular expressions are:

  • Truly compiled regular expressions: as in, no dependency on a regexp library at runtime at all, but rather expanding it to approximately what a person might have written by hand without a regular expressions library.
  • Create anonymous structs for matches, with direct field access (or indexed access) for groups.

I would expect that these would lead to somewhat larger compiled code, but to code that should run more efficiently. I'm not sure if it's a good trade-off or not.

Anyway, these lead to something like this:

re!(FancyIdentifier, r"^(?P<letters>[a-z]+)(?P<numbers>[0-9]+)?$")

expanding to something approximating this, plus quite a bit more (I recognise that it isn't a valid expansion in a static value and has various other issues, but it gives the general idea of what I think would be really nice):

struct FancyIdentifier<'a> {
    all: &'a str,
    letters: &'a str,
    numbers: Option<&'a str>,
}

impl<'a> Index<uint, Option<&'a str>> for FancyIdentifier<'a> {
    fn index(&'a self, index: &uint) -> Option<&'a str> {
        if *index == 0u {
            Some(self.all)
        } else if *index == 1u {
            Some(self.letters)
        } else if *index == 2u {
            self.numbers
        } else {
            fail!("no such group {}", *index);
        }
    }
}

impl<'a> FancyIdentifier<'a> {
    pub fn captures<'t>(text: &'t str) -> Option<FancyIdentifier<'t>> {
        let mut chars = text.chars();
        loop {
            // go through, byte/char by byte/char, keeping track of position
            if b < 'a' || b > 'z' {
                return None;
            }
        }
        loop {
            // … get numbers in much the same way …
        }
        Some(FancyIdentifier {
            all: text,
            letters: letters,
            numbers: numbers,
        })
    }
}

This allows nicer usage:

let foo12 = FancyIdentifier::captures("foo12");
assert_eq!(foo12.letters, "foo");
assert_eq!(foo12.numbers, Some("12"));
assert_eq!(foo12[0], Some("foo12"));
assert_eq!(foo12[2], Some("12"));

I expect this would be rather difficult to implement, too. Still, just thought I'd toss the idea into the ring as I haven't seen it suggested, but it's been sitting in my mind the whole time the discussion has gone on.

Member

chris-morgan commented Apr 13, 2014

A couple of other ideas that I have had with regards to regular expressions are:

  • Truly compiled regular expressions: as in, no dependency on a regexp library at runtime at all, but rather expanding it to approximately what a person might have written by hand without a regular expressions library.
  • Create anonymous structs for matches, with direct field access (or indexed access) for groups.

I would expect that these would lead to somewhat larger compiled code, but to code that should run more efficiently. I'm not sure if it's a good trade-off or not.

Anyway, these lead to something like this:

re!(FancyIdentifier, r"^(?P<letters>[a-z]+)(?P<numbers>[0-9]+)?$")

expanding to something approximating this, plus quite a bit more (I recognise that it isn't a valid expansion in a static value and has various other issues, but it gives the general idea of what I think would be really nice):

struct FancyIdentifier<'a> {
    all: &'a str,
    letters: &'a str,
    numbers: Option<&'a str>,
}

impl<'a> Index<uint, Option<&'a str>> for FancyIdentifier<'a> {
    fn index(&'a self, index: &uint) -> Option<&'a str> {
        if *index == 0u {
            Some(self.all)
        } else if *index == 1u {
            Some(self.letters)
        } else if *index == 2u {
            self.numbers
        } else {
            fail!("no such group {}", *index);
        }
    }
}

impl<'a> FancyIdentifier<'a> {
    pub fn captures<'t>(text: &'t str) -> Option<FancyIdentifier<'t>> {
        let mut chars = text.chars();
        loop {
            // go through, byte/char by byte/char, keeping track of position
            if b < 'a' || b > 'z' {
                return None;
            }
        }
        loop {
            // … get numbers in much the same way …
        }
        Some(FancyIdentifier {
            all: text,
            letters: letters,
            numbers: numbers,
        })
    }
}

This allows nicer usage:

let foo12 = FancyIdentifier::captures("foo12");
assert_eq!(foo12.letters, "foo");
assert_eq!(foo12.numbers, Some("12"));
assert_eq!(foo12[0], Some("foo12"));
assert_eq!(foo12[2], Some("12"));

I expect this would be rather difficult to implement, too. Still, just thought I'd toss the idea into the ring as I haven't seen it suggested, but it's been sitting in my mind the whole time the discussion has gone on.

@huonw

View changes

active/0000-regexps.md
+case---but I'm just hazarding a guess here. (If we go this route, then we'd
+probably also have to expose the regexp parser and AST and possibly the
+compiler and instruction set to make writing your own backend easier. That
+sounds restrictive with respect to making performance improvements in the

This comment has been minimized.

@huonw

huonw Apr 13, 2014

Member

We could expose it as an #[unstable] or even #[experimental] interface: i.e. subject to change, but it's possible to use if you really need it.

@huonw

huonw Apr 13, 2014

Member

We could expose it as an #[unstable] or even #[experimental] interface: i.e. subject to change, but it's possible to use if you really need it.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 13, 2014

Member

@chris-morgan That's a really interesting idea. I hadn't thought of it. (I spoke with @eddyb about it on IRC.)

I think it's something worth trying and could potentially increase performance dramatically, but I also think it's complex enough that it be thrown in the bin of future work. @eddyb and I both agree that it would require a specialization of the Pike VM, which I think is doable (without allocation even). A more naive implementation is difficult because (I think) it would rely on recursion for handling non-determinism, which would pretty easily result in stack overflows. (In fact, most of the complexity of the Pike VM is a direct result of manually managing a queue of states. Any recursion in the VM is strictly bounded to the number of instructions in the regexp.)

If this sounds OK to you, I'll add it to the RFC as possible future work. (Since it may require an API change, I think the #[unstable] and #[experimental] would make that OK.)

Member

BurntSushi commented Apr 13, 2014

@chris-morgan That's a really interesting idea. I hadn't thought of it. (I spoke with @eddyb about it on IRC.)

I think it's something worth trying and could potentially increase performance dramatically, but I also think it's complex enough that it be thrown in the bin of future work. @eddyb and I both agree that it would require a specialization of the Pike VM, which I think is doable (without allocation even). A more naive implementation is difficult because (I think) it would rely on recursion for handling non-determinism, which would pretty easily result in stack overflows. (In fact, most of the complexity of the Pike VM is a direct result of manually managing a queue of states. Any recursion in the VM is strictly bounded to the number of instructions in the regexp.)

If this sounds OK to you, I'll add it to the RFC as possible future work. (Since it may require an API change, I think the #[unstable] and #[experimental] would make that OK.)

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Apr 13, 2014

Member

@chris-morgan, sadly #[phase(syntax, link)] extern crate regexp; will not work becuase this is using procedural macros rather than macro_rules macros (two separate systems).

Member

alexcrichton commented Apr 13, 2014

@chris-morgan, sadly #[phase(syntax, link)] extern crate regexp; will not work becuase this is using procedural macros rather than macro_rules macros (two separate systems).

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Apr 13, 2014

Member

@alexcrichton well, it'll work, but introduce a runtime dependency on libsyntax :P

Member

sfackler commented Apr 13, 2014

@alexcrichton well, it'll work, but introduce a runtime dependency on libsyntax :P

@brendanzab

This comment has been minimized.

Show comment
Hide comment
@brendanzab

brendanzab Apr 13, 2014

Member

@alexcrichton Is that a current wart that should be fixed, or will that remain the same?

Member

brendanzab commented Apr 13, 2014

@alexcrichton Is that a current wart that should be fixed, or will that remain the same?

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Apr 13, 2014

Member

Yes, that is the wart that will be fixed. In the future world, there will be no need to manually compile two crates, and there will be no runtime dependency on libsyntax, and the syntax will be #[phase(syntax, link)] extern crate my_crate_with_syntax_extensions;

Member

alexcrichton commented Apr 13, 2014

Yes, that is the wart that will be fixed. In the future world, there will be no need to manually compile two crates, and there will be no runtime dependency on libsyntax, and the syntax will be #[phase(syntax, link)] extern crate my_crate_with_syntax_extensions;

@brendanzab

This comment has been minimized.

Show comment
Hide comment
@brendanzab

brendanzab Apr 13, 2014

Member

@BurntSushi Awesome work. Have you by any chance seen D's compile time regex with templates? (Scroll down to "Regular Expression Compiler"). I'm guessing you are probably using a very different method to statically compile things though, but its still interesting.

Also, could you explain why you chose the specific identifier for your library and types? Here are some choices you could have made:

  • Re, libre: too ambiguous? but consistent with re!
  • Regex, libregex: the shortening most people use in conversation
  • Regexp, libregexp: current proposal
  • Regexpr, libregexpr: rust uses expr in the macro_rules thing - more consistent maybe?
  • RegExp, libreg_exp: might be more consistent with the accepted identifier style
  • RegExpr, libreg_expr: see above

We could bikeshed this forever, but I do think it deserves at least some passing consideration before we pull the trigger.

Member

brendanzab commented Apr 13, 2014

@BurntSushi Awesome work. Have you by any chance seen D's compile time regex with templates? (Scroll down to "Regular Expression Compiler"). I'm guessing you are probably using a very different method to statically compile things though, but its still interesting.

Also, could you explain why you chose the specific identifier for your library and types? Here are some choices you could have made:

  • Re, libre: too ambiguous? but consistent with re!
  • Regex, libregex: the shortening most people use in conversation
  • Regexp, libregexp: current proposal
  • Regexpr, libregexpr: rust uses expr in the macro_rules thing - more consistent maybe?
  • RegExp, libreg_exp: might be more consistent with the accepted identifier style
  • RegExpr, libreg_expr: see above

We could bikeshed this forever, but I do think it deserves at least some passing consideration before we pull the trigger.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 13, 2014

Member

@bjz Thanks! I did look at D's regexes this morning. From what I understand D provides something similar to my current re! macro, which compiles a regexp at compile time but still relies on a general implementation to do matching. The example I found here (toward the bottom) is: static r = regex("Boo-hoo");. D also supports compiling a regexp to native code with ctRegex, which is what @chris-morgan suggested above. I'm not exactly sure why they are separate in the public API though.

Also, for the name, I didn't put much thought into it. If you pressed me, I'd say I used it simply because that's the name of the package in Go's standard library. (Which isn't that good of a reason.)

re is what Python uses, but I agree with you that it might be too ambiguous.

I'd also be happy with Regex and libregex. I'm less a fan of the other suggestions, just because they look more ugly to me. Also, I think people tend to refer to them as either "regexes" or "regexps" rather than "regexprs", so maybe that's another reason to stick with regex/regexp.

Member

BurntSushi commented Apr 13, 2014

@bjz Thanks! I did look at D's regexes this morning. From what I understand D provides something similar to my current re! macro, which compiles a regexp at compile time but still relies on a general implementation to do matching. The example I found here (toward the bottom) is: static r = regex("Boo-hoo");. D also supports compiling a regexp to native code with ctRegex, which is what @chris-morgan suggested above. I'm not exactly sure why they are separate in the public API though.

Also, for the name, I didn't put much thought into it. If you pressed me, I'd say I used it simply because that's the name of the package in Go's standard library. (Which isn't that good of a reason.)

re is what Python uses, but I agree with you that it might be too ambiguous.

I'd also be happy with Regex and libregex. I'm less a fan of the other suggestions, just because they look more ugly to me. Also, I think people tend to refer to them as either "regexes" or "regexps" rather than "regexprs", so maybe that's another reason to stick with regex/regexp.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 13, 2014

Member

Meta: should the RFC include a discussion/justification of the name?

Member

BurntSushi commented Apr 13, 2014

Meta: should the RFC include a discussion/justification of the name?

@brendanzab

This comment has been minimized.

Show comment
Hide comment
@brendanzab

brendanzab Apr 13, 2014

Member

Regarding D, cool to hear your impressions! A while back I heard Andrei make some very bold claims about D's regex performance compared to other libs, and I would wonder how yours would compare. I realise however that this is an RFC in regarding to the public API, and the internals could be improved later.

I would make a mention of the naming in the RFC – I think it is important to show you have considered alternatives and precedents rather than jumping on the first one that came to mind. The bike shedding is inevitable (and sometimes necessary), but at least it helps to focus the debate.

Member

brendanzab commented Apr 13, 2014

Regarding D, cool to hear your impressions! A while back I heard Andrei make some very bold claims about D's regex performance compared to other libs, and I would wonder how yours would compare. I realise however that this is an RFC in regarding to the public API, and the internals could be improved later.

I would make a mention of the naming in the RFC – I think it is important to show you have considered alternatives and precedents rather than jumping on the first one that came to mind. The bike shedding is inevitable (and sometimes necessary), but at least it helps to focus the debate.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 13, 2014

Member

@bjz RE D: Yeah, I think it would be very exciting to see what @chris-morgan's suggestion would do to performance. That along with implementing a DFA are two major optimizations for future work. (Along with a few other minor ones, like a one-pass NFA.)

I've added some stuff about the name to the RFC.

Member

BurntSushi commented Apr 13, 2014

@bjz RE D: Yeah, I think it would be very exciting to see what @chris-morgan's suggestion would do to performance. That along with implementing a DFA are two major optimizations for future work. (Along with a few other minor ones, like a one-pass NFA.)

I've added some stuff about the name to the RFC.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 13, 2014

Member

@bjz I agree. I would actually prefer that it be called regex! or regexp! (whatever the crate name is I guess). I called it re! because that's what people had been writing (when talking about a hypothetical macro).

Member

BurntSushi commented Apr 13, 2014

@bjz I agree. I would actually prefer that it be called regex! or regexp! (whatever the crate name is I guess). I called it re! because that's what people had been writing (when talking about a hypothetical macro).

@chris-morgan

This comment has been minimized.

Show comment
Hide comment
@chris-morgan

chris-morgan Apr 14, 2014

Member

I personally prefer crate re and macro re!. But then, I come from a Python background, so don't trust me.

Member

chris-morgan commented Apr 14, 2014

I personally prefer crate re and macro re!. But then, I come from a Python background, so don't trust me.

@chris-morgan

This comment has been minimized.

Show comment
Hide comment
@chris-morgan

chris-morgan Apr 14, 2014

Member

Google trends for the regexpr, regexp and regex:

  • "regexpr" is basically never used;
  • "regexp" is steadily declining in usage;
  • "regex" has been the preferred form for at least ten years.
Member

chris-morgan commented Apr 14, 2014

Google trends for the regexpr, regexp and regex:

  • "regexpr" is basically never used;
  • "regexp" is steadily declining in usage;
  • "regex" has been the preferred form for at least ten years.
@lfairy

View changes

active/0000-regexps.md
+The `\w` character class and the zero-width word boundary assertion `\b` are
+defined in terms of the ASCII character set. I'm not aware of any
+implementation that defines these in terms of proper Unicode character classes.
+Do we want to be the first?

This comment has been minimized.

@lfairy

lfairy Apr 14, 2014

Contributor

\w and \d and \s all default to Unicode under Python 3. So there's a little bit of precedent.

@lfairy

lfairy Apr 14, 2014

Contributor

\w and \d and \s all default to Unicode under Python 3. So there's a little bit of precedent.

This comment has been minimized.

@BurntSushi

BurntSushi Apr 14, 2014

Member

Ah! I actually think D also does it. I'd say that's probably enough precedent to go with Unicode. (For word boundaries too, I think.)

@BurntSushi

BurntSushi Apr 14, 2014

Member

Ah! I actually think D also does it. I'd say that's probably enough precedent to go with Unicode. (For word boundaries too, I think.)

@lfairy

This comment has been minimized.

Show comment
Hide comment
@lfairy

lfairy Apr 14, 2014

Contributor

DFA compilation would be great, though probably as an option. The main advantage is performance: it matches in O(n) time and O(1) memory (zero allocations!). It has no runtime dependencies. Plus, since it's effectively a finite state machine, it's straightforward to translate to LLVM.

It's not a free lunch though -- a DFA matcher has worst case exponential code size, which can make it impractical for complex expressions.

If a DFA compiler is implemented, we can either tuck it under a flag, or enable it by default but fall back if the code becomes too large. Either way, I think finishing Unicode support (especially case folding) is a higher priority.

(As for the name, I vote for regex. The 'p' in regexp doesn't add anything semantically; we might as well take it out.)

Contributor

lfairy commented Apr 14, 2014

DFA compilation would be great, though probably as an option. The main advantage is performance: it matches in O(n) time and O(1) memory (zero allocations!). It has no runtime dependencies. Plus, since it's effectively a finite state machine, it's straightforward to translate to LLVM.

It's not a free lunch though -- a DFA matcher has worst case exponential code size, which can make it impractical for complex expressions.

If a DFA compiler is implemented, we can either tuck it under a flag, or enable it by default but fall back if the code becomes too large. Either way, I think finishing Unicode support (especially case folding) is a higher priority.

(As for the name, I vote for regex. The 'p' in regexp doesn't add anything semantically; we might as well take it out.)

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 14, 2014

Member

@lfairy Note that a compiled NFA should also have zero (heap) allocations. (The generalized NFA simulator has O(m) (heap) space complexity, where m is the number of instructions in the regexp.)

I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).

Member

BurntSushi commented Apr 14, 2014

@lfairy Note that a compiled NFA should also have zero (heap) allocations. (The generalized NFA simulator has O(m) (heap) space complexity, where m is the number of instructions in the regexp.)

I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).

@huonw

This comment has been minimized.

Show comment
Hide comment
@huonw

huonw Apr 14, 2014

Member

Maybe the DFA approach could be performed by hooking up Ragel to generate Rust AST (this may be tricky). (cc https://github.com/erickt/ragel, which is generating Rust code as text.)

Member

huonw commented Apr 14, 2014

Maybe the DFA approach could be performed by hooking up Ragel to generate Rust AST (this may be tricky). (cc https://github.com/erickt/ragel, which is generating Rust code as text.)

@seanmonstar

This comment has been minimized.

Show comment
Hide comment
@seanmonstar

seanmonstar Apr 14, 2014

Contributor

I find it slightly odd that Regexp::new() returns a Result. I've come
to assume that new() will always return that object, and that it's safe to
do so.

Would Regexp::compile(str) -> Result feel nicer?

Contributor

seanmonstar commented Apr 14, 2014

I find it slightly odd that Regexp::new() returns a Result. I've come
to assume that new() will always return that object, and that it's safe to
do so.

Would Regexp::compile(str) -> Result feel nicer?

@lfairy

This comment has been minimized.

Show comment
Hide comment
@lfairy

lfairy Apr 14, 2014

Contributor

@BurntSushi The VM does allocate, to create a list of running threads -- but given the allocation only happens once, I see where you're coming from.

I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).

That sounds right.

@huonw Thanks for the link. I suspect people who need the performance/expressiveness of Ragel would use it directly though.

This leaves the DFA approach in an awkward spot, methinks -- simple cases work well with re2-style state caching, and advanced cases can use a lexer generator or something magical like Ragel. Looks like Russ Cox had it right all along ;)

Contributor

lfairy commented Apr 14, 2014

@BurntSushi The VM does allocate, to create a list of running threads -- but given the allocation only happens once, I see where you're coming from.

I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).

That sounds right.

@huonw Thanks for the link. I suspect people who need the performance/expressiveness of Ragel would use it directly though.

This leaves the DFA approach in an awkward spot, methinks -- simple cases work well with re2-style state caching, and advanced cases can use a lexer generator or something magical like Ragel. Looks like Russ Cox had it right all along ;)

@huonw

This comment has been minimized.

Show comment
Hide comment
@huonw

huonw Apr 14, 2014

Member

I find it slightly odd that Regexp::new() returns a Result. I've come to assume that new() will always return that object, and that it's safe to do so. Would Regexp::compile(str) -> Result feel nicer?

The types mean that you can never accidentally use the return of ::new() incorrectly (I'm personally fine with using new for this reason: strong types).

I suspect people who need the performance/expressiveness of Ragel would use it directly though.

That doesn't preclude using Ragel just as a step in an efficient regex -> native code translator. (That is, writing a regex syntax -> ragel syntax translator and an output-to-Rust-AST mode for ragel may be easier than writing a direct regex syntax -> Rust-AST translator that results in equally good code. Of course, adding a ragel dependency to the core distribution would be a no-go.)

Member

huonw commented Apr 14, 2014

I find it slightly odd that Regexp::new() returns a Result. I've come to assume that new() will always return that object, and that it's safe to do so. Would Regexp::compile(str) -> Result feel nicer?

The types mean that you can never accidentally use the return of ::new() incorrectly (I'm personally fine with using new for this reason: strong types).

I suspect people who need the performance/expressiveness of Ragel would use it directly though.

That doesn't preclude using Ragel just as a step in an efficient regex -> native code translator. (That is, writing a regex syntax -> ragel syntax translator and an output-to-Rust-AST mode for ragel may be easier than writing a direct regex syntax -> Rust-AST translator that results in equally good code. Of course, adding a ragel dependency to the core distribution would be a no-go.)

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 18, 2014

Member

I've been working on what @chris-morgan suggested: real compilation to native Rust with the re! macro. I'm sure there are more performance gains to be had, but I'm at a reasonable place right now. One bummer is that it can no longer be declared statically. Instead, it can be used anywhere an expression can be used. But, it is also indistinguishable from a regexp compiled at runtime. Internally, the representation looks like this:

pub enum MaybeNative {
    Dynamic(~[Inst]),
    Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
}

This makes runtime and compiled regexps have a completely identical API. It means it might not be as nice as the API that @chris-morgan suggested, but I think a consistent API between both is probably more valuable.

The doco for the updated code is at a different URL. There are some examples using the new macro: http://burntsushi.net/rustdoc/exp/regexp/index.html (But other portions of the doco are rightfully unchanged.)

The benchmarks are also very encouraging. Left column is for dynamic regexps and the right column is for natively compiled regexps.

literal                                 422 ns/iter (+/- 2)                     120 ns/iter (+/- 17)              
not_literal                            1904 ns/iter (+/- 7)                     931 ns/iter (+/- 642)
match_class                            2452 ns/iter (+/- 6)                    1276 ns/iter (+/- 336)
match_class_in_range                   2559 ns/iter (+/- 91)                   1298 ns/iter (+/- 433)
replace_all                            5221 ns/iter (+/- 529)                  1216 ns/iter (+/- 648)
anchored_literal_short_non_match        939 ns/iter (+/- 10)                    420 ns/iter (+/- 168)
anchored_literal_long_non_match        8979 ns/iter (+/- 64)                   5407 ns/iter (+/- 1982)
anchored_literal_short_match            576 ns/iter (+/- 5)                     126 ns/iter (+/- 92)
anchored_literal_long_match             553 ns/iter (+/- 8)                     150 ns/iter (+/- 103)
one_pass_short_a                       2039 ns/iter (+/- 20)                   1036 ns/iter (+/- 397)
one_pass_short_a_not                   2698 ns/iter (+/- 8)                    1365 ns/iter (+/- 623)
one_pass_short_b                       1457 ns/iter (+/- 14)                    710 ns/iter (+/- 495)
one_pass_short_b_not                   2037 ns/iter (+/- 13)                    974 ns/iter (+/- 552)
one_pass_long_prefix                   1188 ns/iter (+/- 7)                     383 ns/iter (+/- 117)
one_pass_long_prefix_not               1217 ns/iter (+/- 7)                     344 ns/iter (+/- 196)
easy0_32                                564 ns/iter (+/- 14) = 56 MB/s           44 ns/iter (+/- 12) = 727 MB/s
easy0_1K                               2389 ns/iter (+/- 167) = 428 MB/s       1903 ns/iter (+/- 390) = 538 MB/s
easy0_32K                             59404 ns/iter (+/- 882) = 551 MB/s      59889 ns/iter (+/- 34128) = 547 MB/s
easy1_32                                543 ns/iter (+/- 145) = 58 MB/s          55 ns/iter (+/- 58) = 581 MB/s
easy1_1K                               3495 ns/iter (+/- 829) = 292 MB/s       1629 ns/iter (+/- 601) = 628 MB/s
easy1_32K                             92901 ns/iter (+/- 5203) = 352 MB/s     48938 ns/iter (+/- 8302) = 669 MB/s
medium_32                              1611 ns/iter (+/- 61) = 19 MB/s          526 ns/iter (+/- 60) = 60 MB/s
medium_1K                             33457 ns/iter (+/- 621) = 30 MB/s       14541 ns/iter (+/- 5849) = 70 MB/s
medium_32K                          1044635 ns/iter (+/- 19853) = 31 MB/s    472571 ns/iter (+/- 177623) = 69 MB/s
hard_32                                2447 ns/iter (+/- 129) = 13 MB/s        1025 ns/iter (+/- 516) = 31 MB/s
hard_1K                               54844 ns/iter (+/- 297) = 18 MB/s       30248 ns/iter (+/- 11665) = 33 MB/s
hard_32K                            1744529 ns/iter (+/- 40267) = 18 MB/s    993564 ns/iter (+/- 455100) = 32 MB/s

There's also a similarly big jump in performance on the regex-dna benchmark. Old. New.

Member

BurntSushi commented Apr 18, 2014

I've been working on what @chris-morgan suggested: real compilation to native Rust with the re! macro. I'm sure there are more performance gains to be had, but I'm at a reasonable place right now. One bummer is that it can no longer be declared statically. Instead, it can be used anywhere an expression can be used. But, it is also indistinguishable from a regexp compiled at runtime. Internally, the representation looks like this:

pub enum MaybeNative {
    Dynamic(~[Inst]),
    Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
}

This makes runtime and compiled regexps have a completely identical API. It means it might not be as nice as the API that @chris-morgan suggested, but I think a consistent API between both is probably more valuable.

The doco for the updated code is at a different URL. There are some examples using the new macro: http://burntsushi.net/rustdoc/exp/regexp/index.html (But other portions of the doco are rightfully unchanged.)

The benchmarks are also very encouraging. Left column is for dynamic regexps and the right column is for natively compiled regexps.

literal                                 422 ns/iter (+/- 2)                     120 ns/iter (+/- 17)              
not_literal                            1904 ns/iter (+/- 7)                     931 ns/iter (+/- 642)
match_class                            2452 ns/iter (+/- 6)                    1276 ns/iter (+/- 336)
match_class_in_range                   2559 ns/iter (+/- 91)                   1298 ns/iter (+/- 433)
replace_all                            5221 ns/iter (+/- 529)                  1216 ns/iter (+/- 648)
anchored_literal_short_non_match        939 ns/iter (+/- 10)                    420 ns/iter (+/- 168)
anchored_literal_long_non_match        8979 ns/iter (+/- 64)                   5407 ns/iter (+/- 1982)
anchored_literal_short_match            576 ns/iter (+/- 5)                     126 ns/iter (+/- 92)
anchored_literal_long_match             553 ns/iter (+/- 8)                     150 ns/iter (+/- 103)
one_pass_short_a                       2039 ns/iter (+/- 20)                   1036 ns/iter (+/- 397)
one_pass_short_a_not                   2698 ns/iter (+/- 8)                    1365 ns/iter (+/- 623)
one_pass_short_b                       1457 ns/iter (+/- 14)                    710 ns/iter (+/- 495)
one_pass_short_b_not                   2037 ns/iter (+/- 13)                    974 ns/iter (+/- 552)
one_pass_long_prefix                   1188 ns/iter (+/- 7)                     383 ns/iter (+/- 117)
one_pass_long_prefix_not               1217 ns/iter (+/- 7)                     344 ns/iter (+/- 196)
easy0_32                                564 ns/iter (+/- 14) = 56 MB/s           44 ns/iter (+/- 12) = 727 MB/s
easy0_1K                               2389 ns/iter (+/- 167) = 428 MB/s       1903 ns/iter (+/- 390) = 538 MB/s
easy0_32K                             59404 ns/iter (+/- 882) = 551 MB/s      59889 ns/iter (+/- 34128) = 547 MB/s
easy1_32                                543 ns/iter (+/- 145) = 58 MB/s          55 ns/iter (+/- 58) = 581 MB/s
easy1_1K                               3495 ns/iter (+/- 829) = 292 MB/s       1629 ns/iter (+/- 601) = 628 MB/s
easy1_32K                             92901 ns/iter (+/- 5203) = 352 MB/s     48938 ns/iter (+/- 8302) = 669 MB/s
medium_32                              1611 ns/iter (+/- 61) = 19 MB/s          526 ns/iter (+/- 60) = 60 MB/s
medium_1K                             33457 ns/iter (+/- 621) = 30 MB/s       14541 ns/iter (+/- 5849) = 70 MB/s
medium_32K                          1044635 ns/iter (+/- 19853) = 31 MB/s    472571 ns/iter (+/- 177623) = 69 MB/s
hard_32                                2447 ns/iter (+/- 129) = 13 MB/s        1025 ns/iter (+/- 516) = 31 MB/s
hard_1K                               54844 ns/iter (+/- 297) = 18 MB/s       30248 ns/iter (+/- 11665) = 33 MB/s
hard_32K                            1744529 ns/iter (+/- 40267) = 18 MB/s    993564 ns/iter (+/- 455100) = 32 MB/s

There's also a similarly big jump in performance on the regex-dna benchmark. Old. New.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 18, 2014

Member

I've updated the RFC to use natively compiled regexps and simplified some sections based on discussion here. And the implementation is now Unicode friendly for Perl character classes and word boundaries.

Aside from the name of the crate (how is that decided?), I think I've incorporated all feedback given.

Member

BurntSushi commented Apr 18, 2014

I've updated the RFC to use natively compiled regexps and simplified some sections based on discussion here. And the implementation is now Unicode friendly for Perl character classes and word boundaries.

Aside from the name of the crate (how is that decided?), I think I've incorporated all feedback given.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Apr 18, 2014

Member

How does the codegen size compare between the old and new syntax extension implementations? Will a binary with a lot of regexes need to avoid native compilation because it would bloat the binary too much?

Member

sfackler commented Apr 18, 2014

How does the codegen size compare between the old and new syntax extension implementations? Will a binary with a lot of regexes need to avoid native compilation because it would bloat the binary too much?

@seanmonstar

This comment has been minimized.

Show comment
Hide comment
@seanmonstar

seanmonstar Apr 18, 2014

Contributor

indeed, i was thinking similarly: If I use regexp! several times, won't it be generating a lot of redundant code? I imagine that repeatable part can be put into a regexp::native module, and the macro can just expand calling some of those functions with the expanded values.

Contributor

seanmonstar commented Apr 18, 2014

indeed, i was thinking similarly: If I use regexp! several times, won't it be generating a lot of redundant code? I imagine that repeatable part can be put into a regexp::native module, and the macro can just expand calling some of those functions with the expanded values.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 19, 2014

Member

@sfackler I don't (yet) have a comparison with the old regexp! macro, but with native compilation, my test binary with roughly 434 regexps (many of which are pretty big) is 17MB compiled without optimization. Compiled with -O, the binary shrinks to 6.7MB. Compiled with --opt-level=3 -Z lto, the binary is 5.3MB.

As a baseline, if I compile test using dynamic regexps, then the binary sizes are 6MB, 4.3MB and 2.7MB, respectively.

These sizes seem pretty reasonable to me, since I think 400+ regexps is a pretty extreme case.

@seanmonstar That is indeed possible and I'm already doing it for some pieces. There is more that could be done though. Any piece that has knowledge of types like [T, ..N] has to be specialized though.

Member

BurntSushi commented Apr 19, 2014

@sfackler I don't (yet) have a comparison with the old regexp! macro, but with native compilation, my test binary with roughly 434 regexps (many of which are pretty big) is 17MB compiled without optimization. Compiled with -O, the binary shrinks to 6.7MB. Compiled with --opt-level=3 -Z lto, the binary is 5.3MB.

As a baseline, if I compile test using dynamic regexps, then the binary sizes are 6MB, 4.3MB and 2.7MB, respectively.

These sizes seem pretty reasonable to me, since I think 400+ regexps is a pretty extreme case.

@seanmonstar That is indeed possible and I'm already doing it for some pieces. There is more that could be done though. Any piece that has knowledge of types like [T, ..N] has to be specialized though.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 19, 2014

Member

Here's another perspective. Given a minimal binary with a single small regexp (that prints all capture groups), compiling without optimization increases the binary size by 54KB (comparing dynamic vs. native regexp). Compiling with -O increases size by 9KB. Compiling with --opt-level=3 -Z lto decreases size by 184KB. (That seems wicked. Maybe the optimizer knows to leave out Regexp::new and all of its requisite machinery? e.g., The VM, parser and compiler.)

Member

BurntSushi commented Apr 19, 2014

Here's another perspective. Given a minimal binary with a single small regexp (that prints all capture groups), compiling without optimization increases the binary size by 54KB (comparing dynamic vs. native regexp). Compiling with -O increases size by 9KB. Compiling with --opt-level=3 -Z lto decreases size by 184KB. (That seems wicked. Maybe the optimizer knows to leave out Regexp::new and all of its requisite machinery? e.g., The VM, parser and compiler.)

@alexcrichton alexcrichton merged commit c250f8b into rust-lang:master Apr 22, 2014

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Apr 22, 2014

Member

We discussed this in today's meeting and decided to merge it.

The only caveat we'd like to attach is that the entire crate is #[experimental] for now (so we can get some traction first). Other than that though, we're all looking forward to being able to use regular expressions!

Member

alexcrichton commented Apr 22, 2014

We discussed this in today's meeting and decided to merge it.

The only caveat we'd like to attach is that the entire crate is #[experimental] for now (so we can get some traction first). Other than that though, we're all looking forward to being able to use regular expressions!

@BurntSushi BurntSushi deleted the BurntSushi:regexps branch Apr 23, 2014

@bearophile

This comment has been minimized.

Show comment
Hide comment
@bearophile

bearophile Apr 27, 2014

@BurntSushi: >I'm not exactly sure why they are separate in the public API though.<

I think because until someone patches D the interpreter of Compile Time Function Excution to make it more memory-efficient, you sometimes want to avoid ctRegex, to use less compilation memory.

@BurntSushi: >I'm not exactly sure why they are separate in the public API though.<

I think because until someone patches D the interpreter of Compile Time Function Excution to make it more memory-efficient, you sometimes want to avoid ctRegex, to use less compilation memory.

@chriskrycho chriskrycho referenced this pull request in rust-lang/rust Dec 30, 2016

Closed

Document all features in the reference #38643

0 of 17 tasks complete

withoutboats pushed a commit to withoutboats/rfcs that referenced this pull request Jan 15, 2017

Merge pull request #42 from SimonSapin/patch-4
Fix links to promise function vs Promise type

@chriskrycho chriskrycho referenced this pull request in rust-lang-nursery/reference Mar 11, 2017

Closed

Document all features #9

18 of 48 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment