Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
972 lines (789 sloc) 39.5 KB

Table of contents

Summary

This RFC proposes a 1.0 API for the regex crate and therefore a move out of the rust-lang-nursery organization and into the rust-lang organization. Since the API of regex has largely remained unchanged since its inception 2 years ago, significant emphasis is placed on retaining the existing API. Some minor breaking changes are proposed.

Motivation

Regular expressions are a widely used tool and most popular programming languages either have an implementation of regexes in their standard library, or there exists at least one widely used third party implementation. It therefore seems reasonable for Rust to do something similar.

The regex crate specifically serves many use cases, most of which are somehow related to searching strings for patterns. Describing regular expressions in detail is beyond the scope of this RFC, but briefly, these core use cases are supported in the main API:

  1. Testing whether a pattern matches some text.
  2. Finding the location of a match of a pattern in some text.
  3. Finding the location of a match of a pattern---and locations of all its capturing groups---in some text.
  4. Iterating over successive non-overlapping matches of (2) and (3).

The expected outcome is that the regex crate should be the preferred default choice for matching regular expressions when writing Rust code. This is already true today; this RFC formalizes it.

Detailed design

Syntax

Evolution

The public API of a regex library includes the syntax of a regular expression. A change in the semantics of the syntax can cause otherwise working programs to break, yet, we'd still like the option to expand the syntax if necessary. Thus, this RFC proposes:

  1. Any change that causes a previously invalid regex pattern to become valid is not a breaking change. For example, the escape sequence \y is not a valid pattern, but could become one in a future release without a major version bump.
  2. Any change that causes a previously valid regex pattern to become invalid is a breaking change.
  3. Any change that causes a valid regex pattern to change its matching semantics is a breaking change. (For example, changing \b from "word boundary assertion" to "backspace character.")

Bug fixes and Unicode upgrades are exceptions to both (2) and (3).

Another interesting exception to (2) is that compiling a regex can fail if the entire compiled object would exceed some pre-defined user configurable size. In particular, future changes to the compiler could cause certain instructions to use more memory, or indeed, the representation of the compiled regex could change completely. This could cause a regex that fit under the size limit to no longer fit, and therefore fail to compile. These cases are expected to be extremely rare in practice. Notably, the default size limit is 10MB.

Concrete syntax

The syntax is exhaustively documented in the current public API documentation: http://doc.rust-lang.org/regex/regex/index.html#syntax

To my knowledge, the evolution as proposed in this RFC has been followed since regex was created. The syntax has largely remained unchanged with few additions.

Expansion concerns

There are a few possible avenues for expansion, and we take measures to make sure they are possible with respect to API evolution.

  • Escape sequences are often blessed with special semantics. For example, \d is a Unicode character class that matches any digit and \b is a word boundary assertion. We may one day like to add more escape sequences with special semantics. For this reason, any unrecognized escape sequence makes a pattern invalid.
  • If we wanted to expand the syntax with various look-around operators, then it would be possible since most common syntax is considered an invalid pattern today. In particular, all of the syntactic forms listed here are invalid patterns in regex.
  • Character class sets are another potentially useful feature that may be worth adding. Currently, various forms of set notation are treated as valid patterns, but this RFC proposes making them invalid patterns before 1.0.
  • Additional named Unicode classes or codepoints may be desirable to add. Today, any pattern of the form \p{NAME} where NAME is unrecognized is considered invalid, which leaves room for expansion.
  • If all else fails, we can introduce new flags that enable new features that conflict with stable syntax. This is possible because using an unrecognized flag results in an invalid pattern.

Core API

The core API of the regex crate is the Regex type:

pub struct Regex(_);

It has one primary constructor:

impl Regex {
  /// Creates a new regular expression. If the pattern is invalid or otherwise
  /// fails to compile, this returns an error.
  pub fn new(pattern: &str) -> Result<Regex, Error>;
}

And five core search methods. All searching completes in worst case linear time with respect to the search text (the size of the regex is taken as a constant).

impl Regex {
  /// Returns true if and only if the text matches this regex.
  pub fn is_match(&self, text: &str) -> bool;

  /// Returns the leftmost-first match of this regex in the text given. If no
  /// match exists, then None is returned.
  ///
  /// The leftmost-first match is defined as the first match that is found
  /// by a backtracking search.
  pub fn find<'t>(&self, text: &'t str) -> Option<Match<'t>>;

  /// Returns an iterator of successive non-overlapping matches of this regex
  /// in the text given.
  pub fn find_iter<'r, 't>(&'r self, text: &'t str) -> Matches<'r, 't>;

  /// Returns the leftmost-first match of this regex in the text given with
  /// locations for all capturing groups that participated in the match.
  pub fn captures(&self, text: &str) -> Option<Captures>;

  /// Returns an iterator of successive non-overlapping matches with capturing
  /// group information in the text given.
  pub fn captures_iter<'r, 't>(&'r self, text: &'t str) -> CaptureMatches<'r, 't>;
}

(N.B. The captures method can technically replace all uses of find and is_match, but is potentially slower. Namely, the API reflects a performance trade off: the more you ask for, the harder the regex engine has to work.)

There is one additional, but idiosyncratic, search method:

impl Regex {
  /// Returns the end location of a match if one exists in text.
  ///
  /// This may return a location preceding the end of a proper leftmost-first
  /// match. In particular, it may return the location at which a match is
  /// determined to exist. For example, matching `a+` against `aaaaa` will
  /// return `1` while the end of the leftmost-first match is actually `5`.
  ///
  /// This has the same performance characteristics as `is_match`.
  pub fn shortest_match(&self, text: &str) -> Option<usize>;
}

And two methods for splitting:

impl Regex {
  /// Returns an iterator of substrings of `text` delimited by a match of
  /// this regular expression. Each element yielded by the iterator corresponds
  /// to text that *isn't* matched by this regex.
  pub fn split<'r, 't>(&'r self, text: &'t str) -> Split<'r, 't>;

  /// Returns an iterator of at most `limit` substrings of `text` delimited by
  /// a match of this regular expression. Each element yielded by the iterator
  /// corresponds to text that *isn't* matched by this regex. The remainder of
  /// `text` that is not split will be the last element yielded by the
  /// iterator.
  pub fn splitn<'r, 't>(&'r self, text: &'t str, limit: usize) -> SplitN<'r, 't>;
}

And three methods for replacement. Replacement is discussed in more detail in a subsequent section.

impl Regex {
  /// Replaces matches of this regex in `text` with `rep`. If no matches were
  /// found, then the given string is returned unchanged, otherwise a new
  /// string is allocated.
  ///
  /// `replace` replaces the first match only. `replace_all` replaces all
  /// matches. `replacen` replaces at most `limit` matches.
  fn replace<'t, R: Replacer>(&self, text: &'t str, rep: R) -> Cow<'t, str>;
  fn replace_all<'t, R: Replacer>(&self, text: &'t str, rep: R) -> Cow<'t, str>;
  fn replacen<'t, R: Replacer>(&self, text: &'t str, limit: usize, rep: R) -> Cow<'t, str>;
}

And lastly, three simple accessors:

impl Regex {
  /// Returns the original pattern string.
  pub fn as_str(&self) -> &str;

  /// Returns an iterator over all capturing group in the pattern in the order
  /// they were defined (by position of the leftmost parenthesis). The name of
  /// the group is yielded if it has a name, otherwise None is yielded.
  pub fn capture_names(&self) -> CaptureNames;

  /// Returns the total number of capturing groups in the pattern. This
  /// includes the implicit capturing group corresponding to the entire
  /// pattern.
  pub fn captures_len(&self) -> usize;
}

Finally, Regex impls the Send, Sync, Display, Debug, Clone and FromStr traits from the standard library.

Error

The Error enum is an extensible enum, similar to std::io::Error, corresponding to the different ways that regex compilation can fail. In particular, this means that adding a new variant to this enum is not a breaking change. (Removing or changing an existing variant is still a breaking change.)

pub enum Error {
  /// A syntax error.
  Syntax(SyntaxError),
  /// The compiled program exceeded the set size limit.
  /// The argument is the size limit imposed.
  CompiledTooBig(usize),
  /// Hints that destructuring should not be exhaustive.
  ///
  /// This enum may grow additional variants, so this makes sure clients
  /// don't count on exhaustive matching. (Otherwise, adding a new variant
  /// could break existing code.)
  #[doc(hidden)]
  __Nonexhaustive,
}

Note that the Syntax variant could contain the Error type from the regex-syntax crate, but this couples regex-syntax to the public API of regex. We sidestep this hazard by defining a newtype in regex that internally wraps regex_syntax::Error. This also enables us to selectively expose more information in the future.

RegexBuilder

In most cases, the construction of a regex is done with Regex::new. There are however some options one might want to tweak. This can be done with a RegexBuilder:

impl RegexBuilder {
  /// Creates a new builder from the given pattern.
  pub fn new(pattern: &str) -> RegexBuilder;

  /// Compiles the pattern and all set options. If successful, a Regex is
  /// returned. Otherwise, if compilation failed, an Error is returned.
  ///
  /// N.B. `RegexBuilder::new("...").compile()` is equivalent to
  /// `Regex::new("...")`.
  pub fn build(&self) -> Result<Regex, Error>;

  /// Set the case insensitive flag (i).
  pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexBuilder;

  /// Set the multi line flag (m).
  pub fn multi_line(&mut self, yes: bool) -> &mut RegexBuilder;

  /// Set the dot-matches-any-character flag (s).
  pub fn dot_matches_new_line(&mut self, yes: bool) -> &mut RegexBuilder;

  /// Set the swap-greedy flag (U).
  pub fn swap_greed(&mut self, yes: bool) -> &mut RegexBuilder;

  /// Set the ignore whitespace flag (x).
  pub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexBuilder;

  /// Set the Unicode flag (u).
  pub fn unicode(&mut self, yes: bool) -> &mut RegexBuilder;

  /// Set the approximate size limit (in bytes) of the compiled regular
  /// expression.
  ///
  /// If compiling a pattern would approximately exceed this size, then
  /// compilation will fail.
  pub fn size_limit(&mut self, limit: usize) -> &mut RegexBuilder;

  /// Set the approximate size limit (in bytes) of the cache used by the DFA.
  ///
  /// This is a per thread limit. Once the DFA fills the cache, it will be
  /// wiped and refilled again. If the cache is wiped too frequently, the
  /// DFA will quit and fall back to another matching engine.
  pub fn dfa_size_limit(&mut self, limit: usize) -> &mut RegexBuilder;
}

Captures

A Captures value stores the locations of all matching capturing groups for a single match. It provides convenient access to those locations indexed by either number, or, if available, name.

The first capturing group (index 0) is always unnamed and always corresponds to the entire match. Other capturing groups correspond to groups in the pattern. Capturing groups are indexed by the position of their leftmost parenthesis in the pattern.

Note that Captures is a type constructor with a single parameter: the lifetime of the text searched by the corresponding regex. In particular, the lifetime of Captures is not tied to the lifetime of a Regex.

impl<'t> Captures<'t> {
  /// Returns the match associated with the capture group at index `i`. If
  /// `i` does not correspond to a capture group, or if the capture group
  /// did not participate in the match, then `None` is returned.
  pub fn get(&self, i: usize) -> Option<Match<'t>>;

  /// Returns the match for the capture group named `name`. If `name` isn't a
  /// valid capture group or didn't match anything, then `None` is returned.
  pub fn name(&self, name: &str) -> Option<Match<'t>>;

  /// Returns the number of captured groups. This is always at least 1, since
  /// the first unnamed capturing group corresponding to the entire match
  /// always exists.
  pub fn len(&self) -> usize;

  /// Expands all instances of $name in the text given to the value of the
  /// corresponding named capture group. The expanded string is written to
  /// dst.
  ///
  /// The name in $name may be integer corresponding to the index of a capture
  /// group or it can be the name of a capture group. If the name isn't a valid
  /// capture group, then it is replaced with an empty string.
  ///
  /// The longest possible name is used. e.g., $1a looks up the capture group
  /// named 1a and not the capture group at index 1. To exert more precise
  /// control over the name, use braces, e.g., ${1}a.
  ///
  /// To write a literal $, use $$.
  pub fn expand(&self, replacement: &str, dst: &mut String);
}

The Captures type impls Debug, Index<usize> (for numbered capture groups) and Index<str> (for named capture groups). A downside of the Index impls is that the return value is bounded to the lifetime of Captures instead of the lifetime of the actual text searched because of how the Index trait is defined. Callers can work around that limitation if necessary by using an explicit method such as get or name.

Replacer

The Replacer trait is a helper trait to make the various replace methods on Regex more ergonomic. In particular, it makes it possible to use either a standard string as a replacement, or a closure with more explicit access to a Captures value.

pub trait Replacer {
  /// Appends text to dst to replace the current match.
  ///
  /// The current match is represents by caps, which is guaranteed to have a
  /// match at capture group 0.
  ///
  /// For example, a no-op replacement would be
  /// dst.extend(caps.at(0).unwrap()).
  fn replace_append(&mut self, caps: &Captures, dst: &mut String);

  /// Return a fixed unchanging replacement string.
  ///
  /// When doing replacements, if access to Captures is not needed, then
  /// it can be beneficial from a performance perspective to avoid finding
  /// sub-captures. In general, this is called once for every call to replacen.
  fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
    None
  }
}

Along with this trait, there is also a helper type, NoExpand that implements Replacer like so:

pub struct NoExpand<'t>(pub &'t str);

impl<'t> Replacer for NoExpand<'t> {
    fn replace_append(&mut self, _: &Captures, dst: &mut String) {
        dst.push_str(self.0);
    }

    fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
        Some(Cow::Borrowed(self.0))
    }
}

This permits callers to use NoExpand with the replace methods to guarantee that the replacement string is never searched for $group replacement syntax.

We also provide two more implementations of the Replacer trait: &str and FnMut(&Captures) -> String.

quote

There is one free function in regex:

/// Escapes all regular expression meta characters in `text`.
///
/// The string returned may be safely used as a literal in a regex.
pub fn quote(text: &str) -> String;

RegexSet

A RegexSet represents the union of zero or more regular expressions. It is a specialized machine that can match multiple regular expressions simultaneously. Conceptually, it is similar to joining multiple regexes as alternates, e.g., re1|re2|...|reN, with one crucial difference: in a RegexSet, multiple expressions can match. This means that each pattern can be reasoned about independently. A RegexSet is ideal for building simpler lexers or an HTTP router.

Because of their specialized nature, they can only report which regexes match. They do not report match locations. In theory, this could be added in the future, but is difficult.

pub struct RegexSet(_);

impl RegexSet {
  /// Constructs a new RegexSet from the given sequence of patterns.
  ///
  /// The order of the patterns given is used to assign increasing integer
  /// ids starting from 0. Namely, matches are reported in terms of these ids.
  pub fn new<I, S>(patterns: I) -> Result<RegexSet, Error>
      where S: AsRef<str>, I: IntoIterator<Item=S>;

  /// Returns the total number of regexes in this set.
  pub fn len(&self) -> usize;

  /// Returns true if and only if one or more regexes in this set match
  /// somewhere in the given text.
  pub fn is_match(&self, text: &str) -> bool;

  /// Returns the set of regular expressions that match somewhere in the given
  /// text.
  pub fn matches(&self, text: &str) -> SetMatches;
}

RegexSet impls the Debug and Clone traits.

The SetMatches type is queryable and implements IntoIterator.

pub struct SetMatches(_);

impl SetMatches {
  /// Returns true if this set contains 1 or more matches.
  pub fn matched_any(&self) -> bool;

  /// Returns true if and only if the regex identified by the given id is in
  /// this set of matches.
  ///
  /// This panics if the id given is >= the number of regexes in the set that
  /// these matches came from.
  pub fn matched(&self, id: usize) -> bool;

  /// Returns the total number of regexes in the set that created these
  /// matches.
  pub fn len(&self) -> usize;

  /// Returns an iterator over the ids in the set that correspond to a match.
  pub fn iter(&self) -> SetMatchesIter;
}

SetMatches impls the Debug and Clone traits.

Note that a builder is not proposed for RegexSet in this RFC; however, it is likely one will be added at some point in a backwards compatible way.

The bytes submodule

All of the above APIs have thus far been explicitly for searching text where text has type &str. While this author believes that suits most use cases, it should also be possible to search a regex on arbitrary bytes, i.e., &[u8]. One particular use case is quickly searching a file via a memory map. If regexes could only search &str, then one would have to verify it was UTF-8 first, which could be costly. Moreover, if the file isn't valid UTF-8, then you either can't search it, or you have to allocate a new string and lossily copy the contents. Neither case is particularly ideal. It would instead be nice to just search the &[u8] directly.

This RFC including a bytes submodule in the crate. The API of this submodule is a clone of the API described so far, except with &str replaced by &[u8] for the search text (patterns are still &str). The clone includes Regex itself, along with all supporting types and traits such as Captures, Replacer, FindIter, RegexSet, RegexBuilder and so on. (This RFC describes some alternative designs in a subsequent section.)

Since the API is a clone of what has been seen so far, it is not written out again. Instead, we'll discuss the key differences.

Again, the first difference is that a bytes::Regex can search &[u8] while a Regex can search &str.

The second difference is that a bytes::Regex can completely disable Unicode support and explicitly match arbitrary bytes. The details:

  1. The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be in "ASCII compatible" mode.
  2. In ASCII compatible mode, neither Unicode codepoints nor Unicode character classes are allowed.
  3. In ASCII compatible mode, Perl character classes (\w, \d and \s) revert to their typical ASCII definition. \w maps to [[:word:]], \d maps to [[:digit:]] and \s maps to [[:space:]].
  4. In ASCII compatible mode, word boundaries use the ASCII compatible \w to determine whether a byte is a word byte or not.
  5. Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, \xFF matches the literal byte \xFF, while in Unicode mode, \xFF is a Unicode codepoint that matches its UTF-8 encoding of \xC3\xBF. Similarly for octal notation.
  6. . matches any byte except for \n instead of any Unicode codepoint. When the s flag is enabled, . matches any byte.

An interesting property of the above is that while the Unicode flag is enabled, a bytes::Regex is guaranteed to match only valid UTF-8 in a &[u8]. Like Regex, the Unicode flag is enabled by default.

N.B. The Unicode flag can also be selectively disabled in a Regex, but not in a way that permits matching invalid UTF-8.

Drawbacks

Guaranteed linear time matching

A significant contract in the API of the regex crate is that all searching has worst case O(n) complexity, where n ~ length(text). (The size of the regular expression is taken as a constant.) This contract imposes significant restrictions on both the implementation and the set of features exposed in the pattern language. A full analysis is beyond the scope of this RFC, but here are the highlights:

  1. Unbounded backtracking can't be used to implement matching. Backtracking can be quite fast in practice (indeed, the current implementation uses bounded backtracking in some cases), but has worst case exponential time.
  2. Permitting backreferences in the pattern language can cause matching to become NP-complete, which (probably) can't be solved in linear time.
  3. Arbitrary look around is probably difficult to fit into a linear time guarantee in practice.

The benefit to the linear time guarantee is just that: no matter what, all searching completes in linear time with respect to the search text. This is a valuable guarantee to make, because it means that one can execute arbitrary regular expressions over arbitrary input and be absolutely sure that it will finish in some "reasonable" time.

Of course, in practice, constants that are omitted from complexity analysis actually matter. For this reason, the regex crate takes a number of steps to keep constants low. For example, by placing a limit on the size of the regular expression or choosing an appropriate matching engine when another might result in higher constant factors.

This particular drawback segregates Rust's regular expression library from most other regular expression libraries that programmers may be familiar with. Languages such as Java, Python, Perl, Ruby, PHP and C++ support more flavorful regexes by default. Go is the only language this author knows of whose standard regex implementation guarantees linear time matching. Of course, RE2 is also worth mentioning, which is a C++ regex library that guarantees linear time matching. There are other implementations of regexes that guarantee linear time matching (TRE, for example), but none of them are particularly popular.

It is also worth noting that since Rust's FFI is zero cost, one can bind to existing regex implementations that provide more features (bindings for both PCRE1 and Oniguruma exist today).

Allocation

The regex API assumes that the implementation can dynamically allocate memory. Indeed, the current implementation takes advantage of this. A regex library that has no requirement on dynamic memory allocation would look significantly different than the one that exists today. Dynamic memory allocation is utilized pervasively in the parser, compiler and even during search.

The benefit of permitting dynamic memory allocation is that it makes the implementation and API simpler. This does make use of the regex crate in environments that don't have dynamic memory allocation impossible.

This author isn't aware of any regex library that can work without dynamic memory allocation.

With that said, regex may want to grow custom allocator support when the corresponding traits stabilize.

Synchronization is implicit

Every Regex value can be safely used from multiple threads simultaneously. Since a Regex has interior mutable state, this implies that it must do some kind of synchronization in order to be safe.

There are some reasons why we might want to do synchronization automatically:

  1. Regex exposes an immutable API. That is, from looking at its set of methods, none of them borrow the Regex mutably (or otherwise claim to mutate the Regex). This author claims that since there is no observable mutation of a Regex, it not being thread safe would violate the principle of least surprise.
  2. Often, a Regex should be compiled once and reused repeatedly in multiple searches. To facilitate this, lazy_static! can be used to guarantee that compilation happens exactly once. lazy_static! requires its types to be Sync. A user of Regex could work around this by wrapping a Regex in a Mutex, but this would make misuse too easy. For example, locking a Regex in one thread would prevent simultaneous searching in another thread.

Synchronization has overhead, although it is extremely small (and dwarfed by general matching overhead). The author has ad hoc benchmarked the regex implementation with GNU Grep, and per match overhead is comparable in single threaded use. It is this author's opinion, that it is good enough. If synchronization overhead across multiple threads is too much, callers may elect to clone the Regex so that each thread gets its own copy. Cloning a Regex is no more expensive than what would be done internally automatically, but it does eliminate contention.

An alternative is to increase the API surface and have types that are synchronized by default and types that aren't synchronized. This was discussed at length in this thread. My conclusion from this thread is that we either expand the surface of the API, or we break the current API or we keep implicit synchronization as-is. In this author's opinion, neither expanding the API or breaking the API is worth avoiding negligible synchronization overhead.

The implementation is complex

Regular expression engines have a lot of moving parts and it often requires quite a bit of context on how the whole library is organized in order to make significant contributions. Therefore, moving regex into rust-lang is a maintenance hazard. This author has tried to mitigate this hazard somewhat by doing the following:

  1. Offering to mentor contributions. Significant contributions have thus far fizzled, but minor contributions---even to complex code like the DFA---have been successful.
  2. Documenting not just the API, but the internals. The DFA is, for example, heavily documented.
  3. Wrote a HACKING.md guide that gives a sweeping overview of the design.
  4. Significant test and benchmark suites.

With that said, there is still a lot more that could be done to mitigate the maintenance hazard. In this author's opinion, the interaction between the three parts of the implementation (parsing, compilation, searching) is not documented clearly enough.

Alternatives

Big picture

The most important alternative is to decide not to bless a particular implementation of regular expressions. We might want to go this route for any number of reasons (see: Drawbacks). However, the regex crate is already widely used, which provides at least some evidence that some set of programmers find it good enough for general purpose regex searching.

The impact of not moving regex into rust-lang is, plainly, that Rust won't have an "officially blessed" regex implementation. Many programmers may appreciate the complexity of a regex implementation, and therefore might insist that one be officially maintained. However, to be honest, it isn't quite clear what would happen in practice. This author is speculating.

bytes::Regex

This RFC proposes stabilizing the bytes sub-module of the regex crate in its entirety. The bytes sub-module is a near clone of the API at the crate level with one important difference: it searches &[u8] instead of &str. This design was motivated by a similar split in std, but there are alternatives.

A regex trait

One alternative is designing a trait that looks something like this:

trait Regex {
  type Text: ?Sized;

  fn is_match(&self, text: &Self::Text) -> bool;
  fn find(&self, text: &Self::Text) -> Option<Match>;
  fn find_iter<'r, 't>(&'r self, text: &'t Self::Text) -> Matches<'r, 't, Self::Text>;
  // and so on
}

However, there are a couple problems with this approach. First and foremost, the use cases of such a trait aren't exactly clear. It does make writing generic code that searches either a &str or a &[u8] possible, but the semantics of searching &str (always valid UTF-8) or &[u8] are quite a bit different with respect to the original Regex. Secondly, the trait isn't obviously implementable by others. For example, some of the methods return iterator types such as Matches that are typically implemented with a lower level API that isn't exposed. This suggests that a straight-forward traitification of the current API probably isn't appropriate, and perhaps, a better trait needs to be more fundamental to regex searching.

Perhaps the strongest reason to not adopt this design for regex 1.0 is that we don't have any experience with it and there hasn't been any demand for it. In particular, it could be prototyped in another crate.

Reuse some types

In the current proposal, the bytes submodule completely duplicates the top-level API, including all iterator types, Captures and even the Replacer trait. We could parameterize many of those types over the type of the text searched. For example, the proposed Replacer trait looks like this:

trait Replacer {
    fn replace_append(&mut self, caps: &Captures, dst: &mut String);

    fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> {
        None
    }
}

We might add an associated type like so:

trait Replacer {
    type Text: ToOwned + ?Sized;

    fn replace_append(
        &mut self,
        caps: &Captures<Self::Text>,
        dst: &mut <Self::Text as ToOwned>::Owned,
    );

    fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, Self::Text>> {
        None
    }
}

But parameterizing the Captures type is a little bit tricky. Namely, methods like get want to slice the text at match offsets, but this can't be done safely in generic code without introducing another public trait.

The final death knell in this idea is that these two implementations cannot co-exist:

impl<F> Replacer for F where F: FnMut(&Captures) -> String {
    type Text = str;

    fn replace_append(&mut self, caps: &Captures, dst: &mut String) {
        dst.push_str(&(*self)(caps));
    }
}

impl<F> Replacer for F where F: FnMut(&Captures) -> Vec<u8> {
    type Text = [u8];

    fn replace_append(&mut self, caps: &Captures, dst: &mut Vec<u8>) {
        dst.extend(&(*self)(caps));
    }
}

Perhaps there is a path through this using yet more types or more traits, but without a really strong motivating reason to find it, I'm not convinced it's worth it. Duplicating all of the types is unfortunate, but it's simple.

Unresolved questions

The regex repository has more than just the regex crate.

regex-syntax

This crate exposes a regular expression parser and abstract syntax that is completely divorced from compilation or searching. It is not part of regex proper since it may experience more frequent breaking changes and is far less frequently used. It is not clear whether this crate will ever see 1.0, and if it does, what criteria would be used to judge it suitable for 1.0. Nevertheless, it is a useful public API, but it is not part of this RFC.

regex-capi

Recently, regex-capi was built to provide a C API to this regex library. It has been used to build cgo bindings to this library for Go. Given its young age, it is not part of this proposal but will be maintained as a pre-1.0 crate in the same repository.

regex_macros

The regex! compiler plugin is a macro that can compile regular expressions when your Rust program compiles. Stated differently, regex!("...") is transformed into Rust code that executes a search of the given pattern directly. It was written two years ago and largely hasn't changed since. When it was first written, it had two major benefits:

  1. If there was a syntax error in your regex, your Rust program would not compile.
  2. It was faster.

Today, (1) can be simulated in practice with the use of a Clippy lint and (2) is no longer true. In fact, regex! is at least one order of magnitude slower than the standard Regex implementation.

The future of regex_macros is not clear. In one sense, since it is a compiler plugin, there hasn't been much interest in developing it further since its audience is necessarily limited. In another sense, it's not entirely clear what its implementation path is. It would take considerable work for it to beat the current Regex implementation (if it's even possible). More discussion on this is out of scope.

Dependencies

As of now, regex has several dependencies:

  • aho-corasick
  • memchr
  • thread_local
  • regex-syntax
  • utf8-ranges

All of them except for thread_local were written by this author, and were primarily motivated for use in the regex crate. They were split out because they seem generally useful.

There may be other things in regex (today or in the future) that may also be helpful to others outside the strict context of regex. Is it beneficial to split such things out and create a longer list of dependencies? Or should we keep regex as tight as possible?

Exposing more internals

It is conceivable that others might find interest in the regex compiler or more lower level access to the matching engines. We could do something similar to regex-syntax and expose some internals in a separate crate. However, there isn't a pressing desire to do this at the moment, and would probably require a good deal of work.

Breaking changes

This section of the RFC lists all breaking changes between regex 0.1 and the API proposed in this RFC.

  • find and find_iter now return values of type Match instead of (usize, usize). The Match type has start and end methods which can be used to recover the original offsets, as well as an as_str method to get the matched text.
  • The Captures type no longer has any iterators defined. Instead, callers should use the Regex::capture_names method.
  • bytes::Regex enables the Unicode flag by default. Previously, it disabled it by default. The flag can be disabled in the pattern with (?-u).
  • The definition of the Replacer trait was completely re-worked. Namely, its API inverts control of allocation so that the caller must provide a String to write to. Previous implementors will need to examine the new API. Moving to the new API should be straight-forward.
  • The is_empty method on Captures was removed since it always returns false (because every Captures has at least one capture group corresponding to the entire match).
  • The PartialEq and Eq impls on Regex were removed. If you need this functionality, add a newtype around Regex and write the corresponding PartialEq and Eq impls.
  • The lifetime parameters for the iter and iter_named methods on Captures were fixed. The corresponding iterator types, SubCaptures and SubCapturesNamed, grew an additional lifetime parameter.
  • The constructor, Regex::with_size_limit, was removed. It can be replaced with use of RegexBuilder.
  • The is_match free function was removed. Instead, compile a Regex explicitly and call the is_match method.
  • Many iterator types were renamed. (e.g., RegexSplits to SplitsIter.)
  • Replacements now return a Cow<str> instead of a String. Namely, the subject text doesn't need to be copied if there are no replacements. Callers may need to add into_owned() calls to convert the Cow<str> to a proper String.
  • The Error type no longer has the InvalidSet variant, since the error is no longer possible. Its Syntax variant was also modified to wrap a String instead of a regex_syntax::Error. If you need access to specific parse error information, use the regex-syntax crate directly.
  • To allow future growth, some character classes may no longer compile to make room for possibly adding class set notation in the future.
  • Various iterator types have been renamed.
  • The RegexBuilder type now takes an &mut self on most methods instead of self. Additionally, the final build step now uses build() instead of compile().