plan: moving regex engines to regex-automata

In discussions in #524, it became clearer to me that it might be good to say more about what my short/long term plans are for the `regex` crate.

TL;DR - Push all regex engines down into the [`regex-automata`](https://github.com/BurntSushi/regex-automata) crate, and bring that in as an internal-but-published dependency of `regex` as an official part of this repository. It will become a second required dependency (with the first being `regex-syntax`).

The main motivation for doing this is to clean up the regex internals, specifically with the intent to make it easier to add more optimizations. Currently, the infrastructure is a bit of a hodge podge and it can be quite difficult to know where and when a particular regex engine is being used. On top of that, since every regex engine is strictly internal, testing each engine independently is quite annoying, as it requires exporting undocumented APIs. As a result, bugs have cropped up where certain engines are inconsistent with other engines. Similarly, benchmarking each engine independent of the other is made difficult, which in turn makes it difficult to focus on optimizing certain scenarios that come up in the course of regex execution.

On top of all of this, putting the regex engines into their own crate with a proper API means that folks can use those engines directly. It also provides an outlet of sorts to provide lower level APIs that would I otherwise wouldn't want cluttering up the main `regex` crate too much. The main thing on my mind here are streaming APIs (see #425), but there may be other things of use.

I've actually been working towards this goal for a long time now. Arguably, i started around 3 years ago when I began rewriting the `regex-syntax` crate. I then moved to rewriting `memchr` and `aho-corasick` as well. In the process, I pushed complexity down as much as I could. For example, the HIR in `regex-syntax` no longer exposes any regex flags (such as case insensitivity), which makes regex compilers simpler. As another example, `aho-corasick` now supports leftmost-first matching, which matches the regex crate's match semantics precisely, and now makes it easier to defer to `aho-corasick` for an alternation of literals. Yet another example is the Teddy algorithm that uses SIMD for finding matches of multiple literals very quickly. That has been pushed down into `aho-corasick` as well, which means more people get to benefit from it and it no longer needs to be maintained inside of `regex` proper.

A little over a year ago, I started work on [`regex-automata`](https://github.com/BurntSushi/regex-automata). The main use case for `regex-automata` was to generate fully compiled DFAs that could be cheaply embedded into a program and deserialized cheaply for easy matching. (For example, they power the grapheme, word and sentence iterators in the [`bstr` crate](https://github.com/BurntSushi/bstr).)

In the course of building `regex-automata`, I had to write another HIR -> NFA compiler, with the intent that this would one day become the compiler inside the `regex` crate. It has many improvements over the existing one. Firstly, it is much simpler by doing away with the "instruction hole" mechanism used in the current compiler. Secondly, it is also simpler in that it does away with dichotomy between a Unicode NFA and a byte-based NFA, which has been the cause of many bugs and increases the complexity of the matching engines. Thirdly, by getting rid of the Unicode/bytes dichotomy in favor of just using bytes, we will be able to compile one fewer NFAs than what the `regex` crate currently does (one Unicode NFA for the Pike VM and the backtracker, 1 forwards byte NFA and 1 backwards byte NFA for the lazy DFA). Fourthly, it is more aggressive about epsilon removal and does away with the binary "alt" instructions in favor of flattening the byte code, which should decrease overhead substantially when matching. Finally, it use's Daciuk's algorithm (the same one used in the `fst` crate) to compiler nearly minimal UTF-8 automata in linear time, which _substantially_ reduces the size of the corresponding NFAs and DFAs, while also making them faster to execute by virtue of having fewer epsilon transitions.

`regex-automata` also contains new test infrastructure, and most of the tests in the `regex` crate have already been ported to it. This test infrastructure should make it much easier to add new tests and also much faster to run while testing more engines. Today's infrastructure is a terrible macro soup, and just building it takes forever because every test is compiled for every regex engine that we test. The new test infrastructure instead loads tests from TOML files and runs them using its own harness. The downside is that this pushes more things outside of Rust's standard unit test harness, but it's well worth it.

As of a couple months ago, I've begun preparing `regex-automata` to absorb the matching engines from `regex` proper. This will include the lazy DFA, the Pike VM and the backtracker. `regex-automata` also provides its own ahead-of-time-compiled DFAs, which I hope to use as yet another engine for cases where the regex is small and full DFA compilation is cheap. This organization should drastically decrease the complexity of adding new regex engines. I hope this will allow me to finally bring in things like one-pass DFA (see #467), as well as some other ideas I've had for years, such as a bit-parallel NFA and even a JIT powered by [Cranelift](https://github.com/bytecodealliance/wasmtime/tree/master/cranelift), basically with the goal of fixing one of the weakest parts of the `regex` crate right now: the performance of matching with capturing groups. (N.B. If a JIT were added, it would be an optional _opt-in_ feature.)

Once `regex-automata` has absorbed the regex engines inside of `regex`, my plan at that point is to move `regex-automata` into this repository and start integrating it into `regex` proper. At that point, the main responsibility of `regex` will be to glue everything together, plan the appropriate optimizations, including literal oriented optimizations and provide the higher level API that it does today. At this point in time, I do not plan on making any breaking changes to `regex` during all of this.

As a stretch goal, one of the things I'd like to do is to bring the cheap deserialization of regex-automata's DFAs to the `regex` crate proper. This would permit users to generate pre-compiled `Regex` objects that can be embedded into any program. This would allow one to completely avoid paying for compilation time. This does require a lot of work though, so I don't know if this will happen initially. (Doing this basically implies that every single representation of a finite state machine, including all of the literal prefilters, need to be able to work directly from a more primitive representation such as a `&[u8]`.)

Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the `regex` crate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

plan: moving regex engines to regex-automata #656

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

plan: moving regex engines to regex-automata #656

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions