-
Notifications
You must be signed in to change notification settings - Fork 480
Description
From https://lwn.net/Articles/589009/
The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).
On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:
loop: find 'aaa' at position p if found goto match find 'bbb' at position p if found goto match find 'ccc' at position p if found goto match increment position and goto loop match:
The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.
In other words, this is a way to bypass the regex machinery and degrade to a simple substring search.
In addition to being a common case to optimize, it should also give a small bump to the regex-dna
benchmark because one of the regexes is just an alternation of literals: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=rust&id=1. The rest contain character classes, which complicates things somewhat.
The easy part of this is optimization is the actual searching of literal strings and jumping ahead in the input (there is precedent for this already in the code with literal prefixes). The harder part, I think, is analyzing the regex to find where the optimization can be applied. The issue is that an alternation is compiled to a series of split and jump instructions. It is easiest to discover the opportunity to optimize by analyzing the AST of the regex---but there will need to be a way to carry that information through to the VM.
One approach might be to tag pieces of the syntax with possible optimization (this is hopefully the first of many). Then when the AST is compiled to instructions, that information can be stored and indexed by the current program counter. The VM can then ask, "Do there exist any optimizations for this PC?" The rest is gravy.
N.B. This only works for a regex that is of the form a|b|c|...
. It might be possible to generalize this to other cases, but it seems tricky.