Skip to content

Commit

Permalink
Add speculative Unicode word boundary support to the lazy DFA.
Browse files Browse the repository at this point in the history
Hooray! The DFA will now try to interpret Unicode word boundaries as if
they were ASCII word boundaries. If the DFA comes across a non-ASCII byte,
then it will give up and fall back to the slower NFA simulation.
Nevertheless, this prevents us from degrading to very slow matching in a
large number of cases.

Thanks very much to @raphlinus who had the essential idea of
"speculative matching."
  • Loading branch information
BurntSushi committed Apr 14, 2016
1 parent 8d81a54 commit 175761e
Show file tree
Hide file tree
Showing 4 changed files with 64 additions and 32 deletions.
29 changes: 19 additions & 10 deletions PERFORMANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,17 +200,19 @@ just by examing the first (or last) three bytes of the haystack.
**Advice**: Literals can reduce the work that the regex engine needs to do. Use
them if you can, especially as prefixes.

## Unicode word boundaries prevent the DFA from being used
## Unicode word boundaries may prevent the DFA from being used

It's a sad state of the current implementation. It's not clear when or if
Unicode word boundaries will be salvaged, but as it stands right now, using
them automatically disqualifies use of the DFA, which can mean an order of
magnitude slowdown in search time. There are two ways to ameliorate this:
It's a sad state of the current implementation. At the moment, the DFA will try
to interpret Unicode word boundaries as if they were ASCII word boundaries.
If the DFA comes across any non-ASCII byte, it will quit and fall back to an
alternative matching engine that can handle Unicode word boundaries correctly.
The alternate matching engine is generally quite a bit slower (perhaps by an
order of magnitude). If necessary, this can be ameliorated in two ways.

The first way is to add some number of literal prefixes to your regular
expression. Even though the DFA won't be used, specialized routines will still
kick in to find prefix literals quickly, which limits how much work the NFA
simulation will need to do.
expression. Even though the DFA may not be used, specialized routines will
still kick in to find prefix literals quickly, which limits how much work the
NFA simulation will need to do.

The second way is to give up on Unicode and use an ASCII word boundary instead.
One can use an ASCII word boundary by disabling Unicode support. That is,
Expand All @@ -221,11 +223,18 @@ to a syntax error if the regex could match arbitrary bytes. For example, if one
wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
any *byte* when the Unicode flag is disabled.

The second way isn't appreciably different than just using a Unicode word
boundary in the first place, since the DFA will speculatively interpret it as
an ASCII word boundary anyway. The key difference is that if an ASCII word
boundary is used explicitly, then the DFA won't quit in the presence of
non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
more consistent performance.

N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
can simply write `\b` to get an ASCII word boundary.

**Advice**: Use `(?-u:\b)` instead of `\b` if you care about performance more
than correctness.
**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
instead of `\b` if you care about consistent performance more than correctness.

## Excessive counting can lead to exponential state blow up in the DFA

Expand Down
2 changes: 2 additions & 0 deletions src/compile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -269,10 +269,12 @@ impl Compiler {
self.c_empty_look(prog::EmptyLook::EndText)
}
WordBoundary => {
self.compiled.has_unicode_word_boundary = true;
self.byte_classes.set_word_boundary();
self.c_empty_look(prog::EmptyLook::WordBoundary)
}
NotWordBoundary => {
self.compiled.has_unicode_word_boundary = true;
self.byte_classes.set_word_boundary();
self.c_empty_look(prog::EmptyLook::NotWordBoundary)
}
Expand Down
62 changes: 40 additions & 22 deletions src/dfa.rs
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ const CACHE_LIMIT: usize = 2 * (1<<20);
/// of tracking multi-byte assertions in the DFA.
pub fn can_exec(insts: &Program) -> bool {
use prog::Inst::*;
use prog::EmptyLook::*;
// If for some reason we manage to allocate a regex program with more
// than STATE_MAX instructions, then we can't execute the DFA because we
// use 32 bit pointers with some of the bits reserved for special use.
Expand All @@ -83,14 +82,7 @@ pub fn can_exec(insts: &Program) -> bool {
for inst in insts {
match *inst {
Char(_) | Ranges(_) => return false,
EmptyLook(ref inst) => {
match inst.look {
WordBoundary | NotWordBoundary => return false,
WordBoundaryAscii | NotWordBoundaryAscii => {}
StartLine | EndLine | StartText | EndText => {}
}
}
Match(_) | Save(_) | Split(_) | Bytes(_) => {}
EmptyLook(_) | Match(_) | Save(_) | Split(_) | Bytes(_) => {}
}
}
true
Expand Down Expand Up @@ -296,17 +288,22 @@ const STATE_UNKNOWN: StatePtr = 1<<31;
/// once it is entered, no match can ever occur.
const STATE_DEAD: StatePtr = 1<<30;

/// A quit state means that the DFA came across some input that it doesn't
/// know how to process correctly. The DFA should quit and another matching
/// engine should be run in its place.
const STATE_QUIT: StatePtr = 1<<29;

/// A start state is a state that the DFA can start in.
///
/// Note that unlike unknown and dead states, start states have their lower
/// bits set to a state pointer.
const STATE_START: StatePtr = 1<<29;
const STATE_START: StatePtr = 1<<28;

/// A match state means that the regex has successfully matched.
///
/// Note that unlike unknown and dead states, match states have their lower
/// bits set to a state pointer.
const STATE_MATCH: StatePtr = 1<<28;
const STATE_MATCH: StatePtr = 1<<27;

/// The maximum state pointer.
const STATE_MAX: StatePtr = STATE_MATCH - 1;
Expand Down Expand Up @@ -591,7 +588,10 @@ impl<'a> Fsm<'a> {
None => return Result::NoMatch,
Some(i) => i,
};
} else if next_si >= STATE_DEAD {
} else if next_si >= STATE_QUIT {
if next_si & STATE_QUIT > 0 {
return Result::Quit;
}
// Finally, this corresponds to the case where the transition
// entered a state that can never lead to a match or a state
// that hasn't been computed yet. The latter being the "slow"
Expand Down Expand Up @@ -697,7 +697,10 @@ impl<'a> Fsm<'a> {
if self.at < cur {
result = Result::Match(self.at + 2);
}
} else if next_si >= STATE_DEAD {
} else if next_si >= STATE_QUIT {
if next_si & STATE_QUIT > 0 {
return Result::Quit;
}
let byte = Byte::byte(text[self.at]);
prev_si &= STATE_MAX;
next_si = match self.next_state(qcur, qnext, prev_si, byte) {
Expand Down Expand Up @@ -986,10 +989,15 @@ impl<'a> Fsm<'a> {
NotWordBoundaryAscii if flags.not_word_boundary => {
self.cache.stack.push(inst.goto as InstPtr);
}
WordBoundary if flags.word_boundary => {
self.cache.stack.push(inst.goto as InstPtr);
}
NotWordBoundary if flags.not_word_boundary => {
self.cache.stack.push(inst.goto as InstPtr);
}
StartLine | EndLine | StartText | EndText => {}
WordBoundaryAscii | NotWordBoundaryAscii => {}
// The DFA doesn't support Unicode word boundaries. :-(
WordBoundary | NotWordBoundary => unreachable!(),
WordBoundary | NotWordBoundary => {}
}
}
Save(ref inst) => self.cache.stack.push(inst.goto as InstPtr),
Expand Down Expand Up @@ -1057,7 +1065,12 @@ impl<'a> Fsm<'a> {

// OK, now there's enough room to push our new state.
// We do this even if the cache size is set to 0!
let trans = Transitions::new(self.num_byte_classes());
let mut trans = Transitions::new(self.num_byte_classes());
if self.prog.has_unicode_word_boundary {
for b in 128..256 {
trans[self.byte_class(Byte::byte(b as u8))] = STATE_QUIT;
}
}
let si = usize_to_u32(self.cache.states.len());
self.cache.states.push(State {
insts: key.insts.clone(),
Expand Down Expand Up @@ -1120,15 +1133,14 @@ impl<'a> Fsm<'a> {
state_flags.set_empty();
insts.push(ip);
}
WordBoundaryAscii => {
WordBoundary | WordBoundaryAscii => {
state_flags.set_empty();
insts.push(ip);
}
NotWordBoundaryAscii => {
NotWordBoundary | NotWordBoundaryAscii => {
state_flags.set_empty();
insts.push(ip);
}
WordBoundary | NotWordBoundary => unreachable!(),
}
}
Match(_) => {
Expand Down Expand Up @@ -1226,7 +1238,12 @@ impl<'a> Fsm<'a> {
return si;
}
let si = usize_to_u32(self.cache.states.len());
let trans = Transitions::new(self.num_byte_classes());
let mut trans = Transitions::new(self.num_byte_classes());
if self.prog.has_unicode_word_boundary {
for b in 128..256 {
trans[self.byte_class(Byte::byte(b as u8))] = STATE_QUIT;
}
}
self.cache.states.push(state);
self.cache.trans.push(trans);
self.cache.compiled.insert(key, si);
Expand Down Expand Up @@ -1257,8 +1274,9 @@ impl<'a> Fsm<'a> {
}
match self.cache.trans[si as usize][self.byte_class(b)] {
STATE_UNKNOWN => self.exec_byte(qcur, qnext, si, b),
STATE_DEAD => return Some(STATE_DEAD),
nsi => return Some(nsi),
STATE_QUIT => None,
STATE_DEAD => Some(STATE_DEAD),
nsi => Some(nsi),
}
}

Expand Down
3 changes: 3 additions & 0 deletions src/prog.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ pub struct Program {
pub is_anchored_start: bool,
/// Whether the regex must match at the end of the input.
pub is_anchored_end: bool,
/// Whether this program contains a Unicode word boundary instruction.
pub has_unicode_word_boundary: bool,
/// A possibly empty machine for very quickly matching prefix literals.
pub prefixes: LiteralSearcher,
}
Expand All @@ -73,6 +75,7 @@ impl Program {
is_reverse: false,
is_anchored_start: false,
is_anchored_end: false,
has_unicode_word_boundary: false,
prefixes: LiteralSearcher::empty(),
}
}
Expand Down

0 comments on commit 175761e

Please sign in to comment.