Skip to content

Commit

Permalink
Add a onepass DFA.
Browse files Browse the repository at this point in the history
This patch adds a onepass matcher, which is a DFA that
has all the abilities of an NFA! There are lots
of expressions that a onepass matcher can't handle, namely
those cases where a regex contains non-determinism.

The general approach we take is as follows:

    1. Check if a regex is onepass using `src/onepass.rs::is_onepass`.
    2. Compile a new regex program using the compiler with the bytes
        flag set.
    3. Compile a onepass DFA from the program produced in step 2. We
        will roughly map each instruction to a state in the DFA, though
        instructions like `split` don't get states.
        a. Make a new transition table for the first instruction.
        b. For each child of the first instruction:
            - If it is a bytes instruction, add a transition to
                the table for every byte class in the instruction.
            - If it is an instruction which consumes zero input
                (like `EmptyLook` or `Save`), emit a job to a DAG asking to
                forward the first instruction state to the state for
                the non-consuming instruction.
            - Push the child instruction to a queue of instructions to
                process.
        c. Peel off an instruction from the queue and go back to
            step a, processing the instruction as if it was the
            first instruction. If the queue is empty, continue with
            step d.
        d. Topologically sort the forwarding jobs, and shuffle
            the transitions from the forwarding targets to the
            forwarding sources in topological order.
        e. Bake the intermediary transition tables down into a single
            flat vector. States which require some action (`EmptyLook`
            and `Save`) get an extra entry in the baked transition table
            that contains metadata instructing them on how to perform
            their actions.
    4. Wait for the user to give us some input.
    5. Execute the DFA:
        - The inner loop is basically:
            while at < text.len():
                state_ptr = baked_table[text[at]]
                at += 1
        - There is a lot of window dressing to handle special states.

The idea of a onepass matcher comes from Russ Cox and
his RE2 library. I haven't been as good about reading
the RE2 source as I should have, but I've gotten the
impression that the RE2 onepass matcher is more in the
spirit of an NFA simulation without threads than a DFA.
  • Loading branch information
Ethan Pailes committed Jul 10, 2018
1 parent 87cfe7e commit cc907b8
Show file tree
Hide file tree
Showing 10 changed files with 1,488 additions and 71 deletions.
5 changes: 5 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,11 @@ name = "backtrack-bytes"
path = "tests/test_crates_regex.rs"
name = "crates-regex"

# Run the test suite on the onepass engine.
[[test]]
path = "tests/test_onepass.rs"
name = "onepass"

[profile.release]
debug = true

Expand Down
130 changes: 77 additions & 53 deletions src/analysis.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,20 +50,15 @@ impl IsOnePassVisitor {
let mut empty_run = vec![];

for e in NestedConcat::new(es) {
// TODO(ethan):yakshaving factor the determination of when
// a regex accepts_empty out into a separate function,
// so that we don't compute the whole first set when we
// don't need to.
let fset = fset_of(e);
let is_rep = match e.kind() {
&HirKind::Repetition(_) => true,
_ => false,
};

empty_run.push(e);
if !(fset.accepts_empty || is_rep) {
// this is the last one in the run
break;
if !(accepts_empty(e) || is_rep) {
self.0 = self.0 && !fsets_clash(&empty_run);
empty_run.clear();
}
}

Expand All @@ -76,7 +71,7 @@ impl IsOnePassVisitor {
self.0 = self.0 && !fsets_clash(&es.iter().collect::<Vec<_>>());
}

// Unicode classes are really big alternatives from the byte
// Unicode classes are really just big alternatives from the byte
// oriented point of view.
//
// This function translates a unicode class into the
Expand All @@ -99,7 +94,7 @@ impl IsOnePassVisitor {
}
}
}
_ => {} // FALLTHROUGH
_ => {}
}
}

Expand All @@ -115,16 +110,6 @@ fn fsets_clash(es: &[&Hir]) -> bool {
let mut fset = fset_of(e1);
let fset2 = fset_of(e2);

// For the regex /a|()+/, we don't have a way to
// differentiate the branches, so we are not onepass.
//
// We might be able to loosen this restriction by
// considering the expression after the alternative
// if there is one.
if fset.is_empty() || fset2.is_empty() {
return true;
}

fset.intersect(&fset2);
if ! fset.is_empty() {
return true;
Expand All @@ -138,14 +123,14 @@ fn fsets_clash(es: &[&Hir]) -> bool {

/// Compute the first set of a given regular expression.
///
/// The first set of a regular expression is the set of all characters
/// The first set of a regular expression is the set of all bytes
/// which might begin it. This is a less general version of the
/// notion of a regular expression preview (the first set can be
/// thought of as the 1-preview of a regular expression).
///
/// Note that first sets are byte-oriented because the DFA is
/// byte oriented. This means an expression like /Δ|δ/ is actually not
/// one-pass, even though there is clearly no non-determinism inherent
/// onepass, even though there is clearly no non-determinism inherent
/// to the regex at a unicode code point level (big delta and little
/// delta start with the same byte).
fn fset_of(expr: &Hir) -> FirstSet {
Expand All @@ -155,7 +140,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
f
}

match expr.kind() {
// First compute the set of characters that might begin
// the expression (ignoring epsilon for now).
let mut f_char_set = match expr.kind() {
&HirKind::Empty => FirstSet::epsilon(),
&HirKind::Literal(ref lit) => {
match lit {
Expand Down Expand Up @@ -191,29 +178,13 @@ fn fset_of(expr: &Hir) -> FirstSet {
// that such an emptylook could potentially match on any character.
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => FirstSet::anychar(),

&HirKind::Repetition(ref rep) => {
let mut f = fset_of(&*rep.hir);
match rep.kind {
RepetitionKind::ZeroOrOne => f.accepts_empty = true,
RepetitionKind::ZeroOrMore => f.accepts_empty = true,
RepetitionKind::OneOrMore => {},
RepetitionKind::Range(ref range) => {
match range {
&RepetitionRange::Exactly(0)
| &RepetitionRange::AtLeast(0)
| &RepetitionRange::Bounded(0, _) =>
f.accepts_empty = true,
_ => {}
}
}
}
f
},
&HirKind::Repetition(ref rep) => fset_of(&rep.hir),
&HirKind::Group(ref group) => fset_of(&group.hir),

// The most involved case. We need to strip leading empty-looks
// as well as take the union of the first sets of the first n+1
// expressions where n is the number of leading repetitions.
// expressions where n is the number of leading expressions which
// accept the empty string.
&HirKind::Concat(ref es) => {
let mut fset = FirstSet::empty();
for (i, e) in es.iter().enumerate() {
Expand All @@ -229,13 +200,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
let inner_fset = fset_of(e);
fset.union(&inner_fset);

if !inner_fset.accepts_empty() {
if !accepts_empty(e) {
// We can stop accumulating after we stop seeing
// first sets which contain epsilon.
// Also, a contatination which terminated by
// one or more expressions which do not accept
// epsilon itself does not acceept epsilon.
fset.accepts_empty = false;
break;
}
}
Expand All @@ -250,13 +217,68 @@ fn fset_of(expr: &Hir) -> FirstSet {
}
fset
}
};

f_char_set.accepts_empty = accepts_empty(expr);
f_char_set
}

fn accepts_empty(expr: &Hir) -> bool {
match expr.kind() {
&HirKind::Empty => true,
&HirKind::Literal(_) => false,
&HirKind::Class(_) => false,

// A naked empty look is a pretty weird thing because we
// normally strip them from the beginning of concatinations.
// We are just going to treat them like `.`
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => false,

&HirKind::Repetition(ref rep) => {
match rep.kind {
RepetitionKind::ZeroOrOne => true,
RepetitionKind::ZeroOrMore => true,
RepetitionKind::OneOrMore => accepts_empty(&rep.hir),
RepetitionKind::Range(ref range) => {
match range {
&RepetitionRange::Exactly(0)
| &RepetitionRange::AtLeast(0)
| &RepetitionRange::Bounded(0, _) => true,
_ => accepts_empty(&rep.hir),
}
}
}
}

&HirKind::Group(ref group) => accepts_empty(&group.hir),

&HirKind::Concat(ref es) => {
let mut accepts: bool = true;
for e in es.iter() {
match e.kind() {
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => {
// Ignore any leading emptylooks.
}
_ => {
accepts = accepts && accepts_empty(&e);
}
}

if !accepts {
break;
}
}
accepts
}

&HirKind::Alternation(ref es) => es.iter().any(accepts_empty)
}
}

/// The first byte of a unicode code point.
///
/// We only ever care about the first byte of a particular character,
/// because the onepass DFA is implemented in the byte space, not the
/// We only ever care about the first byte of a particular character
/// because the onepass DFA is implemented in the byte space not the
/// character space. This means, for example, that a branch between
/// lowercase delta and uppercase delta is actually non-deterministic.
fn first_byte(c: char) -> u8 {
Expand Down Expand Up @@ -323,10 +345,6 @@ impl FirstSet {
fn is_empty(&self) -> bool {
self.bytes.is_empty() && !self.accepts_empty
}

fn accepts_empty(&self) -> bool {
self.accepts_empty
}
}

/// An iterator over a concatenation of expressions which
Expand Down Expand Up @@ -544,4 +562,10 @@ mod tests {
assert!(!is_onepass(&e1));
assert!(!is_onepass(&e2));
}

#[test]
fn is_onepass_clash_in_middle_of_concat() {
let e = Parser::new().parse(r"ab?b").unwrap();
assert!(!is_onepass(&e));
}
}
2 changes: 1 addition & 1 deletion src/backtrack.rs
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ impl<'a, 'm, 'r, 's, I: Input> Bounded<'a, 'm, 'r, 's, I> {
ip = inst.goto1;
}
EmptyLook(ref inst) => {
if self.input.is_empty_match(at, inst) {
if self.input.is_empty_match(at, inst.look) {
ip = inst.goto;
} else {
return false;
Expand Down
Loading

0 comments on commit cc907b8

Please sign in to comment.