Add a onepass DFA.

This patch adds a onepass matcher, which is a DFA that has all the abilities of an NFA! There are lots of expressions that a onepass matcher can't handle, namely those cases where a regex contains non-determinism. The general approach we take is as follows: 1. Check if a regex is onepass using `src/onepass.rs::is_onepass`. 2. Compile a new regex program using the compiler with the bytes flag set. 3. Compile a onepass DFA from the program produced in step 2. We will roughly map each instruction to a state in the DFA, though instructions like `split` don't get states. a. Make a new transition table for the first instruction. b. For each child of the first instruction: - If it is a bytes instruction, add a transition to the table for every byte class in the instruction. - If it is an instruction which consumes zero input (like `EmptyLook` or `Save`), emit a job to a DAG asking to forward the first instruction state to the state for the non-consuming instruction. - Push the child instruction to a queue of instructions to process. c. Peel off an instruction from the queue and go back to step a, processing the instruction as if it was the first instruction. If the queue is empty, continue with step d. d. Topologically sort the forwarding jobs, and shuffle the transitions from the forwarding targets to the forwarding sources in topological order. e. Bake the intermediary transition tables down into a single flat vector. States which require some action (`EmptyLook` and `Save`) get an extra entry in the baked transition table that contains metadata instructing them on how to perform their actions. 4. Wait for the user to give us some input. 5. Execute the DFA: - The inner loop is basically: while at < text.len(): state_ptr = baked_table[text[at]] at += 1 - There is a lot of window dressing to handle special states. The idea of a onepass matcher comes from Russ Cox and his RE2 library. I haven't been as good about reading the RE2 source as I should have, but I've gotten the impression that the RE2 onepass matcher is more in the spirit of an NFA simulation without threads than a DFA.
rust-lang · Jul 10, 2018 · cc907b8 · cc907b8
1 parent 87cfe7e
commit cc907b8
Show file tree

Hide file tree

Showing 10 changed files with 1,488 additions and 71 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -114,6 +114,11 @@ name = "backtrack-bytes"
 path = "tests/test_crates_regex.rs"
 name = "crates-regex"
 
+# Run the test suite on the onepass engine.
+[[test]]
+path = "tests/test_onepass.rs"
+name = "onepass"
+
 [profile.release]
 debug = true
 

diff --git a/src/analysis.rs b/src/analysis.rs
@@ -50,20 +50,15 @@ impl IsOnePassVisitor {
         let mut empty_run = vec![];
 
         for e in NestedConcat::new(es) {
-            // TODO(ethan):yakshaving factor the determination of when
-            //     a regex accepts_empty out into a separate function,
-            //     so that we don't compute the whole first set when we
-            //     don't need to.
-            let fset = fset_of(e);
             let is_rep = match e.kind() {
                 &HirKind::Repetition(_) => true,
                 _ => false,
             };
 
             empty_run.push(e);
-            if !(fset.accepts_empty || is_rep) {
-                // this is the last one in the run
-                break;
+            if !(accepts_empty(e) || is_rep) {
+                self.0 = self.0 && !fsets_clash(&empty_run);
+                empty_run.clear();
             }
         }
 
@@ -76,7 +71,7 @@ impl IsOnePassVisitor {
         self.0 = self.0 && !fsets_clash(&es.iter().collect::<Vec<_>>());
     }
 
-    // Unicode classes are really big alternatives from the byte
+    // Unicode classes are really just big alternatives from the byte
     // oriented point of view.
     //
     // This function translates a unicode class into the 
@@ -99,7 +94,7 @@ impl IsOnePassVisitor {
                     }
                 }
             }
-            _ => {} // FALLTHROUGH
+            _ => {}
         }
     }
 
@@ -115,16 +110,6 @@ fn fsets_clash(es: &[&Hir]) -> bool {
                 let mut fset = fset_of(e1);
                 let fset2 = fset_of(e2);
 
-                // For the regex /a|()+/, we don't have a way to
-                // differentiate the branches, so we are not onepass.
-                //
-                // We might be able to loosen this restriction by
-                // considering the expression after the alternative
-                // if there is one.
-                if fset.is_empty() || fset2.is_empty() {
-                    return true;
-                }
-
                 fset.intersect(&fset2);
                 if ! fset.is_empty() {
                     return true;
@@ -138,14 +123,14 @@ fn fsets_clash(es: &[&Hir]) -> bool {
 
 /// Compute the first set of a given regular expression.
 ///
-/// The first set of a regular expression is the set of all characters
+/// The first set of a regular expression is the set of all bytes
 /// which might begin it. This is a less general version of the
 /// notion of a regular expression preview (the first set can be
 /// thought of as the 1-preview of a regular expression).
 ///
 /// Note that first sets are byte-oriented because the DFA is
 /// byte oriented. This means an expression like /Δ|δ/ is actually not
-/// one-pass, even though there is clearly no non-determinism inherent
+/// onepass, even though there is clearly no non-determinism inherent
 /// to the regex at a unicode code point level (big delta and little
 /// delta start with the same byte).
 fn fset_of(expr: &Hir) -> FirstSet {
@@ -155,7 +140,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
         f
     }
 
-    match expr.kind() {
+    // First compute the set of characters that might begin
+    // the expression (ignoring epsilon for now).
+    let mut f_char_set = match expr.kind() {
         &HirKind::Empty => FirstSet::epsilon(),
         &HirKind::Literal(ref lit) => {
             match lit {
@@ -191,29 +178,13 @@ fn fset_of(expr: &Hir) -> FirstSet {
         // that such an emptylook could potentially match on any character.
         &HirKind::Anchor(_) | &HirKind::WordBoundary(_) => FirstSet::anychar(),
 
-        &HirKind::Repetition(ref rep) => {
-            let mut f = fset_of(&*rep.hir);
-            match rep.kind {
-                RepetitionKind::ZeroOrOne => f.accepts_empty = true,
-                RepetitionKind::ZeroOrMore => f.accepts_empty = true,
-                RepetitionKind::OneOrMore => {},
-                RepetitionKind::Range(ref range) => {
-                    match range {
-                        &RepetitionRange::Exactly(0)
-                        | &RepetitionRange::AtLeast(0)
-                        | &RepetitionRange::Bounded(0, _) =>
-                            f.accepts_empty = true,
-                        _ => {}
-                    }
-                }
-            }
-            f
-        },
+        &HirKind::Repetition(ref rep) => fset_of(&rep.hir),
         &HirKind::Group(ref group) => fset_of(&group.hir),
 
         // The most involved case. We need to strip leading empty-looks
         // as well as take the union of the first sets of the first n+1
-        // expressions where n is the number of leading repetitions.
+        // expressions where n is the number of leading expressions which
+        // accept the empty string.
         &HirKind::Concat(ref es) => {
             let mut fset = FirstSet::empty();
             for (i, e) in es.iter().enumerate() {
@@ -229,13 +200,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
                         let inner_fset = fset_of(e);
                         fset.union(&inner_fset);
 
-                        if !inner_fset.accepts_empty() {
+                        if !accepts_empty(e) {
                             // We can stop accumulating after we stop seeing
                             // first sets which contain epsilon.
-                            // Also, a contatination which terminated by
-                            // one or more expressions which do not accept
-                            // epsilon itself does not acceept epsilon.
-                            fset.accepts_empty = false;
                             break;
                         }
                     }
@@ -250,13 +217,68 @@ fn fset_of(expr: &Hir) -> FirstSet {
             }
             fset
         }
+    };
+
+    f_char_set.accepts_empty = accepts_empty(expr);
+    f_char_set
+}
+
+fn accepts_empty(expr: &Hir) -> bool {
+    match expr.kind() {
+        &HirKind::Empty => true,
+        &HirKind::Literal(_) => false,
+        &HirKind::Class(_) => false,
+
+        // A naked empty look is a pretty weird thing because we
+        // normally strip them from the beginning of concatinations.
+        // We are just going to treat them like `.`
+        &HirKind::Anchor(_) | &HirKind::WordBoundary(_) => false,
+
+        &HirKind::Repetition(ref rep) => {
+            match rep.kind {
+                RepetitionKind::ZeroOrOne => true,
+                RepetitionKind::ZeroOrMore => true,
+                RepetitionKind::OneOrMore => accepts_empty(&rep.hir),
+                RepetitionKind::Range(ref range) => {
+                    match range {
+                        &RepetitionRange::Exactly(0)
+                        | &RepetitionRange::AtLeast(0)
+                        | &RepetitionRange::Bounded(0, _) => true,
+                        _ => accepts_empty(&rep.hir),
+                    }
+                }
+            }
+        }
+
+        &HirKind::Group(ref group) => accepts_empty(&group.hir),
+
+        &HirKind::Concat(ref es) => {
+            let mut accepts: bool = true;
+            for e in es.iter() {
+                match e.kind() {
+                    &HirKind::Anchor(_) | &HirKind::WordBoundary(_) => {
+                        // Ignore any leading emptylooks.
+                    }
+                    _ => {
+                        accepts = accepts && accepts_empty(&e);
+                    }
+                }
+
+                if !accepts {
+                    break;
+                }
+            }
+            accepts
+        }
+
+        &HirKind::Alternation(ref es) => es.iter().any(accepts_empty)
     }
 }
 
 /// The first byte of a unicode code point.
 ///
-/// We only ever care about the first byte of a particular character,
-/// because the onepass DFA is implemented in the byte space, not the
+/// We only ever care about the first byte of a particular character
+/// because the onepass DFA is implemented in the byte space not the
 /// character space. This means, for example, that a branch between
 /// lowercase delta and uppercase delta is actually non-deterministic.
 fn first_byte(c: char) -> u8 {
@@ -323,10 +345,6 @@ impl FirstSet {
     fn is_empty(&self) -> bool {
         self.bytes.is_empty() && !self.accepts_empty
     }
-
-    fn accepts_empty(&self) -> bool {
-        self.accepts_empty
-    }
 }
 
 /// An iterator over a concatenation of expressions which
@@ -544,4 +562,10 @@ mod tests {
         assert!(!is_onepass(&e1));
         assert!(!is_onepass(&e2));
     }
+
+    #[test]
+    fn is_onepass_clash_in_middle_of_concat() {
+        let e = Parser::new().parse(r"ab?b").unwrap();
+        assert!(!is_onepass(&e));
+    }
 }
diff --git a/src/backtrack.rs b/src/backtrack.rs
@@ -245,7 +245,7 @@ impl<'a, 'm, 'r, 's, I: Input> Bounded<'a, 'm, 'r, 's, I> {
                     ip = inst.goto1;
                 }
                 EmptyLook(ref inst) => {
-                    if self.input.is_empty_match(at, inst) {
+                    if self.input.is_empty_match(at, inst.look) {
                         ip = inst.goto;
                     } else {
                         return false;