Skip to content

RE2 lookbehinds fork assessment

Eugene Lazutkin edited this page May 12, 2026 · 1 revision

A code-level assessment of epfl-systemf/re2-lookbehinds, the RE2 fork by EPFL's SYSTEMF lab that implements linear-time non-backtracking lookbehinds (algorithm from the PLDI 2024 paper by Aurèle Barrière, Clément Pit-Claudel et al.). Recorded 2026-05-12. The fork is referenced from node-re2#199; this page captures the research that explains why we are watching but not vendoring.

What was analyzed

Commit 00477b05f8 "feat(lookbehind): match captureless lookbehinds" — the single feature commit on top of upstream RE2 release 2024-07-02. Diff stats: +267 / −21 across 15 files, of which the meaningful changes touch re2/{parse,compile,prog,nfa,onepass,re2,regexp,simplify,prefilter}.cc. The remaining commits on the branch are cleanup, README, and dev-shell infrastructure. Last push 2024-08-28; the fork has been static since.

Code-quality assessment

Reads as a research-quality proof-of-concept, not a contribution-ready patch.

  1. lb_ field added outside the Inst union (re2/prog.h):

    int32_t lb_;           // opcode == kInstLBCheck or kInstLBWrite
    uint32_t out_opcode_;
    union { uint32_t out1_; uint32_t cap_; ... };

    The original Inst struct is packed (12-16 bytes); lb_ could go in the union without aliasing — the Init functions only set lb_ and out_opcode_, never the union members. Putting it outside bloats every Inst by 4 bytes. Direct cache-footprint cost on every RE2 program, not just lookbehind-bearing ones.

  2. Commented-out anchor cases with a candid TODO (re2/compile.cc):

    // case kRegexpPLB:  // @eg TODO: understand what this does, but not necessary for working
    // case kRegexpNLB:

    IsAnchorStart / IsAnchorEnd are RE2's anchor-detection optimizations that determine whether the DotStar prefix can be skipped. Skipping these cases without analysis is the kind of thing Russ Cox will push back on.

  3. Dormant copy-paste bug in set_out1 (re2/prog.h):

    void set_out1(int out) {
      out1_ = (out<<5) | (last()<<4) | opcode();
    }

    out1_ holds a plain instruction id, not a bit-packed out_opcode_ encoding. Function appears unused but is a clear signal the patch did not go through careful review.

  4. lb_add_start uses O(n) insertion with the right code commented out:

    void lb_add_start(int pos) {
      lb_starts.insert(lb_starts.begin(), pos);
      // lb_starts.push_back(pos);
    }

    No documented reason for prepending. Doesn't matter at typical sizes; reads as leftover experimentation.

  5. Minimal test surface — ~11 ad-hoc assertions in re2/testing/re2_test.cc. RE2's normal bar is property-based testing via RE2::Random, exhaustive enumeration in testing/exhaustive*.cc, and fuzzer integration. None of these is updated.

  6. Bit-layout shift in out_opcode_ to accommodate 4-bit opcodes: the out field drops from 28 to 27 bits, halving max program size from 268M to 134M instructions. Not a practical limit but symptomatic of an invasive change — lb_ in the union would have avoided the layout disruption.

The algorithm itself is sound (peer-reviewed PLDI'24). The implementation needs a polish pass.

Performance impact

Split by whether the pattern uses lookbehinds.

Patterns without lookbehinds — minor regression

  • DFA path: unchanged. prog_->has_lookbehind() defaults to false; only set on kInstLBCheck.
  • BitState, OnePass, prefix-accel: all preserved on the existing fast paths.
  • NFA inner switch dispatch: two new case labels in a jump table — zero cost.
  • New per-byte thread-spawning loop in Search: gated on prog_->lb_starts.size() > 0; iterates zero times when empty.
  • Inst-struct bloat (#1 above) does apply — every RE2 program pays ~25% more memory per Inst. Single-digit percent regression expected on cache-bound benchmarks across the board.

Patterns with lookbehinds — substantial cost

Three sources stack:

  1. Prefix-accel disabled (re2/nfa.cc):

    if (!prog_->has_lookbehind() && !anchored && runq->size() == 0 && ...)
      p = ... PrefixAccel(...);

    RE2's biggest single optimization for unanchored search (memchr/SIMD skip-ahead). Losing it can mean 5-10× slowdown vs. the same pattern without lookbehinds.

  2. DFA / BitState / OnePass all disabled when has_lookbehind is true (re2/re2.cc). Falls through to general NFA — the slowest engine.

  3. Per-byte thread-spawning loop (re2/nfa.cc): for every input byte and every lookbehind in the pattern, allocate a Thread + CopyCapture + AddToThreadq + Decref. CPU and allocator churn proportional to input_length × num_lookbehinds. This is the cost of "lookbehinds can start matching at any byte position."

The blog post mentions these tradeoffs qualitatively; the magnitude is not measured there.

Why it has not been accepted

Two independent reasons.

  1. No PR was ever opened to google/re2. The fork is a "we did this, take it if you want" gesture: paper + blog post + mailing-list note. RE2's review process requires a champion who'll iterate through Russ Cox's review until landing. Nobody is playing that role:

    • EPFL has redirected to rust-lang/regex via PR #1266 (opened 2025-05-15, still in review as of 2025-11-10).
    • Erik Giorgis (the implementing author) was a student whose research project shipped.
    • A different author (furkankoykiran) opened google/re2#585 on 2025-11-26 with a separate bounded lookaround POC (lookbehind + lookahead, 255-char bound) — different design entirely, also unowned by maintainers.
  2. If a PR were opened today, it would need significant rework: fix the Inst-struct bloat, address the anchor-detection TODOs, add proper test/fuzzer coverage, write benchmarks proving the cost story, justify or revert the bit-layout change. Weeks of work by someone who knows RE2's internals.

Future in RE2

The algorithm: probably yes, eventually. The PLDI'24 result is solid and the design space is well-understood. But it will likely arrive via someone other than this fork — Russ Cox re-implementing from the paper, or a different external contributor doing the polish work. No movement expected in 2026.

This fork specifically: probably no. It has fulfilled its research purpose. EPFL's continued effort is on rust-lang/regex.

Implications for node-re2 — the invariant break

Stock RE2 maintains an invariant valuable to node-re2: if a regex parses, it executes safely (no ReDoS) with correct capture semantics. Patterns with backreferences, lookahead, or lookbehind throw SyntaxError at parse time, which we surface to callers.

The fork's parser (re2/parse.cc) accepts any (?<=...) or (?<!...):

case '(':
  if (t.size() > 4 && t[1] == '?' && t[2] == '<' && (t[3] == '=' || t[3] == '!')) {
    if (t[3] == '=') ps.DoPosLookBehind();
    else ps.DoNegLookBehind();
    ...

There is no parse-time check for capture groups inside the lookbehind. The "captureless" restriction is a property of the algorithm (lookbehind sub-automaton tracks position only, not capture state), not a property of the parser. Accepted/rejected matrix under the fork:

Construct Stock RE2 Fork
(?=...) lookahead rejected (SyntaxError) rejected (SyntaxError)
(?!...) negative lookahead rejected (SyntaxError) rejected (SyntaxError)
\1 backreference rejected (SyntaxError) rejected (SyntaxError)
(?<=foo)bar captureless lookbehind rejected (SyntaxError) accepted, correct
(?<=(foo))bar captureful lookbehind rejected (SyntaxError) accepted, silently incorrect captures

The captureful case is the one that breaks our invariant. (foo) inside the lookbehind allocates a normal capture slot via the regular ncap_ counter, and the lookbehind-spawning Search loop runs threads through that capture machinery — but threads spawn at every byte position, so capture slot 1 gets stomped on by lookbehind-side writes that the main pattern then reads. There is no runtime error; the pattern parses, compiles, runs, returns wrong captures.

Were RE2 to absorb this fork as-is, node-re2 would need a parse-time pre-check that walks the regex AST and rejects any capture inside any lookbehind — restoring our "parses ⇒ safe" invariant at the cost of new surface area to maintain. Tractable, but a real reason to wait for a more polished upstream landing rather than vendor early.

References

See also

Clone this wiki locally