-
-
Notifications
You must be signed in to change notification settings - Fork 59
RE2 lookbehinds fork assessment
A code-level assessment of epfl-systemf/re2-lookbehinds, the RE2 fork by EPFL's SYSTEMF lab that implements linear-time non-backtracking lookbehinds (algorithm from the PLDI 2024 paper by Aurèle Barrière, Clément Pit-Claudel et al.). Recorded 2026-05-12. The fork is referenced from node-re2#199; this page captures the research that explains why we are watching but not vendoring.
Commit 00477b05f8 "feat(lookbehind): match captureless lookbehinds" — the single feature commit on top of upstream RE2 release 2024-07-02. Diff stats: +267 / −21 across 15 files, of which the meaningful changes touch re2/{parse,compile,prog,nfa,onepass,re2,regexp,simplify,prefilter}.cc. The remaining commits on the branch are cleanup, README, and dev-shell infrastructure. Last push 2024-08-28; the fork has been static since.
Reads as a research-quality proof-of-concept, not a contribution-ready patch.
-
lb_field added outside theInstunion (re2/prog.h):int32_t lb_; // opcode == kInstLBCheck or kInstLBWrite uint32_t out_opcode_; union { uint32_t out1_; uint32_t cap_; ... };
The original
Inststruct is packed (12-16 bytes);lb_could go in the union without aliasing — the Init functions only setlb_andout_opcode_, never the union members. Putting it outside bloats everyInstby 4 bytes. Direct cache-footprint cost on every RE2 program, not just lookbehind-bearing ones. -
Commented-out anchor cases with a candid TODO (
re2/compile.cc):// case kRegexpPLB: // @eg TODO: understand what this does, but not necessary for working // case kRegexpNLB:
IsAnchorStart/IsAnchorEndare RE2's anchor-detection optimizations that determine whether the DotStar prefix can be skipped. Skipping these cases without analysis is the kind of thing Russ Cox will push back on. -
Dormant copy-paste bug in
set_out1(re2/prog.h):void set_out1(int out) { out1_ = (out<<5) | (last()<<4) | opcode(); }
out1_holds a plain instruction id, not a bit-packedout_opcode_encoding. Function appears unused but is a clear signal the patch did not go through careful review. -
lb_add_startuses O(n) insertion with the right code commented out:void lb_add_start(int pos) { lb_starts.insert(lb_starts.begin(), pos); // lb_starts.push_back(pos); }
No documented reason for prepending. Doesn't matter at typical sizes; reads as leftover experimentation.
-
Minimal test surface — ~11 ad-hoc assertions in
re2/testing/re2_test.cc. RE2's normal bar is property-based testing viaRE2::Random, exhaustive enumeration intesting/exhaustive*.cc, and fuzzer integration. None of these is updated. -
Bit-layout shift in
out_opcode_to accommodate 4-bit opcodes: theoutfield drops from 28 to 27 bits, halving max program size from 268M to 134M instructions. Not a practical limit but symptomatic of an invasive change —lb_in the union would have avoided the layout disruption.
The algorithm itself is sound (peer-reviewed PLDI'24). The implementation needs a polish pass.
Split by whether the pattern uses lookbehinds.
-
DFA path: unchanged.
prog_->has_lookbehind()defaults to false; only set onkInstLBCheck. - BitState, OnePass, prefix-accel: all preserved on the existing fast paths.
- NFA inner switch dispatch: two new case labels in a jump table — zero cost.
-
New per-byte thread-spawning loop in
Search: gated onprog_->lb_starts.size() > 0; iterates zero times when empty. - Inst-struct bloat (#1 above) does apply — every RE2 program pays ~25% more memory per Inst. Single-digit percent regression expected on cache-bound benchmarks across the board.
Three sources stack:
-
Prefix-accel disabled (
re2/nfa.cc):if (!prog_->has_lookbehind() && !anchored && runq->size() == 0 && ...) p = ... PrefixAccel(...);
RE2's biggest single optimization for unanchored search (memchr/SIMD skip-ahead). Losing it can mean 5-10× slowdown vs. the same pattern without lookbehinds.
-
DFA / BitState / OnePass all disabled when
has_lookbehindis true (re2/re2.cc). Falls through to general NFA — the slowest engine. -
Per-byte thread-spawning loop (
re2/nfa.cc): for every input byte and every lookbehind in the pattern, allocate a Thread + CopyCapture + AddToThreadq + Decref. CPU and allocator churn proportional toinput_length × num_lookbehinds. This is the cost of "lookbehinds can start matching at any byte position."
The blog post mentions these tradeoffs qualitatively; the magnitude is not measured there.
Two independent reasons.
-
No PR was ever opened to google/re2. The fork is a "we did this, take it if you want" gesture: paper + blog post + mailing-list note. RE2's review process requires a champion who'll iterate through Russ Cox's review until landing. Nobody is playing that role:
- EPFL has redirected to rust-lang/regex via PR #1266 (opened 2025-05-15, still in review as of 2025-11-10).
- Erik Giorgis (the implementing author) was a student whose research project shipped.
- A different author (furkankoykiran) opened google/re2#585 on 2025-11-26 with a separate bounded lookaround POC (lookbehind + lookahead, 255-char bound) — different design entirely, also unowned by maintainers.
-
If a PR were opened today, it would need significant rework: fix the Inst-struct bloat, address the anchor-detection TODOs, add proper test/fuzzer coverage, write benchmarks proving the cost story, justify or revert the bit-layout change. Weeks of work by someone who knows RE2's internals.
The algorithm: probably yes, eventually. The PLDI'24 result is solid and the design space is well-understood. But it will likely arrive via someone other than this fork — Russ Cox re-implementing from the paper, or a different external contributor doing the polish work. No movement expected in 2026.
This fork specifically: probably no. It has fulfilled its research purpose. EPFL's continued effort is on rust-lang/regex.
Stock RE2 maintains an invariant valuable to node-re2: if a regex parses, it executes safely (no ReDoS) with correct capture semantics. Patterns with backreferences, lookahead, or lookbehind throw SyntaxError at parse time, which we surface to callers.
The fork's parser (re2/parse.cc) accepts any (?<=...) or (?<!...):
case '(':
if (t.size() > 4 && t[1] == '?' && t[2] == '<' && (t[3] == '=' || t[3] == '!')) {
if (t[3] == '=') ps.DoPosLookBehind();
else ps.DoNegLookBehind();
...There is no parse-time check for capture groups inside the lookbehind. The "captureless" restriction is a property of the algorithm (lookbehind sub-automaton tracks position only, not capture state), not a property of the parser. Accepted/rejected matrix under the fork:
| Construct | Stock RE2 | Fork |
|---|---|---|
(?=...) lookahead |
rejected (SyntaxError) | rejected (SyntaxError) |
(?!...) negative lookahead |
rejected (SyntaxError) | rejected (SyntaxError) |
\1 backreference |
rejected (SyntaxError) | rejected (SyntaxError) |
(?<=foo)bar captureless lookbehind |
rejected (SyntaxError) | accepted, correct |
(?<=(foo))bar captureful lookbehind |
rejected (SyntaxError) | accepted, silently incorrect captures |
The captureful case is the one that breaks our invariant. (foo) inside the lookbehind allocates a normal capture slot via the regular ncap_ counter, and the lookbehind-spawning Search loop runs threads through that capture machinery — but threads spawn at every byte position, so capture slot 1 gets stomped on by lookbehind-side writes that the main pattern then reads. There is no runtime error; the pattern parses, compiles, runs, returns wrong captures.
Were RE2 to absorb this fork as-is, node-re2 would need a parse-time pre-check that walks the regex AST and rejects any capture inside any lookbehind — restoring our "parses ⇒ safe" invariant at the cost of new surface area to maintain. Tractable, but a real reason to wait for a more polished upstream landing rather than vendor early.
- PLDI 2024 paper — Linear Matching of JavaScript Regular Expressions
- SYSTEMF blog — Adding linear-time lookbehinds to RE2 (2024-08-23)
- Feature commit
00477b05f8 - SYSTEMF blog — Adding lookbehinds to rust-lang/regex (2025-07-15)
- rust-lang/regex#1266 — open since 2025-05-15
- google/re2#156 — original 2017 lookaround request, rejected
- google/re2#585 — 2025-11 bounded-lookaround POC (different design)
- node-re2#199 — the issue this research informs
- V8 non-backtracking RegExp engine — related research on the upstream linear-regex landscape
Installing
Troubleshooting
- Problem: ABI mismatch in Electron
- Problem: build vs prod environments
- Problem: non-ASCII install path
Developers
Research and notes
- Notes on building alternatives
- Related projects
- FilteredRE2 evaluation
- RE2 lookbehinds fork assessment
- V8 Fast API assessment
- V8 non-backtracking RegExp engine
Project
Repository · README · prebuilds via install-artifact-from-github