-
-
Notifications
You must be signed in to change notification settings - Fork 59
FilteredRE2 evaluation
A benchmark-driven evaluation of whether re2::FilteredRE2 (filtered_re2.h, filtered_re2.cc) is worth exposing through node-re2, as proposed in issue #232. Recorded 2026-05-12. Conclusion: the general-purpose case for FilteredRE2 has weakened materially once RE2.Set's scaling is measured. A narrow niche remains for users above RE2.Set's default-max_mem ceiling (N ≈ 13,500 random-prefix patterns), but a far cheaper alternative — exposing max_mem as a RE2.Set constructor option — would absorb most of that niche.
A wrapper around multiple RE2 instances that uses a prefilter to skip patterns whose required literal substrings ("atoms") don't appear in the haystack. From filtered_re2.h:
By design, it does not include a string matching engine. This is to allow the user of the class to use their favorite string matching engine. The overall flow is:
Addall the regexps … thenCompile…Compilereturns strings that need to be matched. … the caller does the string matching using the returned strings.
API:
-
Add(pattern, opts, &id)— register a pattern, receive an id. -
Compile(&atoms)— returns avector<string>of distinct lowercased literal substrings derived from the patterns. Defaultmin_atom_len = 3. - Caller searches the haystack for any of those atoms — case-insensitively or against a lowercased haystack — and collects the indices of atoms found.
-
FirstMatch(text, atoms)/AllMatches(text, atoms, &matches)— runs full RE2 only on patterns whose atom-set survived the prefilter.
So node-re2 would need to embed a multi-pattern case-insensitive substring search engine for step 3. The classical algorithm fit is Aho-Corasick; alternatives surveyed in the vault topic and project queue.
Three workload shapes were benchmarked. All measurements used nano-benchmark (95% CI, 100 samples, bootstrap), Node.js 26.1.0, Linux x86-64. Each input is processed end-to-end inside the iteration body; setup costs are measured outside and reported separately. Bench files live in bench/:
-
filtered-rules.mjs— Shape 1: ~30 security-style patterns × 1 MiB haystack -
filtered-many.mjs— Shape 4: 2,000 patterns × 200 short inputs -
filtered-scale.mjs— Shape 5: 1K–13K patterns × 200 short inputs (scaling curve)
Three contestants in each:
-
RegExp— V8 builtin regex looped over each pattern -
RE2— individualnew RE2(pattern)instances looped -
RE2.Set— node-re2's union-DFA multi-pattern matcher (since 1.23.0)
A future RE2.Filtered entrant slots in as a fourth case in each file.
30 case-insensitive patterns drawn from common security-rule families (PHP RCE, XSS, path traversal, SQL injection, secrets like AWS/GitHub tokens, LDAP injection, SSRF metadata URLs, Log4Shell templates). 1 MiB of synthetic log-like content with sparse vulnerable-looking sprinkles seeded so exactly one pattern hits.
| Engine | Median per scan | vs RegExp
|
|---|---|---|
RE2.Set |
1.69 ms | 6.77× faster |
RegExp (V8 loop) |
11.43 ms | baseline |
RE2 (loop, individual) |
19.13 ms | 1.67× slower |
The headline: RE2.Set already crushes the obvious baseline at this scale. Its union DFA scans the haystack once and tracks which patterns matched in DFA state — paying the UTF-8 boundary cost once instead of 30 times. The looped-RE2 case is the worst of both worlds — full boundary cost per pattern, no DFA factoring.
Domain-blocklist-style patterns (fast-site<i>\.com, quick-site<i>\.net, …) with shared adjective prefixes — favorable to DFA factoring. 200 short log-like inputs; ~30% contain a literal that matches one of the patterns.
| Engine | Median per 200-input batch | vs RegExp
|
|---|---|---|
RE2.Set |
340 μs | 45.9× faster |
RegExp (V8 loop) |
15.60 ms | baseline |
RE2 (loop, individual) |
629 ms | 40.3× slower |
Setup costs (one-time): V8 ~2 ms (lazy), RE2 loop 14.8 ms, RE2.Set 8.6 ms. The RE2 loop catastrophically suffers from paying the JS↔C++ UTF-8 boundary 2,000 × 200 = 400,000 times. RE2.Set is also faster to construct than 2,000 individual RE2 instances.
Pseudo-random 8-char unique-prefix literal patterns — adversarial to DFA factoring; no shared structure for the union to compress. Same 200 short inputs (~30% containing a token drawn from the full pool, so higher-N sets see more hits).
| N | Compile | Per-batch (200 inputs) |
|---|---|---|
| 1,000 | 2.9 ms | 283 μs |
| 5,000 | 13.9 ms | 289 μs |
| 10,000 | 29.1 ms | 284 μs |
| 13,000 | 38.1 ms | 298 μs |
| ≥ 14,000 | construction fails | n/a |
Two findings:
-
Runtime is essentially flat across the working range. 13× more patterns produces only a 5% rise in per-batch time. The union DFA pays per byte of input, not per pattern, once it's been built. Compile time grows linearly at ~3 μs per pattern.
-
A hard ceiling at N ≈ 13,500 for these patterns.
new RE2.Set(...)throwsError: RE2.Set could not be compiled.above that threshold. The cause isRE2::Options::max_mem— the default 8 MiB DFA memory limit. Adversarial patterns (no shared structure) blow the budget. For patterns with shared structure (Shape 4's domain blocklist), the ceiling is much higher — Shape 4's 2,000 patterns compile easily and could likely scale to many more.
The ceiling is not algorithmic; it is the default-configuration cap on the DFA's working memory.
The original framing in #232 — "FilteredRE2 greatly improves the performance on large strings when multiple regular expressions should be matched" — is true relative to looping individual RE2 instances. But it is largely already-delivered by RE2.Set, which has shipped since 1.23.0 and which the benchmarks above show is dramatically faster than alternatives across every shape tested.
The remaining theoretical wins for FilteredRE2 over RE2.Set are narrow:
-
N above
RE2.Set'smax_memceiling. With unique-prefix patterns and the 8 MiB default, that's ~13,500 patterns. With shared-structure patterns it can be much higher. The Aho-Corasick prefilter scales linearly with atom-count and would absorb this case, but at the cost of running surviving patterns through individualRE2instances (paying the loop cost surfaced in Shapes 1 and 4) for verification. -
FirstMatchsemantics with selective prefilters.RE2.Setalways does a full DFA pass; FilteredRE2 can short-circuit at first prefilter hit. Untested here — would need a fourth bench shape — but the upside is bounded becauseRE2.Set's pass is already cheap in absolute terms (microseconds, not milliseconds).
The Shape 5 ceiling is a configuration limit, not an algorithmic one. RE2's RE2::Options::set_max_mem(int64_t) controls the DFA memory budget; node-re2 exposes it as a constructor option:
const set = new RE2.Set(patterns, {maxMem: 64 * 1024 * 1024});
set.maxMem; // 67108864maxMem is a positive integer (bytes). Default is 8 MiB (8 << 20). Raising it pushes the ceiling from ~13,500 random-prefix patterns to ~100,000 or more, absorbing the bulk of FilteredRE2's would-be users with a one-line C++ change instead of an Aho-Corasick implementation.
-
FilteredRE2 parked in the project queue. The headline motivation is delivered by
RE2.Set; themaxMemoption absorbs the remaining niche. -
Keep FilteredRE2 documented as a next-resort tool for users beyond the extended-
maxMemceiling. That population is small and largely overlaps with specialized engines like Vectorscan — see discussions/#218 for the broader engine-survey direction.
git clone --recursive https://github.com/uhop/node-re2.git
cd node-re2
npm install
npx nano-bench bench/filtered-rules.mjs
npx nano-bench bench/filtered-many.mjs
npx nano-bench bench/filtered-scale.mjsNumbers in this page were collected on Node.js 26.1.0, Linux x86-64. Absolute milliseconds will vary with hardware; the relative rankings should be stable on any modern system.
Installing
Troubleshooting
- Problem: ABI mismatch in Electron
- Problem: build vs prod environments
- Problem: non-ASCII install path
Developers
Research and notes
- Notes on building alternatives
- Related projects
- FilteredRE2 evaluation
- RE2 lookbehinds fork assessment
- V8 Fast API assessment
- V8 non-backtracking RegExp engine
Project
Repository · README · prebuilds via install-artifact-from-github