Skip to content

FilteredRE2 evaluation

Eugene Lazutkin edited this page May 13, 2026 · 2 revisions

A benchmark-driven evaluation of whether re2::FilteredRE2 (filtered_re2.h, filtered_re2.cc) is worth exposing through node-re2, as proposed in issue #232. Recorded 2026-05-12. Conclusion: the general-purpose case for FilteredRE2 has weakened materially once RE2.Set's scaling is measured. A narrow niche remains for users above RE2.Set's default-max_mem ceiling (N ≈ 13,500 random-prefix patterns), but a far cheaper alternative — exposing max_mem as a RE2.Set constructor option — would absorb most of that niche.

What FilteredRE2 is

A wrapper around multiple RE2 instances that uses a prefilter to skip patterns whose required literal substrings ("atoms") don't appear in the haystack. From filtered_re2.h:

By design, it does not include a string matching engine. This is to allow the user of the class to use their favorite string matching engine. The overall flow is: Add all the regexps … then CompileCompile returns strings that need to be matched. … the caller does the string matching using the returned strings.

API:

  1. Add(pattern, opts, &id) — register a pattern, receive an id.
  2. Compile(&atoms) — returns a vector<string> of distinct lowercased literal substrings derived from the patterns. Default min_atom_len = 3.
  3. Caller searches the haystack for any of those atoms — case-insensitively or against a lowercased haystack — and collects the indices of atoms found.
  4. FirstMatch(text, atoms) / AllMatches(text, atoms, &matches) — runs full RE2 only on patterns whose atom-set survived the prefilter.

So node-re2 would need to embed a multi-pattern case-insensitive substring search engine for step 3. The classical algorithm fit is Aho-Corasick; alternatives surveyed in the vault topic and project queue.

Methodology

Three workload shapes were benchmarked. All measurements used nano-benchmark (95% CI, 100 samples, bootstrap), Node.js 26.1.0, Linux x86-64. Each input is processed end-to-end inside the iteration body; setup costs are measured outside and reported separately. Bench files live in bench/:

Three contestants in each:

  • RegExp — V8 builtin regex looped over each pattern
  • RE2 — individual new RE2(pattern) instances looped
  • RE2.Set — node-re2's union-DFA multi-pattern matcher (since 1.23.0)

A future RE2.Filtered entrant slots in as a fourth case in each file.

Shape 1 — security rules over a MB-scale haystack

30 case-insensitive patterns drawn from common security-rule families (PHP RCE, XSS, path traversal, SQL injection, secrets like AWS/GitHub tokens, LDAP injection, SSRF metadata URLs, Log4Shell templates). 1 MiB of synthetic log-like content with sparse vulnerable-looking sprinkles seeded so exactly one pattern hits.

Engine Median per scan vs RegExp
RE2.Set 1.69 ms 6.77× faster
RegExp (V8 loop) 11.43 ms baseline
RE2 (loop, individual) 19.13 ms 1.67× slower

The headline: RE2.Set already crushes the obvious baseline at this scale. Its union DFA scans the haystack once and tracks which patterns matched in DFA state — paying the UTF-8 boundary cost once instead of 30 times. The looped-RE2 case is the worst of both worlds — full boundary cost per pattern, no DFA factoring.

Shape 4 — 2,000 patterns over moderate inputs

Domain-blocklist-style patterns (fast-site<i>\.com, quick-site<i>\.net, …) with shared adjective prefixes — favorable to DFA factoring. 200 short log-like inputs; ~30% contain a literal that matches one of the patterns.

Engine Median per 200-input batch vs RegExp
RE2.Set 340 μs 45.9× faster
RegExp (V8 loop) 15.60 ms baseline
RE2 (loop, individual) 629 ms 40.3× slower

Setup costs (one-time): V8 ~2 ms (lazy), RE2 loop 14.8 ms, RE2.Set 8.6 ms. The RE2 loop catastrophically suffers from paying the JS↔C++ UTF-8 boundary 2,000 × 200 = 400,000 times. RE2.Set is also faster to construct than 2,000 individual RE2 instances.

Shape 5 — scaling curve to find the ceiling

Pseudo-random 8-char unique-prefix literal patterns — adversarial to DFA factoring; no shared structure for the union to compress. Same 200 short inputs (~30% containing a token drawn from the full pool, so higher-N sets see more hits).

N Compile Per-batch (200 inputs)
1,000 2.9 ms 283 μs
5,000 13.9 ms 289 μs
10,000 29.1 ms 284 μs
13,000 38.1 ms 298 μs
≥ 14,000 construction fails n/a

Two findings:

  1. Runtime is essentially flat across the working range. 13× more patterns produces only a 5% rise in per-batch time. The union DFA pays per byte of input, not per pattern, once it's been built. Compile time grows linearly at ~3 μs per pattern.

  2. A hard ceiling at N ≈ 13,500 for these patterns. new RE2.Set(...) throws Error: RE2.Set could not be compiled. above that threshold. The cause is RE2::Options::max_mem — the default 8 MiB DFA memory limit. Adversarial patterns (no shared structure) blow the budget. For patterns with shared structure (Shape 4's domain blocklist), the ceiling is much higher — Shape 4's 2,000 patterns compile easily and could likely scale to many more.

The ceiling is not algorithmic; it is the default-configuration cap on the DFA's working memory.

What this means for FilteredRE2

The original framing in #232 — "FilteredRE2 greatly improves the performance on large strings when multiple regular expressions should be matched" — is true relative to looping individual RE2 instances. But it is largely already-delivered by RE2.Set, which has shipped since 1.23.0 and which the benchmarks above show is dramatically faster than alternatives across every shape tested.

The remaining theoretical wins for FilteredRE2 over RE2.Set are narrow:

  • N above RE2.Set's max_mem ceiling. With unique-prefix patterns and the 8 MiB default, that's ~13,500 patterns. With shared-structure patterns it can be much higher. The Aho-Corasick prefilter scales linearly with atom-count and would absorb this case, but at the cost of running surviving patterns through individual RE2 instances (paying the loop cost surfaced in Shapes 1 and 4) for verification.
  • FirstMatch semantics with selective prefilters. RE2.Set always does a full DFA pass; FilteredRE2 can short-circuit at first prefilter hit. Untested here — would need a fourth bench shape — but the upside is bounded because RE2.Set's pass is already cheap in absolute terms (microseconds, not milliseconds).

A cheaper alternative — maxMem on RE2.Set (shipped)

The Shape 5 ceiling is a configuration limit, not an algorithmic one. RE2's RE2::Options::set_max_mem(int64_t) controls the DFA memory budget; node-re2 exposes it as a constructor option:

const set = new RE2.Set(patterns, {maxMem: 64 * 1024 * 1024});
set.maxMem; // 67108864

maxMem is a positive integer (bytes). Default is 8 MiB (8 << 20). Raising it pushes the ceiling from ~13,500 random-prefix patterns to ~100,000 or more, absorbing the bulk of FilteredRE2's would-be users with a one-line C++ change instead of an Aho-Corasick implementation.

Decision

  • FilteredRE2 parked in the project queue. The headline motivation is delivered by RE2.Set; the maxMem option absorbs the remaining niche.
  • Keep FilteredRE2 documented as a next-resort tool for users beyond the extended-maxMem ceiling. That population is small and largely overlaps with specialized engines like Vectorscan — see discussions/#218 for the broader engine-survey direction.

Reproducing

git clone --recursive https://github.com/uhop/node-re2.git
cd node-re2
npm install
npx nano-bench bench/filtered-rules.mjs
npx nano-bench bench/filtered-many.mjs
npx nano-bench bench/filtered-scale.mjs

Numbers in this page were collected on Node.js 26.1.0, Linux x86-64. Absolute milliseconds will vary with hardware; the relative rankings should be stable on any modern system.

Clone this wiki locally