FilteredRE2 evaluation

A benchmark-driven evaluation of whether re2::FilteredRE2 (filtered_re2.h, filtered_re2.cc) is worth exposing through node-re2, as proposed in issue #232. Recorded 2026-05-12. Conclusion: the general-purpose case for FilteredRE2 has weakened materially once RE2.Set's scaling is measured. A narrow niche remains for users above RE2.Set's default-max_mem ceiling (N ≈ 13,500 random-prefix patterns), but a far cheaper alternative — exposing max_mem as a RE2.Set constructor option — would absorb most of that niche.

What FilteredRE2 is

A wrapper around multiple RE2 instances that uses a prefilter to skip patterns whose required literal substrings ("atoms") don't appear in the haystack. From filtered_re2.h:

By design, it does not include a string matching engine. This is to allow the user of the class to use their favorite string matching engine. The overall flow is: Add all the regexps … then Compile … Compile returns strings that need to be matched. … the caller does the string matching using the returned strings.

API:

Add(pattern, opts, &id) — register a pattern, receive an id.
Compile(&atoms) — returns a vector<string> of distinct lowercased literal substrings derived from the patterns. Default min_atom_len = 3.
Caller searches the haystack for any of those atoms — case-insensitively or against a lowercased haystack — and collects the indices of atoms found.
FirstMatch(text, atoms) / AllMatches(text, atoms, &matches) — runs full RE2 only on patterns whose atom-set survived the prefilter.

So node-re2 would need to embed a multi-pattern case-insensitive substring search engine for step 3. The classical algorithm fit is Aho-Corasick; alternatives surveyed in the vault topic and project queue.

Methodology

Three workload shapes were benchmarked. All measurements used nano-benchmark (95% CI, 100 samples, bootstrap), Node.js 26.1.0, Linux x86-64. Each input is processed end-to-end inside the iteration body; setup costs are measured outside and reported separately. Bench files live in bench/:

filtered-rules.mjs — Shape 1: ~30 security-style patterns × 1 MiB haystack
filtered-many.mjs — Shape 4: 2,000 patterns × 200 short inputs
filtered-scale.mjs — Shape 5: 1K–13K patterns × 200 short inputs (scaling curve)

Three contestants in each:

RegExp — V8 builtin regex looped over each pattern
RE2 — individual new RE2(pattern) instances looped
RE2.Set — node-re2's union-DFA multi-pattern matcher (since 1.23.0)

A future RE2.Filtered entrant slots in as a fourth case in each file.

Shape 1 — security rules over a MB-scale haystack

30 case-insensitive patterns drawn from common security-rule families (PHP RCE, XSS, path traversal, SQL injection, secrets like AWS/GitHub tokens, LDAP injection, SSRF metadata URLs, Log4Shell templates). 1 MiB of synthetic log-like content with sparse vulnerable-looking sprinkles seeded so exactly one pattern hits.

Engine	Median per scan	vs `RegExp`
`RE2.Set`	1.69 ms	6.77× faster
`RegExp` (V8 loop)	11.43 ms	baseline
`RE2` (loop, individual)	19.13 ms	1.67× slower

The headline: RE2.Set already crushes the obvious baseline at this scale. Its union DFA scans the haystack once and tracks which patterns matched in DFA state — paying the UTF-8 boundary cost once instead of 30 times. The looped-RE2 case is the worst of both worlds — full boundary cost per pattern, no DFA factoring.

Shape 4 — 2,000 patterns over moderate inputs

Domain-blocklist-style patterns (fast-site<i>\.com, quick-site<i>\.net, …) with shared adjective prefixes — favorable to DFA factoring. 200 short log-like inputs; ~30% contain a literal that matches one of the patterns.

Engine	Median per 200-input batch	vs `RegExp`
`RE2.Set`	340 μs	45.9× faster
`RegExp` (V8 loop)	15.60 ms	baseline
`RE2` (loop, individual)	629 ms	40.3× slower

Setup costs (one-time): V8 ~2 ms (lazy), RE2 loop 14.8 ms, RE2.Set 8.6 ms. The RE2 loop catastrophically suffers from paying the JS↔C++ UTF-8 boundary 2,000 × 200 = 400,000 times. RE2.Set is also faster to construct than 2,000 individual RE2 instances.

Shape 5 — scaling curve to find the ceiling

Pseudo-random 8-char unique-prefix literal patterns — adversarial to DFA factoring; no shared structure for the union to compress. Same 200 short inputs (~30% containing a token drawn from the full pool, so higher-N sets see more hits).

N	Compile	Per-batch (200 inputs)
1,000	2.9 ms	283 μs
5,000	13.9 ms	289 μs
10,000	29.1 ms	284 μs
13,000	38.1 ms	298 μs
≥ 14,000	construction fails	n/a

Two findings:

Runtime is essentially flat across the working range. 13× more patterns produces only a 5% rise in per-batch time. The union DFA pays per byte of input, not per pattern, once it's been built. Compile time grows linearly at ~3 μs per pattern.
A hard ceiling at N ≈ 13,500 for these patterns. new RE2.Set(...) throws Error: RE2.Set could not be compiled. above that threshold. The cause is RE2::Options::max_mem — the default 8 MiB DFA memory limit. Adversarial patterns (no shared structure) blow the budget. For patterns with shared structure (Shape 4's domain blocklist), the ceiling is much higher — Shape 4's 2,000 patterns compile easily and could likely scale to many more.

The ceiling is not algorithmic; it is the default-configuration cap on the DFA's working memory.

What this means for FilteredRE2

The original framing in #232 — "FilteredRE2 greatly improves the performance on large strings when multiple regular expressions should be matched" — is true relative to looping individual RE2 instances. But it is largely already-delivered by RE2.Set, which has shipped since 1.23.0 and which the benchmarks above show is dramatically faster than alternatives across every shape tested.

The remaining theoretical wins for FilteredRE2 over RE2.Set are narrow:

N above RE2.Set's max_mem ceiling. With unique-prefix patterns and the 8 MiB default, that's ~13,500 patterns. With shared-structure patterns it can be much higher. The Aho-Corasick prefilter scales linearly with atom-count and would absorb this case, but at the cost of running surviving patterns through individual RE2 instances (paying the loop cost surfaced in Shapes 1 and 4) for verification.
FirstMatch semantics with selective prefilters. RE2.Set always does a full DFA pass; FilteredRE2 can short-circuit at first prefilter hit. Untested here — would need a fourth bench shape — but the upside is bounded because RE2.Set's pass is already cheap in absolute terms (microseconds, not milliseconds).

A cheaper alternative — `maxMem` on `RE2.Set` (shipped)

The Shape 5 ceiling is a configuration limit, not an algorithmic one. RE2's RE2::Options::set_max_mem(int64_t) controls the DFA memory budget; node-re2 exposes it as a constructor option:

const set = new RE2.Set(patterns, {maxMem: 64 * 1024 * 1024});
set.maxMem; // 67108864

maxMem is a positive integer (bytes). Default is 8 MiB (8 << 20). Raising it pushes the ceiling from ~13,500 random-prefix patterns to ~100,000 or more, absorbing the bulk of FilteredRE2's would-be users with a one-line C++ change instead of an Aho-Corasick implementation.

Decision

FilteredRE2 parked in the project queue. The headline motivation is delivered by RE2.Set; the maxMem option absorbs the remaining niche.
Keep FilteredRE2 documented as a next-resort tool for users beyond the extended-maxMem ceiling. That population is small and largely overlaps with specialized engines like Vectorscan — see discussions/#218 for the broader engine-survey direction.

Reproducing

git clone --recursive https://github.com/uhop/node-re2.git
cd node-re2
npm install
npx nano-bench bench/filtered-rules.mjs
npx nano-bench bench/filtered-many.mjs
npx nano-bench bench/filtered-scale.mjs

Numbers in this page were collected on Node.js 26.1.0, Linux x86-64. Absolute milliseconds will vary with hardware; the relative rankings should be stable on any modern system.

🔍 Search the wiki

node-re2

Installing

Troubleshooting

Developers

Working on this project

Research and notes

Project

Release history

Repository · README · prebuilds via install-artifact-from-github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FilteredRE2 evaluation

What FilteredRE2 is

Methodology

Shape 1 — security rules over a MB-scale haystack

Shape 4 — 2,000 patterns over moderate inputs

Shape 5 — scaling curve to find the ceiling

What this means for FilteredRE2

A cheaper alternative — `maxMem` on `RE2.Set` (shipped)

Decision

Reproducing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

node-re2

Clone this wiki locally

Uh oh!

FilteredRE2 evaluation

What FilteredRE2 is

Methodology

Shape 1 — security rules over a MB-scale haystack

Shape 4 — 2,000 patterns over moderate inputs

Shape 5 — scaling curve to find the ceiling

What this means for FilteredRE2

A cheaper alternative — maxMem on RE2.Set (shipped)

Decision

Reproducing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

node-re2

Clone this wiki locally

A cheaper alternative — `maxMem` on `RE2.Set` (shipped)