xcull is a security-aware URL canonicalization engine. In a recon pipeline it sits at the front of the flow: it takes the raw URLs harvested for every asset in scope and reduces them to the working set that scanners, fuzzers, and testers actually process. That one stage sets three things for the whole program: how many assets a worker can process per hour, how much memory each worker costs, and whether a sensitive endpoint survives to be tested at all.
This benchmark measures xcull against the four deduplicators most teams already
run (urldedupe, uro, urless, uddup) on one labeled set of 780,200
URLs where the correct answer is known exactly. Every number in this
report, throughput, peak RAM, completion time, surface retained, and false
merge rate, is measured on that same input so there is no cross-input
confusion.
Result in one line: xcull is first on completion time, first on throughput, first on peak RAM, and first on false merge rate in the same run.
These are the properties a recon program actually buys when it picks a deduplicator, ordered the way a platform owner weighs them.
| # | Property | Why it decides the program | xcull on D_unified.full |
|---|---|---|---|
| 1 | Completion time | Wall clock per asset on one core | 1.73 s, fastest measured |
| 2 | Throughput (URLs/s) | Sets continuous-monitoring capacity per worker | 451,000 URLs/s, fastest measured |
| 3 | Peak memory | Sets cost per worker and how many run in parallel | 22.6 MB, lowest measured |
| 4 | Attack surface retained | Did the tool keep every distinct endpoint? | 100 %, tied with passthrough only |
| 5 | False merge rate | A wrong merge silently removes an endpoint | 0 %, the only real deduplicator at zero |
| 6 | Streaming | Constant-memory stdin to stdout fits any pipeline | yes (-k / -x) |
| 7 | Reduction ratio | How much redundant scanner work is removed | 85 % fewer lines |
| 8 | CPU efficiency | Single-core cost of the run | 1.74 CPU-seconds |
Properties 1 through 3 are capacity and cost. Properties 4 and 5 are the security questions, and they are what justifies running a deduplicator at all: a tool that merges two genuinely distinct endpoints into one removes the second from every scan that follows. Everything below is built around proving 4 and 5 without giving up 1 through 3.
Same machine, same input, each tool in its documented default mode, pinned to one core, page cache primed, best of five timed trials.
| Metric | xcull | urldedupe | uro | urless | uddup |
|---|---|---|---|---|---|
| Completion time | 1.73 s | 2.27 s | 7.36 s | 8.83 s | DNF (>600 s) |
| Throughput | 451 k URLs/s | 343 k URLs/s | 106 k URLs/s | 88 k URLs/s | DNF |
| Peak RAM | 22.6 MB | 193.8 MB | 27.6 MB | 40.5 MB | DNF |
| Output lines | 115,764 | 380,650 | 64,667 | 64,138 | DNF |
| Surface retained | 100 % | 100 % (passthrough) | 97.77 % | 96.82 % | DNF |
| False merge rate | 0 % | 0 % (passthrough) | 2.23 % | 3.18 % | DNF |
How to read it:
- xcull is first on completion time, first on throughput, first on peak RAM,
and first on false merge rate in the same run. The next-fastest finisher
(
urldedupe) is 1.31x slower; the next-smallest RAM (uro) is 1.22x heavier and reaches it by deleting whole endpoint classes. urldedupereaches 0 % false merges only because it barely deduplicates. It removes exact byte duplicates and keeps every value, locale, and session-token variant, so it emits 3.3x xcull's output and uses 8.6x the RAM. It cannot drop a real endpoint because it folds almost nothing. That is a passthrough, not a deduplicator.uroandurlessproduce a short, tidy list by deleting whole endpoint classes (every JSESSIONID, every TITLE_SLUG, every UUID in uro's case). Those deleted endpoints are exactly the ones a scanner then never sees.uddupdoes not finish a target this size. Its cost grows with the square of the input and it stops completing past roughly 50,000 URLs.
D_unified.full is generated by harness/synth_gen.py from a fixed set of
canonical endpoint groups whose correct groupings are recorded in
data/D_unified.truth.json. That means a merge that destroys a group can
be counted exactly. False merge rate is the fraction of canonical groups
the tool's output represents with zero survivors, so a lower number means
fewer endpoints silently removed from scope.
| Tool | Canonical groups | Destroyed | False merge rate | Reads as |
|---|---|---|---|---|
| xcull | 55,920 | 0 | 0.00 % | preserves 100 % of distinct groups |
| urldedupe | 55,920 | 0 | 0.00 % (passthrough) | keeps 380,650 lines for 55,920 groups, so it folds almost nothing |
| uro | 55,920 | 1,248 | 2.23 % | destroys every JSESSIONID, every TITLE_SLUG, every UUID class |
| urless | 55,920 | 1,777 | 3.18 % | destroys every JSESSIONID and every TITLE_SLUG class |
| uddup | — | — | DNF | quadratic; does not finish 780k URLs |
xcull reaches 0 % false merges and a real 85 % reduction at the same time,
which is the combination the other tools each miss. The per-class detail,
including which endpoint classes uro and urless deleted in full, is in
BENCHMARK.md Section 4.2 and the CSV
raw/synth_eval.csv.
- More assets per worker. The fastest completion time and the lowest RAM in the same run means a single worker covers more scope per cycle, and more workers fit on the same hardware.
- Fewer missed findings. xcull is the only deduplicator that keeps every
canonical group, including the object-ID endpoints (
/order/1001,/order/1002) where broken-object-level-authorization and IDOR bugs live. A lower false merge rate is a direct reduction in endpoints that never get scanned. - Large assets complete. Bounded memory and linear time mean a target with hundreds of thousands of URLs still finishes in under two seconds, where the alternatives either exhaust memory or never return.
xcull is keep-biased. When a URL is ambiguous, for example it carries an
object ID, a session token, or an opaque hash, the default keeps it instead
of folding it away, because that is where access-control bugs hide. The
cost is a larger output than the most aggressive folders produce. The trade
is deliberate: a few thousand redundant lines a scanner absorbs in seconds,
in exchange for not silently dropping a testable endpoint. Teams that want
a smaller list can fold object IDs with -F. Every number in this report
is the shipping default, and the per-class data is published unedited under
raw/.
BENCHMARK.md: the full report. How each tool was run and measured, the labeled URLs, per-class quality, the trade-offs stated plainly, and the reproduce recipe.AUDIT.md: a per-line security audit of xcull's most aggressive id-folding mode. Every removed URL is classified by hand to confirm it removed redundancy, not surface. The shipping default removes a strict subset of those lines, so the finding carries over.COMPARISON.md: a 99-row side-by-side demo of the kinds of differences a recon engineer notices at a glance (object IDs, session tokens, slug folding, query keyset merges). Not a benchmark, just a quick visual contrast against the four baselines.raw/: the underlying measurement data.raw/trials.csvis the per-trial cost detail;raw/synth_prf.csvandraw/synth_eval.csvare the per-class quality detail.raw/outputs/holds each tool's full output so the quality numbers can be recomputed without re-running the tools.
The URLs are generated deterministically (random seed fixed) and
checksummed in raw/datasets.csv, so every number
here is reproducible by running harness/synth_gen.py and then
harness/bench.sh.
| URLs file | URL count | Canonical groups | What it is |
|---|---|---|---|
D_unified.full |
780,200 | 55,920 | one labeled known-answer URL set designed to match the shape of a real recon capture (heavy templated bulk + long-tail distinct endpoints + small enumerable IDOR surface) |
This is the only set of URLs used. The previous benchmark mixed a real Wayback
capture (for cost/reach without ground truth) with a smaller controlled
set (for false merge rate with ground truth). That meant the false merge
rate row came from a different input than the throughput row, which made
the report harder to verify by hand. D_unified.full supports every metric
from one input.
AGPL-3.0. See LICENSE.