xcull benchmark: URL deduplication for recon at scale

xcull is a security-aware URL canonicalization engine. In a recon pipeline it sits at the front of the flow: it takes the raw URLs harvested for every asset in scope and reduces them to the working set that scanners, fuzzers, and testers actually process. That one stage sets three things for the whole program: how many assets a worker can process per hour, how much memory each worker costs, and whether a sensitive endpoint survives to be tested at all.

This benchmark measures xcull against the four deduplicators most teams already run (urldedupe, uro, urless, uddup) on one labeled set of 780,200 URLs where the correct answer is known exactly. Every number in this report, throughput, peak RAM, completion time, surface retained, and false merge rate, is measured on that same input so there is no cross-input confusion.

Result in one line: xcull is first on completion time, first on throughput, first on peak RAM, and first on false merge rate in the same run.

What this stage has to deliver, in priority order

These are the properties a recon program actually buys when it picks a deduplicator, ordered the way a platform owner weighs them.

#	Property	Why it decides the program	xcull on `D_unified.full`
1	Completion time	Wall clock per asset on one core	1.73 s, fastest measured
2	Throughput (URLs/s)	Sets continuous-monitoring capacity per worker	451,000 URLs/s, fastest measured
3	Peak memory	Sets cost per worker and how many run in parallel	22.6 MB, lowest measured
4	Attack surface retained	Did the tool keep every distinct endpoint?	100 %, tied with passthrough only
5	False merge rate	A wrong merge silently removes an endpoint	0 %, the only real deduplicator at zero
6	Streaming	Constant-memory stdin to stdout fits any pipeline	yes (`-k` / `-x`)
7	Reduction ratio	How much redundant scanner work is removed	85 % fewer lines
8	CPU efficiency	Single-core cost of the run	1.74 CPU-seconds

Properties 1 through 3 are capacity and cost. Properties 4 and 5 are the security questions, and they are what justifies running a deduplicator at all: a tool that merges two genuinely distinct endpoints into one removes the second from every scan that follows. Everything below is built around proving 4 and 5 without giving up 1 through 3.

Headline results: `D_unified.full`, 780,200 URLs

Same machine, same input, each tool in its documented default mode, pinned to one core, page cache primed, best of five timed trials.

Metric	xcull	urldedupe	uro	urless	uddup
Completion time	1.73 s	2.27 s	7.36 s	8.83 s	DNF (>600 s)
Throughput	451 k URLs/s	343 k URLs/s	106 k URLs/s	88 k URLs/s	DNF
Peak RAM	22.6 MB	193.8 MB	27.6 MB	40.5 MB	DNF
Output lines	115,764	380,650	64,667	64,138	DNF
Surface retained	100 %	100 % (passthrough)	97.77 %	96.82 %	DNF
False merge rate	0 %	0 % (passthrough)	2.23 %	3.18 %	DNF

How to read it:

xcull is first on completion time, first on throughput, first on peak RAM, and first on false merge rate in the same run. The next-fastest finisher (urldedupe) is 1.31x slower; the next-smallest RAM (uro) is 1.22x heavier and reaches it by deleting whole endpoint classes.
urldedupe reaches 0 % false merges only because it barely deduplicates. It removes exact byte duplicates and keeps every value, locale, and session-token variant, so it emits 3.3x xcull's output and uses 8.6x the RAM. It cannot drop a real endpoint because it folds almost nothing. That is a passthrough, not a deduplicator.
uro and urless produce a short, tidy list by deleting whole endpoint classes (every JSESSIONID, every TITLE_SLUG, every UUID in uro's case). Those deleted endpoints are exactly the ones a scanner then never sees.
uddup does not finish a target this size. Its cost grows with the square of the input and it stops completing past roughly 50,000 URLs.

The gold metric: false merge rate on known ground truth

D_unified.full is generated by harness/synth_gen.py from a fixed set of canonical endpoint groups whose correct groupings are recorded in data/D_unified.truth.json. That means a merge that destroys a group can be counted exactly. False merge rate is the fraction of canonical groups the tool's output represents with zero survivors, so a lower number means fewer endpoints silently removed from scope.

Tool	Canonical groups	Destroyed	False merge rate	Reads as
xcull	55,920	0	0.00 %	preserves 100 % of distinct groups
urldedupe	55,920	0	0.00 % (passthrough)	keeps 380,650 lines for 55,920 groups, so it folds almost nothing
uro	55,920	1,248	2.23 %	destroys every JSESSIONID, every TITLE_SLUG, every UUID class
urless	55,920	1,777	3.18 %	destroys every JSESSIONID and every TITLE_SLUG class
uddup	—	—	DNF	quadratic; does not finish 780k URLs

xcull reaches 0 % false merges and a real 85 % reduction at the same time, which is the combination the other tools each miss. The per-class detail, including which endpoint classes uro and urless deleted in full, is in BENCHMARK.md Section 4.2 and the CSV raw/synth_eval.csv.

What this means for a recon program

More assets per worker. The fastest completion time and the lowest RAM in the same run means a single worker covers more scope per cycle, and more workers fit on the same hardware.
Fewer missed findings. xcull is the only deduplicator that keeps every canonical group, including the object-ID endpoints (/order/1001, /order/1002) where broken-object-level-authorization and IDOR bugs live. A lower false merge rate is a direct reduction in endpoints that never get scanned.
Large assets complete. Bounded memory and linear time mean a target with hundreds of thousands of URLs still finishes in under two seconds, where the alternatives either exhaust memory or never return.

The one trade xcull makes on purpose

xcull is keep-biased. When a URL is ambiguous, for example it carries an object ID, a session token, or an opaque hash, the default keeps it instead of folding it away, because that is where access-control bugs hide. The cost is a larger output than the most aggressive folders produce. The trade is deliberate: a few thousand redundant lines a scanner absorbs in seconds, in exchange for not silently dropping a testable endpoint. Teams that want a smaller list can fold object IDs with -F. Every number in this report is the shipping default, and the per-class data is published unedited under raw/.

How to trust these numbers

BENCHMARK.md: the full report. How each tool was run and measured, the labeled URLs, per-class quality, the trade-offs stated plainly, and the reproduce recipe.
AUDIT.md: a per-line security audit of xcull's most aggressive id-folding mode. Every removed URL is classified by hand to confirm it removed redundancy, not surface. The shipping default removes a strict subset of those lines, so the finding carries over.
COMPARISON.md: a 99-row side-by-side demo of the kinds of differences a recon engineer notices at a glance (object IDs, session tokens, slug folding, query keyset merges). Not a benchmark, just a quick visual contrast against the four baselines.
raw/: the underlying measurement data. raw/trials.csv is the per-trial cost detail; raw/synth_prf.csv and raw/synth_eval.csv are the per-class quality detail. raw/outputs/ holds each tool's full output so the quality numbers can be recomputed without re-running the tools.

The URLs are generated deterministically (random seed fixed) and checksummed in raw/datasets.csv, so every number here is reproducible by running harness/synth_gen.py and then harness/bench.sh.

URLs file

URLs file	URL count	Canonical groups	What it is
`D_unified.full`	780,200	55,920	one labeled known-answer URL set designed to match the shape of a real recon capture (heavy templated bulk + long-tail distinct endpoints + small enumerable IDOR surface)

This is the only set of URLs used. The previous benchmark mixed a real Wayback capture (for cost/reach without ground truth) with a smaller controlled set (for false merge rate with ground truth). That meant the false merge rate row came from a different input than the throughput row, which made the report harder to verify by hand. D_unified.full supports every metric from one input.

License

AGPL-3.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xcull benchmark: URL deduplication for recon at scale

What this stage has to deliver, in priority order

Headline results: `D_unified.full`, 780,200 URLs

The gold metric: false merge rate on known ground truth

What this means for a recon program

The one trade xcull makes on purpose

How to trust these numbers

URLs file

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
harness		harness
raw		raw
.gitignore		.gitignore
ANONYMIZATION.md		ANONYMIZATION.md
AUDIT.md		AUDIT.md
BENCHMARK.md		BENCHMARK.md
COMPARISON.md		COMPARISON.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

xcull benchmark: URL deduplication for recon at scale

What this stage has to deliver, in priority order

Headline results: D_unified.full, 780,200 URLs

The gold metric: false merge rate on known ground truth

What this means for a recon program

The one trade xcull makes on purpose

How to trust these numbers

URLs file

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Headline results: `D_unified.full`, 780,200 URLs

Packages