Skip to content

xcull/xcull-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xcull benchmark: URL deduplication for recon at scale

xcull is a security-aware URL canonicalization engine. In a recon pipeline it sits at the front of the flow: it takes the raw URLs harvested for every asset in scope and reduces them to the working set that scanners, fuzzers, and testers actually process. That one stage sets three things for the whole program: how many assets a worker can process per hour, how much memory each worker costs, and whether a sensitive endpoint survives to be tested at all.

This benchmark measures xcull against the four deduplicators most teams already run (urldedupe, uro, urless, uddup) on one labeled set of 780,200 URLs where the correct answer is known exactly. Every number in this report, throughput, peak RAM, completion time, surface retained, and false merge rate, is measured on that same input so there is no cross-input confusion.

Result in one line: xcull is first on completion time, first on throughput, first on peak RAM, and first on false merge rate in the same run.

What this stage has to deliver, in priority order

These are the properties a recon program actually buys when it picks a deduplicator, ordered the way a platform owner weighs them.

# Property Why it decides the program xcull on D_unified.full
1 Completion time Wall clock per asset on one core 1.73 s, fastest measured
2 Throughput (URLs/s) Sets continuous-monitoring capacity per worker 451,000 URLs/s, fastest measured
3 Peak memory Sets cost per worker and how many run in parallel 22.6 MB, lowest measured
4 Attack surface retained Did the tool keep every distinct endpoint? 100 %, tied with passthrough only
5 False merge rate A wrong merge silently removes an endpoint 0 %, the only real deduplicator at zero
6 Streaming Constant-memory stdin to stdout fits any pipeline yes (-k / -x)
7 Reduction ratio How much redundant scanner work is removed 85 % fewer lines
8 CPU efficiency Single-core cost of the run 1.74 CPU-seconds

Properties 1 through 3 are capacity and cost. Properties 4 and 5 are the security questions, and they are what justifies running a deduplicator at all: a tool that merges two genuinely distinct endpoints into one removes the second from every scan that follows. Everything below is built around proving 4 and 5 without giving up 1 through 3.

Headline results: D_unified.full, 780,200 URLs

Same machine, same input, each tool in its documented default mode, pinned to one core, page cache primed, best of five timed trials.

Metric xcull urldedupe uro urless uddup
Completion time 1.73 s 2.27 s 7.36 s 8.83 s DNF (>600 s)
Throughput 451 k URLs/s 343 k URLs/s 106 k URLs/s 88 k URLs/s DNF
Peak RAM 22.6 MB 193.8 MB 27.6 MB 40.5 MB DNF
Output lines 115,764 380,650 64,667 64,138 DNF
Surface retained 100 % 100 % (passthrough) 97.77 % 96.82 % DNF
False merge rate 0 % 0 % (passthrough) 2.23 % 3.18 % DNF

How to read it:

  • xcull is first on completion time, first on throughput, first on peak RAM, and first on false merge rate in the same run. The next-fastest finisher (urldedupe) is 1.31x slower; the next-smallest RAM (uro) is 1.22x heavier and reaches it by deleting whole endpoint classes.
  • urldedupe reaches 0 % false merges only because it barely deduplicates. It removes exact byte duplicates and keeps every value, locale, and session-token variant, so it emits 3.3x xcull's output and uses 8.6x the RAM. It cannot drop a real endpoint because it folds almost nothing. That is a passthrough, not a deduplicator.
  • uro and urless produce a short, tidy list by deleting whole endpoint classes (every JSESSIONID, every TITLE_SLUG, every UUID in uro's case). Those deleted endpoints are exactly the ones a scanner then never sees.
  • uddup does not finish a target this size. Its cost grows with the square of the input and it stops completing past roughly 50,000 URLs.

The gold metric: false merge rate on known ground truth

D_unified.full is generated by harness/synth_gen.py from a fixed set of canonical endpoint groups whose correct groupings are recorded in data/D_unified.truth.json. That means a merge that destroys a group can be counted exactly. False merge rate is the fraction of canonical groups the tool's output represents with zero survivors, so a lower number means fewer endpoints silently removed from scope.

Tool Canonical groups Destroyed False merge rate Reads as
xcull 55,920 0 0.00 % preserves 100 % of distinct groups
urldedupe 55,920 0 0.00 % (passthrough) keeps 380,650 lines for 55,920 groups, so it folds almost nothing
uro 55,920 1,248 2.23 % destroys every JSESSIONID, every TITLE_SLUG, every UUID class
urless 55,920 1,777 3.18 % destroys every JSESSIONID and every TITLE_SLUG class
uddup DNF quadratic; does not finish 780k URLs

xcull reaches 0 % false merges and a real 85 % reduction at the same time, which is the combination the other tools each miss. The per-class detail, including which endpoint classes uro and urless deleted in full, is in BENCHMARK.md Section 4.2 and the CSV raw/synth_eval.csv.

What this means for a recon program

  • More assets per worker. The fastest completion time and the lowest RAM in the same run means a single worker covers more scope per cycle, and more workers fit on the same hardware.
  • Fewer missed findings. xcull is the only deduplicator that keeps every canonical group, including the object-ID endpoints (/order/1001, /order/1002) where broken-object-level-authorization and IDOR bugs live. A lower false merge rate is a direct reduction in endpoints that never get scanned.
  • Large assets complete. Bounded memory and linear time mean a target with hundreds of thousands of URLs still finishes in under two seconds, where the alternatives either exhaust memory or never return.

The one trade xcull makes on purpose

xcull is keep-biased. When a URL is ambiguous, for example it carries an object ID, a session token, or an opaque hash, the default keeps it instead of folding it away, because that is where access-control bugs hide. The cost is a larger output than the most aggressive folders produce. The trade is deliberate: a few thousand redundant lines a scanner absorbs in seconds, in exchange for not silently dropping a testable endpoint. Teams that want a smaller list can fold object IDs with -F. Every number in this report is the shipping default, and the per-class data is published unedited under raw/.

How to trust these numbers

  • BENCHMARK.md: the full report. How each tool was run and measured, the labeled URLs, per-class quality, the trade-offs stated plainly, and the reproduce recipe.
  • AUDIT.md: a per-line security audit of xcull's most aggressive id-folding mode. Every removed URL is classified by hand to confirm it removed redundancy, not surface. The shipping default removes a strict subset of those lines, so the finding carries over.
  • COMPARISON.md: a 99-row side-by-side demo of the kinds of differences a recon engineer notices at a glance (object IDs, session tokens, slug folding, query keyset merges). Not a benchmark, just a quick visual contrast against the four baselines.
  • raw/: the underlying measurement data. raw/trials.csv is the per-trial cost detail; raw/synth_prf.csv and raw/synth_eval.csv are the per-class quality detail. raw/outputs/ holds each tool's full output so the quality numbers can be recomputed without re-running the tools.

The URLs are generated deterministically (random seed fixed) and checksummed in raw/datasets.csv, so every number here is reproducible by running harness/synth_gen.py and then harness/bench.sh.

URLs file

URLs file URL count Canonical groups What it is
D_unified.full 780,200 55,920 one labeled known-answer URL set designed to match the shape of a real recon capture (heavy templated bulk + long-tail distinct endpoints + small enumerable IDOR surface)

This is the only set of URLs used. The previous benchmark mixed a real Wayback capture (for cost/reach without ground truth) with a smaller controlled set (for false merge rate with ground truth). That meant the false merge rate row came from a different input than the throughput row, which made the report harder to verify by hand. D_unified.full supports every metric from one input.

License

AGPL-3.0. See LICENSE.

About

udud benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors