Drop simplecrawler in favour of an in-tree fetch-based crawler by soulgalore · Pull Request #4745 · sitespeedio/sitespeed.io

soulgalore · 2026-05-16T12:16:47Z

simplecrawler@1.1.9 hasn't been updated since 2020 and pulls in its
own URL parser, async helper, robots parser and iconv-lite — about a
thousand lines of transitive code for a feature we only use ~10% of:
BFS to a depth cap, maxPages limit, include/exclude regex, image/PDF
skip, content-type filter, robots.txt, and cookies/basicAuth/userAgent
passthrough.

The new lib/plugins/crawler/crawl.js does all of that in ~285 lines
on top of node:fetch and a hand-rolled robots.txt parser, with no new
dependency. Parity was verified against simplecrawler across four
real-site runs (sitespeed.io/, /blog/, /documentation/sitespeed.io/,
/sitespeed.io-40.0/) at different depths and page caps — the
discovered URL sets matched in every case, and the natural-completion
run finished ~6x faster (1.4s vs 8.3s).

Behaviour kept the same: same BFS semantics, depth-counted from the
start URL, same skipped extensions, same regex include/exclude rules,
respect robots.txt by default (--crawler.ignoreRobotsTxt opts out),
crawl pinned to whatever origin the start URL lands on after
redirects. The crawl is strictly serial — there's no real-world page
count where parallelism matters here, and serial is much simpler to
reason about. Easy to add later if it ever does.

Co-authored-by: Claude noreply@anthropic.com

simplecrawler@1.1.9 hasn't been updated since 2020 and pulls in its own URL parser, async helper, robots parser and iconv-lite — about a thousand lines of transitive code for a feature we only use ~10% of: BFS to a depth cap, maxPages limit, include/exclude regex, image/PDF skip, content-type filter, robots.txt, and cookies/basicAuth/userAgent passthrough. The new lib/plugins/crawler/crawl.js does all of that in ~285 lines on top of node:fetch and a hand-rolled robots.txt parser, with no new dependency. Parity was verified against simplecrawler across four real-site runs (sitespeed.io/, /blog/, /documentation/sitespeed.io/, /sitespeed.io-40.0/) at different depths and page caps — the discovered URL sets matched in every case, and the natural-completion run finished ~6x faster (1.4s vs 8.3s). Behaviour kept the same: same BFS semantics, depth-counted from the start URL, same skipped extensions, same regex include/exclude rules, respect robots.txt by default (--crawler.ignoreRobotsTxt opts out), crawl pinned to whatever origin the start URL lands on after redirects. The crawl is strictly serial — there's no real-world page count where parallelism matters here, and serial is much simpler to reason about. Easy to add later if it ever does. Co-authored-by: Claude noreply@anthropic.com

soulgalore force-pushed the crawler-modern branch from 22cb071 to 9483421 Compare May 16, 2026 12:48

soulgalore merged commit cc21f93 into main May 16, 2026
15 checks passed

soulgalore deleted the crawler-modern branch May 16, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drop simplecrawler in favour of an in-tree fetch-based crawler#4745

Drop simplecrawler in favour of an in-tree fetch-based crawler#4745
soulgalore merged 1 commit into
mainfrom
crawler-modern

soulgalore commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

soulgalore commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant