Skip to content

Drop simplecrawler in favour of an in-tree fetch-based crawler#4745

Merged
soulgalore merged 1 commit into
mainfrom
crawler-modern
May 16, 2026
Merged

Drop simplecrawler in favour of an in-tree fetch-based crawler#4745
soulgalore merged 1 commit into
mainfrom
crawler-modern

Conversation

@soulgalore
Copy link
Copy Markdown
Member

simplecrawler@1.1.9 hasn't been updated since 2020 and pulls in its
own URL parser, async helper, robots parser and iconv-lite — about a
thousand lines of transitive code for a feature we only use ~10% of:
BFS to a depth cap, maxPages limit, include/exclude regex, image/PDF
skip, content-type filter, robots.txt, and cookies/basicAuth/userAgent
passthrough.

The new lib/plugins/crawler/crawl.js does all of that in ~285 lines
on top of node:fetch and a hand-rolled robots.txt parser, with no new
dependency. Parity was verified against simplecrawler across four
real-site runs (sitespeed.io/, /blog/, /documentation/sitespeed.io/,
/sitespeed.io-40.0/) at different depths and page caps — the
discovered URL sets matched in every case, and the natural-completion
run finished ~6x faster (1.4s vs 8.3s).

Behaviour kept the same: same BFS semantics, depth-counted from the
start URL, same skipped extensions, same regex include/exclude rules,
respect robots.txt by default (--crawler.ignoreRobotsTxt opts out),
crawl pinned to whatever origin the start URL lands on after
redirects. The crawl is strictly serial — there's no real-world page
count where parallelism matters here, and serial is much simpler to
reason about. Easy to add later if it ever does.

Co-authored-by: Claude noreply@anthropic.com

  simplecrawler@1.1.9 hasn't been updated since 2020 and pulls in its
  own URL parser, async helper, robots parser and iconv-lite — about a
  thousand lines of transitive code for a feature we only use ~10% of:
  BFS to a depth cap, maxPages limit, include/exclude regex, image/PDF
  skip, content-type filter, robots.txt, and cookies/basicAuth/userAgent
  passthrough.

  The new lib/plugins/crawler/crawl.js does all of that in ~285 lines
  on top of node:fetch and a hand-rolled robots.txt parser, with no new
  dependency. Parity was verified against simplecrawler across four
  real-site runs (sitespeed.io/, /blog/, /documentation/sitespeed.io/,
  /sitespeed.io-40.0/) at different depths and page caps — the
  discovered URL sets matched in every case, and the natural-completion
  run finished ~6x faster (1.4s vs 8.3s).

  Behaviour kept the same: same BFS semantics, depth-counted from the
  start URL, same skipped extensions, same regex include/exclude rules,
  respect robots.txt by default (--crawler.ignoreRobotsTxt opts out),
  crawl pinned to whatever origin the start URL lands on after
  redirects. The crawl is strictly serial — there's no real-world page
  count where parallelism matters here, and serial is much simpler to
  reason about. Easy to add later if it ever does.

  Co-authored-by: Claude noreply@anthropic.com
@soulgalore soulgalore merged commit cc21f93 into main May 16, 2026
15 checks passed
@soulgalore soulgalore deleted the crawler-modern branch May 16, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant