Drop simplecrawler in favour of an in-tree fetch-based crawler#4745
Merged
Conversation
simplecrawler@1.1.9 hasn't been updated since 2020 and pulls in its own URL parser, async helper, robots parser and iconv-lite — about a thousand lines of transitive code for a feature we only use ~10% of: BFS to a depth cap, maxPages limit, include/exclude regex, image/PDF skip, content-type filter, robots.txt, and cookies/basicAuth/userAgent passthrough. The new lib/plugins/crawler/crawl.js does all of that in ~285 lines on top of node:fetch and a hand-rolled robots.txt parser, with no new dependency. Parity was verified against simplecrawler across four real-site runs (sitespeed.io/, /blog/, /documentation/sitespeed.io/, /sitespeed.io-40.0/) at different depths and page caps — the discovered URL sets matched in every case, and the natural-completion run finished ~6x faster (1.4s vs 8.3s). Behaviour kept the same: same BFS semantics, depth-counted from the start URL, same skipped extensions, same regex include/exclude rules, respect robots.txt by default (--crawler.ignoreRobotsTxt opts out), crawl pinned to whatever origin the start URL lands on after redirects. The crawl is strictly serial — there's no real-world page count where parallelism matters here, and serial is much simpler to reason about. Easy to add later if it ever does. Co-authored-by: Claude noreply@anthropic.com
22cb071 to
9483421
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
simplecrawler@1.1.9 hasn't been updated since 2020 and pulls in its
own URL parser, async helper, robots parser and iconv-lite — about a
thousand lines of transitive code for a feature we only use ~10% of:
BFS to a depth cap, maxPages limit, include/exclude regex, image/PDF
skip, content-type filter, robots.txt, and cookies/basicAuth/userAgent
passthrough.
The new lib/plugins/crawler/crawl.js does all of that in ~285 lines
on top of node:fetch and a hand-rolled robots.txt parser, with no new
dependency. Parity was verified against simplecrawler across four
real-site runs (sitespeed.io/, /blog/, /documentation/sitespeed.io/,
/sitespeed.io-40.0/) at different depths and page caps — the
discovered URL sets matched in every case, and the natural-completion
run finished ~6x faster (1.4s vs 8.3s).
Behaviour kept the same: same BFS semantics, depth-counted from the
start URL, same skipped extensions, same regex include/exclude rules,
respect robots.txt by default (--crawler.ignoreRobotsTxt opts out),
crawl pinned to whatever origin the start URL lands on after
redirects. The crawl is strictly serial — there's no real-world page
count where parallelism matters here, and serial is much simpler to
reason about. Easy to add later if it ever does.
Co-authored-by: Claude noreply@anthropic.com