Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for crawler exclude patterns #2319

Merged
merged 3 commits into from Feb 17, 2019

Conversation

Projects
None yet
2 participants
@fholzer
Copy link
Contributor

fholzer commented Feb 17, 2019

  • I'm making a big change or adding functionality so I've already opened an issue proposing the change to other contributors, so I got feedback on the idea before took the time to write precious code
  • Check that your change/fix has corresponding unit tests (if applicable)
  • Squash commits so it looks sane
  • Update the documentation https://github.com/sitespeedio/sitespeed.io/tree/master/docs in another PR
  • Verify that the test works by running npm test and test linting by npm run lint

Description

Add support for crawler exclude patterns. Multiple patterns may be provided. Resolves #1929.

At the moment the patterns are matched against the full URL. Feedback is appreciated. Will squash and add documentation if you're happy with this PR

if (!Array.isArray(crawler.exclude)) {
crawler.exclude = [crawler.exclude];
}
crawler.exclude = crawler.exclude.map(e => new RegExp(e));

This comment has been minimized.

@fholzer

fholzer Feb 17, 2019

Author Contributor

Wasn't sure where to put this. At least here it would fail early in case the user provided an invalid regular expression.

@@ -68,7 +89,7 @@ module.exports = {

if (pageCount >= maxPages) {
log.info('Crawler stopped after %d urls', pageCount);
crawler.stop();
crawler.stop(true);

This comment has been minimized.

@fholzer

fholzer Feb 17, 2019

Author Contributor

Even before my changes, when specifying e.g. --crawler.maxPages 5 most of the times I would get 5 pages in the report, but sometime I'd get 6. I guess that's because the crawler already discovered the 6th URL by the time crawler.stop() was called. I hope adding true will fix that. See https://github.com/simplecrawler/simplecrawler/blob/be994f01bc2b055aabb192290babe647f7a08ca1/lib/crawler.js#L1829

If that's not good enough, an additional pageCount check needs to be added at the top of the crawler.on('fetchcomplete', ...) callback, I guess.

@soulgalore

This comment has been minimized.

Copy link
Member

soulgalore commented Feb 17, 2019

Hi @fholzer wow so nice you spend time to fix this! I've been using the crawler so much the last 3-4 years, so I haven't given it much love.

Thanks for the PR!

@soulgalore soulgalore merged commit 3c5ccc3 into sitespeedio:master Feb 17, 2019

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@fholzer fholzer deleted the fholzer:fix-1929 branch Feb 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.