Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for crawler exclude patterns #2319

Merged
merged 3 commits into from Feb 17, 2019
Merged
Changes from 2 commits
Commits
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+39 −4
Diff settings

Always

Just for now

@@ -478,6 +478,11 @@ module.exports.parseCommandLine = function parseCommandLine() {
describe: 'The max number of pages to test. Default is no limit.',
group: 'Crawler'
})
.option('crawler.exclude', {
describe:
'Exclude URLs matching the provided regular expression (ex: "/some/path/", "://some\\.domain/"). Can be provided multiple times.',
group: 'Crawler'
})
/**
Grafana cli option
*/
@@ -983,6 +988,15 @@ module.exports.parseCommandLine = function parseCommandLine() {
);
}
})
.coerce('crawler', crawler => {
if (crawler.exclude) {
if (!Array.isArray(crawler.exclude)) {
crawler.exclude = [crawler.exclude];
}
crawler.exclude = crawler.exclude.map(e => new RegExp(e));

This comment has been minimized.

Copy link
@fholzer

fholzer Feb 17, 2019

Author Contributor

Wasn't sure where to put this. At least here it would fail early in case the user provided an invalid regular expression.

}
return crawler;
})
.coerce('plugins', plugins => {
if (plugins.add && !Array.isArray(plugins.add)) {
plugins.add = plugins.add.split(',');
@@ -36,10 +36,27 @@ module.exports = {
crawler.downloadUnsupported = false;
crawler.allowInitialDomainChange = true;
crawler.parseHTMLComments = false;
crawler.addFetchCondition(function(parsedURL) {
const extension = path.extname(parsedURL.path);
crawler.addFetchCondition(queueItem => {
const extension = path.extname(queueItem.path);
// Don't try to download these, based on file name.
return ['png', 'jpg', 'gif', 'pdf'].indexOf(extension) === -1;
if (['png', 'jpg', 'gif', 'pdf'].indexOf(extension) !== -1) {
return false;
}

if (this.options.exclude) {
for (let e of this.options.exclude) {
if (e.test(queueItem.url)) {
log.verbose(
'Crawler skipping %s, matches exclude pattern %s',
queueItem.url,
e
);
return false;
}
}
}

return true;
});

if (this.basicAuth) {
@@ -49,6 +66,10 @@ module.exports = {
crawler.authPass = userAndPassword[1];
}

crawler.on('fetchconditionerror', (queueItem, err) => {
log.warn('An error occurred in the fetchCondition callback: %s', err);
});

crawler.on('fetchredirect', (queueItem, parsedURL, response) => {
redirectedUrls.add(response.headers.location);
});
@@ -68,7 +89,7 @@ module.exports = {

if (pageCount >= maxPages) {
log.info('Crawler stopped after %d urls', pageCount);
crawler.stop();
crawler.stop(true);

This comment has been minimized.

Copy link
@fholzer

fholzer Feb 17, 2019

Author Contributor

Even before my changes, when specifying e.g. --crawler.maxPages 5 most of the times I would get 5 pages in the report, but sometime I'd get 6. I guess that's because the crawler already discovered the 6th URL by the time crawler.stop() was called. I hope adding true will fix that. See https://github.com/simplecrawler/simplecrawler/blob/be994f01bc2b055aabb192290babe647f7a08ca1/lib/crawler.js#L1829

If that's not good enough, an additional pageCount check needs to be added at the top of the crawler.on('fetchcomplete', ...) callback, I guess.

return resolve();
}
} else {
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.