Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

addFetchCondition and addDownloadCondition not working correctly anymore #360

Closed
JessyCat92 opened this issue Mar 16, 2017 · 6 comments
Closed

Comments

@JessyCat92
Copy link

JessyCat92 commented Mar 16, 2017

What happened?

I just updated to version 1.1.0 and also changed addFetchCondition and addDownloadCondition to work async (I tried also sync - same behaviour). If I use any add Download Condition the crawler stucks at download and if I use only Fetch Condition it skips linked documents.

What should have happened?

it should work like in 1.0.3

Steps to reproduce the problem

It's very helpful if you include a code sample here (including a URL to the site you tried to crawl)
I think it is the best I explain what we are doing.
We have a test html page where we have a link to an XML page. In Version 1.0.3 everything was crawled if I use simply (this code was only to check the problem normaly we have there regexp stuff)

crawler.addFetchCondition(function (queueItem, referrerQueueItem) {            
            return true;
            // in Version 1.1.0 I changed it to callback(true)
});

but after adding the fetch Condition it only crawls the html page and ignores completely the link to the xml files (so also crawler.on("fetchstart", function(queueItem) { console.log(queueIte.url); }); is not executed for the xml file anymore

In case of using download condition I added

crawler.addDownloadCondition(function (queueItem, referrerQueueItem) {            
            return true;
            // in Version 1.1.0 I changed it to callback(true)
});

and after update to 1.0.3 we have that we get Fetchstart Console output for only html file but after this console output nothing will be happen anymore (so no fetch complete)

@choerl
Copy link

choerl commented Mar 16, 2017

Behavior is reproducible

@fredrikekelund
Copy link
Collaborator

Thanks for reporting an issue, @geramy92! This looks to have been an issue in the documentation. The callbacks use the node convention of taking any potential error as the first argument, and the result as the second argument.

Could you try to change your code to callback(null, true) and see if it works as expected?

@JessyCat92
Copy link
Author

yes, with this variant it is working.
So documentation and examples are not correct. But I think if first parameter is error (like on normal node convention) it should be handled or return anything instead of running into such issues.

@fredrikekelund
Copy link
Collaborator

You're absolutely right that the documentation was faulty, I've already updated it to help others from running into the same issue.

We've added two new events to deal with potential errors from fetch conditions and download conditions: fetchconditionerror and downloadconditionerror. And if an error is encountered, simplecrawler will consider that the same thing as the condition having returned false (ie. it won't add the item to the queue/download the resource). Does that make sense?

@JessyCat92
Copy link
Author

Yes, I think thats a good way to solve it.

@fredrikekelund
Copy link
Collaborator

Cool! Let me know if you run into any other issues around this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants