This repository has been archived by the owner on Mar 7, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 355
Oldest unfetched item duplicates #443
Comments
kbychkov
pushed a commit
to kbychkov/simplecrawler
that referenced
this issue
Dec 7, 2018
kbychkov
pushed a commit
to kbychkov/simplecrawler
that referenced
this issue
Dec 20, 2018
kbychkov
pushed a commit
to kbychkov/simplecrawler
that referenced
this issue
Dec 21, 2018
kbychkov
pushed a commit
to kbychkov/simplecrawler
that referenced
this issue
Dec 24, 2018
kbychkov
pushed a commit
to kbychkov/simplecrawler
that referenced
this issue
Jan 17, 2019
kbychkov
pushed a commit
to kbychkov/simplecrawler
that referenced
this issue
Feb 20, 2019
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello!
First of all, thank you for the useful library. I'm working on the MongoDB queue implementation for Simplecrawler and I discovered an issue related to DB perfomance, that is update operation, to be exact. Though it isn't a problem of Simplecrawler but while investigating this one I've found a weak point in the crawler itself and I think the matter should be resolved in a comprehensive manner. The problem is as follows.
Simplecrawler calls
oldestUnfetchedItem
function every 250ms which is defined bycrawler.interval
property. Then it callsfetchQueueItem
which updates the status of the queue item tospooled
. And here is probably the weak point. If for some reason the update is performed more than 250ms then next call tooldestUnfetchedItem
will return the same queue item. I've added some log statements to show what happens.That is not an issue for your queue implementation because it is memory based. In order to emulate this behaviour I've wrapped update in
setTimeout
. The full example is available in oldestUnfetchedItem branch of the fork.As a solution I might suggest to add a new hidden property
_spooled
, for example. When the oldest unfetched item got we check if it exists in_spooled
array and remove it after the status update complete. Perhaps you see another way.The text was updated successfully, but these errors were encountered: