Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Oldest unfetched item duplicates #443

Closed
kbychkov opened this issue Nov 13, 2018 · 0 comments · Fixed by #450
Closed

Oldest unfetched item duplicates #443

kbychkov opened this issue Nov 13, 2018 · 0 comments · Fixed by #450

Comments

@kbychkov
Copy link
Contributor

Hello!

First of all, thank you for the useful library. I'm working on the MongoDB queue implementation for Simplecrawler and I discovered an issue related to DB perfomance, that is update operation, to be exact. Though it isn't a problem of Simplecrawler but while investigating this one I've found a weak point in the crawler itself and I think the matter should be resolved in a comprehensive manner. The problem is as follows.

Simplecrawler calls oldestUnfetchedItem function every 250ms which is defined by crawler.interval property. Then it calls fetchQueueItem which updates the status of the queue item to spooled. And here is probably the weak point. If for some reason the update is performed more than 250ms then next call to oldestUnfetchedItem will return the same queue item. I've added some log statements to show what happens.

oldestunfetcheditem

That is not an issue for your queue implementation because it is memory based. In order to emulate this behaviour I've wrapped update in setTimeout. The full example is available in oldestUnfetchedItem branch of the fork.

As a solution I might suggest to add a new hidden property _spooled, for example. When the oldest unfetched item got we check if it exists in _spooled array and remove it after the status update complete. Perhaps you see another way.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant