New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message loss in spider feed #109

Open
sibiryakov opened this Issue Feb 9, 2016 · 2 comments

Comments

Projects
None yet
2 participants
@sibiryakov
Member

sibiryakov commented Feb 9, 2016

From here https://github.com/scrapinghub/distributed-frontera/issues/24#issuecomment-181386301

Another issue I noticed recently is that my DW keeps on pushing to all partitions although I have no spider running. When I start up my spiders now, they wait until the DW has pushed a new batch (although he pushed multiple times before that). This means that after running the DW for a while without there being any spider running depletes the queue until there is nothing left to crawl. I have to add new seeds (not the same seed URLs, since multiples get dropped) for the spiders to start again and the DW to push new requests again.

@sibiryakov

This comment has been minimized.

Show comment
Hide comment
@sibiryakov

sibiryakov Feb 9, 2016

Member

@lljrsr This is mostly done because there is no clear way of controlling consumption rate on producer side in ZeroMQ. Moreover the contents of queue on ZeroMQ side and the way it decides of high water mark, and starts dropping messages is implicit. Client application has no way to track any of this.
We could partially solve this problem, by disabling new batch generation if there are no heartbeats from spiders. Currently, spiders are sending offsets once per every get_next_requests call. So we could try disabling new batch generation until DBW gets first offset. At the moment, all partitions are marked ready after DBW start.

Member

sibiryakov commented Feb 9, 2016

@lljrsr This is mostly done because there is no clear way of controlling consumption rate on producer side in ZeroMQ. Moreover the contents of queue on ZeroMQ side and the way it decides of high water mark, and starts dropping messages is implicit. Client application has no way to track any of this.
We could partially solve this problem, by disabling new batch generation if there are no heartbeats from spiders. Currently, spiders are sending offsets once per every get_next_requests call. So we could try disabling new batch generation until DBW gets first offset. At the moment, all partitions are marked ready after DBW start.

@lljrsr

This comment has been minimized.

Show comment
Hide comment
@lljrsr

lljrsr Feb 9, 2016

Contributor

Yes. I think that would be a good idea.
It may increase the startup time for spiders but it would make frontera easier to use because this issue would no longer be there.

UPDATE Actually I think that a better approach would be to only delete requests from the queue once the spider has actually looked at them. The DW would then push indefinitely (although it might not need to) and some duplicate requests might get pushed to ZMQ but at least there would be no actual message loss.
You could add the heartbeat feature additionally. But it is much more important that every request actually gets executed at least once.
Also a polling approach could be nice (the DW checks very often for spider availability and pushes only if spiders explicitly request it.).

Contributor

lljrsr commented Feb 9, 2016

Yes. I think that would be a good idea.
It may increase the startup time for spiders but it would make frontera easier to use because this issue would no longer be there.

UPDATE Actually I think that a better approach would be to only delete requests from the queue once the spider has actually looked at them. The DW would then push indefinitely (although it might not need to) and some duplicate requests might get pushed to ZMQ but at least there would be no actual message loss.
You could add the heartbeat feature additionally. But it is much more important that every request actually gets executed at least once.
Also a polling approach could be nice (the DW checks very often for spider availability and pushes only if spiders explicitly request it.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment