New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make requests via message queues #3477
Comments
Did you figure this out? I’m curious about something similar. I have 20M seed urls that I want to que and haven’t found an example of someone implementing something similar. |
@nsherron90 Yeah, I've been digging deeper through this and finally found out that you can't combine a twisted app, i.e. So, still waiting for a fix and working on it. I tried several other things as well, like adding a class field to the spider to store the messages as they come and then fetch from it in There's one possible solution using |
@octohedron After staying up 24hrs straight trying to work this out I decided to go with a redis que. See scrapy-redis. It was fairly easy to setup and uses redis as dedupe and scheduler(que) with persistence. I set it up on a gcloud compute engine instance and setup redis-server as a dameon
Then started the spider with nohup (prevents logging out off ssh session from killing the spider)
Once spider is running just add seed urls to redis and it should start sending to your crawl instance that you started with nohup. Here's how I did it.
Its been running for over 24hrs now and I have tested the peristence bit and it all seems to be working just as expected. I still need to find out what happens when the que is finished and you try adding more or how to setup parallel runs but will be digging more into that next. Hope this helps. |
@nsherron90 Glad you made it work for your use case, although that's exactly the issue we're trying to solve, i.e. not using We've been digging deeper into it and it seems that this is not the So, what I'm going to do is set up a This will of course introduce other challenges like feeding the buffer, slow disk I/O, etc. Another project to look into doing similar things is scrapy-cluster I'm leaving the issue open, because maybe someone with a deeper understanding of |
Hi @octohedron wondering if the Scrapy RT may be useful for your task. |
Hello @octohedron. One of the problems I see in the above snippets is that the You could use the signals feature to enqueue more requests when the spider is idle. Consider you have a list of URLs (
And a spider:
More information about scheduling requests directly from the engine in #542 |
Thanks for your reply @elacuesta
That's the first thing we tried, if you could provide a fully working example, which should only take a few minutes if you know what you are doing, with pika or any other RabbitMQ/AMQP client, it would be great 👍 Feel free to use/modify this project https://github.com/octohedron/pikatest which is the minimum scrapy and pika/RabbitMQ integration setup. |
Have anyone tried to integrate pika consumer implementation on Twisted? https://github.com/pika/pika/blob/master/examples/twisted_service.py |
I'm trying to pass requests to the spider externally, via message queues, and keep it running forever.
I found some projects made by others but none of them work for the current version of scrapy, so I'm trying to fix the issues or just find a way to do it myself.
So far, I found that others got a reference to the scheduler from the
spider
in the middleware, likeThen, in another, custom
"Scheduler"
classAnd then, in the spider
This is the simplest example I could put together, but you get the idea, the code in reality is very messy, which makes it harder to fix.
The issue is, that in theory it should work, although I don't know when is that
next_request
method called, but it does because it calls themake_request
method in the spider, the only problem is that it never gets to theparse
or callback method in the spider, I don't know why.I also tried connecting the spider to the message queue directly in the spider, which should work, but it doesn't, for example
Up to there, everything is working fine, the body of the message received from the queue gets printed.
But, if we try to make or yield a request from the callback method in the spider, it won't work, for example
Unfortunately, yielding a generator or a request from the callback is not making a spider request.
As you can see, I've tried several things without luck, but all I need to do is to be able to make requests with the messages from the message queue, I'm not sure if there's a bug in scrapy or there's something I can fix in the code but I would love to have some input on this before I start digging deeper into the scrapy code myself.
The text was updated successfully, but these errors were encountered: