New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete a redis queue demo with scrapy spider #4416
base: master
Are you sure you want to change the base?
Conversation
- This demo is to get feedback for the implementation details from upstream - The redis queue class is present in demo_queue - Warning: Run the demo with empty redis database in service as it will write key:value pair - To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py - Demo can be run through import scrapy.msg_que.demo_spider if this version is installed - Need to install redis-py before hand - Add folder "msg_queues" within scrapy for demo_spider files On branch msg_queues Changes to be committed: new file: scrapy/msg_que/__init__.py new file: scrapy/msg_que/demo_queue.py new file: scrapy/msg_que/demo_spider.py new file: scrapy/msg_que/items.json new file: scrapy/msg_que/requirements.txt new file: scrapy/msg_que/trial_data.py End of Message
scrapy/msg_que/demo_queue.py
Outdated
import random | ||
|
||
|
||
class redis_spider(scrapy.Spider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A queue class should not inherit from Spider
. They do not need to inherit from any class. And they should be enabled through settings: https://docs.scrapy.org/en/latest/topics/settings.html#scheduler-disk-queue
See @whalebot-helmsman’s latest message on the topic here: #4326 (comment)
- This demo is to get feedback for the implementation details from upstream - The redis queue class is present in demo_queue - Warning: Run the demo with empty redis database in service as it will write key:value pair - To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py - Demo can be run through import scrapy.msg_que.demo_spider if this version is installed - Include in setup.py to install redis-py - Add folder "msg_queues" within scrapy for demo_spider files On branch msg_queues Changes to be committed: new file: scrapy/msg_que/__init__.py new file: scrapy/msg_que/demo_queue.py new file: scrapy/msg_que/demo_spider.py new file: scrapy/msg_que/items.json new file: scrapy/msg_que/requirements.txt new file: scrapy/msg_que/trial_data.py modified: setup.py End of Message
@Gallaecio really thanks for that instruction, it suitably provides the understanding of queue's entry-point in overall functioning. This demo implementation is created based on these suggestions.
|
|
You are right here
This being trivial still has no sense, for me. Let's assume, external message queues are used only as persistent queues. |
-updating the local files with the recent developments happened.
- Prototype queue as per the feeback mentioned in https://github.com/scrapy/scrapy/pull/4416/files#r391510607 - Implement both memory and disk queue for redis. (File: Redis_queue.py) - To test demonstration by running scrapy/msg_que/demo_spider.py (Pre-requisite is to have redis-server installed on localhost:6379) use sudo apt-get install redis-server for debian linux Things which are left/uninitiated : - creating custom spiderstate for redis queue. - better serializing method instead of using id(crawler) - to get redis initiation settings from the setting file of crawler. Changes to be committed: modified: scrapy/msg_que/demo_spider.py -- It runs a demo spider to test the working of redis queue new file: scrapy/msg_que/redis_queue.py -- This file contains the interface queue implementation with redis new file: scrapy/msg_que/serialize.py --This file contains the serializing and deserializing functions for request modified: scrapy/settings/default_settings.py -- Add following settings to run the test of redis queue SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue' modified: scrapy/squeues.py --Add the interface variables of redis queue FifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Fifo_queue_) LifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Lifo_queue_) FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk) LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk) modified: setup.py Add dependency 'redis>= 3.4.1' Changes not staged for commit: modified: .gitignore -- To ignore .vscode files
- Prototype queue as per the feeback mentioned in https://github.com/scrapy/scrapy/pull/4416/files#r391510607 - Implement both memory and disk queue for redis. (File: Redis_queue.py) - To test demonstration by running scrapy/msg_que/demo_spider.py (Pre-requisite is to have redis-server installed on localhost:6379) use sudo apt-get install redis-server for debian linux Things which are left/uninitiated : - creating custom spiderstate for redis queue. - better serializing method instead of using id(crawler) - to get redis initiation settings from the setting file of crawler. Changes to be committed: modified: scrapy/msg_que/demo_spider.py -- It runs a demo spider to test the working of redis queue new file: scrapy/msg_que/redis_queue.py -- This file contains the interface queue implementation with redis new file: scrapy/msg_que/serialize.py --This file contains the serializing and deserializing functions for request modified: scrapy/settings/default_settings.py -- Add following settings to run the test of redis queue SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue' modified: scrapy/squeues.py --Add the interface variables of redis queue FifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Fifo_queue_) LifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Lifo_queue_) FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk) LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk) modified: setup.py Add dependency 'redis>= 3.4.1' modified: .gitignore -- To ignore .vscode files Note: This commit message is taken from the c8edbd1. To recieve the commit changes and messages from trial_brach to here.
@@ -27,7 +27,8 @@ def parse(self,response): | |||
|
|||
process = CrawlerProcess(settings = { | |||
'FEED_FORMAT' :"json", | |||
"FEED_URI" : "items.json" | |||
"FEED_URI" : "items.json", | |||
# "JOBDIR":"crawl_dir" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncommenting this line switches the test to the persistent type queue.
@Gallaecio @whalebot-helmsman, thanks for the detailed suggestions. Made changes according to the feedback. Kindly review this version. Please point out if the progress is in the right direction or not? Could you indicate if you still see any inconsistency in this implementation in terms of your requirement? Changes included are :
|
There is only 5 days before the end of proposal acceptance period. I would advice you on writing a proposal(https://github.com/python-gsoc/python-gsoc.github.io/blob/master/2019/application2019.md) and pay less attention to draft implementation. |
@whalebot-helmsman Thanks for the heads up. I have submitted a draft proposal on the GSOC portal. |
@rindhane Idea behind this project is to handle all supported message queue as uniformly as possible. Is there an opportunity to add state persistence to all supported queues? P.S. Today is a ls the last day of student application period. |
@whalebot-helmsman Could you further elaborate on uniform handling of message-queues? In respect to the proposed scope, where do you find such a discrepancy? In the proposal, it is indicated that all the queues will be able to provide LIFO/FIFO/Random/Sharded style of transactions with Scrapy through a similarly structured class with operating functions such as push(), pop() open(), close() etc. So that Scrapy-scheduler can work with any of them. It is similar to quelib implementation which can provide persistent storage in the file system as well as the SQLite database. The only difference which will be present among the queues will in their internal working with respect to transaction methodology which will depend on the chosen interface library (for eg: redis-py for Redis). To elaborate the case, Kafka and RabbitMQ will be utilized as a message broker(pub-sub) while for Redis it is preferred to be utilized as a database compared to the pub-sub method. This preference is just to have an inherent efficient method. Since presently which method will work better cannot be known, all the possible methods (in case of Redis) will be built and provided and their selection will be done based on the testing results. Also to have the state of the spider on the message queue rather on the file(present condition), it is proposed to have a new middleware extension called QueueState (will work like scrapy/utils/spiderstate.py but stores details on the downstream queue) and some changes to Scrapy/core/scheduler and Jobdir in Scrapy/utils/Job. This modification will store the state on the persistent storage enabled by the message queue rather than on the file system. This will work across all the downstream queues. I hope the above explanation made sense. I think I will work on improving the clarity of details in the proposal. |
On branch msg_queues
-Changes to be committed:
new file: scrapy/msg_que/init.py
new file: scrapy/msg_que/demo_queue.py
new file: scrapy/msg_que/demo_spider.py
new file: scrapy/msg_que/items.json
new file: scrapy/msg_que/requirements.txt
new file: scrapy/msg_que/trial_data.py
*End of Message