Complete a redis queue demo with scrapy spider #4416

rindhane · 2020-03-10T11:27:17Z

This demo is to get feedback for the implementation details from mentors @whalebot-helmsman @Gallaecio of this General Message Queues as Storage for Requests #4326 idea in GSOC 2020
The redis queue class is present in demo_queue
Warning: Run the demo with empty redis database in service as it will write some key:value pair
To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py
Demo can be run through import scrapy.msg_que.demo_spider if this version is installed
pre-requisite are redis-py and running Redis service to run the demo.
Add folder "msg_queues" within scrapy for demo_spider files
On branch msg_queues
-Changes to be committed:
new file: scrapy/msg_que/init.py
new file: scrapy/msg_que/demo_queue.py
new file: scrapy/msg_que/demo_spider.py
new file: scrapy/msg_que/items.json
new file: scrapy/msg_que/requirements.txt
new file: scrapy/msg_que/trial_data.py
*End of Message

- This demo is to get feedback for the implementation details from upstream - The redis queue class is present in demo_queue - Warning: Run the demo with empty redis database in service as it will write key:value pair - To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py - Demo can be run through import scrapy.msg_que.demo_spider if this version is installed - Need to install redis-py before hand - Add folder "msg_queues" within scrapy for demo_spider files On branch msg_queues Changes to be committed: new file: scrapy/msg_que/__init__.py new file: scrapy/msg_que/demo_queue.py new file: scrapy/msg_que/demo_spider.py new file: scrapy/msg_que/items.json new file: scrapy/msg_que/requirements.txt new file: scrapy/msg_que/trial_data.py End of Message

Gallaecio · 2020-03-12T09:58:03Z

scrapy/msg_que/demo_queue.py

+import random
+
+
+class redis_spider(scrapy.Spider):


A queue class should not inherit from Spider. They do not need to inherit from any class. And they should be enabled through settings: https://docs.scrapy.org/en/latest/topics/settings.html#scheduler-disk-queue

See @whalebot-helmsman’s latest message on the topic here: #4326 (comment)

- This demo is to get feedback for the implementation details from upstream - The redis queue class is present in demo_queue - Warning: Run the demo with empty redis database in service as it will write key:value pair - To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py - Demo can be run through import scrapy.msg_que.demo_spider if this version is installed - Include in setup.py to install redis-py - Add folder "msg_queues" within scrapy for demo_spider files On branch msg_queues Changes to be committed: new file: scrapy/msg_que/__init__.py new file: scrapy/msg_que/demo_queue.py new file: scrapy/msg_que/demo_spider.py new file: scrapy/msg_que/items.json new file: scrapy/msg_que/requirements.txt new file: scrapy/msg_que/trial_data.py modified: setup.py End of Message

rindhane · 2020-03-13T05:07:09Z

@Gallaecio really thanks for that instruction, it suitably provides the understanding of queue's entry-point in overall functioning.
To take the discussion and development forward, I will bring the conversation from the associated email here. Request both @Gallaecio @whalebot-helmsman, to correct or suggest from here onwards.

This demo implementation is created based on these suggestions.
Presently, the implementation is created only for the Redis in-memory queue. So the queries are :

Firstly, is this the right approach to construct the queue class? i.e are all the required class functions which are called by scheduler are present? (except barring fromcrawler() function which has been borrowed from the parent Spider class ). If not, then which one is missing? Presently, I am able to test this queue by running a spider from the script. [demo_spider]
In this construction, it is presumed that the Redis-queue class just maintains the transaction or movement of items from the Scrapy-spider and vice-versa. The Scrapy-scheduler runs this movement i.e no separate worker system needs to be created to run this queue. Is this presumption ok?
If the above presumption is not ok, could you elaborate where this library will be integrated into the Scrapy's operation cycle? From my understanding, it will be coupled with a spider to maintain a connection with the external queue. My understanding of the queue position in the operation is depicted by this diagram. Is this understanding appropriate?
Queue-class created presently functions in FIFO mode and persisted in the Redis.db file? I could read here that other requirements are for FIFO/LIFO Disk/Memory. Are there other desired type which has to be implemented?
Is there a minimum number of downstream messaging queues (like Kafka, RabbitMQ) that must be implemented for the GSOC-2020?

Gallaecio · 2020-03-13T09:10:41Z

Creating a persistent (“disk”) queue class is the right approach, however your implementation should not inherit from Spider, as explained above.
Your Redis-based disk queue class would behave as both as a caller (pushing messages) and a worker (reading messages). No additional workers (or callers) needed.
No, it will not be coupled with a spider. It will instead be handled on the Scheduler part. You will define a disk queue class that will keep the connection to Redis (and additional queue classes for other queue systems, one per system). And to use that class it should be enough to point the SCHEDULER_DISK_QUEUE setting to the import path of your class.

Parameters needed by the queue (e.g. Redis connection parameters) may be specified through custom settings, which users would specify the regular way (in settings.py, in Spider.custom_settings) and your queue class would read from crawler.settings (you get a crawler object to read these in from_crawler).
I assume that Redis and other message queues will behave as FIFO or LIFO (or persist data to disk or not) based on server-side settings, in which case I do not think the corresponding disk queue classes should be either FIFO or LIFO, it will behave however the corresponding service is configured.

On disk vs memory: I think message queues are meant to work as disk queues (we say disk, but we mean persistent). However, I guess they could be used as memory queues as well. It should be trivial to support both types of queues for each service (e.g. Redis, etc.), but @whalebot-helmsman might have a more clear vision here.

In cases where dispatching order (FIFO/LIFO/other) can be decided by the service caller or worker, this could be exposed through custom Scrapy settings, as mentioned above, or create separate queues for each type.
We have only stated Redis so far as the minimum. Although we do expect more than one queue to be implemented; once you implement support for the first service, it should be much easier to implement support for additional services.

We are leaving it up for the student to decide which other queue services to support. For your GSoC proposal, you need to evaluate which services there are, which ones are the most interesting, which ones have ready-to-use Python bindings, which ones fit the queue class approach of Scrapy, and which ones you should be able to complete within GSoC.

You can choose the minimum that you are sure you can implement in time, and set a list of additional ones as stretch goals, to be completed as time permits.

whalebot-helmsman · 2020-03-13T11:59:38Z

I think message queues are meant to work as disk queues (we say disk, but we mean persistent).

You are right here

However, I guess they could be used as memory queues as well. It should be trivial to support both types of queues for each service (e.g. Redis, etc.)

This being trivial still has no sense, for me. Let's assume, external message queues are used only as persistent queues.

-updating the local files with the recent developments happened.

- Prototype queue as per the feeback mentioned in https://github.com/scrapy/scrapy/pull/4416/files#r391510607 - Implement both memory and disk queue for redis. (File: Redis_queue.py) - To test demonstration by running scrapy/msg_que/demo_spider.py (Pre-requisite is to have redis-server installed on localhost:6379) use sudo apt-get install redis-server for debian linux Things which are left/uninitiated : - creating custom spiderstate for redis queue. - better serializing method instead of using id(crawler) - to get redis initiation settings from the setting file of crawler. Changes to be committed: modified: scrapy/msg_que/demo_spider.py -- It runs a demo spider to test the working of redis queue new file: scrapy/msg_que/redis_queue.py -- This file contains the interface queue implementation with redis new file: scrapy/msg_que/serialize.py --This file contains the serializing and deserializing functions for request modified: scrapy/settings/default_settings.py -- Add following settings to run the test of redis queue SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue' modified: scrapy/squeues.py --Add the interface variables of redis queue FifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Fifo_queue_) LifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Lifo_queue_) FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk) LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk) modified: setup.py Add dependency 'redis>= 3.4.1' Changes not staged for commit: modified: .gitignore -- To ignore .vscode files

- Prototype queue as per the feeback mentioned in https://github.com/scrapy/scrapy/pull/4416/files#r391510607 - Implement both memory and disk queue for redis. (File: Redis_queue.py) - To test demonstration by running scrapy/msg_que/demo_spider.py (Pre-requisite is to have redis-server installed on localhost:6379) use sudo apt-get install redis-server for debian linux Things which are left/uninitiated : - creating custom spiderstate for redis queue. - better serializing method instead of using id(crawler) - to get redis initiation settings from the setting file of crawler. Changes to be committed: modified: scrapy/msg_que/demo_spider.py -- It runs a demo spider to test the working of redis queue new file: scrapy/msg_que/redis_queue.py -- This file contains the interface queue implementation with redis new file: scrapy/msg_que/serialize.py --This file contains the serializing and deserializing functions for request modified: scrapy/settings/default_settings.py -- Add following settings to run the test of redis queue SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue' modified: scrapy/squeues.py --Add the interface variables of redis queue FifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Fifo_queue_) LifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Lifo_queue_) FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk) LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk) modified: setup.py Add dependency 'redis>= 3.4.1' modified: .gitignore -- To ignore .vscode files Note: This commit message is taken from the c8edbd1. To recieve the commit changes and messages from trial_brach to here.

rindhane · 2020-03-25T03:53:38Z

scrapy/msg_que/demo_spider.py

@@ -27,7 +27,8 @@ def parse(self,response):

 process = CrawlerProcess(settings = { 
    'FEED_FORMAT' :"json",
-    "FEED_URI" : "items.json"
+    "FEED_URI" : "items.json",
+#    "JOBDIR":"crawl_dir"


Uncommenting this line switches the test to the persistent type queue.

rindhane · 2020-03-25T04:17:57Z

@Gallaecio @whalebot-helmsman, thanks for the detailed suggestions. Made changes according to the feedback. Kindly review this version. Please point out if the progress is in the right direction or not?

Could you indicate if you still see any inconsistency in this implementation in terms of your requirement?

Changes included are :

In scrapy/msg_que/redis_queue/ created four classes namely (Fifo_queue_ , Lifo_queue_ ) & (Lifo_queue_disk & Fifo_queue_disk) for in-memory type and persistent respectively, as redis queue interface; inline to the suggestion provided here.
The access to this classes as interface library for Scrapy during operation is made available using the following variable in scrapy/squeues.py :
FifoRedisQueue = scrapy_non_serialization_queue(redis_queue.Fifo_queue)
LifoRedisQueue = scrapy_non_serialization_queue(redis_queue.Lifo_queue)
FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk)
LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk
Made following changes in the scrapy/settings/default_settings.py to test out the redis queue.
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue'
The operation of this interface is being tested by running a demo spider.
1. To test both in-memory or persistent queue, please toggle between commenting and uncommenting the JOBDIR key, value pair
2. Have the Redis-server running in the background at localhost:6379
Pending issues are :
1. Modification in ~scrapy/extensions/spiderstate.py to maintain the state with the Redis-server.
2. Pass on the Redis-server connection setting to the crawler from settings file.
If you may direct me to some existing tests that should be adhered to for this implementation, it would be really helpful.
Note: Presently, I have kept the in-memory type queue interface (through quirky modification of serialization method in Scrapy/utils/reqser.py) because they may be required for pub-sub implementation. Not entirely sure right now, once the main development is through, it would be easy to make the decision.

whalebot-helmsman · 2020-03-25T13:57:57Z

There is only 5 days before the end of proposal acceptance period. I would advice you on writing a proposal(https://github.com/python-gsoc/python-gsoc.github.io/blob/master/2019/application2019.md) and pay less attention to draft implementation.

rindhane · 2020-03-30T08:57:53Z

There is only 5 days before the end of proposal acceptance period. I would advice you on writing a proposal(https://github.com/python-gsoc/python-gsoc.github.io/blob/master/2019/application2019.md) and pay less attention to draft implementation.

@whalebot-helmsman Thanks for the heads up. I have submitted a draft proposal on the GSOC portal.
Could @Gallaecio @whalebot-helmsman, please go through it and give some initial advice?

whalebot-helmsman · 2020-03-31T09:35:42Z

@rindhane Idea behind this project is to handle all supported message queue as uniformly as possible. Is there an opportunity to add state persistence to all supported queues?

P.S. Today is a ls the last day of student application period.

rindhane · 2020-03-31T11:17:48Z

@rindhane Idea behind this project is to handle all supported message queue as uniformly as possible. Is there an opportunity to add state persistence to all supported queues?

P.S. Today is a ls the last day of student application period.

@whalebot-helmsman Could you further elaborate on uniform handling of message-queues? In respect to the proposed scope, where do you find such a discrepancy?

In the proposal, it is indicated that all the queues will be able to provide LIFO/FIFO/Random/Sharded style of transactions with Scrapy through a similarly structured class with operating functions such as push(), pop() open(), close() etc. So that Scrapy-scheduler can work with any of them. It is similar to quelib implementation which can provide persistent storage in the file system as well as the SQLite database.

The only difference which will be present among the queues will in their internal working with respect to transaction methodology which will depend on the chosen interface library (for eg: redis-py for Redis). To elaborate the case, Kafka and RabbitMQ will be utilized as a message broker(pub-sub) while for Redis it is preferred to be utilized as a database compared to the pub-sub method. This preference is just to have an inherent efficient method. Since presently which method will work better cannot be known, all the possible methods (in case of Redis) will be built and provided and their selection will be done based on the testing results.

Also to have the state of the spider on the message queue rather on the file(present condition), it is proposed to have a new middleware extension called QueueState (will work like scrapy/utils/spiderstate.py but stores details on the downstream queue) and some changes to Scrapy/core/scheduler and Jobdir in Scrapy/utils/Job. This modification will store the state on the persistent storage enabled by the message queue rather than on the file system. This will work across all the downstream queues.

I hope the above explanation made sense.

I think I will work on improving the clarity of details in the proposal.

rindhane and others added 2 commits March 10, 2020 13:26

Added Diagram Queue_integration.drawio

920d799

Gallaecio reviewed Mar 12, 2020

View reviewed changes

rindhane and others added 6 commits March 13, 2020 23:35

trial commit

3c36041

trial commit

b51579f

Merge branch 'master' of github.com:rindhane/scrapy into trial_branch

003546f

-updating the local files with the recent developments happened.

Merge branch 'master' of github.com:scrapy/scrapy into trial_branch

59084ca

rindhane commented Mar 25, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete a redis queue demo with scrapy spider #4416

Complete a redis queue demo with scrapy spider #4416

rindhane commented Mar 10, 2020

Gallaecio Mar 12, 2020

rindhane commented Mar 13, 2020

Gallaecio commented Mar 13, 2020

whalebot-helmsman commented Mar 13, 2020

rindhane Mar 25, 2020

rindhane commented Mar 25, 2020

whalebot-helmsman commented Mar 25, 2020

rindhane commented Mar 30, 2020

whalebot-helmsman commented Mar 31, 2020 •

edited

rindhane commented Mar 31, 2020

Complete a redis queue demo with scrapy spider #4416

Are you sure you want to change the base?

Complete a redis queue demo with scrapy spider #4416

Conversation

rindhane commented Mar 10, 2020

Gallaecio Mar 12, 2020

Choose a reason for hiding this comment

rindhane commented Mar 13, 2020

Gallaecio commented Mar 13, 2020

whalebot-helmsman commented Mar 13, 2020

rindhane Mar 25, 2020

Choose a reason for hiding this comment

rindhane commented Mar 25, 2020

whalebot-helmsman commented Mar 25, 2020

rindhane commented Mar 30, 2020

whalebot-helmsman commented Mar 31, 2020 • edited

rindhane commented Mar 31, 2020

whalebot-helmsman commented Mar 31, 2020 •

edited