Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete a redis queue demo with scrapy spider #4416

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rindhane
Copy link

  • This demo is to get feedback for the implementation details from mentors @whalebot-helmsman @Gallaecio of this General Message Queues as Storage for Requests #4326 idea in GSOC 2020
  • The redis queue class is present in demo_queue
  • Warning: Run the demo with empty redis database in service as it will write some key:value pair
  • To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py
  • Demo can be run through import scrapy.msg_que.demo_spider if this version is installed
  • pre-requisite are redis-py and running Redis service to run the demo.
  • Add folder "msg_queues" within scrapy for demo_spider files
    On branch msg_queues
    -Changes to be committed:
    new file: scrapy/msg_que/init.py
    new file: scrapy/msg_que/demo_queue.py
    new file: scrapy/msg_que/demo_spider.py
    new file: scrapy/msg_que/items.json
    new file: scrapy/msg_que/requirements.txt
    new file: scrapy/msg_que/trial_data.py
    *End of Message

rindhane and others added 2 commits March 10, 2020 13:26
- This demo is to get feedback for the implementation details from upstream
- The redis queue class is present in demo_queue
- Warning: Run the demo with empty redis database in service as it will write key:value pair
- To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py
- Demo can be run through import scrapy.msg_que.demo_spider if this version is installed
- Need to install redis-py before hand
- Add folder "msg_queues" within scrapy for demo_spider files
On branch msg_queues
Changes to be committed:
	new file:   scrapy/msg_que/__init__.py
	new file:   scrapy/msg_que/demo_queue.py
	new file:   scrapy/msg_que/demo_spider.py
	new file:   scrapy/msg_que/items.json
	new file:   scrapy/msg_que/requirements.txt
	new file:   scrapy/msg_que/trial_data.py
End of Message
import random


class redis_spider(scrapy.Spider):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A queue class should not inherit from Spider. They do not need to inherit from any class. And they should be enabled through settings: https://docs.scrapy.org/en/latest/topics/settings.html#scheduler-disk-queue

See @whalebot-helmsman’s latest message on the topic here: #4326 (comment)

    - This demo is to get feedback for the implementation details from upstream
    - The redis queue class is present in demo_queue
    - Warning: Run the demo with empty redis database in service as it will write key:value pair
    - To run the demo just go to [Location of scrapy]/scrapy/msg_que folder and run python demo.py
    - Demo can be run through import scrapy.msg_que.demo_spider if this version is installed
    - Include in setup.py to install redis-py
    - Add folder "msg_queues" within scrapy for demo_spider files
    On branch msg_queues
    Changes to be committed:
            new file:   scrapy/msg_que/__init__.py
            new file:   scrapy/msg_que/demo_queue.py
            new file:   scrapy/msg_que/demo_spider.py
            new file:   scrapy/msg_que/items.json
            new file:   scrapy/msg_que/requirements.txt
            new file:   scrapy/msg_que/trial_data.py
            modified:   setup.py
    End of Message
@rindhane
Copy link
Author

@Gallaecio really thanks for that instruction, it suitably provides the understanding of queue's entry-point in overall functioning.
To take the discussion and development forward, I will bring the conversation from the associated email here. Request both @Gallaecio @whalebot-helmsman, to correct or suggest from here onwards.

This demo implementation is created based on these suggestions.
Presently, the implementation is created only for the Redis in-memory queue. So the queries are :

  1. Firstly, is this the right approach to construct the queue class? i.e are all the required class functions which are called by scheduler are present? (except barring fromcrawler()  function which has been borrowed from the parent Spider class ). If not, then which one is missing? Presently, I am able to test this queue by running a spider from the script. [demo_spider] 
  2. In this construction, it is presumed that the Redis-queue class just maintains the transaction or movement of items from the Scrapy-spider and vice-versa. The Scrapy-scheduler runs this movement i.e no separate worker system needs to be created to run this queue. Is this presumption ok? 
  3. If the above presumption is not ok, could you elaborate where this library will be integrated into the Scrapy's operation cycle? From my understanding, it will be coupled with a spider to maintain a connection with the external queue. My understanding of the queue position in the operation is depicted by this diagram. Is this understanding appropriate?  
  4. Queue-class created presently functions in FIFO mode and persisted in the Redis.db file? I could read here that other requirements are for FIFO/LIFO Disk/Memory. Are there other desired type which has to be implemented?  
  5. Is there a minimum number of downstream messaging queues (like Kafka, RabbitMQ) that must be implemented for the GSOC-2020?

@Gallaecio
Copy link
Member

  1. Creating a persistent (“disk”) queue class is the right approach, however your implementation should not inherit from Spider, as explained above.

  2. Your Redis-based disk queue class would behave as both as a caller (pushing messages) and a worker (reading messages). No additional workers (or callers) needed.

  3. No, it will not be coupled with a spider. It will instead be handled on the Scheduler part. You will define a disk queue class that will keep the connection to Redis (and additional queue classes for other queue systems, one per system). And to use that class it should be enough to point the SCHEDULER_DISK_QUEUE setting to the import path of your class.

    Parameters needed by the queue (e.g. Redis connection parameters) may be specified through custom settings, which users would specify the regular way (in settings.py, in Spider.custom_settings) and your queue class would read from crawler.settings (you get a crawler object to read these in from_crawler).

  4. I assume that Redis and other message queues will behave as FIFO or LIFO (or persist data to disk or not) based on server-side settings, in which case I do not think the corresponding disk queue classes should be either FIFO or LIFO, it will behave however the corresponding service is configured.

    On disk vs memory: I think message queues are meant to work as disk queues (we say disk, but we mean persistent). However, I guess they could be used as memory queues as well. It should be trivial to support both types of queues for each service (e.g. Redis, etc.), but @whalebot-helmsman might have a more clear vision here.

    In cases where dispatching order (FIFO/LIFO/other) can be decided by the service caller or worker, this could be exposed through custom Scrapy settings, as mentioned above, or create separate queues for each type.

  5. We have only stated Redis so far as the minimum. Although we do expect more than one queue to be implemented; once you implement support for the first service, it should be much easier to implement support for additional services.

    We are leaving it up for the student to decide which other queue services to support. For your GSoC proposal, you need to evaluate which services there are, which ones are the most interesting, which ones have ready-to-use Python bindings, which ones fit the queue class approach of Scrapy, and which ones you should be able to complete within GSoC.

    You can choose the minimum that you are sure you can implement in time, and set a list of additional ones as stretch goals, to be completed as time permits.

@whalebot-helmsman
Copy link
Contributor

I think message queues are meant to work as disk queues (we say disk, but we mean persistent).

You are right here

However, I guess they could be used as memory queues as well. It should be trivial to support both types of queues for each service (e.g. Redis, etc.)

This being trivial still has no sense, for me. Let's assume, external message queues are used only as persistent queues.

rindhane and others added 6 commits March 13, 2020 23:35
 -updating the local files with the recent developments happened.
 - Prototype queue as per the feeback mentioned in https://github.com/scrapy/scrapy/pull/4416/files#r391510607
 - Implement both memory and disk queue for redis. (File: Redis_queue.py)
 - To test demonstration by running scrapy/msg_que/demo_spider.py
	(Pre-requisite is to have redis-server installed on localhost:6379)
	use sudo apt-get install redis-server for debian linux

Things which are left/uninitiated :
      - creating custom spiderstate for redis queue.
      - better serializing method instead of using id(crawler)
      - to get redis initiation settings from the setting file of crawler.
 Changes to be committed:
	modified:   scrapy/msg_que/demo_spider.py
		    -- It runs a demo spider to test the working of redis queue

	new file:   scrapy/msg_que/redis_queue.py
		    -- This file contains the interface queue implementation with redis

	new file:   scrapy/msg_que/serialize.py
		    --This file contains the serializing and deserializing functions for request

	modified:   scrapy/settings/default_settings.py
		    -- Add following settings to run the test of redis queue
			SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk'
			SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue'

	modified:   scrapy/squeues.py
		    --Add the interface variables of redis queue
			FifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Fifo_queue_)
			LifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Lifo_queue_)
			FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk)
			LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk)
	modified:   setup.py
			Add dependency 'redis>= 3.4.1'
 Changes not staged for commit:
	modified:   .gitignore
		     -- To ignore .vscode files
 - Prototype queue as per the feeback mentioned in https://github.com/scrapy/scrapy/pull/4416/files#r391510607
 - Implement both memory and disk queue for redis. (File: Redis_queue.py)
 - To test demonstration by running scrapy/msg_que/demo_spider.py
	(Pre-requisite is to have redis-server installed on localhost:6379)
	use sudo apt-get install redis-server for debian linux

Things which are left/uninitiated :
      - creating custom spiderstate for redis queue.
      - better serializing method instead of using id(crawler)
      - to get redis initiation settings from the setting file of crawler.

Changes to be committed:
	modified:   scrapy/msg_que/demo_spider.py
		    -- It runs a demo spider to test the working of redis queue

	new file:   scrapy/msg_que/redis_queue.py
		    -- This file contains the interface queue implementation with redis

	new file:   scrapy/msg_que/serialize.py
		    --This file contains the serializing and deserializing functions for request

	modified:   scrapy/settings/default_settings.py
		    -- Add following settings to run the test of redis queue
			SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk'
			SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue'

	modified:   scrapy/squeues.py
		    --Add the interface variables of redis queue
			FifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Fifo_queue_)
			LifoRedisQueue = _scrapy_non_serialization_queue(redis_queue.Lifo_queue_)
			FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk)
			LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk)
	modified:   setup.py
			Add dependency 'redis>= 3.4.1'

	modified:   .gitignore
		     -- To ignore .vscode files

Note: This commit message is taken from the c8edbd1. To recieve the commit changes and messages from trial_brach to here.
@@ -27,7 +27,8 @@ def parse(self,response):

process = CrawlerProcess(settings = {
'FEED_FORMAT' :"json",
"FEED_URI" : "items.json"
"FEED_URI" : "items.json",
# "JOBDIR":"crawl_dir"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncommenting this line switches the test to the persistent type queue.

@rindhane
Copy link
Author

@Gallaecio @whalebot-helmsman, thanks for the detailed suggestions. Made changes according to the feedback. Kindly review this version. Please point out if the progress is in the right direction or not?

Could you indicate if you still see any inconsistency in this implementation in terms of your requirement?

Changes included are :

  • In scrapy/msg_que/redis_queue/ created four classes namely (Fifo_queue_ , Lifo_queue_ ) & (Lifo_queue_disk & Fifo_queue_disk) for in-memory type and persistent respectively, as redis queue interface; inline to the suggestion provided here.

  • The access to this classes as interface library for Scrapy during operation is made available using the following variable in scrapy/squeues.py :
    FifoRedisQueue = scrapy_non_serialization_queue(redis_queue.Fifo_queue)
    LifoRedisQueue = scrapy_non_serialization_queue(redis_queue.Lifo_queue)
    FifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Fifo_queue_disk)
    LifoRedisQueue_disk = _scrapy_serialization_queue(redis_queue.Lifo_queue_disk

  • Made following changes in the scrapy/settings/default_settings.py to test out the redis queue.
    SCHEDULER_DISK_QUEUE = 'scrapy.squeues.LifoRedisQueue_disk'
    SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoRedisQueue'

  • The operation of this interface is being tested by running a demo spider.

    1. To test both in-memory or persistent queue, please toggle between commenting and uncommenting the JOBDIR key, value pair
    2. Have the Redis-server running in the background at localhost:6379
  • Pending issues are :

    1. Modification in ~scrapy/extensions/spiderstate.py to maintain the state with the Redis-server.
    2. Pass on the Redis-server connection setting to the crawler from settings file.
  • If you may direct me to some existing tests that should be adhered to for this implementation, it would be really helpful.

  • Note: Presently, I have kept the in-memory type queue interface (through quirky modification of serialization method in Scrapy/utils/reqser.py) because they may be required for pub-sub implementation. Not entirely sure right now, once the main development is through, it would be easy to make the decision.

@whalebot-helmsman
Copy link
Contributor

There is only 5 days before the end of proposal acceptance period. I would advice you on writing a proposal(https://github.com/python-gsoc/python-gsoc.github.io/blob/master/2019/application2019.md) and pay less attention to draft implementation.

@rindhane
Copy link
Author

There is only 5 days before the end of proposal acceptance period. I would advice you on writing a proposal(https://github.com/python-gsoc/python-gsoc.github.io/blob/master/2019/application2019.md) and pay less attention to draft implementation.

@whalebot-helmsman Thanks for the heads up. I have submitted a draft proposal on the GSOC portal.
Could @Gallaecio @whalebot-helmsman, please go through it and give some initial advice?

@whalebot-helmsman
Copy link
Contributor

whalebot-helmsman commented Mar 31, 2020

@rindhane Idea behind this project is to handle all supported message queue as uniformly as possible. Is there an opportunity to add state persistence to all supported queues?

P.S. Today is a ls the last day of student application period.

@rindhane
Copy link
Author

@rindhane Idea behind this project is to handle all supported message queue as uniformly as possible. Is there an opportunity to add state persistence to all supported queues?

P.S. Today is a ls the last day of student application period.

@whalebot-helmsman Could you further elaborate on uniform handling of message-queues? In respect to the proposed scope, where do you find such a discrepancy?

In the proposal, it is indicated that all the queues will be able to provide LIFO/FIFO/Random/Sharded style of transactions with Scrapy through a similarly structured class with operating functions such as push(), pop() open(), close() etc. So that Scrapy-scheduler can work with any of them. It is similar to quelib implementation which can provide persistent storage in the file system as well as the SQLite database.

The only difference which will be present among the queues will in their internal working with respect to transaction methodology which will depend on the chosen interface library (for eg: redis-py for Redis). To elaborate the case, Kafka and RabbitMQ will be utilized as a message broker(pub-sub) while for Redis it is preferred to be utilized as a database compared to the pub-sub method. This preference is just to have an inherent efficient method. Since presently which method will work better cannot be known, all the possible methods (in case of Redis) will be built and provided and their selection will be done based on the testing results.

Also to have the state of the spider on the message queue rather on the file(present condition), it is proposed to have a new middleware extension called QueueState (will work like scrapy/utils/spiderstate.py but stores details on the downstream queue) and some changes to Scrapy/core/scheduler and Jobdir in Scrapy/utils/Job. This modification will store the state on the persistent storage enabled by the message queue rather than on the file system. This will work across all the downstream queues.

I hope the above explanation made sense.

I think I will work on improving the clarity of details in the proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants