Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document async signal handlers #4295

Closed
wRAR opened this issue Jan 31, 2020 · 11 comments · Fixed by #4390
Closed

Document async signal handlers #4295

wRAR opened this issue Jan 31, 2020 · 11 comments · Fixed by #4390
Labels

Comments

@wRAR
Copy link
Contributor

@wRAR wRAR commented Jan 31, 2020

Right now there are 7 signals whose handlers can return Deferreds: https://docs.scrapy.org/en/latest/topics/signals.html?highlight=deferred

With #4271 these signal handlers can also be async def coroutines.

AFAIK nowhere in the docs it is explained what does this mean and how can it be used. I looked at the code and the situation seems to be this:

item_scraped, item_dropped, item_error: I'm not 100% sure as I couldn't fully untangle the web of callbacks and Deferreds inside Scraper but it looks like until the handlers for items produced by a Request haven't finished, that Request isn't counted as processed, which means spider_idle isn't called and so the spider isn't closed, and may also mean some queue-related things. I traced the callback chain until Scraper.enqueue_scrape.finish_scraping which also calls some _scrape_next method.

engine_started, engine_stopped, spider_opened, spider_closed: these are included in the spider/engine startup/shutdown processes and the process waits for the handler to finish before proceeding to the next line.

@wRAR wRAR added the docs label Jan 31, 2020
@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 5, 2020

Hey! Shall I work on this issue?

@wRAR
Copy link
Contributor Author

@wRAR wRAR commented Feb 5, 2020

@adityaa30 yes please. Feel free to ask for clarifications if you need them.

@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 5, 2020

@wRAR Okay. Thanks!

@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 7, 2020

Till now, I have figured out that:

  1. All the signal handlers use method send_catch_log_deferred in scrapy/signalhelper.py which indeed calls send_catch_log_deferred in scrapy/utils/signal.py which returns a deferred that gets fired once all signal handlers deferred are fired
  2. engine_started, engine_stopped, spider_opened, spider_closed are called in scrapy/core/engine.py
  3. item_scraped, item_dropped, item_error are called in scrapy/core/scraper.py at _itemproc_finished method of class Scraper

However, if i write a simple program using any of the 7 signals which supports returning deferred's and return a Deferred, the program stops after the signal is logged. For example:

from scrapy import Spider, signals
from scrapy.crawler import CrawlerProcess
from twisted.internet import defer
def  show_values(*args,  **kwargs):
	print(f'\nDeferCallback => \nArgs:\n{args}\nKwargs:\n{kwargs}\n')

class  ExampleSpider(Spider):
	name =  "example"
	start_urls =  ["http://quotes.toscrape.com/"]
	
	@classmethod
	def from_crawler(cls,  crawler,  *args,  **kwargs):
		spider =  super(ExampleSpider,  cls).from_crawler(
		crawler,  *args,  **kwargs)
		assert  isinstance(spider, ExampleSpider)
		crawler.signals.connect(spider.spider_opened,
		signal=signals.spider_opened)
		return spider
	
	def spider_opened(self):
		print("spider_opened")
		d = defer.Deferred()
		d.addCallback(show_values)
		d.addErrback(show_values)
		return d
	
	def parse(self,  response):
		for quote in response.css("div.quote"):
			yield  {
				'title': quote.css("span.text::text").extract_first(),
				'author': quote.css(".author::text").extract_first(),
				'tags': quote.css(".tag::text").extract(),
			}

on doing: scrapy crawl example

020-02-07 13:09:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 13:09:27 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Aug 31 2019, 15:02:57) - [G
CC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-5.3.0-29-generic-x86_64-with-debian-buster-sid
2020-02-07 13:09:27 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-02-07 13:09:27 [scrapy.extensions.telnet] INFO: Telnet Password: 350d3196e83197d8
2020-02-07 13:09:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-02-07 13:09:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-07 13:09:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-07 13:09:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 13:09:27 [scrapy.core.engine] INFO: Spider opened
2020-02-07 13:09:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
spider_opened

and after this, the program do not go any futhur. However if nothing is returned then it works fine.
@wRAR What could be the reason here?

@wRAR
Copy link
Contributor Author

@wRAR wRAR commented Feb 7, 2020

@adityaa30 this is a correct behavior, the spider waits for that Deferred to fire. In the real world cases the handler would return a Deferred that it got from some async operation, and that would fire when that operation finished. In tests you'll either need to do the same or fire the Deferred manually.

@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 11, 2020

Okay. So, firing the Deferred manually works by changing:

def spider_opened(self):
	print("spider_opened")
	d = defer.Deferred()
	reactor.callLater(2, d.callback, "Hello from spider_opened")
	d.addCallback(show_values)
	d.addErrback(show_values)
	return d

A general use case can be:

def signal_handler(self):
	d = getDeferredFromSomewhere()
	# Add callbacks here, eg:
	d.addCallback(success_callback)
	d.addErrback(error_callback)
	
	return d # This will be called after the signal has fired

I have gathered the basic knowledge how Deferred works and is implemented in scrapy. What shall I do now @wRAR ?

@wRAR
Copy link
Contributor Author

@wRAR wRAR commented Feb 22, 2020

@adityaa30 I think you should xpand the "Deferred signal handlers" section in docs/topics/signals.rst with

  • an explanation why would someone want to return a Deferred (to signal that the processing of this signal will finish later)
  • a clarification that Scrapy will wait for handlers for engine/spider_opened/closed before proceeding with the respective open/close process
  • another similar clarification about item_* signals that the spider will wait for their handlers, it's better to write what exactly does that mean (which I'm not sure about, as I said in the original comment) but even a brief explanation is better than what we have now.

@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 26, 2020

@wRAR Okay thanks a lot! I am working on it.

@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 26, 2020

@wRAR I think as at max settings['CONCURRENT_ITEMS'] concurrent items (per response) are processed in parallel in the Item Processor (refer scraper.py), parallelly many Deferred's are returned which are all fired by using DeferredList (refer signal.py) because of which after a certain number of scrapped items are processed, the next batch waits for the DeferredList to fire and then runs the item_* signal handler for the next batch of scrapped items.
Let me know if I am moving in the correct direction?

@wRAR
Copy link
Contributor Author

@wRAR wRAR commented Feb 27, 2020

@adityaa30 yeah, I think this is correct.

@adityaa30
Copy link
Contributor

@adityaa30 adityaa30 commented Feb 28, 2020

@wRAR Thanks. I have sent a pr #4390 to fix the current issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants