You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An option to limit RAM and disk usage by the scheduler queue will make the engine to take new requests from the spiders only if there is available space.
Motivation
Recently we've run into the issue of high disk usage by the scheduler queue. We are going through company registry and making a lot of requests. These requests are the only requests spider makes. Sample code:
# crn - company registration numberdefstart(self, /) ->Iterator:
# Reset generators to start from 0 after restartsforgeneratorinself.generators:
generator.reset()
it= ((gen, crn) forgeninself.generatorsforcrningen)
forgenerator, crninit:
self.requests_in_progress+=1yieldRequest(
self.url_pattern_check_crn.format(crn),
self.check_crn_status,
dont_filter=True,
errback=self.check_crn_status_failed,
cb_kwargs=dict(crn=crn, generator=generator),
)
Currently, generators produce 142,685,210 unique CRNs. Requests can end with 404 (company not found) or 200 (company found).
After reaching ~110k successful requests disk queue occupies 10GB. Meanwhile, RAM usage does not exceed 200MB.
Describe alternatives you've considered
Currently, we resolved the issue with increasing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN to big enough number, but this helps only if the site can afford so many connections or does not have any rate limiting techniques.
Additional context
There is SCRAPER_SLOT_MAX_ACTIVE_SIZE setting which is described as follows:
Soft limit (in bytes) for response data being processed.
While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests.
"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?
We are also using FIFO queue for this spider, but I do not think this matters.
The text was updated successfully, but these errors were encountered:
"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?
The code (self._needs_backout() which calls self.scraper.slot.needs_backout()) is in
Summary
An option to limit RAM and disk usage by the scheduler queue will make the engine to take new requests from the spiders only if there is available space.
Motivation
Recently we've run into the issue of high disk usage by the scheduler queue. We are going through company registry and making a lot of requests. These requests are the only requests spider makes. Sample code:
Currently, generators produce 142,685,210 unique CRNs. Requests can end with 404 (company not found) or 200 (company found).
After reaching ~110k successful requests disk queue occupies 10GB. Meanwhile, RAM usage does not exceed 200MB.
Describe alternatives you've considered
Currently, we resolved the issue with increasing
CONCURRENT_REQUESTS
andCONCURRENT_REQUESTS_PER_DOMAIN
to big enough number, but this helps only if the site can afford so many connections or does not have any rate limiting techniques.Additional context
There is
SCRAPER_SLOT_MAX_ACTIVE_SIZE
setting which is described as follows:"Scrapy does not process new requests" means Scrapy does not take new requests from the spider or does not put already scheduled requests to the downloader?
We are also using FIFO queue for this spider, but I do not think this matters.
The text was updated successfully, but these errors were encountered: