scrapy.pqueues.ScrapyPriorityQueue #6266

roykyle8 · 2024-03-04T10:47:39Z

Description

The ScrapyPriorityQueue throws a builtins.KeyError.

2024-03-04 09:33:09,684 2112359:CRITICAL [twisted] Unhandled Error
Traceback (most recent call last):
  File "/opt/custom_crawler.py", line 107, in main
    crawler_process.start()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/pqueues.py", line 98, in pop
    q = self.queues[self.curprio]
builtins.KeyError: 0

Steps to Reproduce

The error occurs randomly and not sure how to reproduce it.

Versions

2.6.2

Resolution:

Since python throws KeyError when the key does not exists in the queue(which is a dict here), should not we check if the key exists first?

if self.curprio is None can be replace with if self.curprio is None or self.curprio not in self.queues.

class ScrapyPriorityQueue:
        # other methods
       
    def pop(self):
        if self.curprio is None or self.curprio not in self.queues:
            return
        q = self.queues[self.curprio]
        m = q.pop()
        if not q:
            del self.queues[self.curprio]
            q.close()
            prios = [p for p, q in self.queues.items() if q]
            self.curprio = min(prios) if prios else None
        return m

The text was updated successfully, but these errors were encountered:

wRAR · 2024-03-04T12:05:09Z

Do you have a disk queue corrupted by e.g. a prior spider crash?

Otherwise you need to make a minimal reproducible example of this problem.

wRAR · 2024-03-04T12:11:26Z

should not we check if the key exists first?

No, as it should exist there because of how is the class code written.

roykyle8 · 2024-03-05T06:16:51Z

@wRAR I clear the request cache directory before starting new run. Starting a fresh crawl with a corrupt queue is not at all possible.

Regarding the reproducible example, the does not occur every time and for every crawl. Hence reproducing it is very very difficult.

Gallaecio · 2024-03-05T08:18:46Z

I clear the request cache directory before starting new run.

HTTPCACHE_DIR? JOBDIR? Something else?

roykyle8 · 2024-03-05T12:20:27Z

@Gallaecio I clear both HTTPCACHE_DIR & JOBDIR before starting new run. So I've a custom script to start multiple spiders together (custom_crawler.py).

# custom_crawler.py:
def main(cmd_args):
    crawler_process = CrawlerProcess(None, install_root_handler=False)
    for sp in [cmd_args.spiders]:

        spider_instance = __load_spider_instance__(sp)
        http_cache_dir = f'/opt/http_cache/{spider_instance.name}'
        job_dir = f'/opt/job_dir/{spider_instance.name}'
        __delete_and_recreate_dir__(http_cache_dir)
        __delete_and_recreate_dir__(job_dir)

        custom_settings = {'HTTPCACHE_DIR': http_cache_dir, 'JOBDIR': Job_dir}
        # SCHEDULER_DISK_QUEUE: scrapy.squeues.PickleFifoDiskQueue is set in spider class
        # SCHEDULER_MEMORY_QUEUE: scrapy.squeues.FifoMemoryQueue is set in spider class
        getattr(spider_instance, 'custom_settings').update(custom_settings)
        crawler_process.crawl(Crawler(spider_instance))
     
     crawl_process.start()

The spiders runs fine but I get KeyError for some spiders. Also this error does not occur every run.

roykyle8 · 2024-03-10T06:57:50Z

@wRAR / @Gallaecio can you please guide how can I find the issue and fix this?

wRAR · 2024-03-10T10:58:09Z

No direct ideas but I would start with logging additions and removals for self.queues, to see if the code expectations (that the key always exists there when expected) are indeed broken, or aren't actually broken.

roykyle8 · 2024-03-27T04:28:39Z

@wRAR @Gallaecio still looking for a solution and trying to debug on KeyError, however started seeing another issue.

2024-03-27 03:30:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:31:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:32:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:33:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:34:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:35:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:36:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:37:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:38:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:39:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:40:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:41:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:42:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:43:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:44:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:45:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:46:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:47:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:48:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:49:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:50:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:51:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:52:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:53:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:54:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:55:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:56:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:57:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:58:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:59:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:00:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:01:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:02:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:03:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:04:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:05:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:06:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:07:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:08:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:09:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:10:09,967 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:11:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:12:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:13:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:14:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:15:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:16:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:17:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:18:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:19:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)

And the spider never completes. I've to kill the process and rerun.

wRAR added the needs more info label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapy.pqueues.ScrapyPriorityQueue #6266

scrapy.pqueues.ScrapyPriorityQueue #6266

roykyle8 commented Mar 4, 2024 •

edited

wRAR commented Mar 4, 2024

wRAR commented Mar 4, 2024

roykyle8 commented Mar 5, 2024

Gallaecio commented Mar 5, 2024

roykyle8 commented Mar 5, 2024 •

edited

roykyle8 commented Mar 10, 2024

wRAR commented Mar 10, 2024

roykyle8 commented Mar 27, 2024

scrapy.pqueues.ScrapyPriorityQueue #6266

scrapy.pqueues.ScrapyPriorityQueue #6266

Comments

roykyle8 commented Mar 4, 2024 • edited

Description

Steps to Reproduce

Versions

Resolution:

wRAR commented Mar 4, 2024

wRAR commented Mar 4, 2024

roykyle8 commented Mar 5, 2024

Gallaecio commented Mar 5, 2024

roykyle8 commented Mar 5, 2024 • edited

roykyle8 commented Mar 10, 2024

wRAR commented Mar 10, 2024

roykyle8 commented Mar 27, 2024

roykyle8 commented Mar 4, 2024 •

edited

roykyle8 commented Mar 5, 2024 •

edited