Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Spider Middleware does not process exception for generator callback #4260

Closed
StasDeep opened this issue Dec 30, 2019 · 4 comments · Fixed by #4272
Closed

First Spider Middleware does not process exception for generator callback #4260

StasDeep opened this issue Dec 30, 2019 · 4 comments · Fixed by #4272
Labels

Comments

@StasDeep
Copy link
Contributor

StasDeep commented Dec 30, 2019

Description

process_spider_exception method of a spider middleware is ignored when spider middleware is first and callback is a generator.

Steps to Reproduce

  1. Create spider middleware with defined process_spider_exception
  2. Put it into SPIDER_MIDDLEWARES with number more than 900 (to make it first)
  3. Raise an exception in a spider callback
  4. UPD. Yield an item from the callback

Expected behavior: process_spider_exception is called for this exception

Actual behavior: process_spider_exception is not called

Versions

Scrapy       : 1.7.1
lxml         : 4.2.5.0
libxml2      : 2.9.8
cssselect    : 1.0.3
parsel       : 1.5.1
w3lib        : 1.19.0
Twisted      : 18.9.0
Python       : 3.6.8 (default, May  8 2019, 05:35:00) - [GCC 6.3.0 20170516]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018)
cryptography : 2.4.2
Platform     : Linux-4.9.184-linuxkit-x86_64-with-debian-9.9
@elacuesta
Copy link
Member

elacuesta commented Dec 31, 2019

Could you provide some code and logs? I could not reproduce the issue, either with 1.7.1 (your version) or with 1.8.0 (latest version). I assigned both 1 and 901 to the middleware, and the exception is caught regardless (please see the diagram in this comment, the last process_spider_exception method to be executed is the one from the middleware with the lowest number).

import scrapy


class ExceptionHandlerMiddleware:
    def process_spider_exception(self, response, exception, spider):
        print('Exception caught:', exception)
        return []


class ExceptionSpider(scrapy.Spider):
    name = 'exception_spider'
    start_urls = ['https://example.org']
    custom_settings = {
        'SPIDER_MIDDLEWARES': {
            __name__ + '.ExceptionHandlerMiddleware': 1,
        }
    }

    def parse(self, response):
        raise Exception('foo')
$ scrapy runspider lastmw.py
(...)
2019-12-31 01:32:25 [scrapy.core.engine] INFO: Spider opened
2019-12-31 01:32:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-12-31 01:32:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-12-31 01:32:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
Exception caught: foo
2019-12-31 01:32:25 [scrapy.core.engine] INFO: Closing spider (finished)
(...)
$ scrapy version --verbose
Scrapy       : 1.7.1
lxml         : 4.2.6.0
libxml2      : 2.9.9
cssselect    : 1.0.3
parsel       : 1.5.1
w3lib        : 1.19.0
Twisted      : 18.9.0
Python       : 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018)
cryptography : 2.3.1
Platform     : Darwin-18.7.0-x86_64-i386-64bit
$ scrapy version --verbose
Scrapy       : 1.8.0
lxml         : 4.2.6.0
libxml2      : 2.9.9
cssselect    : 1.0.3
parsel       : 1.5.1
w3lib        : 1.19.0
Twisted      : 18.9.0
Python       : 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018)
cryptography : 2.3.1
Platform     : Darwin-18.7.0-x86_64-i386-64bit

@StasDeep
Copy link
Contributor Author

StasDeep commented Jan 3, 2020

If you add yield {} to the callback (doesn't matter before or after the exception), it won't be caught.
It is caught however when the number of middleware is not the highest one (e.g. 899 works, but 901 does not).

@StasDeep StasDeep changed the title Last Spider Middleware does not process exception First Spider Middleware does not process exception for generator callback Jan 3, 2020
@elacuesta
Copy link
Member

Ok, that changes things. Indeed, I this is what I'm getting:

import scrapy


class ExceptionHandlerMiddleware:
    def process_spider_exception(self, response, exception, spider):
        print('Exception caught:', exception)
        return []


class ExceptionSpider(scrapy.Spider):
    name = 'exception_spider'
    start_urls = ['https://example.org']
    custom_settings = {
        'SPIDER_MIDDLEWARES': {
            __name__ + '.ExceptionHandlerMiddleware': 901,
        }
    }

    def parse(self, response):
        yield
        raise Exception('foo')
$ scrapy runspider lastmw.py
(...)
2020-01-03 12:21:09 [scrapy.core.engine] INFO: Spider opened
2020-01-03 12:21:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-01-03 12:21:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-01-03 12:21:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2020-01-03 12:21:10 [scrapy.core.scraper] ERROR: Spider error processing <GET https://example.org> (referer: None)
Traceback (most recent call last):
  File "/.../scrapy/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/.../scrapy/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/.../scrapy/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/.../scrapy/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/.../scrapy/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/.../scrapy/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/.../scrapy/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/.../scrapy/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/.../scrapy/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/.../lastmw.py", line 21, in parse
    raise Exception('foo')
Exception: foo
2020-01-03 12:21:10 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

@elacuesta
Copy link
Member

elacuesta commented Jan 5, 2020

It seems to me like this is an edge case of #220. By the time the middleware that's closest to the spider is executed the iterable from the spider callback has not been evaluated yet. The exception is raised in the first process_spider_output method but the execution chain jumps from one middleware to the next one (the previous one in numerical order), so the process_spider_exception method is not invoked.
I can think of two workarounds while this issue is still present:

  1. Do not assign the highest priority to the middleware in question
  2. Put a dummy middleware right in between the spider and the middleware manager:
import scrapy


class ExceptionHandlerMiddleware:
    def process_spider_exception(self, response, exception, spider):
        print('Exception caught:', exception)
        return []


class DummyMiddleware:
    def process_spider_output(self, response, result, spider):
        yield from result


class ExceptionSpider(scrapy.Spider):
    name = 'exception_spider'
    start_urls = ['https://example.org']
    custom_settings = {
        'SPIDER_MIDDLEWARES': {
            __name__ + '.DummyMiddleware': 902,
            __name__ + '.ExceptionHandlerMiddleware': 901,
        }
    }

    def parse(self, response):
        yield
        raise Exception('foo')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants