[MRG+1] process_spider_exception on generators #2061

elacuesta · 2016-06-17T19:33:39Z

This PR is a starting point to fix #220.
It could probably use some more test cases, mostly to figure out what exactly is the desired behaviour when processing the exceptions.
I can't take much credit for this: if it breaks, blame @dangra 😛

codecov-io · 2016-07-25T19:04:32Z

Codecov Report

Merging #2061 into master will increase coverage by 0.17%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2061      +/-   ##
==========================================
+ Coverage   84.48%   84.66%   +0.17%     
==========================================
  Files         167      167              
  Lines        9405     9454      +49     
  Branches     1397     1408      +11     
==========================================
+ Hits         7946     8004      +58     
+ Misses       1201     1195       -6     
+ Partials      258      255       -3

Impacted Files	Coverage Δ
scrapy/core/scraper.py	`88.51% <ø> (ø)`	⬆️
scrapy/exceptions.py	`91.3% <100%> (+0.82%)`	⬆️
scrapy/core/downloader/middleware.py	`100% <100%> (ø)`	⬆️
scrapy/utils/python.py	`83.68% <100%> (+1%)`	⬆️
scrapy/core/spidermw.py	`97.53% <100%> (+9.53%)`	⬆️
scrapy/utils/defer.py	`96.49% <0%> (+3.5%)`	⬆️
scrapy/dupefilters.py	`96.15% <0%> (+5.95%)`	⬆️

elacuesta · 2016-07-26T14:26:45Z

All checks passing now, but I have two concerns:

When process_spider_exception returns None the exception appears in the job stats, but it does not appear if the method returns an iterable. This was not introduced by this PR, so it should be addressed in a separate PR IMHO.
If a Response subclass object is returned as part of the iterable, Scrapy complains with ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse'. As stated in this doc page responses might be returned as part of that iterable, but I'm not sure about the desired behaviour: is this a problem with Scrapy or with the docs?
(Update: reading the docs carefully, I think it doesn't make sense for process_spider_exception to return responses as part of the iterable, since that iterable will be passed to the process_spider_output chain, which is supposed to receive requests, dicts or items; I believe the error is in the docs).

kmike

Thanks for the PR! 👍 for adding more tests. Code looks good, but for me it is not clear what happens if middleware raises an error instead of a spider, and what are the consequences. Could you please add docs and tests for that?

tests/test_spidermiddleware.py

kmike · 2016-10-05T08:35:35Z

scrapy/core/spidermw.py

            for method in self.methods['process_spider_output']:
-                result = method(response=response, result=result, spider=spider)
+                result = wrapper(method(response=response, result=result, spider=spider))
                assert _isiterable(result), \
                    'Middleware %s must returns an iterable object, got %s ' % \
                    (fname(method), type(result))


Could you please check that this assertion is still active? I haven't tried the code, but it seems the loop in wrapper method will fail before _isiterable check.

My answer is below, sorry. Seems like I don't fully get this new GitHub review thingy yet :-P

kmike · 2016-10-05T08:47:53Z

scrapy/core/spidermw.py

+                    for r in result_iterable:
+                        yield r
+                except Exception as ex:
+                    exception_result = process_spider_exception(Failure(ex))


Docs for process_spider_exception method say:

This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception.

After this change method starts to fire when process_spider_output of a previous middleware raises an error; I'm not sure what are the implications, but it should be documented, and it could be backwards incompatible.

Exceptions from process_spider_output are handled only when the method returns a generator. There are two cases:

process_spider_output is not a generator. In that case, if an exception is raised the function's return values does not pass along, and since there is no result_iterable the exception handler does not get called. This is exactly the same as the current behaviour and documentation.

process_spider_output is a generator. In that case, the exception is not raised right away, instead it actually raises when iterating over result_iterable, and the exception handler is called. This is what could backwards compability, but since it's the very thing that's currently broken, I don't think anyone is relying on this functionality.

@kmike Please check the above to see if it makes any sense :-P

Hm, this makes sense, but it is a bit tricky, and it could be easy to break by changing the current implementation. Are there tests for that?

I believe that

This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception

means that exceptions from previous process_spider_output methods should be handled by process_spider_exception; the way of getting a spider's result (either items, requests or exceptions) is through the process_spider_output chain. Maybe the docs should be updated to reflect that?
I'm rewriting some tests and adding some more to make this clear.

elacuesta · 2016-10-07T15:39:16Z

scrapy/core/spidermw.py

-                assert _isiterable(result), \
-                    'Middleware %s must returns an iterable object, got %s ' % \
-                    (fname(method), type(result))
+                if _isiterable(result):


@kmike: At first I tried to find a way of checking the result is iterable without exposing the AssertionError exception to be caught by some middleware's process_spider_exception, but then I thought, since any exception from process_spider_input should be passed to process_spider_exception (from the docs), it wouldn't be wrong if the result of process_spider_output passed too.

That being said, I'm getting a bit confused on what the desired/excepted behavior is/should be, please help me :-)

I think it doesn't make sense to catch this error in process_spider_exception - this is a programmer error, not a problem with data. So it'd be nice if process_spider_exception still won't be able to catch this error.

Another question: what happens if an error from process_spider_output is handled by process_spider_exception? For process_spider_input or for spider itself it is documented that process_spider_output chain is invoked, but for process_spider_exception it doesn't make sense to start process_spider_output from the beginning.

I think executing process_spider_output methods which are not executed yet makes sense in this case, but this should be tested and documented.

tests/test_spidermiddleware.py

scrapy/core/spidermw.py

tests/test_spidermiddleware.py

elacuesta · 2016-11-28T12:33:42Z

scrapy/exceptions.py

@@ -11,6 +11,11 @@ class NotConfigured(Exception):
    """Indicates a missing configuration situation"""
    pass

+class InvalidValue(TypeError):


@kmike I named the exception like this, but suggestions are welcome of course :-)
Regardless of the name, I think this same exception should also be raised by the downloader middleware manager when a returned value is invalid, but that falls out of the scope of this PR, I will open a separate one.

I think it makes sense to make this exception private (add underscore, undocument it) because it is a Scrapy implementation detail - why would user want to use it?

InvalidValue sounds a lot like ValueError, but it is inherited from TypeError; what about _InvalidOutput or something like that?

I like the alternative name, but I think it shouldn't be private. This exception will raise if a user-implemented middleware's process_request or process_response returns an invalid value, it's supposed to appear in the user log.

I renamed the exception to InvalidOutput, but kept it public because of the reasons above. Also, I think subclassing from TypeError ("Passing arguments of the wrong type (e.g. passing a list when an int is expected) should result in a TypeError") is more appropriate than ValueError ("Raised when a built-in operation or function receives an argument that has the right type but an inappropriate value")

elacuesta · 2017-02-09T13:37:22Z

Current status of this PR is:

Catch exceptions from previous process_spider_output (also documented)
Execute only non already called process_spider_output methods when process_spider_exception returns an iterable
Raise InvalidValue exception (scrapy-specific, name suggestions are welcome) when an invalid value is returned from a spider middleware's processing method

@kmike I think that addresses your latest concerns, please let me know if there's anything more I can do to get this PR moving. Thanks!

scrapy/core/spidermw.py

kmike · 2017-02-22T17:01:14Z

tests/test_spidermiddleware.py

+    """ return value is NOT a generator """
+    name = 'not_a_generator'
+    def parse(self, response):
+        raise AssertionError


Not returning a generator sounds like a separate case from raising an exception

~~I'm afraid I don't understand what you mean 😕~~

What I'm trying to do there is catch exceptions whether or not the result of a callback is a generator. This particular test (i.e. exceptions from callbacks which do not return generators) would pass even before the modifications from this PR.

Let me know if I didn't answer your question :-)

tests/test_spidermiddleware.py

redapple · 2017-03-06T13:05:57Z

@kmike , what do you think of this PR now?

kmike · 2017-03-07T14:28:29Z

scrapy/exceptions.py

@@ -11,6 +11,11 @@ class NotConfigured(Exception):
    """Indicates a missing configuration situation"""
    pass

+class InvalidOutput(TypeError):


+1 to inherit it from TypeError

tests/test_spidermiddleware.py

kmike · 2017-03-09T15:14:24Z

docs/topics/exceptions.rst

+indicate that some method returned a value not suported by the processing
+chain.
+See :ref:`topics-spider-middleware` and :ref:`topics-downloader-middleware`
+for a list of supported output values.


@elacuesta in #2061 (comment) you argued that it is good to have this exception public & documented because it may appear in logs. But I still don't quite like making it public :) There is a lot of exceptions which can appear in logs (from Scrapy, from Twisted, other Python exceptions), and we don't document them all. To make exception in logs readable one can use a readable error message, which is already the case.

There is a non-zero overhead of documenting an exception and making it public - user may read docs and wonder how can this exception be used. All other Scrapy documented exceptions are useful for users - user may want to either raise them or catch them; unlike all other exceptions, this exception shouldn't be raised or caught in user code.

Docs tell that "This exception can be raised by a downloader or spider middleware", but this information can be misleading for users - we document that exceptions raised in middlewares can be caught in process_exception method, and we document that InvalidOutput can be raised in a middleware - but in fact this particular exception can't be caught in process_exception.

For me this exception still looks like an implementation detail with no use for end users; documenting it in scrapy docs doesn't help scrapy users, only make these docs longer to read and a bit more confusing. For example, users may think they should raise this exception if they validate items yielded by a spider, and validation fails ("value not supported by the processing chain") - but they shouldn't do this, they shouldn't raise this exception themselves.

I think information you've written here is valuable for people who modify Scrapy itself; it can be great to have something like that in the exception docstring, to make it easier for people to figure out how Scrapy works internally.

I see your point, there is no case in which a user would raise or catch this exception, it's only something to make them check their code if they write a bad middleware. I'll remove the docs about it. Thanks for the feedback 👍

I removed the exception from the docs and added an underscore to the name. I also modified the DownloaderMiddlewareManager class to raise _InvalidOutput too, not a good idea to raise a custom exception from the spider middleware and AssertionError from the downloader middleware IMHO.

scrapy/core/spidermw.py

kmike · 2018-10-09T19:15:12Z

docs/topics/spider-middleware.rst

-        This method is called when a spider or :meth:`process_spider_input`
-        method (from other spider middleware) raises an exception.
+        This method is called when a spider or :meth:`process_spider_output`
+        method (from a previous spider middleware) raises an exception.


kmike · 2018-10-09T19:17:25Z

scrapy/core/spidermw.py

        if hasattr(mw, 'process_start_requests'):
            self.methods['process_start_requests'].insert(0, mw.process_start_requests)
+        self.methods['process_spider_output'].insert(0, getattr(mw, 'process_spider_output', None))
+        self.methods['process_spider_exception'].insert(0, getattr(mw, 'process_spider_exception', None))


To check my understanding: are you adding None to the list to make start_index (and slicing) work properly?

Yes, precisely. They are skipped by the if method is None: continue blocks when iterating over the process_spider_output and process_spider_exception methods.

elacuesta · 2018-10-11T17:14:25Z

Hi there @kmike 👋
Thanks for the feedback, I updated the PR based on your comments.

kmike

Thanks @elacuesta! I don't have any further feedback, thanks for carefully addressing everything :)

I think this is good to go, but as the code is quite complex, I think it is better to merge it after the 1.6 release, and after a review from another Scrapy commiter (as usual).

🎉 🎂 🎈

elacuesta · 2018-10-29T13:09:34Z

This is awesome, a million thanks @kmike 🙌

Looking forward to 1.7 then, thanks again!

Gallaecio

It looks good to me, though I’m hesitant to merge it myself.

kmike · 2019-03-26T21:39:22Z

Heh, poor @elacuesta - at least 4 Scrapy committers looked at it in some way; there are two +1's now, but we're not brave enough to merge it. Please don't take it personal - that's the price of tackling hard issues :) @dangra @lopuhin do you want to take another look? If no, I can merge it this week.

scrapy/core/spidermw.py

lopuhin

Hey, I checked the code and docs, didn't check the tests. I think I understand the code, and I think it matches the updated docs. The way execution jumps between process_spider_output and process_spider_exception makes it tricky, but I don't see how this could be improved.

+1 to merge from me, I can click the green button unless @dangra also wants to check it.

lopuhin · 2019-04-01T07:44:18Z

Many thanks @elacuesta and thanks @kmike and @Gallaecio for review 👍

elacuesta changed the title ~~Process spider exception generator~~ process_spider_exception on generators Jun 17, 2016

elacuesta force-pushed the process_spider_exception_generator branch from 9c950e6 to 42c4ad7 Compare July 25, 2016 17:46

elacuesta force-pushed the process_spider_exception_generator branch from a65ebfa to d01c702 Compare July 26, 2016 15:58

kmike requested changes Oct 5, 2016

View reviewed changes

elacuesta force-pushed the process_spider_exception_generator branch 3 times, most recently from 31c1bb1 to 949766d Compare October 7, 2016 12:14

elacuesta commented Oct 7, 2016

View reviewed changes

elacuesta commented Nov 23, 2016

View reviewed changes

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved

elacuesta commented Nov 23, 2016

View reviewed changes

scrapy/core/spidermw.py Outdated Show resolved Hide resolved

elacuesta commented Nov 23, 2016

View reviewed changes

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved

elacuesta commented Nov 28, 2016

View reviewed changes

elacuesta force-pushed the process_spider_exception_generator branch from ffbf09f to 533c799 Compare February 9, 2017 13:20

kmike reviewed Feb 22, 2017

View reviewed changes

scrapy/core/spidermw.py Outdated Show resolved Hide resolved

kmike reviewed Feb 22, 2017

View reviewed changes

scrapy/core/spidermw.py Outdated Show resolved Hide resolved

kmike reviewed Feb 22, 2017

View reviewed changes

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved

kmike added this to the v1.4 milestone Feb 22, 2017

elacuesta force-pushed the process_spider_exception_generator branch from 533c799 to 7705466 Compare February 22, 2017 18:04

Spider middleware: process_spider_exception on generators

706ed0e

elacuesta force-pushed the process_spider_exception_generator branch from 7705466 to 706ed0e Compare March 1, 2017 15:02

kmike reviewed Mar 7, 2017

View reviewed changes

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved

Spider middleware: use Mockserver to test process_spider_exception

4090cc3

kmike reviewed Mar 9, 2017

View reviewed changes

Undocument _InvalidOutput exception

9c256cf

kmike reviewed Oct 9, 2018

View reviewed changes

scrapy/core/spidermw.py Outdated Show resolved Hide resolved

kmike reviewed Oct 9, 2018

View reviewed changes

elacuesta added 6 commits October 10, 2018 11:37

Merge branch 'master' into process_spider_exception_generator

2396356

Move MutableChain to scrapy.utils.python

58f5565

Simplify MutableChain

a05eaee

Assign processing methods to a variable before iterating

15f0a89

Add tests for MutableChain

e0360e5

Force Travis build

c602e69

kmike approved these changes Oct 25, 2018

View reviewed changes

kmike added this to the v1.7 milestone Oct 25, 2018

elacuesta mentioned this pull request Oct 29, 2018

[MRG+1] Use collections.deque instead of list to store MiddlewareManager's methods #3476

Merged

elacuesta added 3 commits January 3, 2019 11:34

Merge branch 'master' into process_spider_exception_generator

9759112

Deques can't be sliced, use itertools.islice instead

6c78b3d

Styling nitpick :-)

e3e804c

kmike changed the title ~~process_spider_exception on generators~~ [MRG+1] process_spider_exception on generators Mar 22, 2019

Gallaecio approved these changes Mar 26, 2019

View reviewed changes

lopuhin reviewed Mar 27, 2019

View reviewed changes

scrapy/core/spidermw.py Show resolved Hide resolved

lopuhin approved these changes Mar 27, 2019

View reviewed changes

lopuhin merged commit b5c552d into scrapy:master Apr 1, 2019

elacuesta deleted the process_spider_exception_generator branch April 1, 2019 14:50

elacuesta mentioned this pull request Apr 1, 2019

process_spider_exception not called with exception from spider #1015

Closed

elacuesta mentioned this pull request Aug 12, 2019

Get server IP address for HTTP/1.1 Responses #3940

Merged

3 tasks

elacuesta mentioned this pull request Dec 31, 2019

First Spider Middleware does not process exception for generator callback #4260

Closed

GeorgeA92 mentioned this pull request Jan 15, 2021

Multiple execution of process_spider_exception of the same spider middleware method for single exception #4729

Open

Gallaecio mentioned this pull request Feb 22, 2021

spider middleware process_spider_output() doesn't capture item yielded by spider request errback if it was called by downloader #4999

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] process_spider_exception on generators #2061

[MRG+1] process_spider_exception on generators #2061

elacuesta commented Jun 17, 2016 •

edited

codecov-io commented Jul 25, 2016 •

edited by codecov bot

elacuesta commented Jul 26, 2016 •

edited

kmike left a comment

kmike Oct 5, 2016

elacuesta Oct 7, 2016

kmike Oct 5, 2016

elacuesta Oct 7, 2016

kmike Nov 21, 2016

elacuesta Nov 22, 2016 •

edited

elacuesta Oct 7, 2016 •

edited

kmike Nov 21, 2016

elacuesta Nov 28, 2016

kmike Feb 22, 2017

elacuesta Feb 22, 2017

elacuesta Mar 1, 2017

elacuesta commented Feb 9, 2017

kmike Feb 22, 2017

elacuesta Feb 22, 2017 •

edited

redapple commented Mar 6, 2017

kmike Mar 7, 2017

kmike Mar 9, 2017

elacuesta Mar 9, 2017

elacuesta Mar 10, 2017

kmike Oct 9, 2018

kmike Oct 9, 2018 •

edited

elacuesta Oct 11, 2018

elacuesta commented Oct 11, 2018

kmike left a comment

elacuesta commented Oct 29, 2018

Gallaecio left a comment

kmike commented Mar 26, 2019

lopuhin left a comment

lopuhin commented Apr 1, 2019

[MRG+1] process_spider_exception on generators #2061

[MRG+1] process_spider_exception on generators #2061

Conversation

elacuesta commented Jun 17, 2016 • edited

codecov-io commented Jul 25, 2016 • edited by codecov bot

Codecov Report

elacuesta commented Jul 26, 2016 • edited

kmike left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elacuesta Nov 22, 2016 • edited

Choose a reason for hiding this comment

elacuesta Oct 7, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elacuesta commented Feb 9, 2017

Choose a reason for hiding this comment

elacuesta Feb 22, 2017 • edited

Choose a reason for hiding this comment

redapple commented Mar 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Oct 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elacuesta commented Oct 11, 2018

kmike left a comment

Choose a reason for hiding this comment

elacuesta commented Oct 29, 2018

Gallaecio left a comment

Choose a reason for hiding this comment

kmike commented Mar 26, 2019

lopuhin left a comment

Choose a reason for hiding this comment

lopuhin commented Apr 1, 2019

elacuesta commented Jun 17, 2016 •

edited

codecov-io commented Jul 25, 2016 •

edited by codecov bot

elacuesta commented Jul 26, 2016 •

edited

elacuesta Nov 22, 2016 •

edited

elacuesta Oct 7, 2016 •

edited

elacuesta Feb 22, 2017 •

edited

kmike Oct 9, 2018 •

edited