New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] process_spider_exception on generators #2061
[MRG+1] process_spider_exception on generators #2061
Conversation
9c950e6
to
42c4ad7
Compare
Codecov Report
@@ Coverage Diff @@
## master #2061 +/- ##
==========================================
+ Coverage 84.48% 84.66% +0.17%
==========================================
Files 167 167
Lines 9405 9454 +49
Branches 1397 1408 +11
==========================================
+ Hits 7946 8004 +58
+ Misses 1201 1195 -6
+ Partials 258 255 -3
|
All checks passing now, but I have two concerns:
|
a65ebfa
to
d01c702
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! 👍 for adding more tests. Code looks good, but for me it is not clear what happens if middleware raises an error instead of a spider, and what are the consequences. Could you please add docs and tests for that?
scrapy/core/spidermw.py
Outdated
for method in self.methods['process_spider_output']: | ||
result = method(response=response, result=result, spider=spider) | ||
result = wrapper(method(response=response, result=result, spider=spider)) | ||
assert _isiterable(result), \ | ||
'Middleware %s must returns an iterable object, got %s ' % \ | ||
(fname(method), type(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check that this assertion is still active? I haven't tried the code, but it seems the loop in wrapper
method will fail before _isiterable
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My answer is below, sorry. Seems like I don't fully get this new GitHub review thingy yet :-P
scrapy/core/spidermw.py
Outdated
for r in result_iterable: | ||
yield r | ||
except Exception as ex: | ||
exception_result = process_spider_exception(Failure(ex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs for process_spider_exception method say:
This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception.
After this change method starts to fire when process_spider_output
of a previous middleware raises an error; I'm not sure what are the implications, but it should be documented, and it could be backwards incompatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions from process_spider_output
are handled only when the method returns a generator. There are two cases:
process_spider_output
is not a generator. In that case, if an exception is raised the function's return values does not pass along, and since there is noresult_iterable
the exception handler does not get called. This is exactly the same as the current behaviour and documentation.process_spider_output
is a generator. In that case, the exception is not raised right away, instead it actually raises when iterating overresult_iterable
, and the exception handler is called. This is what could backwards compability, but since it's the very thing that's currently broken, I don't think anyone is relying on this functionality.
@kmike Please check the above to see if it makes any sense :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this makes sense, but it is a bit tricky, and it could be easy to break by changing the current implementation. Are there tests for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that
This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception
means that exceptions from previous process_spider_output
methods should be handled by process_spider_exception
; the way of getting a spider's result (either items, requests or exceptions) is through the process_spider_output
chain. Maybe the docs should be updated to reflect that?
I'm rewriting some tests and adding some more to make this clear.
31c1bb1
to
949766d
Compare
assert _isiterable(result), \ | ||
'Middleware %s must returns an iterable object, got %s ' % \ | ||
(fname(method), type(result)) | ||
if _isiterable(result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike: At first I tried to find a way of checking the result is iterable without exposing the AssertionError
exception to be caught by some middleware's process_spider_exception
, but then I thought, since any exception from process_spider_input
should be passed to process_spider_exception
(from the docs), it wouldn't be wrong if the result of process_spider_output
passed too.
That being said, I'm getting a bit confused on what the desired/excepted behavior is/should be, please help me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't make sense to catch this error in process_spider_exception - this is a programmer error, not a problem with data. So it'd be nice if process_spider_exception
still won't be able to catch this error.
Another question: what happens if an error from process_spider_output
is handled by process_spider_exception
? For process_spider_input
or for spider itself it is documented that process_spider_output
chain is invoked, but for process_spider_exception it doesn't make sense to start process_spider_output from the beginning.
I think executing process_spider_output
methods which are not executed yet makes sense in this case, but this should be tested and documented.
scrapy/exceptions.py
Outdated
@@ -11,6 +11,11 @@ class NotConfigured(Exception): | |||
"""Indicates a missing configuration situation""" | |||
pass | |||
|
|||
class InvalidValue(TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike I named the exception like this, but suggestions are welcome of course :-)
Regardless of the name, I think this same exception should also be raised by the downloader middleware manager when a returned value is invalid, but that falls out of the scope of this PR, I will open a separate one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to make this exception private (add underscore, undocument it) because it is a Scrapy implementation detail - why would user want to use it?
InvalidValue sounds a lot like ValueError, but it is inherited from TypeError; what about _InvalidOutput or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the alternative name, but I think it shouldn't be private. This exception will raise if a user-implemented middleware's process_request
or process_response
returns an invalid value, it's supposed to appear in the user log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the exception to InvalidOutput
, but kept it public because of the reasons above. Also, I think subclassing from TypeError
("Passing arguments of the wrong type (e.g. passing a list when an int is expected) should result in a TypeError") is more appropriate than ValueError
("Raised when a built-in operation or function receives an argument that has the right type but an inappropriate value")
ffbf09f
to
533c799
Compare
Current status of this PR is:
@kmike I think that addresses your latest concerns, please let me know if there's anything more I can do to get this PR moving. Thanks! |
tests/test_spidermiddleware.py
Outdated
""" return value is NOT a generator """ | ||
name = 'not_a_generator' | ||
def parse(self, response): | ||
raise AssertionError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not returning a generator sounds like a separate case from raising an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't understand what you mean 😕
What I'm trying to do there is catch exceptions whether or not the result of a callback is a generator. This particular test (i.e. exceptions from callbacks which do not return generators) would pass even before the modifications from this PR.
Let me know if I didn't answer your question :-)
533c799
to
7705466
Compare
7705466
to
706ed0e
Compare
@kmike , what do you think of this PR now? |
scrapy/exceptions.py
Outdated
@@ -11,6 +11,11 @@ class NotConfigured(Exception): | |||
"""Indicates a missing configuration situation""" | |||
pass | |||
|
|||
class InvalidOutput(TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to inherit it from TypeError
docs/topics/exceptions.rst
Outdated
indicate that some method returned a value not suported by the processing | ||
chain. | ||
See :ref:`topics-spider-middleware` and :ref:`topics-downloader-middleware` | ||
for a list of supported output values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elacuesta in #2061 (comment) you argued that it is good to have this exception public & documented because it may appear in logs. But I still don't quite like making it public :) There is a lot of exceptions which can appear in logs (from Scrapy, from Twisted, other Python exceptions), and we don't document them all. To make exception in logs readable one can use a readable error message, which is already the case.
There is a non-zero overhead of documenting an exception and making it public - user may read docs and wonder how can this exception be used. All other Scrapy documented exceptions are useful for users - user may want to either raise them or catch them; unlike all other exceptions, this exception shouldn't be raised or caught in user code.
Docs tell that "This exception can be raised by a downloader or spider middleware", but this information can be misleading for users - we document that exceptions raised in middlewares can be caught in process_exception method, and we document that InvalidOutput can be raised in a middleware - but in fact this particular exception can't be caught in process_exception.
For me this exception still looks like an implementation detail with no use for end users; documenting it in scrapy docs doesn't help scrapy users, only make these docs longer to read and a bit more confusing. For example, users may think they should raise this exception if they validate items yielded by a spider, and validation fails ("value not supported by the processing chain") - but they shouldn't do this, they shouldn't raise this exception themselves.
I think information you've written here is valuable for people who modify Scrapy itself; it can be great to have something like that in the exception docstring, to make it easier for people to figure out how Scrapy works internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, there is no case in which a user would raise or catch this exception, it's only something to make them check their code if they write a bad middleware. I'll remove the docs about it. Thanks for the feedback 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the exception from the docs and added an underscore to the name. I also modified the DownloaderMiddlewareManager
class to raise _InvalidOutput
too, not a good idea to raise a custom exception from the spider middleware and AssertionError
from the downloader middleware IMHO.
This method is called when a spider or :meth:`process_spider_input` | ||
method (from other spider middleware) raises an exception. | ||
This method is called when a spider or :meth:`process_spider_output` | ||
method (from a previous spider middleware) raises an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
scrapy/core/spidermw.py
Outdated
if hasattr(mw, 'process_start_requests'): | ||
self.methods['process_start_requests'].insert(0, mw.process_start_requests) | ||
self.methods['process_spider_output'].insert(0, getattr(mw, 'process_spider_output', None)) | ||
self.methods['process_spider_exception'].insert(0, getattr(mw, 'process_spider_exception', None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check my understanding: are you adding None to the list to make start_index (and slicing) work properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, precisely. They are skipped by the if method is None: continue
blocks when iterating over the process_spider_output
and process_spider_exception
methods.
Hi there @kmike 👋 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elacuesta! I don't have any further feedback, thanks for carefully addressing everything :)
I think this is good to go, but as the code is quite complex, I think it is better to merge it after the 1.6 release, and after a review from another Scrapy commiter (as usual).
🎉 🎂 🎈
This is awesome, a million thanks @kmike 🙌 Looking forward to 1.7 then, thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, though I’m hesitant to merge it myself.
Heh, poor @elacuesta - at least 4 Scrapy committers looked at it in some way; there are two +1's now, but we're not brave enough to merge it. Please don't take it personal - that's the price of tackling hard issues :) @dangra @lopuhin do you want to take another look? If no, I can merge it this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I checked the code and docs, didn't check the tests. I think I understand the code, and I think it matches the updated docs. The way execution jumps between process_spider_output
and process_spider_exception
makes it tricky, but I don't see how this could be improved.
+1 to merge from me, I can click the green button unless @dangra also wants to check it.
Many thanks @elacuesta and thanks @kmike and @Gallaecio for review 👍 |
This PR is a starting point to fix #220.
It could probably use some more test cases, mostly to figure out what exactly is the desired behaviour when processing the exceptions.
I can't take much credit for this: if it breaks, blame @dangra 😛