Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] process_spider_exception on generators #2061

Merged
merged 46 commits into from Apr 1, 2019

Conversation

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Jun 17, 2016

This PR is a starting point to fix #220.
It could probably use some more test cases, mostly to figure out what exactly is the desired behaviour when processing the exceptions.
I can't take much credit for this: if it breaks, blame @dangra 馃槢

@elacuesta elacuesta changed the title Process spider exception generator process_spider_exception on generators Jun 17, 2016
@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from 9c950e6 to 42c4ad7 Jul 25, 2016
@codecov-io
Copy link

@codecov-io codecov-io commented Jul 25, 2016

Codecov Report

Merging #2061 into master will increase coverage by 0.17%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2061      +/-   ##
==========================================
+ Coverage   84.48%   84.66%   +0.17%     
==========================================
  Files         167      167              
  Lines        9405     9454      +49     
  Branches     1397     1408      +11     
==========================================
+ Hits         7946     8004      +58     
+ Misses       1201     1195       -6     
+ Partials      258      255       -3
Impacted Files Coverage 螖
scrapy/core/scraper.py 88.51% <酶> (酶) 猬嗭笍
scrapy/exceptions.py 91.3% <100%> (+0.82%) 猬嗭笍
scrapy/core/downloader/middleware.py 100% <100%> (酶) 猬嗭笍
scrapy/utils/python.py 83.68% <100%> (+1%) 猬嗭笍
scrapy/core/spidermw.py 97.53% <100%> (+9.53%) 猬嗭笍
scrapy/utils/defer.py 96.49% <0%> (+3.5%) 猬嗭笍
scrapy/dupefilters.py 96.15% <0%> (+5.95%) 猬嗭笍

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 26, 2016

All checks passing now, but I have two concerns:

  • When process_spider_exception returns None the exception appears in the job stats, but it does not appear if the method returns an iterable. This was not introduced by this PR, so it should be addressed in a separate PR IMHO.
  • If a Response subclass object is returned as part of the iterable, Scrapy complains with ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse'. As stated in this doc page responses might be returned as part of that iterable, but I'm not sure about the desired behaviour: is this a problem with Scrapy or with the docs?
    (Update: reading the docs carefully, I think it doesn't make sense for process_spider_exception to return responses as part of the iterable, since that iterable will be passed to the process_spider_output chain, which is supposed to receive requests, dicts or items; I believe the error is in the docs).

@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from a65ebfa to d01c702 Jul 26, 2016
Copy link
Member

@kmike kmike left a comment

Thanks for the PR! 馃憤 for adding more tests. Code looks good, but for me it is not clear what happens if middleware raises an error instead of a spider, and what are the consequences. Could you please add docs and tests for that?

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
for method in self.methods['process_spider_output']:
result = method(response=response, result=result, spider=spider)
result = wrapper(method(response=response, result=result, spider=spider))
assert _isiterable(result), \
'Middleware %s must returns an iterable object, got %s ' % \
(fname(method), type(result))
Copy link
Member

@kmike kmike Oct 5, 2016

Could you please check that this assertion is still active? I haven't tried the code, but it seems the loop in wrapper method will fail before _isiterable check.

Copy link
Member Author

@elacuesta elacuesta Oct 7, 2016

My answer is below, sorry. Seems like I don't fully get this new GitHub review thingy yet :-P

for r in result_iterable:
yield r
except Exception as ex:
exception_result = process_spider_exception(Failure(ex))
Copy link
Member

@kmike kmike Oct 5, 2016

Docs for process_spider_exception method say:

This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception.

After this change method starts to fire when process_spider_output of a previous middleware raises an error; I'm not sure what are the implications, but it should be documented, and it could be backwards incompatible.

Copy link
Member Author

@elacuesta elacuesta Oct 7, 2016

Exceptions from process_spider_output are handled only when the method returns a generator. There are two cases:

  • process_spider_output is not a generator. In that case, if an exception is raised the function's return values does not pass along, and since there is no result_iterable the exception handler does not get called. This is exactly the same as the current behaviour and documentation.
  • process_spider_output is a generator. In that case, the exception is not raised right away, instead it actually raises when iterating over result_iterable, and the exception handler is called. This is what could backwards compability, but since it's the very thing that's currently broken, I don't think anyone is relying on this functionality.

@kmike Please check the above to see if it makes any sense :-P

Copy link
Member

@kmike kmike Nov 21, 2016

Hm, this makes sense, but it is a bit tricky, and it could be easy to break by changing the current implementation. Are there tests for that?

Copy link
Member Author

@elacuesta elacuesta Nov 22, 2016

I believe that

This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception

means that exceptions from previous process_spider_output methods should be handled by process_spider_exception; the way of getting a spider's result (either items, requests or exceptions) is through the process_spider_output chain. Maybe the docs should be updated to reflect that?
I'm rewriting some tests and adding some more to make this clear.

@elacuesta elacuesta force-pushed the process_spider_exception_generator branch 3 times, most recently from 31c1bb1 to 949766d Oct 7, 2016
assert _isiterable(result), \
'Middleware %s must returns an iterable object, got %s ' % \
(fname(method), type(result))
if _isiterable(result):
Copy link
Member Author

@elacuesta elacuesta Oct 7, 2016

@kmike: At first I tried to find a way of checking the result is iterable without exposing the AssertionError exception to be caught by some middleware's process_spider_exception, but then I thought, since any exception from process_spider_input should be passed to process_spider_exception (from the docs), it wouldn't be wrong if the result of process_spider_output passed too.

That being said, I'm getting a bit confused on what the desired/excepted behavior is/should be, please help me :-)

Copy link
Member

@kmike kmike Nov 21, 2016

I think it doesn't make sense to catch this error in process_spider_exception - this is a programmer error, not a problem with data. So it'd be nice if process_spider_exception still won't be able to catch this error.

Another question: what happens if an error from process_spider_output is handled by process_spider_exception? For process_spider_input or for spider itself it is documented that process_spider_output chain is invoked, but for process_spider_exception it doesn't make sense to start process_spider_output from the beginning.

I think executing process_spider_output methods which are not executed yet makes sense in this case, but this should be tested and documented.

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
scrapy/core/spidermw.py Outdated Show resolved Hide resolved
tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
@@ -11,6 +11,11 @@ class NotConfigured(Exception):
"""Indicates a missing configuration situation"""
pass

class InvalidValue(TypeError):
Copy link
Member Author

@elacuesta elacuesta Nov 28, 2016

@kmike I named the exception like this, but suggestions are welcome of course :-)
Regardless of the name, I think this same exception should also be raised by the downloader middleware manager when a returned value is invalid, but that falls out of the scope of this PR, I will open a separate one.

Copy link
Member

@kmike kmike Feb 22, 2017

I think it makes sense to make this exception private (add underscore, undocument it) because it is a Scrapy implementation detail - why would user want to use it?

InvalidValue sounds a lot like ValueError, but it is inherited from TypeError; what about _InvalidOutput or something like that?

Copy link
Member Author

@elacuesta elacuesta Feb 22, 2017

I like the alternative name, but I think it shouldn't be private. This exception will raise if a user-implemented middleware's process_request or process_response returns an invalid value, it's supposed to appear in the user log.

Copy link
Member Author

@elacuesta elacuesta Mar 1, 2017

I renamed the exception to InvalidOutput, but kept it public because of the reasons above. Also, I think subclassing from TypeError ("Passing arguments of the wrong type (e.g. passing a list when an int is expected) should result in a TypeError") is more appropriate than ValueError ("Raised when a built-in operation or function receives an argument that has the right type but an inappropriate value")

@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from ffbf09f to 533c799 Feb 9, 2017
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Feb 9, 2017

Current status of this PR is:

  • Catch exceptions from previous process_spider_output (also documented)
  • Execute only non already called process_spider_output methods when process_spider_exception returns an iterable
  • Raise InvalidValue exception (scrapy-specific, name suggestions are welcome) when an invalid value is returned from a spider middleware's processing method

@kmike I think that addresses your latest concerns, please let me know if there's anything more I can do to get this PR moving. Thanks!

scrapy/core/spidermw.py Outdated Show resolved Hide resolved
scrapy/core/spidermw.py Outdated Show resolved Hide resolved
""" return value is NOT a generator """
name = 'not_a_generator'
def parse(self, response):
raise AssertionError
Copy link
Member

@kmike kmike Feb 22, 2017

Not returning a generator sounds like a separate case from raising an exception

Copy link
Member Author

@elacuesta elacuesta Feb 22, 2017

I'm afraid I don't understand what you mean 馃槙

What I'm trying to do there is catch exceptions whether or not the result of a callback is a generator. This particular test (i.e. exceptions from callbacks which do not return generators) would pass even before the modifications from this PR.

Let me know if I didn't answer your question :-)

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
@kmike kmike added this to the v1.4 milestone Feb 22, 2017
@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from 533c799 to 7705466 Feb 22, 2017
@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from 7705466 to 706ed0e Mar 1, 2017
@redapple
Copy link
Contributor

@redapple redapple commented Mar 6, 2017

@kmike , what do you think of this PR now?

@@ -11,6 +11,11 @@ class NotConfigured(Exception):
"""Indicates a missing configuration situation"""
pass

class InvalidOutput(TypeError):
Copy link
Member

@kmike kmike Mar 7, 2017

+1 to inherit it from TypeError

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
indicate that some method returned a value not suported by the processing
chain.
See :ref:`topics-spider-middleware` and :ref:`topics-downloader-middleware`
for a list of supported output values.
Copy link
Member

@kmike kmike Mar 9, 2017

@elacuesta in #2061 (comment) you argued that it is good to have this exception public & documented because it may appear in logs. But I still don't quite like making it public :) There is a lot of exceptions which can appear in logs (from Scrapy, from Twisted, other Python exceptions), and we don't document them all. To make exception in logs readable one can use a readable error message, which is already the case.

There is a non-zero overhead of documenting an exception and making it public - user may read docs and wonder how can this exception be used. All other Scrapy documented exceptions are useful for users - user may want to either raise them or catch them; unlike all other exceptions, this exception shouldn't be raised or caught in user code.

Docs tell that "This exception can be raised by a downloader or spider middleware", but this information can be misleading for users - we document that exceptions raised in middlewares can be caught in process_exception method, and we document that InvalidOutput can be raised in a middleware - but in fact this particular exception can't be caught in process_exception.

For me this exception still looks like an implementation detail with no use for end users; documenting it in scrapy docs doesn't help scrapy users, only make these docs longer to read and a bit more confusing. For example, users may think they should raise this exception if they validate items yielded by a spider, and validation fails ("value not supported by the processing chain") - but they shouldn't do this, they shouldn't raise this exception themselves.

I think information you've written here is valuable for people who modify Scrapy itself; it can be great to have something like that in the exception docstring, to make it easier for people to figure out how Scrapy works internally.

Copy link
Member Author

@elacuesta elacuesta Mar 9, 2017

I see your point, there is no case in which a user would raise or catch this exception, it's only something to make them check their code if they write a bad middleware. I'll remove the docs about it. Thanks for the feedback 馃憤

Copy link
Member Author

@elacuesta elacuesta Mar 10, 2017

I removed the exception from the docs and added an underscore to the name. I also modified the DownloaderMiddlewareManager class to raise _InvalidOutput too, not a good idea to raise a custom exception from the spider middleware and AssertionError from the downloader middleware IMHO.

scrapy/core/spidermw.py Outdated Show resolved Hide resolved
This method is called when a spider or :meth:`process_spider_input`
method (from other spider middleware) raises an exception.
This method is called when a spider or :meth:`process_spider_output`
method (from a previous spider middleware) raises an exception.
Copy link
Member

@kmike kmike Oct 9, 2018

馃憤

if hasattr(mw, 'process_start_requests'):
self.methods['process_start_requests'].insert(0, mw.process_start_requests)
self.methods['process_spider_output'].insert(0, getattr(mw, 'process_spider_output', None))
self.methods['process_spider_exception'].insert(0, getattr(mw, 'process_spider_exception', None))
Copy link
Member

@kmike kmike Oct 9, 2018

To check my understanding: are you adding None to the list to make start_index (and slicing) work properly?

Copy link
Member Author

@elacuesta elacuesta Oct 11, 2018

Yes, precisely. They are skipped by the if method is None: continue blocks when iterating over the process_spider_output and process_spider_exception methods.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Oct 11, 2018

Hi there @kmike 馃憢
Thanks for the feedback, I updated the PR based on your comments.

kmike
kmike approved these changes Oct 25, 2018
Copy link
Member

@kmike kmike left a comment

Thanks @elacuesta! I don't have any further feedback, thanks for carefully addressing everything :)

I think this is good to go, but as the code is quite complex, I think it is better to merge it after the 1.6 release, and after a review from another Scrapy commiter (as usual).

馃帀 馃巶 馃巿

@kmike kmike added this to the v1.7 milestone Oct 25, 2018
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Oct 29, 2018

This is awesome, a million thanks @kmike 馃檶

Looking forward to 1.7 then, thanks again!

@kmike kmike changed the title process_spider_exception on generators [MRG+1] process_spider_exception on generators Mar 22, 2019
Copy link
Member

@Gallaecio Gallaecio left a comment

It looks good to me, though I鈥檓 hesitant to merge it myself.

@kmike
Copy link
Member

@kmike kmike commented Mar 26, 2019

Heh, poor @elacuesta - at least 4 Scrapy committers looked at it in some way; there are two +1's now, but we're not brave enough to merge it. Please don't take it personal - that's the price of tackling hard issues :) @dangra @lopuhin do you want to take another look? If no, I can merge it this week.

Copy link
Member

@lopuhin lopuhin left a comment

Hey, I checked the code and docs, didn't check the tests. I think I understand the code, and I think it matches the updated docs. The way execution jumps between process_spider_output and process_spider_exception makes it tricky, but I don't see how this could be improved.

+1 to merge from me, I can click the green button unless @dangra also wants to check it.

@lopuhin lopuhin merged commit b5c552d into scrapy:master Apr 1, 2019
3 checks passed
@lopuhin
Copy link
Member

@lopuhin lopuhin commented Apr 1, 2019

Many thanks @elacuesta and thanks @kmike and @Gallaecio for review 馃憤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants