Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] process_spider_exception on generators #2061

Merged
merged 46 commits into from Apr 1, 2019

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Jun 17, 2016

This PR is a starting point to fix #220.
It could probably use some more test cases, mostly to figure out what exactly is the desired behaviour when processing the exceptions.
I can't take much credit for this: if it breaks, blame @dangra 😛

@elacuesta elacuesta changed the title Process spider exception generator process_spider_exception on generators Jun 17, 2016
@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from 9c950e6 to 42c4ad7 Compare July 25, 2016 17:46
@codecov-io
Copy link

codecov-io commented Jul 25, 2016

Codecov Report

Merging #2061 into master will increase coverage by 0.17%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2061      +/-   ##
==========================================
+ Coverage   84.48%   84.66%   +0.17%     
==========================================
  Files         167      167              
  Lines        9405     9454      +49     
  Branches     1397     1408      +11     
==========================================
+ Hits         7946     8004      +58     
+ Misses       1201     1195       -6     
+ Partials      258      255       -3
Impacted Files Coverage Δ
scrapy/core/scraper.py 88.51% <ø> (ø) ⬆️
scrapy/exceptions.py 91.3% <100%> (+0.82%) ⬆️
scrapy/core/downloader/middleware.py 100% <100%> (ø) ⬆️
scrapy/utils/python.py 83.68% <100%> (+1%) ⬆️
scrapy/core/spidermw.py 97.53% <100%> (+9.53%) ⬆️
scrapy/utils/defer.py 96.49% <0%> (+3.5%) ⬆️
scrapy/dupefilters.py 96.15% <0%> (+5.95%) ⬆️

@elacuesta
Copy link
Member Author

elacuesta commented Jul 26, 2016

All checks passing now, but I have two concerns:

  • When process_spider_exception returns None the exception appears in the job stats, but it does not appear if the method returns an iterable. This was not introduced by this PR, so it should be addressed in a separate PR IMHO.
  • If a Response subclass object is returned as part of the iterable, Scrapy complains with ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse'. As stated in this doc page responses might be returned as part of that iterable, but I'm not sure about the desired behaviour: is this a problem with Scrapy or with the docs?
    (Update: reading the docs carefully, I think it doesn't make sense for process_spider_exception to return responses as part of the iterable, since that iterable will be passed to the process_spider_output chain, which is supposed to receive requests, dicts or items; I believe the error is in the docs).

@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from a65ebfa to d01c702 Compare July 26, 2016 15:58
Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! 👍 for adding more tests. Code looks good, but for me it is not clear what happens if middleware raises an error instead of a spider, and what are the consequences. Could you please add docs and tests for that?

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
tests/test_spidermiddleware.py Outdated Show resolved Hide resolved
for method in self.methods['process_spider_output']:
result = method(response=response, result=result, spider=spider)
result = wrapper(method(response=response, result=result, spider=spider))
assert _isiterable(result), \
'Middleware %s must returns an iterable object, got %s ' % \
(fname(method), type(result))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check that this assertion is still active? I haven't tried the code, but it seems the loop in wrapper method will fail before _isiterable check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My answer is below, sorry. Seems like I don't fully get this new GitHub review thingy yet :-P

for r in result_iterable:
yield r
except Exception as ex:
exception_result = process_spider_exception(Failure(ex))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs for process_spider_exception method say:

This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception.

After this change method starts to fire when process_spider_output of a previous middleware raises an error; I'm not sure what are the implications, but it should be documented, and it could be backwards incompatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions from process_spider_output are handled only when the method returns a generator. There are two cases:

  • process_spider_output is not a generator. In that case, if an exception is raised the function's return values does not pass along, and since there is no result_iterable the exception handler does not get called. This is exactly the same as the current behaviour and documentation.
  • process_spider_output is a generator. In that case, the exception is not raised right away, instead it actually raises when iterating over result_iterable, and the exception handler is called. This is what could backwards compability, but since it's the very thing that's currently broken, I don't think anyone is relying on this functionality.

@kmike Please check the above to see if it makes any sense :-P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this makes sense, but it is a bit tricky, and it could be easy to break by changing the current implementation. Are there tests for that?

Copy link
Member Author

@elacuesta elacuesta Nov 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that

This method is called when when a spider or process_spider_input() method (from other spider middleware) raises an exception

means that exceptions from previous process_spider_output methods should be handled by process_spider_exception; the way of getting a spider's result (either items, requests or exceptions) is through the process_spider_output chain. Maybe the docs should be updated to reflect that?
I'm rewriting some tests and adding some more to make this clear.

@elacuesta elacuesta force-pushed the process_spider_exception_generator branch 3 times, most recently from 31c1bb1 to 949766d Compare October 7, 2016 12:14
assert _isiterable(result), \
'Middleware %s must returns an iterable object, got %s ' % \
(fname(method), type(result))
if _isiterable(result):
Copy link
Member Author

@elacuesta elacuesta Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike: At first I tried to find a way of checking the result is iterable without exposing the AssertionError exception to be caught by some middleware's process_spider_exception, but then I thought, since any exception from process_spider_input should be passed to process_spider_exception (from the docs), it wouldn't be wrong if the result of process_spider_output passed too.

That being said, I'm getting a bit confused on what the desired/excepted behavior is/should be, please help me :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't make sense to catch this error in process_spider_exception - this is a programmer error, not a problem with data. So it'd be nice if process_spider_exception still won't be able to catch this error.

Another question: what happens if an error from process_spider_output is handled by process_spider_exception? For process_spider_input or for spider itself it is documented that process_spider_output chain is invoked, but for process_spider_exception it doesn't make sense to start process_spider_output from the beginning.

I think executing process_spider_output methods which are not executed yet makes sense in this case, but this should be tested and documented.

scrapy/core/spidermw.py Outdated Show resolved Hide resolved
@@ -11,6 +11,11 @@ class NotConfigured(Exception):
"""Indicates a missing configuration situation"""
pass

class InvalidValue(TypeError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike I named the exception like this, but suggestions are welcome of course :-)
Regardless of the name, I think this same exception should also be raised by the downloader middleware manager when a returned value is invalid, but that falls out of the scope of this PR, I will open a separate one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to make this exception private (add underscore, undocument it) because it is a Scrapy implementation detail - why would user want to use it?

InvalidValue sounds a lot like ValueError, but it is inherited from TypeError; what about _InvalidOutput or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the alternative name, but I think it shouldn't be private. This exception will raise if a user-implemented middleware's process_request or process_response returns an invalid value, it's supposed to appear in the user log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the exception to InvalidOutput, but kept it public because of the reasons above. Also, I think subclassing from TypeError ("Passing arguments of the wrong type (e.g. passing a list when an int is expected) should result in a TypeError") is more appropriate than ValueError ("Raised when a built-in operation or function receives an argument that has the right type but an inappropriate value")

@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from ffbf09f to 533c799 Compare February 9, 2017 13:20
@elacuesta
Copy link
Member Author

Current status of this PR is:

  • Catch exceptions from previous process_spider_output (also documented)
  • Execute only non already called process_spider_output methods when process_spider_exception returns an iterable
  • Raise InvalidValue exception (scrapy-specific, name suggestions are welcome) when an invalid value is returned from a spider middleware's processing method

@kmike I think that addresses your latest concerns, please let me know if there's anything more I can do to get this PR moving. Thanks!

scrapy/core/spidermw.py Outdated Show resolved Hide resolved
scrapy/core/spidermw.py Outdated Show resolved Hide resolved
""" return value is NOT a generator """
name = 'not_a_generator'
def parse(self, response):
raise AssertionError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not returning a generator sounds like a separate case from raising an exception

Copy link
Member Author

@elacuesta elacuesta Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand what you mean 😕

What I'm trying to do there is catch exceptions whether or not the result of a callback is a generator. This particular test (i.e. exceptions from callbacks which do not return generators) would pass even before the modifications from this PR.

Let me know if I didn't answer your question :-)

@kmike kmike added this to the v1.4 milestone Feb 22, 2017
@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from 533c799 to 7705466 Compare February 22, 2017 18:04
@elacuesta elacuesta force-pushed the process_spider_exception_generator branch from 7705466 to 706ed0e Compare March 1, 2017 15:02
@redapple
Copy link
Contributor

redapple commented Mar 6, 2017

@kmike , what do you think of this PR now?

@@ -11,6 +11,11 @@ class NotConfigured(Exception):
"""Indicates a missing configuration situation"""
pass

class InvalidOutput(TypeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to inherit it from TypeError

indicate that some method returned a value not suported by the processing
chain.
See :ref:`topics-spider-middleware` and :ref:`topics-downloader-middleware`
for a list of supported output values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elacuesta in #2061 (comment) you argued that it is good to have this exception public & documented because it may appear in logs. But I still don't quite like making it public :) There is a lot of exceptions which can appear in logs (from Scrapy, from Twisted, other Python exceptions), and we don't document them all. To make exception in logs readable one can use a readable error message, which is already the case.

There is a non-zero overhead of documenting an exception and making it public - user may read docs and wonder how can this exception be used. All other Scrapy documented exceptions are useful for users - user may want to either raise them or catch them; unlike all other exceptions, this exception shouldn't be raised or caught in user code.

Docs tell that "This exception can be raised by a downloader or spider middleware", but this information can be misleading for users - we document that exceptions raised in middlewares can be caught in process_exception method, and we document that InvalidOutput can be raised in a middleware - but in fact this particular exception can't be caught in process_exception.

For me this exception still looks like an implementation detail with no use for end users; documenting it in scrapy docs doesn't help scrapy users, only make these docs longer to read and a bit more confusing. For example, users may think they should raise this exception if they validate items yielded by a spider, and validation fails ("value not supported by the processing chain") - but they shouldn't do this, they shouldn't raise this exception themselves.

I think information you've written here is valuable for people who modify Scrapy itself; it can be great to have something like that in the exception docstring, to make it easier for people to figure out how Scrapy works internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, there is no case in which a user would raise or catch this exception, it's only something to make them check their code if they write a bad middleware. I'll remove the docs about it. Thanks for the feedback 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the exception from the docs and added an underscore to the name. I also modified the DownloaderMiddlewareManager class to raise _InvalidOutput too, not a good idea to raise a custom exception from the spider middleware and AssertionError from the downloader middleware IMHO.

scrapy/core/spidermw.py Outdated Show resolved Hide resolved
This method is called when a spider or :meth:`process_spider_input`
method (from other spider middleware) raises an exception.
This method is called when a spider or :meth:`process_spider_output`
method (from a previous spider middleware) raises an exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if hasattr(mw, 'process_start_requests'):
self.methods['process_start_requests'].insert(0, mw.process_start_requests)
self.methods['process_spider_output'].insert(0, getattr(mw, 'process_spider_output', None))
self.methods['process_spider_exception'].insert(0, getattr(mw, 'process_spider_exception', None))
Copy link
Member

@kmike kmike Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding: are you adding None to the list to make start_index (and slicing) work properly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, precisely. They are skipped by the if method is None: continue blocks when iterating over the process_spider_output and process_spider_exception methods.

@elacuesta
Copy link
Member Author

Hi there @kmike 👋
Thanks for the feedback, I updated the PR based on your comments.

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elacuesta! I don't have any further feedback, thanks for carefully addressing everything :)

I think this is good to go, but as the code is quite complex, I think it is better to merge it after the 1.6 release, and after a review from another Scrapy commiter (as usual).

🎉 🎂 🎈

@kmike kmike added this to the v1.7 milestone Oct 25, 2018
@elacuesta
Copy link
Member Author

This is awesome, a million thanks @kmike 🙌

Looking forward to 1.7 then, thanks again!

@kmike kmike changed the title process_spider_exception on generators [MRG+1] process_spider_exception on generators Mar 22, 2019
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, though I’m hesitant to merge it myself.

@kmike
Copy link
Member

kmike commented Mar 26, 2019

Heh, poor @elacuesta - at least 4 Scrapy committers looked at it in some way; there are two +1's now, but we're not brave enough to merge it. Please don't take it personal - that's the price of tackling hard issues :) @dangra @lopuhin do you want to take another look? If no, I can merge it this week.

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I checked the code and docs, didn't check the tests. I think I understand the code, and I think it matches the updated docs. The way execution jumps between process_spider_output and process_spider_exception makes it tricky, but I don't see how this could be improved.

+1 to merge from me, I can click the green button unless @dangra also wants to check it.

@lopuhin lopuhin merged commit b5c552d into scrapy:master Apr 1, 2019
@lopuhin
Copy link
Member

lopuhin commented Apr 1, 2019

Many thanks @elacuesta and thanks @kmike and @Gallaecio for review 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

process_spider_exception() not invoked for generators
6 participants