Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional request attribute binding for responses/failures #4632

Merged
merged 15 commits into from Aug 17, 2020

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Jun 15, 2020

Fixes #4529, closes #4618.

Additional tests to understand the current and proposed behaviour in #4618.

@codecov
Copy link

codecov bot commented Jun 15, 2020

Codecov Report

Merging #4632 into master will increase coverage by 0.08%.
The diff coverage is 85.18%.

@@            Coverage Diff             @@
##           master    #4632      +/-   ##
==========================================
+ Coverage   86.23%   86.31%   +0.08%     
==========================================
  Files         160      160              
  Lines        9669     9678       +9     
  Branches     1419     1422       +3     
==========================================
+ Hits         8338     8354      +16     
+ Misses       1069     1063       -6     
+ Partials      262      261       -1     
Impacted Files Coverage Δ
scrapy/core/scraper.py 87.11% <82.60%> (-0.86%) ⬇️
scrapy/core/engine.py 87.50% <100.00%> (+0.05%) ⬆️
scrapy/utils/curl.py 100.00% <0.00%> (ø)
scrapy/pipelines/images.py 91.81% <0.00%> (ø)
scrapy/utils/python.py 85.71% <0.00%> (+0.25%) ⬆️
scrapy/utils/trackref.py 85.71% <0.00%> (+2.85%) ⬆️
scrapy/utils/spider.py 77.77% <0.00%> (+11.11%) ⬆️
scrapy/utils/py36.py 100.00% <0.00%> (+80.00%) ⬆️

@elacuesta elacuesta force-pushed the request-attribute-binding branch from e6a66f9 to 5f28400 Compare Jun 16, 2020
Copy link
Member

@Gallaecio Gallaecio left a comment

These changes prevent an overridden response.request from being overwritten by Scrapy.

However, shouldn’t we modify the rest of the code of those two methods to also use response.request instead of the original request argument? I’m guessing someone overriding the request of a response may intend for a different callback to be used, for example.

scrapy/core/scraper.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member Author

elacuesta commented Jun 16, 2020

shouldn’t we modify the rest of the code of those two methods to also use response.request instead of the original request argument?

Very good point, thanks for noticing!

@Gallaecio
Copy link
Member

Gallaecio commented Jun 18, 2020

What about adding tests that cover de4094c ?

@elacuesta elacuesta added this to the v2.3 milestone Jun 25, 2020
@Gallaecio
Copy link
Member

Gallaecio commented Jun 29, 2020

I wonder if it wouldn’t make sense to merge test_crawled_log_message into test_response_received_signal.

And I’m guessing test_crawled_log_message only tests 1 of the 2 changes. It would be great if we could have a second test which covers the other change.

@elacuesta
Copy link
Member Author

elacuesta commented Jun 29, 2020

Thanks for the suggestion, I merged 3 similar tests. Regarding the missing test (about using callback/errback from the overriden request, just to make sure we're talking about the same thing), it's in my plans to add it, I'll probably do it this week.

@Gallaecio
Copy link
Member

Gallaecio commented Jun 29, 2020

about using callback/errback from the overriden request, just to make sure we're talking about the same thing

Actually, I think we are talking about different things.

I meant that the code changes involve 2 changes, or better put changes in 2 places: engine and scraper. I believe we need to have separate tests that check that each of those changes work, as I assume the current test would pass even if one of those changes was reverted.

Of course, testing the callbacks and errbacks would be awesome. But I had already forgotten about that 🤦

@amarynets
Copy link

amarynets commented Jun 29, 2020

Hi, @elacuesta thanks a lot for work that you already have done. I just checked my issue #4633 with changes that you added and it does now work correctly.

  1. The id of my first request is 140571516279888(with original URL)
  2. Then I make the second request with id 140571516280784
  3. In the process_response, I have the following: response.request = request.replace(url=request.mata['original_url'])
    In callback I have responce.request.url not from origin URL but from the last request

@elacuesta
Copy link
Member Author

elacuesta commented Jul 9, 2020

Pushed a new test to check overriden callbacks. I tried the same with errbacks, but realized it doesn't really make sense: the method I'm mostly testing here is process_response, if I return a Response with a request that overrides the errback, the only way to send it to the errback is by raising an exception in a later middleware. However, later process_exception methods do not get the overriden response. Of course I might be missing something, please make sure this makes sense.

@Gallaecio Regarding #4632 (comment), the component that logs the "crawled" message is the engine, not the scraper, I'm not sure what else can be tested. Please let me know about any ideas you might have.

@Gallaecio
Copy link
Member

Gallaecio commented Jul 16, 2020

The current tests continue to pass if we separately revert the changes in 3 lines:

scrapy/core/scraper.py:154: result.request → request
scrapy/core/scraper.py:156: result.request → request
scrapy/core/scraper.py:157: result.request → request

I think we need to figure out a way to make the tests break when any of these changes is made.

I’ll give it some thought and get back to you.

@elacuesta
Copy link
Member Author

elacuesta commented Aug 3, 2020

Regarding the discussed lines:

warn_on_generator_with_return_value(spider, result.request.errback)
dfd.addCallbacks(callback=callback,
errback=result.request.errback,
callbackKeywords=result.request.cb_kwargs)

  • 154 is not really important, it's just a warning and it doesn't affect the request/response flow at all
  • 156 is the one I think cannot be tested because it never changes actual behaviour. This change is about the process_response method, which handles successful responses, so the errback is not invoked.
  • 157 is tested in e5b5d0d

Because of this, it might actually make sense to revert the changes to 154 and 156.

@Gallaecio
Copy link
Member

Gallaecio commented Aug 5, 2020

@elacuesta I finally understood the reason why those 2 line changes have no effect 😅

I think my main issue was the way call_spider is written. What would everyone think about refactoring it like this?

@Gallaecio Gallaecio removed this from the v2.3 milestone Aug 5, 2020
Copy link
Member

@Gallaecio Gallaecio left a comment

PS: Missing coverage is about scenarios of the previous implementation. While it would be great to cover them, I don’t think it’s something that should block merging these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bind request to response only if not bound already
3 participants