Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

headers_received signal #4897

Merged
merged 7 commits into from Mar 11, 2021
Merged

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Nov 19, 2020

Closes #1772

A new headers_received signal that fires when the headers for a response are received, before starting the download of the body. Handlers for this signal can stop the download of a response by raising a scrapy.exceptions.StopDownload exception, just like the handlers for the bytes_received signal.

Tasks:

  • Implementation
  • Tests
  • Docs
  • Fix failing tests

@elacuesta elacuesta changed the title Headers received signal headers_received signal Nov 19, 2020
@elacuesta elacuesta marked this pull request as ready for review November 20, 2020 17:39
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

(not merging yet due to on-going tests)

@codecov
Copy link

codecov bot commented Jan 4, 2021

Codecov Report

Merging #4897 (d60b697) into master (ea92b49) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4897      +/-   ##
==========================================
+ Coverage   87.82%   87.83%   +0.01%     
==========================================
  Files         158      158              
  Lines        9723     9736      +13     
  Branches     1433     1435       +2     
==========================================
+ Hits         8539     8552      +13     
  Misses        929      929              
  Partials      255      255              
Impacted Files Coverage Δ
scrapy/core/downloader/handlers/http11.py 92.69% <100.00%> (+0.22%) ⬆️
scrapy/signals.py 100.00% <100.00%> (ø)
scrapy/utils/curl.py 100.00% <100.00%> (ø)

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So glad you managed to find the issue affecting the tests! 🎉 I had made some attempts myself but I had not been able to.

docs/topics/exceptions.rst Outdated Show resolved Hide resolved
docs/topics/signals.rst Outdated Show resolved Hide resolved
scrapy/core/downloader/handlers/http11.py Show resolved Hide resolved
scrapy/core/downloader/handlers/http11.py Show resolved Hide resolved
@elacuesta
Copy link
Member Author

@Gallaecio please review again after 84e91b6 😄

@Gallaecio
Copy link
Member

Gallaecio commented Mar 1, 2021

I’m undecided on the right approach to provide the content length.

My initial reaction was: we should first fix Scrapy responses missing the Content-Length header. Then I saw https://github.com/Almad/twisted/blob/d9dccb23ffa63184e376e98bebe6a5cee32953f3/twisted/web2/channel/http.py#L202 and I thought it made sense, Twisted is just following the standard. Then I thought of cases where the header and the content length do not actually match, and how in Scrapy we allow our users to still process those requests; I think exposing the Content-Length header to users would go in line with that, Scrapy users some times may need that low-level stuff.

So I’m +0.5 on this. I think the code looks great, I’m just not sure about the length parameter. Adding it as a parameter would require a messy deprecation later if we eventually remove it. But the only alternative I can think of would be to add the header to responses, which I think would be out of the scope of this change, and I’m unsure whether it would be the easiest thing (pick the content length like you did here and set that into the header) or require Twisted monkey patching. If it’s not easy, it’s probably best to have a messy deprecation later than to postpone this further.

@elacuesta elacuesta added this to the 2.5 milestone Mar 4, 2021
@elacuesta elacuesta merged commit 0c16088 into scrapy:master Mar 11, 2021
@elacuesta elacuesta deleted the headers_received_signal branch March 11, 2021 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More possibilities to cancel downloads inside HTTP downloader handler
2 participants