Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More possibilities to cancel downloads inside HTTP downloader handler #1772

Closed
Djayb6 opened this issue Feb 8, 2016 · 7 comments · Fixed by #4897
Closed

More possibilities to cancel downloads inside HTTP downloader handler #1772

Djayb6 opened this issue Feb 8, 2016 · 7 comments · Fixed by #4897

Comments

@Djayb6
Copy link

Djayb6 commented Feb 8, 2016

Hello,

Currently, a download is cancelled in the HTTP downloader if the expected size of the response is greater than the DOWNLOAD_MAXSIZE setting. However, there is no way to cancel a download after the headers are received and before the body is downloaded based on other conditions, such as the value of a specific header, and I see some cases where it could be useful.

For instance, one cannot rely on LinkExtractor to filter out media links (images, videos, etc...) since a link without a media extension could still be a media. Thus, by having a way to obtain the headers of a response when they are received, one can check the value of the Content-Type header and trigger the cancellation of the download if necessary.

I thought about an implementation and came up with this . The main idea is when the headers of the response are received, the downloader handler sends a signal headers_received with the txresponse and the request, and cancels the download based on the return value of the first receiver's callback. It is a quick hack but is not very intrusive. The main drawback is that when connecting to this signal in a spider, one must specifies sender=Any as the crawler's signal manager is not available in the downloader handler.

I'm waiting for your remarks and ideas.

Many thanks,

JB.

@kmike
Copy link
Member

kmike commented Feb 16, 2016

Yeah, we need a public & documented way to cancel downloads. That's not good users can't implement an alternative to DOWNLOAD_WARNSIZE / DOWNLOAD_MAXSIZE themselves.

Signals look like a reasonable way to do that. The current signal dispatching implementation is slow though, so we need to check that this new signal won't slow down anything. I'm not sure it is good to expose twisted response directly; it is better to design Scrapy API in a way Twisted is hidden. sender=Any is not good because there may be several spiders in the same process, and user may want to filter out requests from one spider but not from another.

@sibiryakov
Copy link
Contributor

IMO, triggering signals on every http request is very expensive. It would be nice to have an option for disabling that.

@Djayb6
Copy link
Author

Djayb6 commented Feb 17, 2016

@kmike @sibiryakov I took into account what you said about signal dispatching and I came up with this

This approach does not use signal dispatching. Instead, an optional callback on_headers_received has been added to scrapy.http.Request that is called in the dowloader handler if the callback was set.
A scrapy.http.Response is created with the url and the headers in order to hide Twisted's API, and is passed to the callback along with the original request. Then as before if the callback returns True the request is cancelled.
I think this approach is way less expensive and more user-friendly. What do you think ?

JB

@kmike
Copy link
Member

kmike commented Feb 17, 2016

Hey @Djayb6,

With on_headers_received callback it becomes hard to implement reusable extensions which use this feature - e.g. an extension to limit max response size or an extension to filter out responses based on their content-type. I'd love to see an extension point which allows to implement e.g. DOWNLOAD_WARNSIZE feature without hardcoding it in a downloader handler.

I think signals are preferrable, but we should measure their impact. I'd say a single signal per response should be fine; a signal per downloaded chunk is much more risky. But maybe the slowness can be fixed by moving to a different signal implementation (see #8), or maybe overhead is not that big.

@Djayb6
Copy link
Author

Djayb6 commented Mar 19, 2016

@kmike see my new implementation here . It is based on signals (one signal per download), hides Twisted API, and it is possible to disable the signal based on a setting (for @sibiryakov).
You can find an extension that cancels downloads based on pre-defined conditions here

FYI, regarding the signal dispatching implementation, the celery project extracted django pydispatcher's fork and modified it for its needs.
JB

@kmike
Copy link
Member

kmike commented Aug 21, 2020

//cc @elacuesta - is it fixed by the signals you introduced?

@elacuesta
Copy link
Member

Hey @kmike. Interesting, I wasn't aware of this thread, thanks for pointing this out. #4205 adds a way to stop downloads, but this issue still seems valid to me because headers are not sent as arguments in the signal. They are available though, it would be a matter of doing something like:

diff --git scrapy/core/downloader/handlers/http11.py scrapy/core/downloader/handlers/http11.py
index fb04d1fb..c419b195 100644
--- scrapy/core/downloader/handlers/http11.py
+++ scrapy/core/downloader/handlers/http11.py
@@ -513,6 +513,7 @@ class _ResponseReader(protocol.Protocol):
             data=bodyBytes,
             request=self._request,
             spider=self._crawler.spider,
+            headers=Headers(self._txresponse.headers.getAllRawHeaders()),
         )
         for handler, result in bytes_received_result:
             if isinstance(result, Failure) and isinstance(result.value, StopDownload):

Or adding a new signal as originally proposed, or both; I don't really have a strong preference either way. A good thing is that the sender issue should not be a problem anymore, now that the download handler has access to the crawler instance since #4205.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants