-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More possibilities to cancel downloads inside HTTP downloader handler #1772
Comments
Yeah, we need a public & documented way to cancel downloads. That's not good users can't implement an alternative to DOWNLOAD_WARNSIZE / DOWNLOAD_MAXSIZE themselves. Signals look like a reasonable way to do that. The current signal dispatching implementation is slow though, so we need to check that this new signal won't slow down anything. I'm not sure it is good to expose twisted response directly; it is better to design Scrapy API in a way Twisted is hidden. |
IMO, triggering signals on every http request is very expensive. It would be nice to have an option for disabling that. |
@kmike @sibiryakov I took into account what you said about signal dispatching and I came up with this This approach does not use signal dispatching. Instead, an optional callback JB |
Hey @Djayb6, With I think signals are preferrable, but we should measure their impact. I'd say a single signal per response should be fine; a signal per downloaded chunk is much more risky. But maybe the slowness can be fixed by moving to a different signal implementation (see #8), or maybe overhead is not that big. |
@kmike see my new implementation here . It is based on signals (one signal per download), hides Twisted API, and it is possible to disable the signal based on a setting (for @sibiryakov). FYI, regarding the signal dispatching implementation, the celery project extracted django pydispatcher's fork and modified it for its needs. |
//cc @elacuesta - is it fixed by the signals you introduced? |
Hey @kmike. Interesting, I wasn't aware of this thread, thanks for pointing this out. #4205 adds a way to stop downloads, but this issue still seems valid to me because headers are not sent as arguments in the signal. They are available though, it would be a matter of doing something like: diff --git scrapy/core/downloader/handlers/http11.py scrapy/core/downloader/handlers/http11.py
index fb04d1fb..c419b195 100644
--- scrapy/core/downloader/handlers/http11.py
+++ scrapy/core/downloader/handlers/http11.py
@@ -513,6 +513,7 @@ class _ResponseReader(protocol.Protocol):
data=bodyBytes,
request=self._request,
spider=self._crawler.spider,
+ headers=Headers(self._txresponse.headers.getAllRawHeaders()),
)
for handler, result in bytes_received_result:
if isinstance(result, Failure) and isinstance(result.value, StopDownload): Or adding a new signal as originally proposed, or both; I don't really have a strong preference either way. A good thing is that the sender issue should not be a problem anymore, now that the download handler has access to the crawler instance since #4205. |
Hello,
Currently, a download is cancelled in the HTTP downloader if the expected size of the response is greater than the
DOWNLOAD_MAXSIZE
setting. However, there is no way to cancel a download after the headers are received and before the body is downloaded based on other conditions, such as the value of a specific header, and I see some cases where it could be useful.For instance, one cannot rely on LinkExtractor to filter out media links (images, videos, etc...) since a link without a media extension could still be a media. Thus, by having a way to obtain the headers of a response when they are received, one can check the value of the
Content-Type
header and trigger the cancellation of the download if necessary.I thought about an implementation and came up with this . The main idea is when the headers of the response are received, the downloader handler sends a signal
headers_received
with thetxresponse
and therequest
, and cancels the download based on the return value of the first receiver's callback. It is a quick hack but is not very intrusive. The main drawback is that when connecting to this signal in a spider, one must specifiessender=Any
as the crawler's signal manager is not available in the downloader handler.I'm waiting for your remarks and ideas.
Many thanks,
JB.
The text was updated successfully, but these errors were encountered: