New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control download due to response's mime type. #1312
Comments
I like the idea. The signal should fire as soon as possible after HTTP headers are received. |
To clarify: I think this signal (and/or the callback) should be fired only once. |
Links no longer work. Where?
|
@sardok I wonder if you could resurrect those old links from your 2015 post above? They don't work. I'm having a bear of a time trying to intercept a situation where the linkextractor finds links which do not have PDF in the link url (correctly adhering to deny_extension), but which are then redirected to a PDF and downloaded creating many errors in my crawl. And another case where the link without those same extensions renders a PDF dynamically even though the linkextactor correctly presented it. Am considering this approach (and which @kmike seconded ) where a signal is fired to stop the download as in all my cases, the PDF (or other docs) are in the header as it is being downloaded. Is this somewhere I can follow and try? |
This is part of handler code that.. cancel downloading if expected response size is above value of scrapy/scrapy/core/downloader/handlers/http11.py Lines 453 to 467 in f2fb476
I assume that original proposal had similar update in handler that.. cancel download by.. content type received from headers. |
@leeprevost Sorry, it seems that, i deleted the scrapy repository in my github account. Checked my computers but no luck. |
@sardok thanks for trying. |
@GeorgeA92 thanks for steering me to that. |
An issue raised, in development of a generic crawler which was supposed to follow particular rules for extracting and visiting links as well as collecting some statistics about visited page.
As websites in the batch, differ a lot, to each other, the defined rules for link extraction, started to be broken and yielded unwanted results, so to speak returned many false positives.
The main issue was about links to implicit binary files. By default, binary files are ignored from the link extractor, however not every file link points to the proper filename and file extension.
I made an experimental work which introduces a parameter called DENY_CONTENT_TYPE in http agent. The http agent cancels downloading the response if reponse's content type is matched with the one in given DENY_CONTENT_TYPE, you may find this implementation attempt here: https://github.com/sardok/scrapy/commit/cb1d941d8cf0f32b9eaac043a17411920a830f61
Than, Shane gave the idea about giving a spider, the control of response downloading which would allow spider to cancel download operation if needed. This is an another attempt to fix the issue: https://github.com/sardok/scrapy/commits/download-control-callback . I chose to use signals here as download agent and spider has no direct relation and no easy way of passing information between them.
What is the proper way of doing that and how much re-factoring needed? any other ideas about the matter?
thanks.
The text was updated successfully, but these errors were encountered: