Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control download due to response's mime type. #1312

Open
sardok opened this issue Jun 22, 2015 · 9 comments
Open

Control download due to response's mime type. #1312

sardok opened this issue Jun 22, 2015 · 9 comments

Comments

@sardok
Copy link

sardok commented Jun 22, 2015

An issue raised, in development of a generic crawler which was supposed to follow particular rules for extracting and visiting links as well as collecting some statistics about visited page.
As websites in the batch, differ a lot, to each other, the defined rules for link extraction, started to be broken and yielded unwanted results, so to speak returned many false positives.
The main issue was about links to implicit binary files. By default, binary files are ignored from the link extractor, however not every file link points to the proper filename and file extension.

I made an experimental work which introduces a parameter called DENY_CONTENT_TYPE in http agent. The http agent cancels downloading the response if reponse's content type is matched with the one in given DENY_CONTENT_TYPE, you may find this implementation attempt here: https://github.com/sardok/scrapy/commit/cb1d941d8cf0f32b9eaac043a17411920a830f61

Than, Shane gave the idea about giving a spider, the control of response downloading which would allow spider to cancel download operation if needed. This is an another attempt to fix the issue: https://github.com/sardok/scrapy/commits/download-control-callback . I chose to use signals here as download agent and spider has no direct relation and no easy way of passing information between them.

What is the proper way of doing that and how much re-factoring needed? any other ideas about the matter?

thanks.

@kmike
Copy link
Member

kmike commented Jun 22, 2015

I like the idea. The signal should fire as soon as possible after HTTP headers are received.
Another option is to extend Request interface - in addition to callback and errback add a third callback, on_headers_received or something like this.

@kmike
Copy link
Member

kmike commented Jun 22, 2015

To clarify: I think this signal (and/or the callback) should be fired only once.

@leeprevost
Copy link

issues #2303 and #6159 seem to be related to this. In #2303, i considered another approach which is a custom redirect middleware downloader.. Could possibly add the DENY_CONTENT_TYPE logic to this.

@leeprevost
Copy link

Links no longer work. Where?

I made an experimental work which introduces a parameter called DENY_CONTENT_TYPE in http agent. The http agent cancels downloading the response if reponse's content type is matched with the one in given DENY_CONTENT_TYPE, you may find this implementation attempt here: https://github.com/sardok/scrapy/commit/cb1d941d8cf0f32b9eaac043a17411920a830f61

Then, Shane gave the idea about giving a spider, the control of response downloading which would allow spider to cancel download operation if needed. This is an another attempt to fix the issue: https://github.com/sardok/scrapy/commits/download-control-callback . I chose to use signals here as download agent and spider has no direct relation and no easy way of passing information between them.

@leeprevost
Copy link

@sardok I wonder if you could resurrect those old links from your 2015 post above? They don't work.

I'm having a bear of a time trying to intercept a situation where the linkextractor finds links which do not have PDF in the link url (correctly adhering to deny_extension), but which are then redirected to a PDF and downloaded creating many errors in my crawl.

And another case where the link without those same extensions renders a PDF dynamically even though the linkextactor correctly presented it.

Am considering this approach (and which @kmike seconded ) where a signal is fired to stop the download as in all my cases, the PDF (or other docs) are in the header as it is being downloaded. Is this somewhere I can follow and try?

@GeorgeA92
Copy link
Contributor

This is part of handler code that.. cancel downloading if expected response size is above value of DOWNLOAD_MAXSIZE setting

if maxsize and expected_size > maxsize:
warning_msg = (
"Cancelling download of %(url)s: expected response "
"size (%(size)s) larger than download max size (%(maxsize)s)."
)
warning_args = {
"url": request.url,
"size": expected_size,
"maxsize": maxsize,
}
logger.warning(warning_msg, warning_args)
txresponse._transport.loseConnection()
raise defer.CancelledError(warning_msg % warning_args)

I assume that original proposal had similar update in handler that.. cancel download by.. content type received from headers.

@sardok
Copy link
Author

sardok commented Nov 30, 2023

@leeprevost Sorry, it seems that, i deleted the scrapy repository in my github account. Checked my computers but no luck.

@leeprevost
Copy link

@sardok thanks for trying.

@leeprevost
Copy link

@GeorgeA92 thanks for steering me to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants