Control download due to response's mime type. #1312

sardok · 2015-06-22T14:31:52Z

An issue raised, in development of a generic crawler which was supposed to follow particular rules for extracting and visiting links as well as collecting some statistics about visited page.
As websites in the batch, differ a lot, to each other, the defined rules for link extraction, started to be broken and yielded unwanted results, so to speak returned many false positives.
The main issue was about links to implicit binary files. By default, binary files are ignored from the link extractor, however not every file link points to the proper filename and file extension.

I made an experimental work which introduces a parameter called DENY_CONTENT_TYPE in http agent. The http agent cancels downloading the response if reponse's content type is matched with the one in given DENY_CONTENT_TYPE, you may find this implementation attempt here: https://github.com/sardok/scrapy/commit/cb1d941d8cf0f32b9eaac043a17411920a830f61

Than, Shane gave the idea about giving a spider, the control of response downloading which would allow spider to cancel download operation if needed. This is an another attempt to fix the issue: https://github.com/sardok/scrapy/commits/download-control-callback . I chose to use signals here as download agent and spider has no direct relation and no easy way of passing information between them.

What is the proper way of doing that and how much re-factoring needed? any other ideas about the matter?

thanks.

kmike · 2015-06-22T16:54:09Z

I like the idea. The signal should fire as soon as possible after HTTP headers are received.
Another option is to extend Request interface - in addition to callback and errback add a third callback, on_headers_received or something like this.

kmike · 2015-06-22T17:00:45Z

To clarify: I think this signal (and/or the callback) should be fired only once.

leeprevost · 2023-11-28T17:05:50Z

issues #2303 and #6159 seem to be related to this. In #2303, i considered another approach which is a custom redirect middleware downloader.. Could possibly add the DENY_CONTENT_TYPE logic to this.

leeprevost · 2023-11-29T14:38:23Z

Links no longer work. Where?

I made an experimental work which introduces a parameter called DENY_CONTENT_TYPE in http agent. The http agent cancels downloading the response if reponse's content type is matched with the one in given DENY_CONTENT_TYPE, you may find this implementation attempt here: https://github.com/sardok/scrapy/commit/cb1d941d8cf0f32b9eaac043a17411920a830f61

Then, Shane gave the idea about giving a spider, the control of response downloading which would allow spider to cancel download operation if needed. This is an another attempt to fix the issue: https://github.com/sardok/scrapy/commits/download-control-callback . I chose to use signals here as download agent and spider has no direct relation and no easy way of passing information between them.

leeprevost · 2023-11-30T03:04:01Z

@sardok I wonder if you could resurrect those old links from your 2015 post above? They don't work.

I'm having a bear of a time trying to intercept a situation where the linkextractor finds links which do not have PDF in the link url (correctly adhering to deny_extension), but which are then redirected to a PDF and downloaded creating many errors in my crawl.

And another case where the link without those same extensions renders a PDF dynamically even though the linkextactor correctly presented it.

Am considering this approach (and which @kmike seconded ) where a signal is fired to stop the download as in all my cases, the PDF (or other docs) are in the header as it is being downloaded. Is this somewhere I can follow and try?

GeorgeA92 · 2023-11-30T21:58:23Z

This is part of handler code that.. cancel downloading if expected response size is above value of DOWNLOAD_MAXSIZE setting

scrapy/scrapy/core/downloader/handlers/http11.py

Lines 453 to 467 in f2fb476

    
           if maxsize and expected_size > maxsize: 
        
               warning_msg = ( 
        
                   "Cancelling download of %(url)s: expected response " 
        
                   "size (%(size)s) larger than download max size (%(maxsize)s)." 
        
               ) 
        
               warning_args = { 
        
                   "url": request.url, 
        
                   "size": expected_size, 
        
                   "maxsize": maxsize, 
        
               } 
        
               logger.warning(warning_msg, warning_args) 
        
               txresponse._transport.loseConnection() 
        
               raise defer.CancelledError(warning_msg % warning_args)

I assume that original proposal had similar update in handler that.. cancel download by.. content type received from headers.

sardok · 2023-11-30T22:05:03Z

@leeprevost Sorry, it seems that, i deleted the scrapy repository in my github account. Checked my computers but no luck.

leeprevost · 2023-12-01T01:39:41Z

@sardok thanks for trying.

leeprevost · 2023-12-01T01:40:06Z

@GeorgeA92 thanks for steering me to that.

kmike added help wanted and removed help wanted labels Nov 2, 2015

Gallaecio added the enhancement label Jul 8, 2019

leeprevost mentioned this issue Nov 28, 2023

RedirectMiddleware does not respect spider's crawling rules #2303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control download due to response's mime type. #1312

Control download due to response's mime type. #1312

sardok commented Jun 22, 2015

kmike commented Jun 22, 2015

kmike commented Jun 22, 2015

leeprevost commented Nov 28, 2023

leeprevost commented Nov 29, 2023

leeprevost commented Nov 30, 2023

GeorgeA92 commented Nov 30, 2023

sardok commented Nov 30, 2023

leeprevost commented Dec 1, 2023

leeprevost commented Dec 1, 2023

Control download due to response's mime type. #1312

Control download due to response's mime type. #1312

Comments

sardok commented Jun 22, 2015

kmike commented Jun 22, 2015

kmike commented Jun 22, 2015

leeprevost commented Nov 28, 2023

leeprevost commented Nov 29, 2023

leeprevost commented Nov 30, 2023

GeorgeA92 commented Nov 30, 2023

sardok commented Nov 30, 2023

leeprevost commented Dec 1, 2023

leeprevost commented Dec 1, 2023