Add more extensions to IGNORED_EXTENSIONS #1837

kmike · 2016-03-02T11:23:04Z

A follow-up to #1835. For the record, in my last project I had to also add the following extensions to ignored:

{
    '7z', '7zip', 'xz', 'gz', 'tar', 'bz2',   # archives
    'cdr',  # Corel Draw files
    'apk', # Android packages
}

cdr and apk look safe; I wonder if we can add .gz: Scrapy decompresses gz responses automatically, so if we add .gz to IGNORED_EXTENSIONS spider may stop following e.g. sitemap.xml.gz links.

On the other hand, for broad crawls most .gz files are archives, and we're filtering out .zip files already.

The text was updated successfully, but these errors were encountered:

nyov · 2016-03-03T17:08:42Z

I would exclude gz, that seems too fickle.
Serving gzipped content (with or without the file extension) is perfectly valid.

I think to make that case safe, the link would at least need to be pre-fetched with a HEAD request, to check if a Content-Disposition: attachment header is set, in which case it can be ignored.
If a crawl is seeing many downloadable gz archives, that means as many useless requests, but it won't break in strange ways.

kmike · 2016-03-03T18:42:07Z

I think to make that case safe, the link would at least need to be pre-fetched with a HEAD request, to check if a Content-Disposition: attachment header is set, in which case it can be ignored.
If a crawl is seeing many downloadable gz archives, that means as many useless requests, but it won't break in strange ways.

it looks more like #1772

djunzu · 2016-03-06T00:47:57Z

My two cents...

I think anything that could generate the error AttributeError: 'Response' object has no attribute 'body_as_unicode' should not be allowed. So all archive/compression extensions should be added to IGNORED_EXTENSIONS by default.

But most important, I think it should exist a easy way to extend/override IGNORED_EXTENSIONS. Currently it is possible to completely override the values but not to extend them. So the user must inevitably dive into the code when it is needed to add one extension. If there were get/set methods to IGNORED_EXTENSIONS, it would be very easier to add/remove one or more extensions!

nyov · 2016-03-06T10:36:53Z

At the time a response can throw an AttributeError for body_as_unicode, the whole response must have been downloaded and decompressed. That's what IGNORED_EXTENSIONS is trying to avoid in the first place: downloading the resource. Otherwise we could just use MIME detection on the response content.

IGNORED_EXTENSIONS are probably defined in LinkExtractor (instead of settings) because the module was formerly in contrib.
Depending on whether it will stay now, or extracted into another package, the settings could be merged into the default_settings for easier overriding. Still, it's not that hard to do.

# in the spider or in your settings:
from scrapy.linkextractors import IGNORED_EXTENSIONS
IGNORED_EXTENSIONS += ['my', 'custom', 'extensions']
# if added in settings:
#from myproject.settings import IGNORED_EXTENSIONS
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
    lx = LinkExtractor(deny_extensions=IGNORED_EXTENSIONS, allow=(), restrict_xpaths=())
    rules = (lx,)
    # ...

nyov · 2016-03-06T10:58:12Z

@kmike: Is cancelling a response/download the right way though? It galls me that there is no better way. Sending two requests for every resource ending in .gz is ugly (HEAD first, then possibly GET).

But RSTing the connection seems like it shouldn't be a "default action", either.
The server will be sending the response body already at that time. A RST on the connection might impact things like HTTP pipelining (Without research that's just a guess, though, not sure if twisted even supports it.)
Depending on the number of such aborted requests, that might be considered a DoS attack, since it asks the server for the full response first, then aborts when the server did work to fetch or compute the ressource (possibly on-the-fly compression of something big).
Since that's a special case only for .gz extensions, it might not happen that often, but we should be aware of that.

anatolykazantsev · 2016-06-18T01:02:50Z

I vote for adding .cdr extension.

kmike added the discuss label Mar 2, 2016

anatolykazantsev mentioned this issue Jun 21, 2016

Updated the list of ignored extension in the linkextractor component #2067

Closed

akhterwahab mentioned this issue Nov 14, 2019

Add .dmg, .iso & .apk to ignored other extensions #4066

Merged

wRAR closed this as completed in #4066 Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more extensions to IGNORED_EXTENSIONS #1837

Add more extensions to IGNORED_EXTENSIONS #1837

kmike commented Mar 2, 2016

nyov commented Mar 3, 2016

kmike commented Mar 3, 2016

djunzu commented Mar 6, 2016

nyov commented Mar 6, 2016

nyov commented Mar 6, 2016

anatolykazantsev commented Jun 18, 2016

Add more extensions to IGNORED_EXTENSIONS #1837

Add more extensions to IGNORED_EXTENSIONS #1837

Comments

kmike commented Mar 2, 2016

nyov commented Mar 3, 2016

kmike commented Mar 3, 2016

djunzu commented Mar 6, 2016

nyov commented Mar 6, 2016

nyov commented Mar 6, 2016

anatolykazantsev commented Jun 18, 2016