Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more extensions to IGNORED_EXTENSIONS #1837

Closed
kmike opened this issue Mar 2, 2016 · 6 comments · Fixed by #4066
Closed

Add more extensions to IGNORED_EXTENSIONS #1837

kmike opened this issue Mar 2, 2016 · 6 comments · Fixed by #4066
Labels

Comments

@kmike
Copy link
Member

kmike commented Mar 2, 2016

A follow-up to #1835. For the record, in my last project I had to also add the following extensions to ignored:

{
    '7z', '7zip', 'xz', 'gz', 'tar', 'bz2',   # archives
    'cdr',  # Corel Draw files
    'apk', # Android packages
}

cdr and apk look safe; I wonder if we can add .gz: Scrapy decompresses gz responses automatically, so if we add .gz to IGNORED_EXTENSIONS spider may stop following e.g. sitemap.xml.gz links.

On the other hand, for broad crawls most .gz files are archives, and we're filtering out .zip files already.

@kmike kmike added the discuss label Mar 2, 2016
@nyov
Copy link
Contributor

nyov commented Mar 3, 2016

I would exclude gz, that seems too fickle.
Serving gzipped content (with or without the file extension) is perfectly valid.

I think to make that case safe, the link would at least need to be pre-fetched with a HEAD request, to check if a Content-Disposition: attachment header is set, in which case it can be ignored.
If a crawl is seeing many downloadable gz archives, that means as many useless requests, but it won't break in strange ways.

@kmike
Copy link
Member Author

kmike commented Mar 3, 2016

I think to make that case safe, the link would at least need to be pre-fetched with a HEAD request, to check if a Content-Disposition: attachment header is set, in which case it can be ignored.
If a crawl is seeing many downloadable gz archives, that means as many useless requests, but it won't break in strange ways.

it looks more like #1772

@djunzu
Copy link
Contributor

djunzu commented Mar 6, 2016

My two cents...

I think anything that could generate the error AttributeError: 'Response' object has no attribute 'body_as_unicode' should not be allowed. So all archive/compression extensions should be added to IGNORED_EXTENSIONS by default.

But most important, I think it should exist a easy way to extend/override IGNORED_EXTENSIONS. Currently it is possible to completely override the values but not to extend them. So the user must inevitably dive into the code when it is needed to add one extension. If there were get/set methods to IGNORED_EXTENSIONS, it would be very easier to add/remove one or more extensions!

@nyov
Copy link
Contributor

nyov commented Mar 6, 2016

At the time a response can throw an AttributeError for body_as_unicode, the whole response must have been downloaded and decompressed. That's what IGNORED_EXTENSIONS is trying to avoid in the first place: downloading the resource. Otherwise we could just use MIME detection on the response content.

IGNORED_EXTENSIONS are probably defined in LinkExtractor (instead of settings) because the module was formerly in contrib.
Depending on whether it will stay now, or extracted into another package, the settings could be merged into the default_settings for easier overriding. Still, it's not that hard to do.

# in the spider or in your settings:
from scrapy.linkextractors import IGNORED_EXTENSIONS
IGNORED_EXTENSIONS += ['my', 'custom', 'extensions']
# if added in settings:
#from myproject.settings import IGNORED_EXTENSIONS
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
    lx = LinkExtractor(deny_extensions=IGNORED_EXTENSIONS, allow=(), restrict_xpaths=())
    rules = (lx,)
    # ...

@nyov
Copy link
Contributor

nyov commented Mar 6, 2016

@kmike: Is cancelling a response/download the right way though? It galls me that there is no better way. Sending two requests for every resource ending in .gz is ugly (HEAD first, then possibly GET).

But RSTing the connection seems like it shouldn't be a "default action", either.
The server will be sending the response body already at that time. A RST on the connection might impact things like HTTP pipelining (Without research that's just a guess, though, not sure if twisted even supports it.)
Depending on the number of such aborted requests, that might be considered a DoS attack, since it asks the server for the full response first, then aborts when the server did work to fetch or compute the ressource (possibly on-the-fly compression of something big).
Since that's a special case only for .gz extensions, it might not happen that often, but we should be aware of that.

@anatolykazantsev
Copy link

I vote for adding .cdr extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants