-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more extensions to IGNORED_EXTENSIONS #1837
Comments
I would exclude gz, that seems too fickle. I think to make that case safe, the link would at least need to be pre-fetched with a HEAD request, to check if a |
it looks more like #1772 |
My two cents... I think anything that could generate the error But most important, I think it should exist a easy way to extend/override |
At the time a response can throw an
# in the spider or in your settings:
from scrapy.linkextractors import IGNORED_EXTENSIONS
IGNORED_EXTENSIONS += ['my', 'custom', 'extensions']
# if added in settings:
#from myproject.settings import IGNORED_EXTENSIONS
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
lx = LinkExtractor(deny_extensions=IGNORED_EXTENSIONS, allow=(), restrict_xpaths=())
rules = (lx,)
# ... |
@kmike: Is cancelling a response/download the right way though? It galls me that there is no better way. Sending two requests for every resource ending in But |
I vote for adding .cdr extension. |
A follow-up to #1835. For the record, in my last project I had to also add the following extensions to ignored:
cdr and apk look safe; I wonder if we can add .gz: Scrapy decompresses gz responses automatically, so if we add .gz to
IGNORED_EXTENSIONS
spider may stop following e.g. sitemap.xml.gz links.On the other hand, for broad crawls most .gz files are archives, and we're filtering out .zip files already.
The text was updated successfully, but these errors were encountered: