New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated the list of ignored extension in the linkextractor component #2067
Updated the list of ignored extension in the linkextractor component #2067
Conversation
Current coverage is 83.33%
|
+1, but without the I've already reasoned this in the mentioned #1837 issue: In the name of sane defaults, I would rather see these URLs in the log and fix my LE on a case-by-case basis, if they happen to be archives, than having them silently dropped and perhaps miss a whole lot of actual website content. |
I would add 'tar.gz' however. |
@nyov so do you propose to remove .gz, .bz2, .xz, but add tar.(gz|bz2|xz) ? |
No, only Deflate/gzip/zlib are well-known for use with on-the-fly encoding (afaik) -> RFC. A few more codings have made it into the official IANA list, and others are used unofficially (such as So, '.gz' alone is fickle but '.tar.gz' is 99% a tarball archive; all other compression file-endings are reasonably certain to not contain browser-interpreted content. ( Considering this is a default setting, that is as far as I'd go with the presets. The only sure-fire way, however, is to handle things like a browser, not assume content from URI names. Checking HTTP headers, detecting file MIME magic -- don't judge the book by it's cover:
N.B.: default Apache
|
See #1837 issue for details and discussion.