Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the list of ignored extension in the linkextractor component #2067

Conversation

anatolykazantsev
Copy link

@anatolykazantsev anatolykazantsev commented Jun 21, 2016

  • Added following extension to the list: .7z, .7zip, .bz2, .gz, .tar, .xz, .cdr, .ico, .apk
  • Moved archives into the separate section

See #1837 issue for details and discussion.

@codecov-io
Copy link

@codecov-io codecov-io commented Jun 21, 2016

Current coverage is 83.33%

Merging #2067 into master will not change coverage

Powered by Codecov. Last updated by 80c296e...55f3240

@nyov
Copy link
Contributor

@nyov nyov commented Jul 2, 2016

+1, but without the gz extension

I've already reasoned this in the mentioned #1837 issue:
zlib and gzip are recognized/accepted for on-the-fly decompression by browsers.
Whether a gzipped file (e.g. *.html.gz or *.xml.z) is interpreted as a download or served as a page, depends on the headers which are served with the ressource.
You can't assume them to be downloads in beforehand, from a link.
To determine that reliably, at least a HEAD request to the ressource is necessary to determine the Content-Type header: "application/x-gzip" for a download, "text/html" or similar for an "inline ressource".
Filtering all these links out in the default LinkExtractor setup is bad, IMO.

In the name of sane defaults, I would rather see these URLs in the log and fix my LE on a case-by-case basis, if they happen to be archives, than having them silently dropped and perhaps miss a whole lot of actual website content.

@nyov
Copy link
Contributor

@nyov nyov commented Jul 4, 2016

I would add 'tar.gz' however.

@anatolykazantsev
Copy link
Author

@anatolykazantsev anatolykazantsev commented Jul 4, 2016

@nyov so do you propose to remove .gz, .bz2, .xz, but add tar.(gz|bz2|xz) ?

@nyov
Copy link
Contributor

@nyov nyov commented Jul 5, 2016

No, only .gz -> .tar.gz.
And I might veto the inclusion of .z entirely, if you had included that in your list.

Deflate/gzip/zlib are well-known for use with on-the-fly encoding (afaik) -> RFC.
The others were (back in '97 as per RFC dates) likely too memory-intensive or cpu heavy to be considered for 'on-the-fly' decompression in browsers.

A few more codings have made it into the official IANA list, and others are used unofficially (such as bzip2, lzma, xz - Wikipedia article), but these are much much less common.
I think they can be exempted from consideration here.

So, '.gz' alone is fickle but '.tar.gz' is 99% a tarball archive; all other compression file-endings are reasonably certain to not contain browser-interpreted content. (.bz2 might on Apache, see mime.conf)

Considering this is a default setting, that is as far as I'd go with the presets.
It's much harder to debug why your crawler does not see and fetch these friggin' URLs, I think, than to add them to the list when they get mis-handled, case-by-case basis.


The only sure-fire way, however, is to handle things like a browser, not assume content from URI names. Checking HTTP headers, detecting file MIME magic -- don't judge the book by it's cover:

<IfModule mod_mime.c>
<Files "*.html.ohandfuckscrapy.xz">
        RemoveType .xz
        AddType text/html .xz
</Files>

<Location "/noscrapedocs">
    # *.whocares.about.file.endings
    ForceType text/html
</Location>
</IfModule>

N.B.: default Apache mime.conf:

<IfModule mod_mime.c>

        # AddEncoding allows you to have certain browsers uncompress
        # information on the fly. Note: Not all browsers support this.
        # Despite the name similarity, the following Add* directives have
        # nothing to do with the FancyIndexing customization directives above.
        #
        #AddEncoding x-compress .Z
        #AddEncoding x-gzip .gz .tgz
        #AddEncoding x-bzip2 .bz2
        #
        # If the AddEncoding directives above are commented-out, then you
        # probably should define those extensions to indicate media types:
        #
        AddType application/x-compress .Z
        AddType application/x-gzip .gz .tgz
        AddType application/x-bzip2 .bz2
[...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants