Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check but don't actually download links to some classes of files #7691

Closed
markcmiller86 opened this issue May 18, 2020 · 4 comments
Closed

Comments

@markcmiller86
Copy link

Is your feature request related to a problem? Please describe.
Yes. I am running linkcheck builder on docs that contain numerous references to large .tar.gz and/or .zip data files and sphinx-build appears to stall on these when I think what it is doing is downloading those files in their entirety instead of simply confirming the links to them are valid.

Describe the solution you'd like
Apparently both wget and curl have the ability to test a link without downloading the associated file. So, I belileve what I am asking is somehow possible.

In wget the relevant CL argument is --spider and in curl it is --head (of the server supports HEAD requests) or --raw 0-0 (specifying a range of bytes). Both of these commands return different exit codes based on whether the file exists at the given address or not. So, it gives me hope that there is some way of achieving same from within Python.

I would like sphinx-build -b linkcheck to support a new option, linkcheck_dont_download = [] which would be list of regular expressions that match URIs that should not be downloaded. When a link to be checked matches one of these regular expressions, the logic the linkcheck builder uses should be adjusted to match same behavior as wget --spider or curl -r 0-0.

Describe alternatives you've considered
Am currently using linkcheck_ignore to fully ignore these links but in a set of docs with many such links, that kind of approach pretty much defeats the whole purpose of a link check.

Additional context
I was waiting many minutes for sphinx-build -b linkcheck to complete when, if I ignore such links, it completes in less than a minute.

@markcmiller86 markcmiller86 added the type:enhancement enhance or introduce a new feature label May 18, 2020
@htgoebel
Copy link

I support this feature request. (While I would actually call it a bug that all the data is downloaded.)

In my case an Linux distribution's installation DVD image is linked, which is about 4GB in size.

@markcmiller86
Copy link
Author

While I would actually call it a bug that all the data is downloaded.

I think I agree. To be honest, I think it would be best if linkcheck builder by default did not download such content at all and that some action is required, such as specifying a linkcheck_download in _config.yml with regexs to force download on matches when it is desired.

@NickVolynkin
Copy link
Contributor

It would be nice to have an option to check all links with HEAD requests. We have somewhat large docs and linkcheck stage can take several minutes. (Usual html build takes ~30s, sphinx-build -M dummy takes ~15s).

@tk0miya
Copy link
Member

tk0miya commented Jul 10, 2020

I thought the linkcheck builder sends HEAD request first. But it uses a GET request unexpectedly instead of HEAD. It must be a bug of linkcheck builder. Could you check #7936 will resolve this issue, please?

Note: The linkcheck builder is implemented not to download the whole of contents if the URL does not contain anchors (#). So I guess the #7936 will fix the whole of this issue.

@tk0miya tk0miya added type:bug builder and removed type:enhancement enhance or introduce a new feature labels Jul 10, 2020
@tk0miya tk0miya added this to the 3.2.0 milestone Jul 10, 2020
tk0miya added a commit that referenced this issue Jul 11, 2020
Fix #7691: linkcheck: HEAD requests are not used for checking
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants