Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkcheck fails on GitHub Doc URLs #10343

Closed
kmpaul opened this issue Apr 12, 2022 · 6 comments
Closed

Linkcheck fails on GitHub Doc URLs #10343

kmpaul opened this issue Apr 12, 2022 · 6 comments

Comments

@kmpaul
Copy link

kmpaul commented Apr 12, 2022

Describe the bug

Sphinx linkcheck fails with a 403 Client Error: Forbidden error on every GitHub Docs site URL, even though the links are correct and work fine.

How to Reproduce

  1. Create a new sphinx project with sphinx-quickstart.
  2. Add a docs.github.com URL to the index.rst file, like so:
    `GitHub Hello World <https://docs.github.com/en/get-started/quickstart/hello-world>`_
  3. Run make linkcheck

And you can verify that the above link works fine:

https://docs.github.com/en/get-started/quickstart/hello-world

Expected behavior

There should be no broken link error.

Your project

https://foundations.projectpythia.org/

Screenshots

No response

OS

Mac OS 11.6.5 and linux

Python version

3.10.4

Sphinx version

4.2.0+

Sphinx extensions

No response

Extra tools

No response

Additional context

No response

@dopplershift
Copy link

I don't think this is a bug in Sphinx since I see it even when trying to access the pages using cURL (see github/docs#17042). I AM able to get it to work by spoofing a user agent in the linkcheck_request_headers:

linkcheck_request_headers = {
    r'https://docs.github.com/': {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; '
                                                'rv:24.0) Gecko/20100101 Firefox/24.0'}
}

@kmpaul
Copy link
Author

kmpaul commented Apr 13, 2022

Fantastic find, @dopplershift! Thanks for looking into this.

@francoisfreitag
Copy link
Contributor

Thanks @dopplershift for the investigation and solving the issue!
It appears to be an “issue” with the GitHub docs server, which refuse serving certain User-Agent. I don’t think linkcheck behavior should change, the tools to bypass the checks already exist.
Closing the issue. Feel free to reopen if needed.

@kmpaul
Copy link
Author

kmpaul commented Apr 13, 2022

@francoisfreitag: I'm not entirely sure that I agree with your conclusion. I agree that @dopplershift's investigation is great, and that it is wonderful to have a work-around for the problem. However, what is the purpose of linkcheck if not to determine if clicking a link will work? Clicking a link in your browser retrieves the web page with an approved User-Agent matching your browser. Isn't that the expected interaction with any link that linkcheck checks? Perhaps I am wrong, and I'm open to your interpretation.

However, I might suggest that this issue suggests that the Sphinx linkcheck should spoof a valid user-agent by default. What do you think?

@francoisfreitag
Copy link
Contributor

Clicking a link in your browser retrieves the web page with an approved User-Agent matching your browser.

The “approved User-Agent” depends on the server linkcheck is connecting to, and your browser might not be the same as my browser. In this case, Firefox is accepted. In another, only Chromium might be accepted, or even a specific version of a browser might be accepted (IE 11 👋).
The point is, there’s no single spoofed value that would be accepted by all servers. Besides, reporting what tools is being used in the User-Agent is the best way to let the server tailor its response based on the request. It’s also good for transparency and server-side statistics (knowing where the traffic is coming from).

Leaving the default User-Agent is the most sensible thing to do, so that good citizens of the internet can have a reasonable idea where their traffic is coming from. For servers which behave unexpectedly, Sphinx provides mechanisms to implement workarounds and still get the benefits of linkcheck.

@kmpaul
Copy link
Author

kmpaul commented Apr 13, 2022

Yeah. I realize that an "approved User-Agent" cannot be predicted by Sphinx. I completely agree with you on that. And it shouldn't be Sphinx's responsibility to "guess" the approved User-Agent. And I agree with you that it is probably outside the scope of linkcheck to verify that the link can be opened with even a subset of browsers and browser versions. I guess I'm just puzzled by GitHub would require specific User-Agent headers, when the User-Agent header is entirely optional...

Sigh. Ok. Thanks for the discussion @francoisfreitag! I'll track @dopplershift's question on the GitHub docs repo (github/docs#17042).

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants