Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rate limit detection / retries @ linkcheck #7388

Closed
webknjaz opened this issue Mar 27, 2020 · 18 comments
Closed

Add support for rate limit detection / retries @ linkcheck #7388

webknjaz opened this issue Mar 27, 2020 · 18 comments
Labels
help wanted type:enhancement enhance or introduce a new feature
Milestone

Comments

@webknjaz
Copy link
Contributor

Is your feature request related to a problem? Please describe.

When there's a lot of references to GitHub issues and Pull Requests, linkcheck goes over them quite quickly and at some point, sometimes, it hits their rate limit.

Describe the solution you'd like

linkcheck needs some sort of a setting for rate-limiting (leaky bucket?) and retries (exponential backoff?). These settings could be per-domain similar to #7247 (comment) for flexibility.

Also, maybe there could be another setting for a callable supplied by users. This'd allow inspecting the response and returning True/False from such a callback specifying if the failure is temporary and the request should be retried later (similar to giveup callback in https://pypi.org/p/backoff/ library — maybe it could even be integrated?).

Describe alternatives you've considered

N/A

Additional context

Basically, this fails in CI:

�[31;01mWarning, treated as error:�[39;49;00m
/home/zuul/src/github.com/cherrypy/cheroot/docs/history.rst.rst:241:broken link: https://github.com/cherrypy/cheroot/pull/80 (429 Client Error: too many requests for url: https://github.com/cherrypy/cheroot/pull/80)
ERROR: InvocationError for command /home/zuul/src/github.com/cherrypy/cheroot/.tox/linkcheck-docs/bin/python -m sphinx -j auto -a -n -W -b linkcheck --color -d /home/zuul/src/github.com/cherrypy/cheroot/.tox/docs_doctree . /home/zuul/src/github.com/cherrypy/cheroot/.tox/docs_out (exited with code 2)

(https://zuul.vexxhost.dev/t/cherrypy/build/e02b806392eb41c7a4030ec4bae735c3/console)

@webknjaz webknjaz added the type:enhancement enhance or introduce a new feature label Mar 27, 2020
@tk0miya
Copy link
Member

tk0miya commented Mar 28, 2020

+1; Reasonable. I'm not familiar with rate-limit detection. But it would be helpful.
Also +1 for exponential back-off (or retries).

@tk0miya tk0miya added this to the some future version milestone Mar 28, 2020
@blink1073
Copy link

I implemented similar logic here for reference: jupyterlab/pytest-check-links#7

@mgeier
Copy link
Contributor

mgeier commented Sep 27, 2020

See also #6629.

@francoisfreitag
Copy link
Contributor

Should be solved by #8467.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 22, 2020
Follow the Retry-After header if present, otherwise use an exponential
back-off.

Close sphinx-doc#7388
@tk0miya tk0miya modified the milestones: some future version, 3.4.0 Nov 25, 2020
@tk0miya
Copy link
Member

tk0miya commented Nov 28, 2020

@francoisfreitag Could I close this?

@francoisfreitag
Copy link
Contributor

I think so, even if the issue asks for more configuration that has been made available in #8467.

Let’s start with the simple solution first. We can consider more advanced configuration if it is not sufficient. Perhaps using an event as suggested in #8467 (comment).

@webknjaz
Copy link
Contributor Author

Honestly, I'd prefer the backoff lib to be integrated and the callbacks for determining whether the retry is necessary to be exposed...

@tk0miya
Copy link
Member

tk0miya commented Nov 29, 2020

@webknjaz Please let me know why do you want callbacks (flexibility)? Do you have any troublesome links? I still don't understand why you need such a feature.

@webknjaz
Copy link
Contributor Author

webknjaz commented Dec 6, 2020

I think that it's a good idea for Sphinx to implement support for some common standardized headers. But OTOH there's plenty of services reinventing the wheel, using something beyond the standard, providing more context with extra headers like GitHub does (https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#rate-limiting) or not playing by the rules, basically lying about when to retry.
There's also Backoff: https://cliquet.readthedocs.io/en/latest/api/backoff.html.
Also, in some cases, the user would know better when they want to increase the interval or treat other conditions like a JSON response with the retry info instead of headers.

This is why I'm convinced that there should be an interface for providing a more flexibile configuration. Having Sphinx core special-case each and every possible unconventional case is not a sustainable solution while empowering users to deal with this is.

I can't provide any URLs at the moment, I guess, GHA badges had problems at some point so I added them to the ignore list, I'll need to re-check my ignores across projects to see if there's anything interesting among those. I vaguely recall something about codecov returning 4xx code w/o any extra metadata but I can't confirm this right now.

@francoisfreitag
Copy link
Contributor

francoisfreitag commented Dec 6, 2020

Thanks for documenting why the additional complexity might be worth it.

Some web servers definitely misbehave. The GitHub API rate limits is probably not a realistic example, I doubt people do GitHub API calls in their documentation. An example noticed during the development of the current version was GitHub web returning Retry-After: 1m0, not a valid value. Coincidentally, it aligns with the default wait time of 60 seconds 🙂.

Currrent situation

Good

I hope the majority of cases. Most web servers won’t rate-limit, or specify a correct Retry-After header.

Suboptimal

These cases should eventually work:

  • Retry-After header not specified (or custom headers, or retry after details in the response body). linkcheck may wait more than necessary. It uses its backoff policy up to the configurable user limit. e.g. if the rate limit is applied for 10 minutes, linkcheck will wait 1 minute + 2 minutes + 4 minutes + 8 minutes, total 15 minutes.
  • web servers may have different rate limits for different paths. The current implementation stops checking the entire domain after a rate limit is issued.

Poor

  • An incorrect Retry-After is not supported at all. The only options I see are to ignore these URLs, or hope that the builder waits long enough to get the response.
  • A rate limit with an incorrect status code does not cause linkcheck to retry.

Improvements

Giving user control allows them to improve the wait time by specifying the optimal delay, or reading a part of the response to know when to retry. It also allows overriding a lying web server. Definitely, both are improvements over the current situation.

Having Sphinx core special-case each and every possible unconventional case is not a sustainable solution

I agree.

empowering users to deal with this is [a sustainable solution]

I find acceptable that linkcheck does not handle misbehaving web servers. Similarly, it may fail due to transient issues: connectivity error, the occasional configuration issue (HTTP statuses 500, 502, 504), etc.

I’m concerned that only a small number of users would be willing to take the time to observe misbehaving web servers (either to figure out where the rate limits are specified or observe the wait time experimentally) and write a retry policy for it.

Delegating control to users increases complexity:

  • Users need to understand how to connect their callbacks to events, filter responses, possibly maintain some state (e.g. last wait time for a URL), and the interface to communicate with linkcheck (how to indicate to wait? how to indicate a broken link? how to indicate a working link?). Also, because linkcheck checks URLs in threads, the user callback must be thread safe.

  • Maintainers have more API surface to maintain and evolve. For example, what attributes are available on a response? The builder should provide a default behavior for the common case (which it currently does), plus delegate to users and handle their errors, such as failing callbacks, incorrect return types, etc. The interface to communicate with the builder must be documented and follow the deprecation policy to evolve.

Possible implementation

To anyone interested in attempting an implementation, an earlier version of the patch was offering a lambda called with the last wait time and the 429 response. It gave users control of the time to wait before to retry. It was removed because it breaks Sphinx incremental build feature #8467 (comment). tk0miya suggested using events (I believe https://www.sphinx-doc.org/en/master/extdev/appapi.html#emitting-events, listeners being declared with https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx.application.Sphinx.connect). That would allow users to implement their own rate limit backoff algorithms in conf.py, possibly using the backoff library or the tool of their choice.

Going further

Perhaps not only rate limits need to be configurable, but users would want to inspect all responses and decide what to do (consider the link working, broken, redirected, rate-limited, etc). linkcheck could keep the current implementation as the default, and give a chance to users to inspect each response and set the result themselves. At that point though, perhaps linkcheck should just dump the links to a text file and users can chose their tools to check them.

@francoisfreitag
Copy link
Contributor

In short, my opinion is to wait for a couple releases, see if the current implementation solves most issues. Once remaining problematic behaviors are clearly identified, decide on the best course of action.

Perhaps add events, allowing users to have complete control over a link validity and behavior on rate limits (and other kind of errors).
Perhaps dump all links in a text file for users who want complete control over how they are checked.

@webknjaz
Copy link
Contributor Author

webknjaz commented Dec 7, 2020

@francoisfreitag this is not related to rate-limits but here's one case when allowing to amend the validity detection mechanism would help.

I've added a link to https://github.com/python/cpython/blob/c39b52f/Lib/poplib.py#L297-L302 and now I get:

broken link: python/cpython@c39b52f/Lib/poplib.py#L297-L302 (Anchor 'L297-L302' not found)

When you look at the source, it becomes obvious that there are anchors for L297 and L302 but not for the range — it's something that is JS-based.

Ref: https://github.com/cherrypy/cheroot/runs/1511163391?check_suite_focus=true#step:9:603

@francoisfreitag
Copy link
Contributor

In an ideal world, linkcheck would be able to run the JavaScript and figure out that the anchor exists. This issue will be more frequent as sites are increasingly implemented in full JS, and not all of them have server-side rendering.

Realistically, linkcheck_anchors_ignore is a reasonable alternative IMO. The linkcheck builder will verify the page exists, but won’t check the anchor.

@webknjaz
Copy link
Contributor Author

webknjaz commented Dec 7, 2020

Sounds fair.

@webknjaz
Copy link
Contributor Author

webknjaz commented Dec 7, 2020

Did you mean to link some other issue? You linked this one.

@tk0miya tk0miya modified the milestones: 3.4.0, 3.5.0 Dec 20, 2020
@tk0miya
Copy link
Member

tk0miya commented Dec 28, 2020

It seems there is no remaining tasks for this issue. Can I close this?

@francoisfreitag
Copy link
Contributor

This issue has turned to be about control over the retry policy (or even linkcheck behavior in general) for power users. Delegating control increases complexity, both in user configuration and on Sphinx side.

My preferred way forward is to close this issue. I suggest opening a new issue when limitations of the current implementation cause an actual problem. That would help narrowing the space of possible issues and focusing the discussion.

@tk0miya
Copy link
Member

tk0miya commented Jan 11, 2021

Thank you for your comment. Let's close this :-)

@tk0miya tk0miya closed this as completed Jan 11, 2021
felixxm pushed a commit to ngnpope/django that referenced this issue May 17, 2021
We explicitly ignore checking anchors for line-range anchors on GitHub
which are dynamically generated and, otherwise, show up as broken links.

See sphinx-doc/sphinx#7388 (comment).

We also ignore links to resources that require authentication.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted type:enhancement enhance or introduce a new feature
Projects
None yet
Development

No branches or pull requests

5 participants