Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable async capabilities for HTTP requests #8391

Open
4 tasks
francoisfreitag opened this issue Nov 8, 2020 · 10 comments
Open
4 tasks

Enable async capabilities for HTTP requests #8391

francoisfreitag opened this issue Nov 8, 2020 · 10 comments

Comments

@francoisfreitag
Copy link
Contributor

francoisfreitag commented Nov 8, 2020

Is your feature request related to a problem? Please describe.
From #6629, the linkcheck command is currently using threads to concurrently check the status of links in the documentation. Threads are not the most efficient way to concurrently check links: once all threads are busy waiting, the work queue stops being consumed.

Describe the solution you'd like

Using an event loop allows the URL verifier to yield control to another coroutine until it gets a response. That means a single thread is able to send multiple requests concurrently and process the response as they arrive. It also facilitates handling rate-limiting, because a coroutine can be scheduled to run in the future.

The first step toward using asynchronous concurrency in linkchecker is to replace the requests library uses with an async-compatible HTTP library. The aiohttp library has an API pretty similar to that of requests and is well-established and under active development, it seems like a good choice.

Describe alternatives you've considered

Tried handling rate limits with a PriorityQueue as described in #6629 (comment).

TODO

  • Increase test coverage of the existing code
  • Use a wrapper that makes async code synchronous for sync use cases (get event loop, queue the request, run event loop until complete).
  • Adapt existing code that expects a requests.Response to use an aiohttp.ClientResponse. Both look pretty similar.
  • Make a compatibility wrapper for arguments where a requests object was expected and aiohttp expects a different input. Consider REQUESTS_CA_BUNDLE, tls_cacerts, auth_info for the linkcheck_auth setting.
@francoisfreitag
Copy link
Contributor Author

I’m hitting an issue with this work: HTTP authentication.

linkcheck_auth documents its auth_info parameter as

anything that is understood by the requests library (see requests Authentication for details).

The Requests library supports advanced authentication schemes, such as Digest auth or potentially even NTLM through https://github.com/requests/requests-ntlm. Both require multiple HTTP request-response cycle for authentication.

aiohttp only supports HTTP basic auth, and the discussion to support more advanced authentication protocols seems to have reached a standstill. It is unclear whether more authentication methods will be supported the future.

Based on the PR that introduced the linkcheck_auth, I believe basic auth was the intended use case. But with the documentation “anything that is understood by the requests library”, I would be surprised if dropping support for the digest auth or NTLM authentication method does not break someone’s doc.

Is dropping support for complex authentication an option?
I doubt many users hide their doc behind complex authentication schemes, but someone somewhere certainly does that.
FWIW, intersphinx (the only other extension to use Requests) only mentions basic authentication.
What do you think?

@francoisfreitag
Copy link
Contributor Author

An option could be to use aiohttp for most cases, public URLs or URLs protected with basic auth, and make HTTP requests that require a complex authentication scheme synchronously with requests.
That means maintaining integration of two similar but slightly different HTTP libraries in the project. I don’t think if it’s worth the complexity, but throwing the idea out there for discussion.

@tk0miya
Copy link
Member

tk0miya commented Nov 15, 2020

Personally, I don't object to drop support them (I'd forgotten we have such a feature!). But I can't say nobody uses some complex authentication. I guess nobody uses it, but...

@francoisfreitag
Copy link
Contributor Author

francoisfreitag commented Nov 21, 2020

Another backward incompatibility: AIOHTTP does not return a requests.Response object. Extensions that expect such objects are likely to break.

A wrapper can be made for the most common bits (status code, url, encoding, content, etc.), but it cannot cover all methods from requests.Response, in particular not chunked reads:

The current plan for synchronous requests with AIOHTTP is to schedule a coroutine that makes the HTTP request on a new event loop, blocks until the event loop completes and returns the result.
The AIOHTTP response object is only available during the lifespan of the event loop, that makes chunked reads impossible in a synchronous context.

@AA-Turner
Copy link
Member

@jayaddison would you be interested in looking into this? httpx may be a good candidate as I believe it looks to keep the interface of Requests.

Alternativley, we create a new sphinx.util._async_http helper module instead of replacing util.requests.

To François' point above, I don't know how AIO-HTTP has developed w.r.t. authentication in the last three years.

A

@francoisfreitag
Copy link
Contributor Author

I’ve been using httpx for a while now, and it’s been an easy switch from requests at $DAYJOB. That would probably ease the transition. 👍

@jayaddison
Copy link
Contributor

@AA-Turner:

@jayaddison would you be interested in looking into this? httpx may be a good candidate as I believe it looks to keep the interface of Requests.

Thanks; I'm tempted - I think I should try to learn more about async programming Python beforehand, though.

I'm also not sure I'm quite there in terms of familiarity with the existing linkchecker. Learning, but not confident that I understand it well enough to make significant changes yet.

@francoisfreitag:

TODO

The first task about expanding test coverage rings true. I'd suggest performance testing (throughput and resource usage) measurement too, so that we learn about any regressions.

What I'm less certain about and still thinking through is what the best path might be, after further coverage is available:

  • Migrate directly to an async approach?
  • Refactor and simplify the code first, and then migrate to async?
  • Develop an additional, async linkchecker using the experience gained from the existing synchronous one?

(my sense of my development experience recently is that I tend to lean more towards debugging, maintenance and fixes, and so that'd lead me towards the second option. but I don't know how much time I can commit, and don't want to get in the way of other approaches)

@francoisfreitag
Copy link
Contributor Author

The test coverage has been expanded quite a bit since I wrote that issue. The goal at the time was to better handle rate limiting (esp. from GitHub), and async provides a nicer way to schedule the next check time than the current threads waking up to verify nothing needs doing, then sleeping some.
Refs #6629

I would probably go the second route as well. Because of the my findings 2/3 years ago, I decided to opt for the threaded version, which didn’t need to change requests. It can totally be reconsidered, since httpx seems like a viable alternative that supports async.
I’m not sure how much compatibility there is between httpx and requests plugins, and how much efforts should be put in preserving the backward compatibility with complex auth (e.g. NTLM).

@Ousret
Copy link

Ousret commented Apr 6, 2024

There's another candidate that aim to be a drop-in replacement for requests, see niquests. plugins should work as is and allow a progressive migration. if needed, I can assist.

@Ousret
Copy link

Ousret commented Apr 8, 2024

The multiplexing aspect of Niquests may come in handy to avoid converting the whole code into async immediately and make a significant gain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants