Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trio.socket.getaddrinfo may invoke a worker thread when it doesn't need to #1249

Open
zackw opened this issue Oct 14, 2019 · 3 comments
Open

Comments

@zackw
Copy link
Contributor

zackw commented Oct 14, 2019

trio.socket.getaddrinfo tries to avoid invoking a worker thread when it knows that stdlib getaddrinfo won't block: the host is already an IP address and the port is already numeric. It does this by calling stdlib getaddrinfo with AI_NUMERICHOST|AI_NUMERICSERV added to the flags. If that call succeeds, it returns immediately, otherwise it calls getaddrinfo in a worker thread.

The problem is that trio assumes that a failure of this initial call is because we need to do a host and/or service lookup. This is not necessarily true. The most important case is when the caller of trio.socket.getaddrinfo already set AI_NUMERICHOST|AI_NUMERICSERV in the flags; in that case the call in the worker thread is going to fail in the exact same way that the call in the main thread failed, and there's no point going through the thread. Failures for unrelated reasons (e.g. inappropriate address family) will also fail the same way in the worker thread Failures are only retried in the worker thread if the getaddrinfo error code is EAI_NONAME, so we only need to worry about the cases where that can happen, but there are still several types of invalid "host" and/or "port" argument that will produce EAI_NONAME and there's no point retrying, even if caller didn't set AI_NUMERICHOST|AI_NUMERICSERV.

To fix this, getaddrinfo just needs to pay a little more attention to why the initial call failed and what the caller's flags were, and not retry calls that will provably fail the same way in the worker. This will also let us tell the authors of custom HostResolver implementations that they will not be called when both the host and the service were already numeric (but it's still possible for them to get an IP address with a service keyword, or a domain name with a numeric port) and they can rely on the rest of the arguments to have been validated already -- in particular they do not need to worry about the address family being something other than AF_INET, AF_INET6, or AF_UNSPEC.

I plan to work on this myself but I am filing this issue in advance to get advice about testing. Is there a good way to test "this operation does not do anything in a worker thread" already? I suppose I could define a test HostResolver that bombs out if called, since the logic is "call custom HR or else call stdlib getaddrinfo in a worker" if we get to that point, but maybe someone has a better idea?

@njsmith
Copy link
Member

njsmith commented Oct 14, 2019

Huh. That's some impressive attention to edge cases. I approve :-). Is also appreciate hearing more about what kind of use case makes this detail important!

Our getaddrinfo tests already monkeypatch the stdlib socket.getaddrinfo in order to check that our trio.socket.getaddrinfo is invoking it the way we expect. We could use that strategy to test this too, I think. (For example, install a fake socket.getaddrinfo that records which thread it was called in.)

@zackw
Copy link
Contributor Author

zackw commented Oct 14, 2019

Is also appreciate hearing more about what kind of use case makes this detail important!

It's not actually about the threads, for me. I was writing a HostResolver implementation and I was surprised to get called when I had passed AI_NUMERICHOST|AI_NUMERICSERV along with a domain name (from the tests).

The larger goal is to substitute dnspython and trio-native networking for the stdlib getaddrinfo, in order to log exact details of the DNS traffic as it comes on and off the wire, and also to send every query to a bunch of different DNS servers at the same time and compare the responses. (The even larger goal is Internet censorship monitoring, see https://arxiv.org/abs/1907.04245 .)

@graingert graingert self-assigned this Jun 18, 2022
@brandyn
Copy link

brandyn commented May 7, 2023

Not sure if this is related, but I am seeing Trio freeze up completely in "await socket.getnameinfo((ip, 0), 0)" until it eventually throws a "transient failure" exception. (By freeze up completely, I mean--presumably--something that should have gone to a worker thread hasn't: My entire application hangs for a second or two until the call bails.) Note this call normally succeeds and everything works fine but once in a blue moon it hangs and ultimately fails.

[Update: More like 5-10 seconds. And the exact error is "[Errno -3] Temporary failure in name resolution"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants