Skip to content
This repository was archived by the owner on Nov 23, 2017. It is now read-only.

Skip getaddrinfo if "host" is already an IP address. #302

Closed
wants to merge 16 commits into from

Conversation

ajdavis
Copy link

@ajdavis ajdavis commented Dec 9, 2015

elif type == socket.SOCK_DGRAM:
proto = socket.IPPROTO_UDP
else:
return None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of this section, it reproduces more of getaddrinfo's own logic than I'm comfortable with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine, though. AFAIK you can't really combine SOCK_STREAM with anything other than IPPROTO_TCP, same for UDP. And they only make sense for AF_INET and AF_INET6.

return None

try:
addr = ipaddress.ip_address(host)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put an LRU cache around this call?

Also, wrap _ipaddr_infos in an LRU cache.
@ajdavis
Copy link
Author

ajdavis commented Dec 12, 2015

I've added another patch now, thanks for your comments.

@@ -533,9 +535,37 @@ def _getaddrinfo_debug(self, host, port, family, type, proto, flags):
logger.debug(msg)
return addrinfo

@functools.lru_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I don't want lru_cache internals to hold references to event loops. How about we factor _ipaddr_infos method into a top-level function in base_events.py, returning None or a tuple. This way we'll also share this cache between different loops. I'd also bump the cache size to 1024 or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point; I've added another commit that moves this to module scope and increases the cache size.

Allows loops to share the _ipaddr_infos cache, and prevents the cache from
holding references to loops.
@1st1
Copy link
Member

1st1 commented Dec 12, 2015

LGTM, but please wait for @gvanrossum's final approval.

If you're going to commit this yourself, please squash all commits into one and merge by hand. Don't forget to update the CPython repo!

@ajdavis
Copy link
Author

ajdavis commented Dec 12, 2015

Thanks so much Yury. I'll squash and commit here once we hear from Guido, but I don't think I have permissions to update CPython.

return None
else:
af = socket.AF_INET if addr.version == 4 else socket.AF_INET6
return [(af, type, proto, '', (host, port))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something makes me worry that a recipient of this function could mutate this list and mess up the cached value. Maybe this should return a tuple and the call site could call list() on the result.

@gvanrossum
Copy link
Member

My suggested change should be simple enough. I have no other comments so you can commit with the suggested fix after either Yury or myself approves.

@ajdavis
Copy link
Author

ajdavis commented Dec 13, 2015

New patch implements @gvanrossum's suggestion: _ipaddr_infos now returns a tuple and the call site calls list() on the result, to prevent inadvertent mutation of the value in the LRU cache.

@ajdavis
Copy link
Author

ajdavis commented Dec 13, 2015

Done: _ipaddr_infos is now _ipaddr_info.

@1st1
Copy link
Member

1st1 commented Dec 13, 2015

@ajdavis Looks good! Please merge the PR (and I'll sync CPython repo myself).

@@ -535,7 +559,12 @@ def _getaddrinfo_debug(self, host, port, family, type, proto, flags):

def getaddrinfo(self, host, port, *,
family=0, type=0, proto=0, flags=0):
if self._debug:
info = _ipaddr_info(host, port, family, type, proto)
if info:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use is not None here.

@gvanrossum
Copy link
Member

Cool to see the solution. LGTM (except maybe the is not None) issue. Whoopie!

@vstinner
Copy link
Member

Sorry, I didn't read the history of this pull request.

_check_resolved_address() already has code to check if a string is a valid IPv4 or IPv6 address. I didn't remember why exactly, but I remember that we choose to not use the ipaddress module for this function. Recently, the call to _check_resolved_address() was made optional (only done in debug mode) for performance reason. But socket.inet_pton() is now used if available, instead of getaddrinfo(), for performance reason.

I would prefer to not use ipaddress or use ipaddress in both functions (_check_resolved_address() & _ipaddr_info()).

@functools.lru_cache(maxsize=1024)

Is it worth to use a cache on such simple function?

@gvanrossum
Copy link
Member

The reason why _check_resolved_address() exists is that otherwise people
might be calling loop.sock_connect() with a hostname, at which point the
connect() call in the _socket extension module will do the host lookup,
potentially hanging the thread until the DNS resolution comes back, and
also acquiring the getaddrinfo lock in certain cases (though it tries to
use inet_pton() if it can).

The most worrisome part of this code (from a quick glance) seems to be that
getaddrinfo('localhost', 0) may return IPv6 addresses with '%' in them,
which always cause a call to getaddrinfo() in connect()...

I do like the fact that the ipaddress module never calls getaddrinfo() -
it's pure text manipulation. But I don't know if it handles IPv6 address
like 'fe80::1%lo0' (which is one of the ones returned for localhost on my
Mac).

@1st1
Copy link
Member

1st1 commented Dec 13, 2015

_check_resolved_address() already has code to check if a string is a valid IPv4 or IPv6 address. I didn't remember why exactly, but I remember that we choose to not use the ipaddress module for this function. Recently, the call to _check_resolved_address() was made optional (only done in debug mode) for performance reason. But socket.inet_pton() is now used if available, instead of getaddrinfo(), for performance reason.

IIRC Guido didn't want asyncio to depend on ipaddress, which was a relatively new (and provisional) module in the stdlib back then. I think it's OK now.

I like the idea of using socket.inet_pton though. Maybe we should rewrite the patch to use it when it's available, and fallback to ipaddress when it's not?

Here's what I propose:

  1. Update _ipaddr_info to use socket.inet_pton. If it's not available (in 3.3 on Windows), we fallback to use the ipaddress module. (I don't like using AI_NUMERICHOST because we have to call socket.getaddrinfo again).
  2. _check_resolved_address should be updated to use _ipaddr_info.
  3. We can probably start calling _check_resolved_address in release mode too.

@functools.lru_cache(maxsize=1024)
Is it worth to use a cache on such simple function?

The function is simple, the instantiation of IPv4Address or IPv6Address address isn't (LRU cache in 3.5 is written in C, it's quite fast to check the cache compared to parsing of addr string).

@gvanrossum
Copy link
Member

Sounds good.

On Sun, Dec 13, 2015 at 12:34 PM, Yury Selivanov notifications@github.com
wrote:

_check_resolved_address() already has code to check if a string is a valid
IPv4 or IPv6 address. I didn't remember why exactly, but I remember that we
choose to not use the ipaddress module for this function. Recently, the
call to _check_resolved_address() was made optional (only done in debug
mode) for performance reason. But socket.inet_pton() is now used if
available, instead of getaddrinfo(), for performance reason.

IIRC Guido didn't want asyncio to depend on ipaddress, which was a
relatively new module in the stdlib back then. I think it's OK now.

I like the idea of using socket.inet_pton though. Maybe we should rewrite
the patch to use it when it's available, and fallback to ipaddress when
it's not?

Here's what I propose:

Update _ipaddr_info to use socket.inet_pton. If it's not available (in
3.3 on Windows), we fallback to use the ipaddress module. (I don't
like using AI_NUMERICHOST because we have to call socket.getaddrinfo
again).
2.

_check_resolved_address should be updated to use _ipaddr_info.
3.

We can probably start calling _check_resolved_address in release mode
too.

@functools.lru_cache(maxsize=1024)
Is it worth to use a cache on such simple function?

The function is simple, the instantiation of IPv4Address or IPv6Address
address isn't (LRU cache in 3.5 is written in C, it's quite fast to check
the cache compared to parsing of addr string).


Reply to this email directly or view it on GitHub
#302 (comment).

--Guido van Rossum (python.org/~guido)

@1st1
Copy link
Member

1st1 commented Dec 14, 2015

I do like the fact that the ipaddress module never calls getaddrinfo() --
it's pure text manipulation. But I don't know if it handles IPv6 address
like 'fe80::1%lo0' (which is one of the ones returned for localhost on my
Mac).

Interesting. %lo0 is a zone index (more about it here), and it looks like ipaddress module doesn't support them. We can probably use the following algorithm (as a fallback for socket.inet_pton):

  1. Try IPv4Address(address)
  2. Try IPv6Address( re.sub(r'%\w+$', '', address) )

@gvanrossum @Haypo what do you think?

@gvanrossum
Copy link
Member

FWIW inet_pton() does support those (even though the man page on my Mac
doesn't mention it). Possibly the return value from getaddrinfo() varies by
system though? In my case I found this in my /etc/hosts, explaining why I
get it:

127.0.0.1    localhost
::1             localhost
fe80::1%lo0    localhost

I swear I didn't put that there, so it's probably due to some default Mac
initialization (or the Dropbox IT folks).

I don't know if it's acceptable to just ignore the % stuff if we don't have
inet_pton() but I feel strongly about not calling getaddrinfo() as a
callback (since its lock is what started this whole thing) so I'm willing
to give it a try and see if anyone screams.

--Guido

On Mon, Dec 14, 2015 at 10:06 AM, Yury Selivanov notifications@github.com
wrote:

I do like the fact that the ipaddress module never calls getaddrinfo() --
it's pure text manipulation. But I don't know if it handles IPv6 address
like 'fe80::1%lo0' (which is one of the ones returned for localhost on my
Mac).

Interesting. '%lo0 is a zone index (more about it here
https://en.wikipedia.org/wiki/IPv6_address#Link-local_addresses_and_zone_indices),
and it looks like ipaddress module doesn't support them. We can probably
use the following algorithm (as a fallback for socket.inet_pton):

  1. Try IPv4Address(address)
  2. Try IPv6Address( re.sub(r'%\w+$', '', address) )

@gvanrossum https://github.com/gvanrossum @Haypo
https://github.com/haypo what do you think?


Reply to this email directly or view it on GitHub
#302 (comment).

--Guido van Rossum (python.org/~guido)

@ajdavis
Copy link
Author

ajdavis commented Dec 14, 2015

This is all clear, thanks - I'll send an updated PR in the next day or two.

@1st1
Copy link
Member

1st1 commented Dec 14, 2015

[..] but I feel strongly about not calling getaddrinfo() as a
fallback (since its lock is what started this whole thing) [..]

@ajdavis Please reflect this in a comment on why we use ipaddress module as a fallback. I was re-reading the issue, and forgot why we shouldn't try to use getaddrinfo/AI_NUMERICHOST until I saw that Guido's message.

@ajdavis
Copy link
Author

ajdavis commented Dec 15, 2015

This has certainly been educational. Thanks for the simplifying suggestions; I've uploaded a new patch.

try:
addr = ipaddress.IPv4Address(host)
except ValueError:
host = host.partition('%')[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't modify the host variable -- it should be returned as-is.

@1st1
Copy link
Member

1st1 commented Dec 15, 2015

This has certainly been educational. Thanks for the simplifying suggestions; I've uploaded a new patch.

For me too :) I've left a few more comments, but we're almost there.

# If we happen to make an invalid address look valid, the code
# will fail later on sock.connect or sock.bind.
if af == socket.AF_INET6:
host = host.partition('%')[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary, socket.inet_pton supports zone indexes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mac, inet_pton accepts "::1%lo", and on Windows it accepts "::1%0". On Linux I can't get it to accept any address with a zone index, I don't think Linux's inet_pton supports zone indexes. I think we should keep stripping the zone index before passing to inet_pton, in order to reliably check whether the address is resolved or not. But you're right, I should return the original "host" parameter, not the one with zone index stripped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I just checked and couldn't get it to parse zone index on Linux too :( Then it's worth adding a comment there.

@ajdavis
Copy link
Author

ajdavis commented Dec 16, 2015

New patch.

if proto not in {0, socket.IPPROTO_TCP, socket.IPPROTO_UDP} or host is None:
return None

if type & socket.SOCK_STREAM:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment that on Linux, socket.type can be a bit-mask. That was something new to me, for instance.

@1st1
Copy link
Member

1st1 commented Dec 16, 2015

LGTM. @gvanrossum, please take a final look.

elif family == socket.AF_INET6:
host, port = address[:2]
else:
# Even though getaddrinfo with AI_NUMERICHOST would be non-blocking, it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also move this comment ("Even though ...") to line 113, before we use the ipaddress module? That will explain why we use it instead of AI_NUMERICHOST.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, new patch.

# On Windows, socket.inet_pton() is only available since Python 3.4.
if hasattr(socket, 'inet_pton'):
if family == socket.AF_UNSPEC:
afs = socket.AF_INET, socket.AF_INET6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would make this a list (and also on line 96 below). Tuples ought to be used for "record"-like structures, and only rarely for read-only sequences. Here there's no particular need for the sequence to be read-only, and the cost of creating a 2-item list vs. a 2-item tuple isn't really worth worrying about given all the other things we do here.

@gvanrossum
Copy link
Member

Admirable work! I have a few style nits and one platform question.

@gvanrossum
Copy link
Member

Thanks! All looks good now. FWIW I'd prefer not to see your local revision history in the asyncio logs.

@ajdavis ajdavis closed this Dec 17, 2015
@ajdavis
Copy link
Author

ajdavis commented Dec 17, 2015

Squashed and committed in 39c135b

@1st1
Copy link
Member

1st1 commented Dec 17, 2015

akheron pushed a commit to akheron/cpython that referenced this pull request Dec 17, 2015
getaddrinfo takes an exclusive lock on some platforms, causing clients to queue
up waiting for the lock if many names are being resolved concurrently. Users
may want to handle name resolution in their own code, for the sake of caching,
using an alternate resolver, or to measure DNS duration separately from
connection duration. Skip getaddrinfo if the "host" passed into
create_connection is already resolved.

See python/asyncio#302 for details.

Patch by A. Jesse Jiryu Davis.

--HG--
branch : 3.4
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants