New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transient socket failure to connect to 'localhost' #56021
Comments
test test_telnetlib failed -- Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_telnetlib.py", line 45, in testBasic
telnet = telnetlib.Telnet(HOST, self.port)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\telnetlib.py", line 209, in __init__
self.open(host, port, timeout)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\telnetlib.py", line 225, in open
self.sock = socket.create_connection((host, port), timeout)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 407, in create_connection
raise err
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 398, in create_connection
sock.connect(sa)
socket.error: [Errno 10061] No connection could be made because the target machine actively refused it |
What do you propose for a fix?
Something like 2 would seem like a good idea for all tests dependent on a resource out of developers' control. |
Well, if you find a more reliable host than "localhost", why not ;-) |
With a bit of searching, HOST == support.HOST == 'localhost'. Looking at the traceback, it is socket that fails, not telnetlib or its test. Hence the clearer title. I am still curious what you propose: catch and skip or something else? For Windows, I consider a one-time event like this a routine random glitch to be ignored at least until it repeats ;-). |
I only saw the failure on test_telnetlib, not in other tests using sockets. I think that this issue is specific to test_telnetlib (even not telnetlib). It is maybe a race condition: the code to wait until the server is active is maybe not correct. GeneralTests.setUp() waits until the server has called serv.listen(5). I don't know if the server must be waiting in serv.accept() on Windows (using Cygwin?). Last instruction of GeneralTests.setUp() is a time.sleep(.1): ugly synchronization hack to workaround a race conditon?? |
Does the failure occur on other buildbots? If not, it's maybe something specific to this Windows Seven: a local firewall or something like that? Can we use start 127.0.0.1 instead of "localhost"? I don't know if it would change anything. Note: the TCP server of test_telnetlib doesn't use SO_REUSEADDR whereas it starts/stops very quickly. We may be something like: serv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) |
Oh, the last failure of the buildbot "x86 Windows7 3.x" is on test_ftplib, not test_telnetlib! ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_ftplib.py", line 948, in testTimeoutConnect
ftp.connect(HOST, timeout=30)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 148, in connect
source_address=self.source_address)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 407, in create_connection
raise err
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 398, in create_connection
sock.connect(sa)
socket.timeout: timed out ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_ftplib.py", line 920, in testTimeoutDefault
ftp = ftplib.FTP("localhost")
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 114, in __init__
self.connect(host)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 148, in connect
source_address=self.source_address)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 407, in create_connection
raise err
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 398, in create_connection
sock.connect(sa)
socket.timeout: timed out ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_ftplib.py", line 955, in testTimeoutDifferentOrder
ftp.connect(HOST)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 148, in connect
source_address=self.source_address)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 407, in create_connection
raise err
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 398, in create_connection
sock.connect(sa)
socket.timeout: timed out ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_ftplib.py", line 963, in testTimeoutDirectAccess
ftp.connect(HOST)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 148, in connect
source_address=self.source_address)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 407, in create_connection
raise err
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 398, in create_connection
sock.connect(sa)
socket.timeout: timed out ====================================================================== Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_ftplib.py", line 932, in testTimeoutNone
ftp = ftplib.FTP("localhost", timeout=None)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 114, in __init__
self.connect(host)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\ftplib.py", line 148, in connect
source_address=self.source_address)
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 407, in create_connection
raise err
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\socket.py", line 398, in create_connection
sock.connect(sa)
socket.error: [Errno 10061] No connection could be made because the target machine actively refused it |
Some tests of test_ftplib and test_telnetlib use HOST or directly 'localhost' instead of getting the host from the server socket. About the test_ftplib failures, only the tests using explicitly 'localhost' do fail. Attached patch reads the name of the server socket instead of using HOST or 'localhost'. You would like to try it on Linux by changing support.bind_port() function: replace the default host value from HOST to '127.0.0.2'. If you do that, test_ftplib and test_telnetlib fail without my patch, and pass with my patch. By the way, why do we use 'localhost' instead of '127.0.0.1' for support.HOST? '127.0.0.1' doesn't depend on the DNS configuration of the host (especially its "hosts" file, even Windows has such file). |
Perhaps Michael or Ezio have an idea of whether 'reason' or 'happenstance' is the answer to your questions. |
I am seeing this failure from time to time in OpenIndiana buildbots. For instance http://www.python.org/dev/buildbot/all/builders/AMD64%20OpenIndiana%203.x/builds/1751/steps/test/logs/stdio Seems a clear race condition. |
This might be a good idea. Depending on the DNS setup, it could lead to a latency which might explain such failures.
The code looks correct: a threading.Event is set by the server once it called listen(), point at which incoming connections should be queued (SYN/ACK is sent before accept()). |
Any progress on this?. I still see frequent OpenIndiana Buildbots failures because of this. Is anybody activelly working on this?. Should I get involved? |
I don't think so.
Sure, if you have access to a machine on which you can reliably reproduce the problem, it'll be much easier. I would bet on a deficient name resolution service: using 127.0.0.1 instead of 'localhost' for support.HOST could help. It could also be due to a firewall setting (e.g., drop incoming connections requests when the connection rate is too high). |
I explain a reliable method to reproduce this issue on Linux (it may work on other OSes) in msg138882. |
It's a way to reproduce the symptom (i.e. connection refused because you're trying to connect to 127.0.0.2 while the server is listening on 127.0.0.1), but not the cause: if the server binds to 'localhost' and the client connects to 'localhost', it should work. |
Checking the testsuite source code, I see several issues: The server thread only waits for 3 seconds for the connection. If a connection is not created before 3 seconds, the server suicides and when the connection is tried, it will fail. This probably explain why the problem is sporadic and seems to depend of name resolving. If the DNS resolver is "slow", we have a problem. Also, the event is signaled twice in the server, and the client does a wait and a clear. If the thread scheduler is lucky, the server would signal twice and THEN the client would wait (and return inmediatelly) and clear, completelly missing the second signaling and hanging the client in the next wait (in the teardown). So, I would propose:
Opinions?. I assign the issue to myself. Please, provide feedback and I will create & apply the patch. I have seen this issue too in 2.7, in my buildbots (OpenIndiana). You can reproduce the issue easily changing the "self.sock.settimeout(3)" to "self.sock.settimeout(0.01)", for instance. PS: I see use of test.support.HOST in the testuite, but that attribute is not documented anywhere. :-?? """
Python 3.2.2 (default, Sep 5 2011, 01:49:10)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from test import support
>>> support.HOST
'localhost'
""" Should it be 127.0.0.1, or not use at all, since it is not documented?. PPS: I only checked telnetlib, not ftplib. |
About the "support.HOST", changing from "localhost" to "127.0.0.1" could be problematic is servers without IPv4 support (servers IPv6 only). I guess this is a theorical problem so far, and that when we find this issue the exception would be pretty obvious... Opinions?. What about documenting "support.HOST"? |
Indeed, but 3 seconds to resolve localhost is not "slow", it's really
As noted, this might break on IPv6-only hosts. Not sure this will be a
Sounds like a recipe for masking bugs. Having a dangling thread is
OK.
Yeah, this should be removed.
Please check your name resolution setting :-) |
Charles-François: The only way for the server thread being around would be if the test fails badly, not calling teardown (I would do a fake tcp connection to the server in the teardown, followed by a thread.join). In this case, the thread (being "daemon") will die when the tests are done anyway, and the test will be showed as fail in the buildbots, so somebody should take care of it. I would not mask anything. About using getsockname(), the bind would bind to all IPs of the machine, including real network interfaces. I rather prefer to keep the bind private and not accesible from the external network :-). I really think that ipv6-only hosts are a non-issue just now. And the failure would be quite self-evident. In Solaris I get this exception: "socket.error: [Errno 126] Cannot assign requested address". Quite clear. Anyway, we can keep using "localhost", but just delete the socket timeout in the server. About the nameresolution speed, it is actually happening. Could be, too, a busy machine. I know that some buildbots are running inside a virtual machine, sometimes with little real RAM and possibly paging a lot. The fact is that the 3 second delay is actually happening, and I don't think that solution should be to increase the timeout. That only mask the issue. |
Please don't. Any problem might then hang the whole test suite.
How so? >>> sock = socket.socket()
>>> sock.bind(("localhost", 0))
>>> sock.getsockname()
('127.0.0.1', 60919) |
Uhmmmm.... doing a fake connection in the teardown would be problematic if the socket is reused for something else in the meantime. The kernel is suppose to keep the socket in the "not reuse" state for a while, but... I am seeing too liberal mixing of support.HOST and "localhost". That should be unified. I need consensus about making test.support.HOST = "127.0.0.1". What do you think? test_ftplib needs love too, but changes are far more intrusive there. I will wait until you approve changes to telnetlib.py. Please, review attached changeset. |
Antoine: Deleting the socket timeout doesn't hang the test if we set the thread to "daemon" and do not do a thread.join() (unneeded in the normal situation, since garbage collecting the test instance will collect the thread too). If you don't like this, I can do a fake connection in teardown (look the proposed changeset). The problem with that is OS port reuse. Quite safe, but only "quite". If thread.join had a timeout , we could wait for a while and if the thread is still active, do a fake connection and another join. A bit overkill for a test, I guess :-). I stand corrected about getsockname(). I am neutral to it, although we are still involving DNS. I would rather prefer a direct "127.0.0.1". |
Doesn't look acceptable to me. |
What's wrong with a socket timeout exactly? Everything you're proposing |
Antoine, the problem with this test is the timeout. We can set an arbitrary timeout, but how big is big enough?. My change doesn't need a timeout at all. Problem solved. The only "cosmetic" problem is the risk of "leaking" a thread. But it would not affect the testsuite if it is a daemon thread, and we would only "leak" if the test fails, not under normal circunstances. The complexities suggested are heroic effords trying to manage that thread when simply ignoring it would be acceptable. I could set a timeout of 5 minutes just to satisfy you, but for that time the test should have been done yet, and the thread collected anyway. I see that more a hack that actually setting the thread to daemon and forget it, knowing that it will automatically die when done. |
I would say answering this question is your task, since you have access
This is not cosmetic, the thread might be keeping all kinds of resources
The issue is not to satisfy me, it's to satify the buildbots. If you say If you say that replacing "localhost" with "127.0.0.1" would fix the |
Antoine: Then you would be satisfied if I increase the timeout from 3 seconds to 60 seconds and clean the event signaling?. The current event signaling code has a few race conditions with potential deadlocks. |
Yes! |
Consider too that if something goes bad enough in the test to skip the teardown method, the thread will be alive for a while, possibly contaminating some other tests, like you commented. This is actually unsolvable, I think. Code that NEED to be executed with no other threads around could/should check the thread count and fail with a clear error message. So we at least can point to the real culprit. |
Such as? tearDown is normally like a "finally" block, it always gets |
Please, review 71ab454bfe19.diff . I am not satisfied with the timeout approach, since the timeout time is arbitrary. I would rather do the fake connection in teardowm, to be sure the server died. Anyway, this seems to be the minimal patch to solve the problem at hand "most of the time" (if know the test is failing sporadically with a timeout of 3 seconds, hope it fails once per year with a timeout of 60 seconds). |
Stupid mistake. Please, review b93657b239a5.diff (erroneous "sock.close()" deleted) |
Looks good to me, thanks. |
New changeset 76b6b85e4b78 by Jesus Cea in branch '2.7': New changeset 554802e562fa by Jesus Cea in branch '2.7': New changeset f94533c9229d by Jesus Cea in branch '3.2': New changeset 3b9f58f85d3e by Jesus Cea in branch '3.2': New changeset 85c10a905424 by Jesus Cea in branch 'default': New changeset ca8a0dfb2176 by Jesus Cea in branch 'default': |
Seems to be fixed now. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: