Skip to content
This repository has been archived by the owner on Nov 23, 2017. It is now read-only.

Tulip server stops serving when exceeding fd limit #78

Closed
GoogleCodeExporter opened this issue Apr 10, 2015 · 19 comments
Closed

Tulip server stops serving when exceeding fd limit #78

GoogleCodeExporter opened this issue Apr 10, 2015 · 19 comments

Comments

@GoogleCodeExporter
Copy link

Linux machines generally have a default per-process limit of 1024 file 
descriptors. When a Tulip server receives an inbound connection, but the fd 
limit is exceeded (it can't return a new socket), the listener is closed 
entirely. Subsequent connection attempts receive "Connection refused"

The server barfs out the following:

Accept failed
Traceback (most recent call last):
  File "/home/marti/src/tulip/asyncio/selector_events.py", line 99, in _accept_connection
    conn, addr = sock.accept()
  File "/usr/lib64/python3.3/socket.py", line 135, in accept
    fd, addr = self._accept()
OSError: [Errno 24] Too many open files

This error probably shouldn't be fatal, as the server would recover once some 
other clients disconnect. Should there be a new API in tulip to let the 
application know of such transient errors?

This can be reproduced easily with the echo.py server from 
https://bitbucket.org/intgr/tulip-echo (I'm sure there's a bazillion things 
wrong with this code, but it works for my purposes :)

In one shell:
  ulimit -n 7
  ./echo.py

In another shell:
  ./client.py 10


Original issue reported on code.google.com by ma...@juffo.org on 23 Oct 2013 at 9:46

@GoogleCodeExporter
Copy link
Author

Issue 33 has been merged into this issue.

Original comment by gvanrossum@gmail.com on 23 Oct 2013 at 3:07

@GoogleCodeExporter
Copy link
Author

Good one. Agreed we need to handle this better. Is it just the accept() or are 
there other places where we should handle this? (I imagine the socket creations 
in create_connection(), create_server() and create_datagram_endpoint() may also 
raise this, as would the pipe() and socketpair() calls in the subprocess 
management stuff. But in a sense accept() is more important because dying there 
breaks the whole service for good until it is restarted, rather than just 
breaking one connection.

Original comment by gvanrossum@gmail.com on 23 Oct 2013 at 3:10

  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

This issue was closed by revision 1395b867e5e2.

Original comment by gvanrossum@gmail.com on 23 Oct 2013 at 3:51

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

I didn't realize this before, but Linux actually keeps the new connection in 
the queue even after returning ENFILE. So the listener socket is immediately 
reported as active in the next epoll loop and it will try to accept() again.

So after 1395b867e5e2, this condition causes tulip to burn 100% CPU and spam 
"Out of FDs accepting connections" as fast as possible.

I guess this is a little better than before, but I wouldn't consider it "fixed" 
:)

We're not the only ones having this problem: 
https://trac.torproject.org/projects/tor/ticket/925

Original comment by ma...@juffo.org on 23 Oct 2013 at 4:07

@GoogleCodeExporter
Copy link
Author

So I guess we need something like pause_reading() on listening sockets?
How do we know when it's safe to accept() again, just use an arbitrary timer? 
What a PITA

Original comment by ma...@juffo.org on 23 Oct 2013 at 4:10

@GoogleCodeExporter
Copy link
Author

Actually I'm not sure that this really fixed it. Are there other errors from 
accept() that we may have to log and ignore? ("System temporarily out of 
resources"???) At the very least we should also fix the accept() loop in 
proactor_events().

Looking at man accept (on OSX) it seems that the following might also need to 
be interpret as temporary:

     [ECONNABORTED]     The connection to socket has been aborted.
     [EMFILE]           The per-process descriptor table is full.
     [ENFILE]           The system file table is full.
     [ENOMEM]           Insufficient memory was available to complete the
                        operation.

Original comment by gvanrossum@gmail.com on 23 Oct 2013 at 4:14

  • Changed state: Started

@GoogleCodeExporter
Copy link
Author

Here is a summary of various caveats with accept():
http://bugs.python.org/issue6706#msg91579

In any case, I think errors should be logged and the loop should proceed.

Original comment by antoine.pitrou on 23 Oct 2013 at 4:20

@GoogleCodeExporter
Copy link
Author

So maybe I should just do this:

        try:
            conn, addr = sock.accept()
            conn.setblocking(False)
        except (BlockingIOError, InterruptedError):
            pass  # False alarm.                                                
        except Exception:
            # There's nowhere to send the error, so just log it.                
            # TODO: Someone will want an error handler for this.                
            logger.exception('Accept failed')
        else:

And similar in proactor_event()?  (Maybe move the setblocking() call out of the 
try block, I don't see how it could fail.)

Original comment by gvanrossum@gmail.com on 23 Oct 2013 at 4:32

@GoogleCodeExporter
Copy link
Author

According to http://bugs.python.org/issue6706#msg91579 you might also want to 
catch at least ConnectionAbortedError (for ECONNABORTED) and TypeError in case 
sock.accept() returns None.

In pyftpdlib I managed to come up with a test case which should solicit at 
least some of these peculiar error conditions:
https://code.google.com/p/pyftpdlib/source/browse/trunk/test/test_ftpd.py?spec=s
vn1233&r=1232#3225

According to Twisted other error codes may arise:
http://twistedmatrix.com/trac/browser/trunk/twisted/internet/tcp.py?rev=36856#L1
030

There are also cases in which the returned socket will fail at the next send(), 
recv() or get[sock/peer]name() call so asyncio should take that into account 
and execute that call via call_soon() or similar.

It's always a shame to complicate the API but given the severity of the event I 
agree someone will eventually come up and demand a callback handler for EMFILE.

Original comment by g.rodola on 23 Oct 2013 at 8:20

@GoogleCodeExporter
Copy link
Author

So, hm... Given that even Twisted doesn't have a custom error handler for this 
case yet, it really seems that the thing that makes sense is to log *all* 
errors coming out of accept(), with the exception of EWOUDLBLOCK and EAGAIN 
which we already silence.  The only question is, do we want to log the 
traceback or not? (The traceback wouldn't tell us much, so I am actually 
tempted to drop it, in favor of the type and str() of the exception.)

I don't see a path in the current accept() code that would ever return None 
instead of a (conn, addr) pair, but I do see a path where addr might be None -- 
however I don't see how that would break the accept code currently in 
selector_events.py, it would just set the 'peername' extra field to None, which 
the app can deal with.

Original comment by gvanrossum@gmail.com on 23 Oct 2013 at 8:54

@GoogleCodeExporter
Copy link
Author

This issue was closed by revision 1d43c66090ab.

Original comment by gvanrossum@gmail.com on 24 Oct 2013 at 11:44

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

Did you read comment #4 and comment #5? The EMFILE error doesn't just go away 
after you handle the error from accept(), it will continue to trigger the epoll 
loop until you actually have file descriptors to accept the connection.

As it stands the server will consume 100% CPU and spam errors on Linux.

Original comment by ma...@juffo.org on 25 Oct 2013 at 8:03

@GoogleCodeExporter
Copy link
Author

I'm sorry, I did not read those comments carefully enough. I'll verify that 
behavior, and I'll see if a timer is reasonable.

Original comment by gvanrossum@gmail.com on 25 Oct 2013 at 4:15

@GoogleCodeExporter
Copy link
Author

Hm, you seem to be right. I don't see how Twisted deals with this though -- 
they break out of their loop but they don't seem to remove the I/O handler.

What do you think of this fix?  http://codereview.appspot.com/17180043

Maybe I should take the list of errno's from Twisted and treat those the same?

Original comment by gvanrossum@gmail.com on 25 Oct 2013 at 5:22

  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

Also, maybe I should distinguish between OSError (== socket.error) and 
Exception?  E.g. if Exception is raised, *do* close the server socket?

Original comment by gvanrossum@gmail.com on 25 Oct 2013 at 5:26

@GoogleCodeExporter
Copy link
Author

> What do you think of this fix?  http://codereview.appspot.com/17180043

Yeah, seems reasonable and confirmed that it fixes the issue.

I'd prefer to define the 1-second delay as a constant instead of specifying it 
inline.

> distinguish between OSError (== socket.error) and Exception? E.g. if 
Exception is raised, *do* close the server socket?

I don't think there is much that can happen besides OSError? Perhaps 
MemoryError in some circumstances? Though I've never seen that one in practice 
anywhere.

But regardless of what happens, I think closing the socket is the worst 
possible failure mode for servers. Even segfaulting is preferable because then 
it's clear that the server has failed and can be restarted (automatically or 
otherwise). But if it continues serving old connections and just doesn't accept 
new ones, it might not be obvious even to a monitoring system that it's broken.

And if an attacker can figure out how to provoke that, then it's a DoS 
vulnerability.

Original comment by ma...@juffo.org on 26 Oct 2013 at 11:32

@GoogleCodeExporter
Copy link
Author

New version of the patch on http://codereview.appspot.com/17180043 -- catch 
some more errors, and make the delay a constant (not sure if I like that, 
constants.py is used inconsistently).  I did away with the exception logging if 
we don't understand the error -- the event loop will already take care of that. 
 I've also added a test.

Original comment by gvanrossum@gmail.com on 29 Oct 2013 at 11:37

@GoogleCodeExporter
Copy link
Author

I'm just going to commit that.

Original comment by gvanrossum@gmail.com on 30 Oct 2013 at 6:02

@GoogleCodeExporter
Copy link
Author

This issue was closed by revision 4d54f7ebda0f.

Original comment by gvanrossum@gmail.com on 30 Oct 2013 at 6:10

  • Changed state: Fixed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant