New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
abortConnection() method for transports #7153
Comments
|
|
|
|
|
|
|
|
|
|
(In [17458]) An abortConnection() function for transport, and some tests. Refs #7153 |
This is actually a pretty evil security hole. I logged into a server to see why it was out of memory, and found 1434 open connections with no activity. A malicious user was opening hundreds of connections to a twisted server but not reading from them. Since loseConnection() won't close the sockets unless the write buffer is empty, I have no way to fight this attack at the application level. We have had a twisted IRC server and an entire twisted live video cluster (93 nodes) get destroyed by this exact exploit once a certain hacking group figured it out. In the mean time I'm patching in abortConnection2.diff, but I thought this real world example might be useful information for people trying to fix this. |
I'm torn between happiness at live video clustering and sadness about the exploit. Upping priority, changing to defect. |
I'm a little concerned about the form of the solution here. It means that every protocol that is not specifically designed to be resistant to this particular exploit (in other words, all the protocols in Twisted) will continue to be a problem. Perhaps a better way to deal with the exploit would be to have a default timeout on loseConnection if no data is being consumed by the client? Or an easy way to set up administrative limits on the number of simultaneous connections from a single IP? I haven't thought too hard about the solution yet, but I definitely don't think we could consider this form of DoS dealt with if you have to hand-patch some code for every port that you want to open. |
Mechanism separate from policy. |
Obviously we should have abortConnection that is separate from loseConnection. The question is, should loseConnection additionally have some sort of a default timeout case where it calls abortConnection. loseConnection essentially is a policy, since it can be implemented in terms of producer API + abortConnection. |
I was not trying to make a point about the validity of this ticket. I was responding to the comment by kvogt that "This is actually a pretty evil security hole". This ticket isn't going to fix any security hole (or, more accurately, DoS). It may be a necessary prerequisite, but we should probably have a separate issue to deal with the larger problem. |
Why the tests fail on Windows: On Windows, [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. Sometimes you see This incorrect Hopefully someone who knows how PyOpenSSL and OpenSSL work will comment :-) |
Could you open a ticket for the windows connectionlost/connectiondone issue you discovered? With a failing test or minimal example if at all possible. Thanks! |
Shouldn't we just fix this patch? The ticket is still open, after all... |
Oh, I hadn't realized the bug was in the patch. Good plan! |
Lo and behold!
Requirmentes are win32eventreactor and SSL connections! |
I've been looking at this again. I noticed that The 20091225 patch makes Like the previous patches, this patch patch breaks the IOCP reactor, and still has the OpenSSL problem mentioned above (so it breaks many tests that use SSL on Windows). It also needs a test for a server aborting a client, not just a client aborting. There's also likely an untested codepath in |
Pavel, could you take a look at the IOCP code and see if it makes sense to you? Thanks! |
Ready for review, I guess. Some notes:
|
Anyone can review! |
|
Re #11338... not really sure what you're suggesting. None of the new tests open many connections... Re #14, quite possibly there is no test covering it. But I've written enough convoluted tests. If you really feel this is a problem, I can revert that fix :) All other comments have been addressed. I just have to figure out why merging forward breaks a test. |
(In [32604]) Branching to 'abortConnection-78-9' |
Noooooooooooooooooooooooooooooooooo f u cfreactor: [ERROR] twisted.internet.test.test_tls.AbortSSLConnectionTest_CFReactor.test_fullWriteBufferAfterByteExchange Doesn't happen on the other OS X builder, strangely. Can I just ignore that? Please? Or mark it failing? I don't want to have to deal with another crappy broken reactor :( There's still also random mystery errors elsewhere (item 1) -- suggestions? http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-9 |
|
Replying to exarkun:
For those who might see this and not be hip enough to have our entire ticket database memorized, exarkun's talking about: http://twistedmatrix.com/trac/ticket/4482 |
Or... no, that's a different bug? Hrm. I can't find a bug report about FD leaks in gtkreactor. I thought I knew about this though. Anyone have a reference handy? |
I assert that GTK does not leak file descriptors when simply running the reactor, and I have attached a file which I believe proves it. Perhaps this isn't true of our rhel6 buildbot, so someone can run this file there and let me know if there's a problem. However, I've run this program on every Ubuntu release back to Hardy and none of them exhibit the described behavior. |
Not exactly what I thought I was talking about, but it seems likely to be related anyway: https://bugzilla.gnome.org/show_bug.cgi?id=579406. I think you didn't see the problem on Ubuntu because eventually Ubuntu backported the fix, but maybe RHEL did not. |
OK, I think I know what to do now. Taking off from review until I can get buildbot in better shape. |
http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-9 We're down to failures that are even more mysterious and not-my-faulty. Only unaddressed review comment is 14; see comments above. I can, if you want, remove the fix and open new ticket (Jerub claims to have seen bugs caused by it, so possibly writing a test isn't impossible). |
Thanks. |
I got rid of the minimize thing, it's not strictly necessary and the tests are still reasonable without it, and also dealt with 6 and 7. Re 8, I will file a ticket after this branch is merged. As far as 9, I think it was the bug fix, not the loop, so I lowered the number down to 100 (30MB write) As far as 10 goes, the above means it's faster and writes less... but I think there may actually be a bottleneck in TLS. I don't think crypto is so slow that it should add add 2 CPU seconds, and it's much slower if I transfer same number of bytes with a smaller number of (larger) writes. Anyway, it's down to 3 seconds on my computer now. I don't want to lower it much more, because the idea is to fill up buffers. Since I've addressed all review comments, I'll put this up for review once I can merge forward to include the fix for #5301. |
(In [32824]) Branching to 'abortConnection-78-10' |
Can you also add |
Good point, will do. And maybe rename the abort mixin to start with an _ if it doesn't already. |
Added @SInCE, ready for review: http://buildbot.twistedmatrix.com/boxes-supported?branch=/branches/abortConnection-78-10 |
Just for refrence regarding SO_LINGER: Portability note 1: Some implementations of the BSD socket API do not implement SO_LINGER at all. On such systems, applying SO_LINGER either fails with EINVAL or is (silently) ignored. Having SO_LINGER defined in the headers is no guarantee that SO_LINGER is actually implemented. Portability note 2: Since the BSD documentation on SO_LINGER is sparse and inadequate, it is not surprising to find the various implementations interpreting the effect of SO_LINGER differently. For instance, the effect of SO_LINGER on non-blocking sockets is not mentioned at all in BSD documentation, and is consequently treated differently on different platforms. Taking case 3 for example: Some implementations behave as described above. With others, a non-blocking socket close() succeed immediately leaving the rest to a background process. Others ignore non-blocking'ness and behave as if the socket were blocking. Yet others behave as if SO_LINGER wasn't in effect [as if the case 1, the default, was in effect], or ignore linger->l_linger [case 3 is treated as case 2]. Given the lack of /Event. Replying to naked:
|
Thanks for the info. So it sounds like there's a problem case where certain BSDs (which?) will block on close() if SO_LINGER is set to 0, even for non-blocking sockets? Or am I misunderstanding? |
Attachments:
Searchable metadata
The text was updated successfully, but these errors were encountered: