slowloris protection can be triggered by legitimate use of chromium #146

joeyh · 2013-03-09T05:55:52Z

I have a yesod web app that I'm able to fairly reliably, through a particular pattern of activity, cause to seem to freeze. Ie, I click on a link in the web browser and it spins, forever. Once this has happened, any links I click on will exhibit the behavior, and even pages that are a simple static hamlet template with no other IO freeze. However, open any link in a new tab and it works fine. It's only the http connection used by one particular chromium tab that has gotten stuck somehow.

(There is a mildly embarrassing video of this happening to me in the middle of a demo, here: http://mirror.linux.org.au/linux.conf.au/2013/mp4/gitannex.mp4 around 20 minutes in.)

I got a tcpdump of this happening and verified that chromium was sending a complete http request, and never getting an answer back, no matter how long I wait. Here is a tcpdump that shows a full session from app startup time to hang: http://downloads.kitenet.net/misc/git-annex-webapp-hang.tcpdump.bz2

I instrumented my use of warp, making settingsOnException settingsOnOpen and settingsOnClose do debug logging. From this I could see that ThreadKilled exceptions were being thrown. Only one place in warp calls killThread: registerKillThread does when the timeout is exceeded. I added a print there and verified that it was indeed responsible for killing threads. Here is a server log file to this issue showing the http requests and debug output: http://downloads.kitenet.net/misc/daemon.log

I commented out the killThread call, and my app seems fixed; no more hangs.

The timeout is apparently only supposed to be triggered by the slowloris attack. Ironically, since my webapp only listens for connections from localhost, it doesn't need DOS protection at all. But I cannot find a way to disable the slowloris protection in warp's API.

Adding a knob to disable killing threads would be sufficient for my needs, although I suspect you might want to take further action.

warp version: 1.3.7.4

joeyh · 2013-03-09T18:17:52Z

I've added a lot of debug logging to the timeout manager code to try to gather more data around when this happens. Using this patch http://downloads.kitenet.net/misc/warp-instrumented.patch I'm seeing logs like this:

[2013-03-09 13:55:50 JEST] open connection ThreadId 99
[2013-03-09 13:55:50 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:55:50 JEST] 127.0.0.1 GET /static/js/bootstrap-collapse.js Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:55:50 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:55:50 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:19 JEST] ("state was",Active,"now",Inactive,"ThreadId 99")
[2013-03-09 13:56:34 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] 127.0.0.1 GET /static/css/bootstrap-responsive.css Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:49 JEST] ("state was",Active,"now",Inactive,"ThreadId 99")
[2013-03-09 13:57:20 JEST] ("state was",Inactive,"now",Inactive,"ThreadId 99")
[2013-03-09 13:57:20 JEST] ("calling onTimeout","ThreadId 99")
[2013-03-09 13:57:20 JEST] ("registerKillThread fired, killing","ThreadId 99")

Note that the timeout checker ran twice for thread 99. First setting it Inactive and then 30 seconds later seeing it was still inactive, and calling its onTimeout.

Since I have the killThread commented out, we can see this thread go on to successfully process another request!

[2013-03-09 13:57:36 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:57:36 JEST] 127.0.0.1 POST /filebrowser Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:57:36 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:57:36 JEST] ("tickle","ThreadId 99","by",ThreadId 99)

So, chromium is opening a connection, using it to request two files within seconds, and then a full minute later, reusing the connection to load another page. Clear evidence that the 30 second slowloris timeout is triggered by
a regular browser doing nothing special.

Question: Why does the connection not get closed when the slowloris timeout kills the thread? If I'm reading runSettingsConnectionMaker correctly, it only catches IO exceptions and closes the connection. The async ThreadKilled exception is not handled there See FIXME in the code. Perhaps that is the real bug.

I suppose a workaround is to set the timeout as high as it will go. Since it's an Int holding microseconds, the highest timeout that can be set is around 35 minutes. Probably high enough for chromium to drop the connection.

snoyberg · 2013-03-10T05:15:26Z

Thank you for filing this report. I've been hunting a bug in yesod devel which I think may be related, though until now I thought it was from a completely different library.

I think the Slowloris approach is still correct, even if it kills some Chromium connections. At a certain point, there's no way to distinguish between legitimate client connections and attacks. 30 seconds is a somewhat arbitrary cutoff, but the cutoff can be customized by the user. And if everything worked as it should, this wouldn't be a problem.

I don't think the FIXME is relevant to this issue. The code:

serve `onException` cleanup

should in fact be triggering the connection to be closed on any exception. The getConnLoop code should not be relevant. However, I can see from your logs that the problem really is that cleanup is never getting called: notice how your logs never mention a call to cancel. Unfortunately, I'm not able to reproduce that behavior on my system.

Do you have a minimal example showing the problem? And can you provide more information on OS, GHC version, compile flags, and runtime flags?

snoyberg · 2013-03-11T17:15:46Z

I've implemented a possible fix, though I'm not sure if it actually solves the problem (it seems to make yesod devel better behaved in my testing). Would you be able to test the newest Github code?

kazu-yamamoto · 2013-03-13T07:27:10Z

cleanup should always call connClose, shouldn't it?

snoyberg · 2013-03-13T11:58:17Z

I would think with both the old and new code this would be true, I can't see any reason why it wouldn't be. Nonetheless, @joeyh's log definitely implies that connClose is not being run for some reason. I'm not sure if the patch I pushed fixes the problem, but IMO it's a better way to write the function anyway.

kazu-yamamoto · 2013-03-13T12:26:32Z

I will investigate this tomorrow.

kazu-yamamoto · 2013-03-13T12:36:18Z

BTW, would you explain why mask & restore are used around here?

kazu-yamamoto · 2013-03-13T12:45:14Z

A thread created by forkIO inherits mask state. So, reading the code, I seems to me that any child threads cannot receive asynchronous exceptions...

kazu-yamamoto · 2013-03-13T12:47:11Z

As the document of Control.Exception says, we should not use mask & restore by ourselves. We should use bracket instead if necessary.

snoyberg · 2013-03-13T12:52:32Z

The reason for the usage of mask is to avoid an exception getting thrown between getConnLoop and mkConn, thereby possibly leaking a connection. There are two places where async exceptions can be thrown:

In the main loop in the allowInterrupt call.
In the child thread's serveConnection call, which has been restored.

I think we may want to restore getConnLoop as well.

meteficha · 2013-03-13T12:56:13Z

If the whole getConnLoop was restored, wouldn't it be possible for an async exception to be thrown after getConn but before exiting the restore? I guess it would be better to restore just the right branch of getConnLoop's catch. However, I don't think it would solve @joeyh's problem either way.

snoyberg · 2013-03-13T13:34:34Z

Good point; perhaps we should pass restore to the callback function and allow it to restore things as it wants. But I agree that this is unlikely to be related to the bug at hand.

kazu-yamamoto · 2013-03-14T08:57:33Z

What about this logic?

    forever . mask_ $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIOWithUnmask $ \unmask -> do
            th <- T.registerKillThread tm
            conn <- mkConn
#if SENDFILEFD
            let cleaner = Cleaner th fc
#else
            let cleaner = Cleaner th
#endif
            let serve = onOpen >> serveConnection set cleaner port app conn addr
                cleanup = connClose conn >> T.cancel th >> onClose
            unmask serve `finally` cleanup `catch` onE

kazu-yamamoto · 2013-03-14T09:00:41Z

I confirmed that the code above calls cleanup both in normal cases and error cases.

Note that I also confirmed that cleanup of the original Warp 1.3.7.4 is called in error cases. I don't know @joeyh 's problem comes from.

snoyberg · 2013-03-14T11:46:25Z

I'm not sure that this actually changes anything. You're performing the masking inside the forever loop, so each loop iteration we break out of the mask and therefore have an opportunity to receive an async exception. However, allowInterrupt already achieves that goal. And I'm not sure why forkIOWithUnmask would have different behavior than getting a restore function from mask.

kazu-yamamoto · 2013-03-14T12:34:11Z

My code is based on the explanation of Control.Concurrent. This code is clearer than the current one from my point of view.

Anyway, I would want @joeyh to test both code and would like to see his results.

meteficha · 2013-03-15T01:33:40Z

@kazu-yamamoto Shouldn't it be mask_ . forever? Also, did you put onOpen inside unmask on purpose? If so, why didn't you unmask onClose as well?

I find @kazu-yamamoto's code easier to read, but I'd add parenthesis around the finally on the last line in order to be explicit about catch's scope.

kazu-yamamoto · 2013-03-15T02:08:36Z

I don't know which is better, mask_ . forever or forever . mask_. Maybe you are right. @meteficha

I don't not the expected behavior of onOpen and onClose. Would you explain them? @snoyberg

I have no objection to enclose unmask sever and cleanup. :-)

snoyberg · 2013-03-15T14:27:05Z

onOpen and onClose allow us to run user-specified code before and after each connection is served. One possible use case is logging; I'm not sure of others. And actually, looking at the code again, both the old implementation and @kazu-yamamoto's new implementation are vulnerable to a connection leak if onOpen throws an exception. However, I believe the new bracket-based code is immune to such problems.

I'm not sure if there's really a good motivation for not using the new bracket code. The only tweak I'd consider making is restoreing the entire body of the bracket. That code would look like:

    mask $ \restore -> forever $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIO $ do
            th <- T.registerKillThread tm
            bracket mkConn connClose $ \conn -> restore $ do
#if SENDFILEFD
                let cleaner = Cleaner th fc
#else
                let cleaner = Cleaner th
#endif
                let serve = do
                        onOpen
                        serveConnection set cleaner port app conn addr
                        cleanup
                    cleanup = T.cancel th >> onClose
                handle onE (serve `onException` cleanup)

What this says is:

Block all async exceptions, then loop forever, doing the following.
At the beginning of the loop, check for async exceptions.
Get a new connection maker, then fork a thread to handle it.
Inside that new thread, register a timeout handler.
Use bracket to acquire a connection from the connection maker and register a cleanup action for it, then run the inner action.
Allow async exceptions to be thrown in the inner action.

Since no async exceptions can be thrown in steps 3-5, it should be impossible to leak the connection there. And once the bracket function is called, any async exceptions in step 6 will properly trigger cleanup.

meteficha · 2013-03-15T14:39:20Z

So why not:

    mask_ . forever $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIOWithUnmask $ \unmask -> 
            bracket (T.registerKillThread tm) T.cancel $ \th ->
            bracket mkConn connClose $ \conn ->
#if SENDFILEFD
                let cleaner = Cleaner th fc
#else
                let cleaner = Cleaner th
#endif
                in unmask .
                   handle onE .
                   bracket_ onOpen onClose $
                   serveConnection set cleaner port app conn addr

I've put T.cancel and onClose inside brackets since that's what they are anyway.

EDIT: Removed unnecessary do.

snoyberg · 2013-03-15T14:44:21Z

Yes, that looks even better. @kazu-yamamoto any objections to that formulation?

joeyh · 2013-03-15T16:16:20Z

Michael Snoyman wrote:

I've implemented a possible fix, though I'm not sure if it actually solves the
problem (it seems to make yesod devel better behaved in my testing). Would you
be able to test the newest Github code?

I'm sorry, didn't see this thread until today.

Yes, I can try testing. I have a fairly good recipe to reproduce the
bug. But I sometimes have to do the manual series of clicks a dozen times
to get it to pop up. So I can't test a wide variety of solutions, but
if you think you've settled on one I will test it as extensively as I can.

see shy jo

joeyh · 2013-03-15T16:24:31Z

I'm using ghc 7.4.1-4 from Debian unstable. Linux 3.2.0. Using the threaded RTS, no other special RTS flags or build flags.

Sadly I do not have a minimal test case.

kazu-yamamoto · 2013-03-15T23:28:31Z

No objection. Please leave a comment to refer this issue around this code.

snoyberg · 2013-03-17T06:37:30Z

I've just pushed a commit to make the code change. The only change I made versus @meteficha's code was to swap the two bracket calls: there's no point registering a timeout handler before we've acquired the connection, since async exceptions will be blocked during connection making anyway.

@meteficha and @kazu-yamamoto: Please review this. If you think it's good, we should ask @joeyh to test the new code.

snoyberg · 2013-03-17T12:49:43Z

OK @joeyh, I think the code is ready to be tested. Can you give it a shot?

joeyh · 2013-03-18T17:14:24Z

Michael Snoyman wrote:

OK @joeyh, I think the code is ready to be tested. Can you give it a shot?

I set the slowloris timeout down to just 10 seconds, and instrumented
warp so I can tell when threads time out. I've been running my app like
this for a while and while quite a lot of threads have timed out and
been killed, I've yet to see chromium freeze.

This isn't quite conclusive, but it's looking likely to be fixed.
I will continue running in this configuration for a while.

I would still appreciate a setting to entirely disable the slowloris
protection, since it is completely unncessary in my application, which
only accepts connections from localhost.

see shy jo

snoyberg · 2013-03-19T17:04:25Z

I've just released a new version of Warp with this fix, thanks everyone for working on it.

I'm hesitant to include the option to disable timeouts, simply because it will likely introduce a bunch of conditionals into some tight loops. I'll leave the issue open for now.

kazu-yamamoto · 2013-03-20T05:12:42Z

Does "I've yet to see chromium freeze" mean "chromium still freezes sometime"?

gertcuykens · 2013-03-20T05:30:24Z

it means it did not freeze

…hat caused regular browsers to stall when they reuse a connection after leaving it idle for 30 seconds. (See yesodweb/wai#146)

snoyberg added a commit that referenced this issue Mar 11, 2013

Possible fix for #146

5dca62e

snoyberg added a commit that referenced this issue Mar 17, 2013

Felipe's improved bracket-based loop, with comments #146

765a68b

snoyberg closed this as completed Sep 8, 2013

kazu-yamamoto mentioned this issue Aug 14, 2014

graceful shutdown #282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slowloris protection can be triggered by legitimate use of chromium #146

slowloris protection can be triggered by legitimate use of chromium #146

joeyh commented Mar 9, 2013

joeyh commented Mar 9, 2013

snoyberg commented Mar 10, 2013

snoyberg commented Mar 11, 2013

kazu-yamamoto commented Mar 13, 2013

snoyberg commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

snoyberg commented Mar 13, 2013

meteficha commented Mar 13, 2013

snoyberg commented Mar 13, 2013

kazu-yamamoto commented Mar 14, 2013

kazu-yamamoto commented Mar 14, 2013

snoyberg commented Mar 14, 2013

kazu-yamamoto commented Mar 14, 2013

meteficha commented Mar 15, 2013

kazu-yamamoto commented Mar 15, 2013

snoyberg commented Mar 15, 2013

meteficha commented Mar 15, 2013

snoyberg commented Mar 15, 2013

joeyh commented Mar 15, 2013

joeyh commented Mar 15, 2013

kazu-yamamoto commented Mar 15, 2013

snoyberg commented Mar 17, 2013

snoyberg commented Mar 17, 2013

joeyh commented Mar 18, 2013

snoyberg commented Mar 19, 2013

kazu-yamamoto commented Mar 20, 2013

gertcuykens commented Mar 20, 2013

slowloris protection can be triggered by legitimate use of chromium #146

slowloris protection can be triggered by legitimate use of chromium #146

Comments

joeyh commented Mar 9, 2013

joeyh commented Mar 9, 2013

snoyberg commented Mar 10, 2013

snoyberg commented Mar 11, 2013

kazu-yamamoto commented Mar 13, 2013

snoyberg commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

kazu-yamamoto commented Mar 13, 2013

snoyberg commented Mar 13, 2013

meteficha commented Mar 13, 2013

snoyberg commented Mar 13, 2013

kazu-yamamoto commented Mar 14, 2013

kazu-yamamoto commented Mar 14, 2013

snoyberg commented Mar 14, 2013

kazu-yamamoto commented Mar 14, 2013

meteficha commented Mar 15, 2013

kazu-yamamoto commented Mar 15, 2013

snoyberg commented Mar 15, 2013

meteficha commented Mar 15, 2013

snoyberg commented Mar 15, 2013

joeyh commented Mar 15, 2013

joeyh commented Mar 15, 2013

kazu-yamamoto commented Mar 15, 2013

snoyberg commented Mar 17, 2013

snoyberg commented Mar 17, 2013

joeyh commented Mar 18, 2013

snoyberg commented Mar 19, 2013

kazu-yamamoto commented Mar 20, 2013

gertcuykens commented Mar 20, 2013