Skip to content
This repository

slowloris protection can be triggered by legitimate use of chromium #146

Closed
joeyh opened this Issue · 30 comments

5 participants

Joey Hess Michael Snoyman Kazu Yamamoto Felipe Lessa Gert Cuykens
Joey Hess
joeyh commented

I have a yesod web app that I'm able to fairly reliably, through a particular pattern of activity, cause to seem to freeze. Ie, I click on a link in the web browser and it spins, forever. Once this has happened, any links I click on will exhibit the behavior, and even pages that are a simple static hamlet template with no other IO freeze. However, open any link in a new tab and it works fine. It's only the http connection used by one particular chromium tab that has gotten stuck somehow.

(There is a mildly embarrassing video of this happening to me in the middle of a demo, here: http://mirror.linux.org.au/linux.conf.au/2013/mp4/gitannex.mp4 around 20 minutes in.)

I got a tcpdump of this happening and verified that chromium was sending a complete http request, and never getting an answer back, no matter how long I wait. Here is a tcpdump that shows a full session from app startup time to hang: http://downloads.kitenet.net/misc/git-annex-webapp-hang.tcpdump.bz2

I instrumented my use of warp, making settingsOnException settingsOnOpen and settingsOnClose do debug logging. From this I could see that ThreadKilled exceptions were being thrown. Only one place in warp calls killThread: registerKillThread does when the timeout is exceeded. I added a print there and verified that it was indeed responsible for killing threads. Here is a server log file to this issue showing the http requests and debug output: http://downloads.kitenet.net/misc/daemon.log

I commented out the killThread call, and my app seems fixed; no more hangs.

The timeout is apparently only supposed to be triggered by the slowloris attack. Ironically, since my webapp only listens for connections from localhost, it doesn't need DOS protection at all. But I cannot find a way to disable the slowloris protection in warp's API.

Adding a knob to disable killing threads would be sufficient for my needs, although I suspect you might want to take further action.

warp version: 1.3.7.4

Joey Hess
joeyh commented

I've added a lot of debug logging to the timeout manager code to try to gather more data around when this happens. Using this patch http://downloads.kitenet.net/misc/warp-instrumented.patch I'm seeing logs like this:

[2013-03-09 13:55:50 JEST] open connection ThreadId 99
[2013-03-09 13:55:50 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:55:50 JEST] 127.0.0.1 GET /static/js/bootstrap-collapse.js Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:55:50 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:55:50 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:19 JEST] ("state was",Active,"now",Inactive,"ThreadId 99")
[2013-03-09 13:56:34 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] 127.0.0.1 GET /static/css/bootstrap-responsive.css Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:49 JEST] ("state was",Active,"now",Inactive,"ThreadId 99")
[2013-03-09 13:57:20 JEST] ("state was",Inactive,"now",Inactive,"ThreadId 99")
[2013-03-09 13:57:20 JEST] ("calling onTimeout","ThreadId 99")
[2013-03-09 13:57:20 JEST] ("registerKillThread fired, killing","ThreadId 99")

Note that the timeout checker ran twice for thread 99. First setting it Inactive and then 30 seconds later seeing it was still inactive, and calling its onTimeout.

Since I have the killThread commented out, we can see this thread go on to successfully process another request!

[2013-03-09 13:57:36 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:57:36 JEST] 127.0.0.1 POST /filebrowser Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:57:36 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:57:36 JEST] ("tickle","ThreadId 99","by",ThreadId 99)

So, chromium is opening a connection, using it to request two files within seconds, and then a full minute later, reusing the connection to load another page. Clear evidence that the 30 second slowloris timeout is triggered by
a regular browser doing nothing special.

Question: Why does the connection not get closed when the slowloris timeout kills the thread? If I'm reading runSettingsConnectionMaker correctly, it only catches IO exceptions and closes the connection. The async ThreadKilled exception is not handled there See FIXME in the code. Perhaps that is the real bug.

I suppose a workaround is to set the timeout as high as it will go. Since it's an Int holding microseconds, the highest timeout that can be set is around 35 minutes. Probably high enough for chromium to drop the connection.

Michael Snoyman
Owner

Thank you for filing this report. I've been hunting a bug in yesod devel which I think may be related, though until now I thought it was from a completely different library.

I think the Slowloris approach is still correct, even if it kills some Chromium connections. At a certain point, there's no way to distinguish between legitimate client connections and attacks. 30 seconds is a somewhat arbitrary cutoff, but the cutoff can be customized by the user. And if everything worked as it should, this wouldn't be a problem.

I don't think the FIXME is relevant to this issue. The code:

serve `onException` cleanup

should in fact be triggering the connection to be closed on any exception. The getConnLoop code should not be relevant. However, I can see from your logs that the problem really is that cleanup is never getting called: notice how your logs never mention a call to cancel. Unfortunately, I'm not able to reproduce that behavior on my system.

Do you have a minimal example showing the problem? And can you provide more information on OS, GHC version, compile flags, and runtime flags?

Michael Snoyman snoyberg referenced this issue from a commit
Michael Snoyman snoyberg Possible fix for #146 5dca62e
Michael Snoyman
Owner

I've implemented a possible fix, though I'm not sure if it actually solves the problem (it seems to make yesod devel better behaved in my testing). Would you be able to test the newest Github code?

Kazu Yamamoto
Collaborator

cleanup should always call connClose, shouldn't it?

Michael Snoyman
Owner

I would think with both the old and new code this would be true, I can't see any reason why it wouldn't be. Nonetheless, @joeyh's log definitely implies that connClose is not being run for some reason. I'm not sure if the patch I pushed fixes the problem, but IMO it's a better way to write the function anyway.

Kazu Yamamoto
Collaborator

I will investigate this tomorrow.

Kazu Yamamoto
Collaborator

BTW, would you explain why mask & restore are used around here?

Kazu Yamamoto
Collaborator

A thread created by forkIO inherits mask state. So, reading the code, I seems to me that any child threads cannot receive asynchronous exceptions...

Kazu Yamamoto
Collaborator

As the document of Control.Exception says, we should not use mask & restore by ourselves. We should use bracket instead if necessary.

Michael Snoyman
Owner

The reason for the usage of mask is to avoid an exception getting thrown between getConnLoop and mkConn, thereby possibly leaking a connection. There are two places where async exceptions can be thrown:

  1. In the main loop in the allowInterrupt call.
  2. In the child thread's serveConnection call, which has been restored.

I think we may want to restore getConnLoop as well.

Felipe Lessa
Owner

If the whole getConnLoop was restored, wouldn't it be possible for an async exception to be thrown after getConn but before exiting the restore? I guess it would be better to restore just the right branch of getConnLoop's catch. However, I don't think it would solve @joeyh's problem either way.

Michael Snoyman
Owner

Good point; perhaps we should pass restore to the callback function and allow it to restore things as it wants. But I agree that this is unlikely to be related to the bug at hand.

Kazu Yamamoto
Collaborator

What about this logic?

    forever . mask_ $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIOWithUnmask $ \unmask -> do
            th <- T.registerKillThread tm
            conn <- mkConn
#if SENDFILEFD
            let cleaner = Cleaner th fc
#else
            let cleaner = Cleaner th
#endif
            let serve = onOpen >> serveConnection set cleaner port app conn addr
                cleanup = connClose conn >> T.cancel th >> onClose
            unmask serve `finally` cleanup `catch` onE
Kazu Yamamoto
Collaborator

I confirmed that the code above calls cleanup both in normal cases and error cases.

Note that I also confirmed that cleanup of the original Warp 1.3.7.4 is called in error cases. I don't know @joeyh 's problem comes from.

Michael Snoyman
Owner

I'm not sure that this actually changes anything. You're performing the masking inside the forever loop, so each loop iteration we break out of the mask and therefore have an opportunity to receive an async exception. However, allowInterrupt already achieves that goal. And I'm not sure why forkIOWithUnmask would have different behavior than getting a restore function from mask.

Kazu Yamamoto
Collaborator

My code is based on the explanation of Control.Concurrent. This code is clearer than the current one from my point of view.

Anyway, I would want @joeyh to test both code and would like to see his results.

Felipe Lessa
Owner

@kazu-yamamoto Shouldn't it be mask_ . forever? Also, did you put onOpen inside unmask on purpose? If so, why didn't you unmask onClose as well?

I find @kazu-yamamoto's code easier to read, but I'd add parenthesis around the finally on the last line in order to be explicit about catch's scope.

Kazu Yamamoto
Collaborator

I don't know which is better, mask_ . forever or forever . mask_. Maybe you are right. @meteficha

I don't not the expected behavior of onOpen and onClose. Would you explain them? @snoyberg

I have no objection to enclose unmask sever and cleanup. :-)

Michael Snoyman
Owner

onOpen and onClose allow us to run user-specified code before and after each connection is served. One possible use case is logging; I'm not sure of others. And actually, looking at the code again, both the old implementation and @kazu-yamamoto's new implementation are vulnerable to a connection leak if onOpen throws an exception. However, I believe the new bracket-based code is immune to such problems.

I'm not sure if there's really a good motivation for not using the new bracket code. The only tweak I'd consider making is restoreing the entire body of the bracket. That code would look like:

    mask $ \restore -> forever $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIO $ do
            th <- T.registerKillThread tm
            bracket mkConn connClose $ \conn -> restore $ do
#if SENDFILEFD
                let cleaner = Cleaner th fc
#else
                let cleaner = Cleaner th
#endif
                let serve = do
                        onOpen
                        serveConnection set cleaner port app conn addr
                        cleanup
                    cleanup = T.cancel th >> onClose
                handle onE (serve `onException` cleanup)

What this says is:

  1. Block all async exceptions, then loop forever, doing the following.
  2. At the beginning of the loop, check for async exceptions.
  3. Get a new connection maker, then fork a thread to handle it.
  4. Inside that new thread, register a timeout handler.
  5. Use bracket to acquire a connection from the connection maker and register a cleanup action for it, then run the inner action.
  6. Allow async exceptions to be thrown in the inner action.

Since no async exceptions can be thrown in steps 3-5, it should be impossible to leak the connection there. And once the bracket function is called, any async exceptions in step 6 will properly trigger cleanup.

Felipe Lessa
Owner

So why not:

    mask_ . forever $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIOWithUnmask $ \unmask -> 
            bracket (T.registerKillThread tm) T.cancel $ \th ->
            bracket mkConn connClose $ \conn ->
#if SENDFILEFD
                let cleaner = Cleaner th fc
#else
                let cleaner = Cleaner th
#endif
                in unmask .
                   handle onE .
                   bracket_ onOpen onClose $
                   serveConnection set cleaner port app conn addr

I've put T.cancel and onClose inside brackets since that's what they are anyway.

EDIT: Removed unnecessary do.

Michael Snoyman
Owner

Yes, that looks even better. @kazu-yamamoto any objections to that formulation?

Joey Hess
joeyh commented
Joey Hess
joeyh commented

I'm using ghc 7.4.1-4 from Debian unstable. Linux 3.2.0. Using the threaded RTS, no other special RTS flags or build flags.

Sadly I do not have a minimal test case.

Kazu Yamamoto
Collaborator

No objection. Please leave a comment to refer this issue around this code.

Michael Snoyman
Owner

I've just pushed a commit to make the code change. The only change I made versus @meteficha's code was to swap the two bracket calls: there's no point registering a timeout handler before we've acquired the connection, since async exceptions will be blocked during connection making anyway.

@meteficha and @kazu-yamamoto: Please review this. If you think it's good, we should ask @joeyh to test the new code.

Michael Snoyman
Owner

OK @joeyh, I think the code is ready to be tested. Can you give it a shot?

Joey Hess
joeyh commented
Michael Snoyman
Owner

I've just released a new version of Warp with this fix, thanks everyone for working on it.

I'm hesitant to include the option to disable timeouts, simply because it will likely introduce a bunch of conditionals into some tight loops. I'll leave the issue open for now.

Kazu Yamamoto
Collaborator

Does "I've yet to see chromium freeze" mean "chromium still freezes sometime"?

Gert Cuykens

it means it did not freeze

Ben Gamari bgamari referenced this issue from a commit in bgamari/git-annex
Joey Hess joeyh webapp: Work around bug in Warp's slowloris attack prevention code, t…
…hat caused regular browsers to stall when they reuse a connection after leaving it idle for 30 seconds. (See yesodweb/wai#146)
3e32bb4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.