Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slowloris protection can be triggered by legitimate use of chromium #146

Closed
joeyh opened this issue Mar 9, 2013 · 30 comments
Closed

slowloris protection can be triggered by legitimate use of chromium #146

joeyh opened this issue Mar 9, 2013 · 30 comments

Comments

@joeyh
Copy link
Contributor

joeyh commented Mar 9, 2013

I have a yesod web app that I'm able to fairly reliably, through a particular pattern of activity, cause to seem to freeze. Ie, I click on a link in the web browser and it spins, forever. Once this has happened, any links I click on will exhibit the behavior, and even pages that are a simple static hamlet template with no other IO freeze. However, open any link in a new tab and it works fine. It's only the http connection used by one particular chromium tab that has gotten stuck somehow.

(There is a mildly embarrassing video of this happening to me in the middle of a demo, here: http://mirror.linux.org.au/linux.conf.au/2013/mp4/gitannex.mp4 around 20 minutes in.)

I got a tcpdump of this happening and verified that chromium was sending a complete http request, and never getting an answer back, no matter how long I wait. Here is a tcpdump that shows a full session from app startup time to hang: http://downloads.kitenet.net/misc/git-annex-webapp-hang.tcpdump.bz2

I instrumented my use of warp, making settingsOnException settingsOnOpen and settingsOnClose do debug logging. From this I could see that ThreadKilled exceptions were being thrown. Only one place in warp calls killThread: registerKillThread does when the timeout is exceeded. I added a print there and verified that it was indeed responsible for killing threads. Here is a server log file to this issue showing the http requests and debug output: http://downloads.kitenet.net/misc/daemon.log

I commented out the killThread call, and my app seems fixed; no more hangs.

The timeout is apparently only supposed to be triggered by the slowloris attack. Ironically, since my webapp only listens for connections from localhost, it doesn't need DOS protection at all. But I cannot find a way to disable the slowloris protection in warp's API.

Adding a knob to disable killing threads would be sufficient for my needs, although I suspect you might want to take further action.

warp version: 1.3.7.4

@joeyh
Copy link
Contributor Author

joeyh commented Mar 9, 2013

I've added a lot of debug logging to the timeout manager code to try to gather more data around when this happens. Using this patch http://downloads.kitenet.net/misc/warp-instrumented.patch I'm seeing logs like this:

[2013-03-09 13:55:50 JEST] open connection ThreadId 99
[2013-03-09 13:55:50 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:55:50 JEST] 127.0.0.1 GET /static/js/bootstrap-collapse.js Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:55:50 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:55:50 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:19 JEST] ("state was",Active,"now",Inactive,"ThreadId 99")
[2013-03-09 13:56:34 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] 127.0.0.1 GET /static/css/bootstrap-responsive.css Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:34 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:56:49 JEST] ("state was",Active,"now",Inactive,"ThreadId 99")
[2013-03-09 13:57:20 JEST] ("state was",Inactive,"now",Inactive,"ThreadId 99")
[2013-03-09 13:57:20 JEST] ("calling onTimeout","ThreadId 99")
[2013-03-09 13:57:20 JEST] ("registerKillThread fired, killing","ThreadId 99")

Note that the timeout checker ran twice for thread 99. First setting it Inactive and then 30 seconds later seeing it was still inactive, and calling its onTimeout.

Since I have the killThread commented out, we can see this thread go on to successfully process another request!

[2013-03-09 13:57:36 JEST] ("paused","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:57:36 JEST] 127.0.0.1 POST /filebrowser Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.68 Safari/537.17 ThreadId 99
[2013-03-09 13:57:36 JEST] ("tickle","ThreadId 99","by",ThreadId 99)
[2013-03-09 13:57:36 JEST] ("tickle","ThreadId 99","by",ThreadId 99)

So, chromium is opening a connection, using it to request two files within seconds, and then a full minute later, reusing the connection to load another page. Clear evidence that the 30 second slowloris timeout is triggered by
a regular browser doing nothing special.

Question: Why does the connection not get closed when the slowloris timeout kills the thread? If I'm reading runSettingsConnectionMaker correctly, it only catches IO exceptions and closes the connection. The async ThreadKilled exception is not handled there See FIXME in the code. Perhaps that is the real bug.

I suppose a workaround is to set the timeout as high as it will go. Since it's an Int holding microseconds, the highest timeout that can be set is around 35 minutes. Probably high enough for chromium to drop the connection.

@snoyberg
Copy link
Member

Thank you for filing this report. I've been hunting a bug in yesod devel which I think may be related, though until now I thought it was from a completely different library.

I think the Slowloris approach is still correct, even if it kills some Chromium connections. At a certain point, there's no way to distinguish between legitimate client connections and attacks. 30 seconds is a somewhat arbitrary cutoff, but the cutoff can be customized by the user. And if everything worked as it should, this wouldn't be a problem.

I don't think the FIXME is relevant to this issue. The code:

serve `onException` cleanup

should in fact be triggering the connection to be closed on any exception. The getConnLoop code should not be relevant. However, I can see from your logs that the problem really is that cleanup is never getting called: notice how your logs never mention a call to cancel. Unfortunately, I'm not able to reproduce that behavior on my system.

Do you have a minimal example showing the problem? And can you provide more information on OS, GHC version, compile flags, and runtime flags?

snoyberg added a commit that referenced this issue Mar 11, 2013
@snoyberg
Copy link
Member

I've implemented a possible fix, though I'm not sure if it actually solves the problem (it seems to make yesod devel better behaved in my testing). Would you be able to test the newest Github code?

@kazu-yamamoto
Copy link
Contributor

cleanup should always call connClose, shouldn't it?

@snoyberg
Copy link
Member

I would think with both the old and new code this would be true, I can't see any reason why it wouldn't be. Nonetheless, @joeyh's log definitely implies that connClose is not being run for some reason. I'm not sure if the patch I pushed fixes the problem, but IMO it's a better way to write the function anyway.

@kazu-yamamoto
Copy link
Contributor

I will investigate this tomorrow.

@kazu-yamamoto
Copy link
Contributor

BTW, would you explain why mask & restore are used around here?

@kazu-yamamoto
Copy link
Contributor

A thread created by forkIO inherits mask state. So, reading the code, I seems to me that any child threads cannot receive asynchronous exceptions...

@kazu-yamamoto
Copy link
Contributor

As the document of Control.Exception says, we should not use mask & restore by ourselves. We should use bracket instead if necessary.

@snoyberg
Copy link
Member

The reason for the usage of mask is to avoid an exception getting thrown between getConnLoop and mkConn, thereby possibly leaking a connection. There are two places where async exceptions can be thrown:

  1. In the main loop in the allowInterrupt call.
  2. In the child thread's serveConnection call, which has been restored.

I think we may want to restore getConnLoop as well.

@meteficha
Copy link
Member

If the whole getConnLoop was restored, wouldn't it be possible for an async exception to be thrown after getConn but before exiting the restore? I guess it would be better to restore just the right branch of getConnLoop's catch. However, I don't think it would solve @joeyh's problem either way.

@snoyberg
Copy link
Member

Good point; perhaps we should pass restore to the callback function and allow it to restore things as it wants. But I agree that this is unlikely to be related to the bug at hand.

@kazu-yamamoto
Copy link
Contributor

What about this logic?

    forever . mask_ $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIOWithUnmask $ \unmask -> do
            th <- T.registerKillThread tm
            conn <- mkConn
#if SENDFILEFD
            let cleaner = Cleaner th fc
#else
            let cleaner = Cleaner th
#endif
            let serve = onOpen >> serveConnection set cleaner port app conn addr
                cleanup = connClose conn >> T.cancel th >> onClose
            unmask serve `finally` cleanup `catch` onE

@kazu-yamamoto
Copy link
Contributor

I confirmed that the code above calls cleanup both in normal cases and error cases.

Note that I also confirmed that cleanup of the original Warp 1.3.7.4 is called in error cases. I don't know @joeyh 's problem comes from.

@snoyberg
Copy link
Member

I'm not sure that this actually changes anything. You're performing the masking inside the forever loop, so each loop iteration we break out of the mask and therefore have an opportunity to receive an async exception. However, allowInterrupt already achieves that goal. And I'm not sure why forkIOWithUnmask would have different behavior than getting a restore function from mask.

@kazu-yamamoto
Copy link
Contributor

My code is based on the explanation of Control.Concurrent. This code is clearer than the current one from my point of view.

Anyway, I would want @joeyh to test both code and would like to see his results.

@meteficha
Copy link
Member

@kazu-yamamoto Shouldn't it be mask_ . forever? Also, did you put onOpen inside unmask on purpose? If so, why didn't you unmask onClose as well?

I find @kazu-yamamoto's code easier to read, but I'd add parenthesis around the finally on the last line in order to be explicit about catch's scope.

@kazu-yamamoto
Copy link
Contributor

I don't know which is better, mask_ . forever or forever . mask_. Maybe you are right. @meteficha

I don't not the expected behavior of onOpen and onClose. Would you explain them? @snoyberg

I have no objection to enclose unmask sever and cleanup. :-)

@snoyberg
Copy link
Member

onOpen and onClose allow us to run user-specified code before and after each connection is served. One possible use case is logging; I'm not sure of others. And actually, looking at the code again, both the old implementation and @kazu-yamamoto's new implementation are vulnerable to a connection leak if onOpen throws an exception. However, I believe the new bracket-based code is immune to such problems.

I'm not sure if there's really a good motivation for not using the new bracket code. The only tweak I'd consider making is restoreing the entire body of the bracket. That code would look like:

    mask $ \restore -> forever $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIO $ do
            th <- T.registerKillThread tm
            bracket mkConn connClose $ \conn -> restore $ do
#if SENDFILEFD
                let cleaner = Cleaner th fc
#else
                let cleaner = Cleaner th
#endif
                let serve = do
                        onOpen
                        serveConnection set cleaner port app conn addr
                        cleanup
                    cleanup = T.cancel th >> onClose
                handle onE (serve `onException` cleanup)

What this says is:

  1. Block all async exceptions, then loop forever, doing the following.
  2. At the beginning of the loop, check for async exceptions.
  3. Get a new connection maker, then fork a thread to handle it.
  4. Inside that new thread, register a timeout handler.
  5. Use bracket to acquire a connection from the connection maker and register a cleanup action for it, then run the inner action.
  6. Allow async exceptions to be thrown in the inner action.

Since no async exceptions can be thrown in steps 3-5, it should be impossible to leak the connection there. And once the bracket function is called, any async exceptions in step 6 will properly trigger cleanup.

@meteficha
Copy link
Member

So why not:

    mask_ . forever $ do
        allowInterrupt
        (mkConn, addr) <- getConnLoop
        void . forkIOWithUnmask $ \unmask -> 
            bracket (T.registerKillThread tm) T.cancel $ \th ->
            bracket mkConn connClose $ \conn ->
#if SENDFILEFD
                let cleaner = Cleaner th fc
#else
                let cleaner = Cleaner th
#endif
                in unmask .
                   handle onE .
                   bracket_ onOpen onClose $
                   serveConnection set cleaner port app conn addr

I've put T.cancel and onClose inside brackets since that's what they are anyway.

EDIT: Removed unnecessary do.

@snoyberg
Copy link
Member

Yes, that looks even better. @kazu-yamamoto any objections to that formulation?

@joeyh
Copy link
Contributor Author

joeyh commented Mar 15, 2013

Michael Snoyman wrote:

I've implemented a possible fix, though I'm not sure if it actually solves the
problem (it seems to make yesod devel better behaved in my testing). Would you
be able to test the newest Github code?

I'm sorry, didn't see this thread until today.

Yes, I can try testing. I have a fairly good recipe to reproduce the
bug. But I sometimes have to do the manual series of clicks a dozen times
to get it to pop up. So I can't test a wide variety of solutions, but
if you think you've settled on one I will test it as extensively as I can.

see shy jo

@joeyh
Copy link
Contributor Author

joeyh commented Mar 15, 2013

I'm using ghc 7.4.1-4 from Debian unstable. Linux 3.2.0. Using the threaded RTS, no other special RTS flags or build flags.

Sadly I do not have a minimal test case.

@kazu-yamamoto
Copy link
Contributor

No objection. Please leave a comment to refer this issue around this code.

@snoyberg
Copy link
Member

I've just pushed a commit to make the code change. The only change I made versus @meteficha's code was to swap the two bracket calls: there's no point registering a timeout handler before we've acquired the connection, since async exceptions will be blocked during connection making anyway.

@meteficha and @kazu-yamamoto: Please review this. If you think it's good, we should ask @joeyh to test the new code.

@snoyberg
Copy link
Member

OK @joeyh, I think the code is ready to be tested. Can you give it a shot?

@joeyh
Copy link
Contributor Author

joeyh commented Mar 18, 2013

Michael Snoyman wrote:

OK @joeyh, I think the code is ready to be tested. Can you give it a shot?

I set the slowloris timeout down to just 10 seconds, and instrumented
warp so I can tell when threads time out. I've been running my app like
this for a while and while quite a lot of threads have timed out and
been killed, I've yet to see chromium freeze.

This isn't quite conclusive, but it's looking likely to be fixed.
I will continue running in this configuration for a while.

I would still appreciate a setting to entirely disable the slowloris
protection, since it is completely unncessary in my application, which
only accepts connections from localhost.

see shy jo

@snoyberg
Copy link
Member

I've just released a new version of Warp with this fix, thanks everyone for working on it.

I'm hesitant to include the option to disable timeouts, simply because it will likely introduce a bunch of conditionals into some tight loops. I'll leave the issue open for now.

@kazu-yamamoto
Copy link
Contributor

Does "I've yet to see chromium freeze" mean "chromium still freezes sometime"?

@gertcuykens
Copy link
Contributor

it means it did not freeze

@snoyberg snoyberg closed this as completed Sep 8, 2013
bgamari pushed a commit to bgamari/git-annex that referenced this issue Dec 24, 2013
…hat caused regular browsers to stall when they reuse a connection after leaving it idle for 30 seconds. (See yesodweb/wai#146)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants