Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout left running after use of sourceRequestBody #351

Closed
adamgundry opened this issue Mar 19, 2015 · 26 comments
Closed

Timeout left running after use of sourceRequestBody #351

adamgundry opened this issue Mar 19, 2015 · 26 comments

Comments

@adamgundry
Copy link

I'm debugging a Warp/conduit server that is failing with Thread killed by Warp's timeout reaper when a request causes a slow operation in the application. It seems that after a use of sourceRequestBody, the timeout is left running when control returns to the application, even if the client sends a complete request. Here's a minimal example:

import Control.Concurrent         ( threadDelay )
import Data.Aeson                 ( json' )
import Data.ByteString.Lazy.Char8 ( pack )
import Data.Conduit               ( ($$) )
import Data.Conduit.Attoparsec    ( sinkParser )
import Network.HTTP.Types.Status  ( ok200 )
import Network.Wai                ( responseLBS )
import Network.Wai.Conduit        ( sourceRequestBody )
import Network.Wai.Handler.Warp

main :: IO ()
main = runSettings stgs app
  where
    stgs = setTimeout 1 $ setPort 2000 $ setOnExceptionResponse err defaultSettings
    err e = responseLBS ok200 [] (pack $ show e)

    app rq k = do v <- sourceRequestBody rq $$ sinkParser json'
                  print v
                  threadDelay 2000000
                  k $ responseLBS ok200 [] (pack "OK")

If you POST a JSON object (e.g. {}) to this server, it will be printed, but the thread will be killed after a second. Replacing the conduit code with strictRequestBody makes the problem go away.

I'm a little suspicious of timeoutBody because it seems to read but never write an IORef, but I don't know if that is the direct culprit.

Thanks for taking a look at this, and apologies if I've missed something obvious!

@snoyberg
Copy link
Member

Haven't tested this yet, but just looking at timeoutBody, I think your suspicion about it being buggy is correct.

@snoyberg
Copy link
Member

I tested, and had to increase the threadDelay to 5 seconds instead, but reproduced.

@snoyberg
Copy link
Member

OK, I've figured out what's going on here... and it's tricky. There's a subtle difference between the conduit and non-conduit version. The former is feeding data into the parser. Once the parser is done, the conduit stops. Termination therefore happens as soon as the JSON message is fully consumed. This means that the code ends up making just a single call to requestBody, resulting in a ByteString containing {}, and then stops.

With strictRequestBody, we consume everything until we get an empty ByteString. At that point, we know that slowloris protection is no longer needed, and therefore the timeout is paused. In conduit, however, we never consume that empty ByteString, and therefore the protection remains active.

There's an easy workaround for this; after getting your JSON value, you can run:

sourceRequestBody rq $$ sinkNull

This will force the rest of the body (in this case, nothing) to be consumed, and the timeout protection will be disabled. To be clear: this is a workaround, because warp is not handling this situation correctly.

I need to think a bit more carefully about the slowloris protection to be certain, but there may not be a good resolution to this problem. If we simply pause the timeout handler each time we get a chunk, we could be attacked by having a slow stream of 1-byte chunks. So this may just need to be a "known deficiency."

To start off with though: can you try out my workaround and see if it solves your actual problem?

Thanks for the very clear bug report and reproducing case.

@adamgundry
Copy link
Author

Once again, your response time is amazing, thanks! I wondered if there might be something like this going on, and it does seem tricky to resolve cleanly. In the meantime, I can confirm that your workaround solves the original problem.

snoyberg added a commit that referenced this issue Mar 20, 2015
This addresses the common case of a non-chunked request body.
Previously, we would wait until a zero-length `ByteString` is returned,
but that is suboptimal for some cases. For more information, see issue #351.
@snoyberg
Copy link
Member

OK, I came up with a 95% solution to the problem. For the case of non-chunked request bodies (which is the usual case), we can determine that the request is done without an extra read, which I've now implemented. I added a note about the general problem of partially-consumed request bodies to the documentation.

One final thing I could do is add a function to the vault to pause the timeout to give users more explicit control. Since that's likely not too hard, I'll give it a shot.

@snoyberg
Copy link
Member

OK, pauseTimeout function is now available as well, and I've tested that it seems to resolve the issue here too. I think this is an acceptable tradeoff now, and I'll go ahead and release to Hackage.

Again, thanks for the solid bug report and for testing.

@spl
Copy link
Contributor

spl commented Mar 20, 2015

I just upgraded to Warp 3.0.10, which includes this change. I'm already using Network.Wai.strictRequestBody, and I'm seeing “Thread killed by Warp's timeout reaper” in my Warp exception handler.

I'm not sure if my problem is related to this issue. I just encountered it and haven't looked into it deeply. Just wanted to check in with you to see what you thought.

@snoyberg
Copy link
Member

Were you getting that behavior with the older Warp as well, or is this a
new regression?

On Fri, Mar 20, 2015 at 3:14 PM Sean Leather notifications@github.com
wrote:

I just upgraded to Warp 3.0.10, which includes this change. I'm already
using Network.Wai.strictRequestBody, and I'm seeing “Thread killed by
Warp's timeout reaper” in my Warp exception handler.

I'm not sure if my problem is related to this issue. I just encountered it
and haven't looked into it deeply. Just wanted to check in with you to see
what you thought.


Reply to this email directly or view it on GitHub
#351 (comment).

@spl
Copy link
Contributor

spl commented Mar 20, 2015

@snoyberg TBH, I'm not sure. Several things have changed simultaneously, one of those changes being the addition of the exception handler.

I'm building the same app with 3.0.9.3 right now to test it. It should be fairly easy to determine if the issue was introduced in 3.0.10 because the reaper is now coming for regular visits with my current test setup.

@spl
Copy link
Contributor

spl commented Mar 20, 2015

Sadly (for me), it's not a problem introduced by this change. The response is not taking longer than 1 second in most cases, and the exception comes much later. There's a fork in the middle. Any suggestions on how to debug this?

@spl
Copy link
Contributor

spl commented Mar 20, 2015

Okay, I may be missing something obvious, too. I'm able to reproduce my problem with a very simple program, adapted from @adamgundry's code:

module Test (main) where

import Data.ByteString.Lazy.Char8 ( pack )
import Data.Time.Clock            ( getCurrentTime )
import Network.HTTP.Types.Status  ( ok200 )
import Network.Wai                ( responseLBS, strictRequestBody )
import Network.Wai.Handler.Warp

main :: IO ()
main = runSettings stgs app
  where
    stgs = setTimeout 2 $ setPort 3000 $ setOnException err defaultSettings
    err _ e = do t <- getCurrentTime
                 putStrLn $ show t ++ "\tError: " ++ show e

    app rq k = do v <- strictRequestBody rq
                  t <- getCurrentTime
                  putStrLn $ show t ++ "\tRequest: " ++ show v
                  k $ responseLBS ok200 [] (pack "OK")

I POSTed a simple input:

$ runghc Test.hs
2015-03-20 21:17:06.614678 UTC  Request: "key=value"
2015-03-20 21:17:10.617833 UTC  Error: Thread killed by Warp's timeout reaper

This is using warp-3.0.10.

@snoyberg
Copy link
Member

I'm unable to reproduce this, I just get a result immediately. I don't see anything which would cause the server to block long enough to cause the 2 second timeout to ever get triggered. How are you sending the post? I'm using the following script:

curl -d key=value http://localhost:3000

@spl
Copy link
Contributor

spl commented Mar 21, 2015

Interesting. I was using the Postman plugin for Chrome. I just tried it with a GET request from both Chrome and Firefox, and the same reap happens. But with curl, the thread does not get reaped. The issue came up with an AWS EC2 worker receiving messages from SQS via some local server on the worker instance (nginx, I'd guess).

@spl
Copy link
Contributor

spl commented Mar 21, 2015

Also, the request is always printed immediately no matter what I do. The reaper appears later, presumably after the timeout.

@spl
Copy link
Contributor

spl commented Mar 21, 2015

I modified the code to print the full request.

# curl -d key=value http://localhost:3000
2015-03-21 21:07:24.959385 UTC  Request: Request {requestMethod = "POST", httpVersion = HTTP/1.1, rawPathInfo = "/", rawQueryString = "", requestHeaders = [("User-Agent","curl/7.37.1"),("Host","localhost:3000"),("Accept","*/*"),("Content-Length","9"),("Content-Type","application/x-www-form-urlencoded")], isSecure = False, remoteHost = 10.211.55.1:52115, pathInfo = [], queryString = [], requestBody = <IO ByteString>, vault = <Vault>, requestBodyLength = KnownLength 9, requestHeaderHost = Just "localhost:3000", requestHeaderRange = Nothing}
# Chrome/Postman
2015-03-21 21:07:38.778795 UTC  Request: Request {requestMethod = "POST", httpVersion = HTTP/1.1, rawPathInfo = "/", rawQueryString = "", requestHeaders = [("Host","localhost:3000"),("Connection","keep-alive"),("Content-Length","9"),("Accept","application/json"),("Cache-Control","no-cache"),("Origin","chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm"),("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36"),("Content-Type","application/x-www-form-urlencoded"),("Accept-Encoding","gzip, deflate"),("Accept-Language","en-US,en;q=0.8")], isSecure = False, remoteHost = 10.211.55.1:52117, pathInfo = [], queryString = [], requestBody = <IO ByteString>, vault = <Vault>, requestBodyLength = KnownLength 9, requestHeaderHost = Just "localhost:3000", requestHeaderRange = Nothing}
2015-03-21 21:07:42.783402 UTC  Error: Thread killed by Warp's timeout reaper

@spl
Copy link
Contributor

spl commented Mar 21, 2015

It's not the HTTP headers. I used the Chrome headers with curl, but no reaping.

@snoyberg
Copy link
Member

In your real application do you actually have such a low timeout value? Are you able to reproduce the error with the normal timeout? I have a feeling that the failures you're experiencing right now just have to do with chrome keeping a persistent connection.

@spl
Copy link
Contributor

spl commented Mar 22, 2015

In your real application do you actually have such a low timeout value?

No, we use the default. I only used the low timeout here to reduce the wait before the reaper.

Are you able to reproduce the error with the normal timeout?

Yes. This is the result of using Chrome and removing the setTimeout from the above code:

2015-03-22 08:54:25.642874 UTC  Request: "key=value"
2015-03-22 08:55:25.673206 UTC  Error: Thread killed by Warp's timeout reaper

This reflects approximately what I saw in the real application as well: about a minute of delay before the reaper arrived.

I have a feeling that the failures you're experiencing right now just have to do with chrome keeping a persistent connection.

Okay... I have reproduced the same problem with netcat by forcing a persistent connection by sleeping for 70 seconds:

$ (printf "GET / HTTP/1.1\n\n"; sleep 70) | netcat localhost 3000

The result:

2015-03-22 09:13:03.124894 UTC  Request: ""
2015-03-22 09:14:03.161658 UTC  Error: Thread killed by Warp's timeout reaper

I don't understand the issue well enough to know what the problem is. But I have now reproduced it with four different HTTP clients (nginx, Chrome, Firefox, and netcat), so I do think there is a problem somewhere.

Should the timeout handler be independent of the connection persistence? Why is it still hanging around when the request has been fully processed and the response has been sent?

@snoyberg
Copy link
Member

Are you actually experiencing any negative effects from this, or is it just an error message? This sounds to me like Warp is just letting you know that a persistent connection was killed by the timeout reaper, but that no problem actually exists. I just realized that you ignored the first parameter to the exception handler, which would indeed let you know if this exception happened during a request or just to a persistent connection.

@spl
Copy link
Contributor

spl commented Mar 22, 2015

I just realized that you ignored the first parameter to the exception handler, which would indeed let you know if this exception happened during a request or just to a persistent connection.

I added printing the Maybe Request parameter, and it's a Nothing for the reaper. I don't know how the exception is thrown in relation to the request, so I wouldn't know the significance of that parameter being Nothing or Just.

Are you actually experiencing any negative effects from this, or is it just an error message?

I don't see any obvious direct negative effects due to the exception.

This sounds to me like Warp is just letting you know that a persistent connection was killed by the timeout reaper, but that no problem actually exists.

First, have you reproduced the issue? Or is it only me?

If there is no real error, I think no exception should be thrown. There is a problem with having an exception that isn't actually an error. I can choose to “ignore” it by catching it and doing nothing. But what if, later, there is an actual error, and I miss it?

Is it possible to change the timeout handler? I don't claim to understand much related to protecting against the slowloris attack. But, again, I wonder (and these may be meaningless questions for lack of understanding): Should the timeout handler be independent of connection persistence? Does the timeout exception mean that the connection was closed? Why is the timeout still around when the request has been fully processed and the response sent?

If it is not possible to change the timeout handler, then I would like to disable it. In my case, I know exactly who's making the request and the server is firewalled against everything else.

@snoyberg
Copy link
Member

If you look at the default exception handler, you'll see that these
exceptions are by default not displayed. They do not indicate an error in
all cases, but instead indicate that "an exception was thrown." The
exception mechanism is the only means by which we can preemptively kill a
thread.

I don't really see what should be done here, beyond possibly improving
documentation. The defaults are correct, and we certainly should make it
possibly for users to get more details about threads being killed by the
reaper if users need to know that.

On Sun, Mar 22, 2015 at 4:24 PM Sean Leather notifications@github.com
wrote:

I just realized that you ignored the first parameter to the exception
handler, which would indeed let you know if this exception happened during
a request or just to a persistent connection.

I added printing the Maybe Request parameter, and it's a Nothing for the
reaper. I don't know how the exception is thrown in relation to the
request, so I wouldn't know the significance of that parameter being
Nothing or Just.

Are you actually experiencing any negative effects from this, or is it
just an error message?

I don't see any obvious direct negative effects due to the exception.

This sounds to me like Warp is just letting you know that a persistent
connection was killed by the timeout reaper, but that no problem actually
exists.

First, have you reproduced the issue? Or is it only me?

If there is no real error, I think no exception should be thrown. There is
a problem with having an exception that isn't actually an error. I can
choose to “ignore” it by catching it and doing nothing. But what if, later,
there is an actual error, and I miss it?

Is it possible to change the timeout handler? I don't claim to understand
much related to protecting against the slowloris attack. But, again, I
wonder (and these may be meaningless questions for lack of understanding):
Should the timeout handler be independent of connection persistence? Does
the timeout exception mean that the connection was closed? Why is the
timeout still around when the request has been fully processed and the
response sent?

If it is not possible to change the timeout handler, then I would like to
disable it. In my case, I know exactly who's making the request and the
server is firewalled against everything else.


Reply to this email directly or view it on GitHub
#351 (comment).

@spl
Copy link
Contributor

spl commented Mar 22, 2015

Fair enough.

So, I currently see the following options:

  1. Catch TimeoutThread if I only want to ignore the timeout.
  2. Use defaultShouldDisplayException to ignore exceptions, including TimeoutThread, that do not always indicate errors.
  3. Use Network.Wai.Handler.Warp.Timeout.stopManager to disable the timeout if I don't want to generate a TimeoutThread exception in any case.

@spl
Copy link
Contributor

spl commented Mar 22, 2015

Use Network.Wai.Handler.Warp.Timeout.stopManager to disable the timeout if I don't want to generate a TimeoutThread exception in any case.

Or does this not even do what I would expect?

-- ...
import qualified Network.Wai.Handler.Warp.Timeout as Timeout

main :: IO ()
main = do
    mgr <- Timeout.initialize 30000000
    Timeout.stopManager mgr
    runSettings (stgs mgr) app
  where
    stgs mgr = setManager mgr $ setTimeout 2 $ setPort 3000 $ setOnException err defaultSettings
-- ...

I still get a visit from the reaper.

@snoyberg
Copy link
Member

I've never tried to implement (3), so I'm not surprised it doesn't work.
I'd recommend instead just lengthening the timeout duration beyond what
most clients use for persistent connections. However, (1) and (2) sound
much better to me.

On Sun, Mar 22, 2015 at 5:07 PM Sean Leather notifications@github.com
wrote:

Fair enough.

So, I currently see the following options:

  1. Catch TimeoutThread if I only want to ignore the timeout.
  2. Use defaultShouldDisplayException to ignore exceptions, including
    TimeoutThread, that do not always indicate errors.
  3. Use Network.Wai.Handler.Warp.Timeout.stopManager to disable the
    timeout if I don't want to generate a TimeoutThread exception in any
    case.


Reply to this email directly or view it on GitHub
#351 (comment).

@anacrolix
Copy link

I don't think pauseTimeout works as intended. It sets the pauseTimeoutKey value to Paused, which is promptly lost on resume and tickle, which occur outside the callers control.

@anacrolix
Copy link

I can still confirm that pauseTimeout does not work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants