Timeout left running after use of sourceRequestBody #351

adamgundry · 2015-03-19T14:15:20Z

I'm debugging a Warp/conduit server that is failing with Thread killed by Warp's timeout reaper when a request causes a slow operation in the application. It seems that after a use of sourceRequestBody, the timeout is left running when control returns to the application, even if the client sends a complete request. Here's a minimal example:

import Control.Concurrent         ( threadDelay )
import Data.Aeson                 ( json' )
import Data.ByteString.Lazy.Char8 ( pack )
import Data.Conduit               ( ($$) )
import Data.Conduit.Attoparsec    ( sinkParser )
import Network.HTTP.Types.Status  ( ok200 )
import Network.Wai                ( responseLBS )
import Network.Wai.Conduit        ( sourceRequestBody )
import Network.Wai.Handler.Warp

main :: IO ()
main = runSettings stgs app
  where
    stgs = setTimeout 1 $ setPort 2000 $ setOnExceptionResponse err defaultSettings
    err e = responseLBS ok200 [] (pack $ show e)

    app rq k = do v <- sourceRequestBody rq $$ sinkParser json'
                  print v
                  threadDelay 2000000
                  k $ responseLBS ok200 [] (pack "OK")

If you POST a JSON object (e.g. {}) to this server, it will be printed, but the thread will be killed after a second. Replacing the conduit code with strictRequestBody makes the problem go away.

I'm a little suspicious of timeoutBody because it seems to read but never write an IORef, but I don't know if that is the direct culprit.

Thanks for taking a look at this, and apologies if I've missed something obvious!

The text was updated successfully, but these errors were encountered:

snoyberg · 2015-03-19T14:19:20Z

Haven't tested this yet, but just looking at timeoutBody, I think your suspicion about it being buggy is correct.

snoyberg · 2015-03-19T17:21:46Z

I tested, and had to increase the threadDelay to 5 seconds instead, but reproduced.

snoyberg · 2015-03-19T17:46:59Z

OK, I've figured out what's going on here... and it's tricky. There's a subtle difference between the conduit and non-conduit version. The former is feeding data into the parser. Once the parser is done, the conduit stops. Termination therefore happens as soon as the JSON message is fully consumed. This means that the code ends up making just a single call to requestBody, resulting in a ByteString containing {}, and then stops.

With strictRequestBody, we consume everything until we get an empty ByteString. At that point, we know that slowloris protection is no longer needed, and therefore the timeout is paused. In conduit, however, we never consume that empty ByteString, and therefore the protection remains active.

There's an easy workaround for this; after getting your JSON value, you can run:

sourceRequestBody rq $$ sinkNull

This will force the rest of the body (in this case, nothing) to be consumed, and the timeout protection will be disabled. To be clear: this is a workaround, because warp is not handling this situation correctly.

I need to think a bit more carefully about the slowloris protection to be certain, but there may not be a good resolution to this problem. If we simply pause the timeout handler each time we get a chunk, we could be attacked by having a slow stream of 1-byte chunks. So this may just need to be a "known deficiency."

To start off with though: can you try out my workaround and see if it solves your actual problem?

Thanks for the very clear bug report and reproducing case.

adamgundry · 2015-03-20T08:52:25Z

Once again, your response time is amazing, thanks! I wondered if there might be something like this going on, and it does seem tricky to resolve cleanly. In the meantime, I can confirm that your workaround solves the original problem.

This addresses the common case of a non-chunked request body. Previously, we would wait until a zero-length `ByteString` is returned, but that is suboptimal for some cases. For more information, see issue #351.

snoyberg · 2015-03-20T09:14:54Z

OK, I came up with a 95% solution to the problem. For the case of non-chunked request bodies (which is the usual case), we can determine that the request is done without an extra read, which I've now implemented. I added a note about the general problem of partially-consumed request bodies to the documentation.

One final thing I could do is add a function to the vault to pause the timeout to give users more explicit control. Since that's likely not too hard, I'll give it a shot.

snoyberg · 2015-03-20T09:28:47Z

OK, pauseTimeout function is now available as well, and I've tested that it seems to resolve the issue here too. I think this is an acceptable tradeoff now, and I'll go ahead and release to Hackage.

Again, thanks for the solid bug report and for testing.

spl · 2015-03-20T13:14:48Z

I just upgraded to Warp 3.0.10, which includes this change. I'm already using Network.Wai.strictRequestBody, and I'm seeing “Thread killed by Warp's timeout reaper” in my Warp exception handler.

I'm not sure if my problem is related to this issue. I just encountered it and haven't looked into it deeply. Just wanted to check in with you to see what you thought.

snoyberg · 2015-03-20T13:45:04Z

Were you getting that behavior with the older Warp as well, or is this a
new regression?

On Fri, Mar 20, 2015 at 3:14 PM Sean Leather notifications@github.com
wrote:

I just upgraded to Warp 3.0.10, which includes this change. I'm already
using Network.Wai.strictRequestBody, and I'm seeing “Thread killed by
Warp's timeout reaper” in my Warp exception handler.

I'm not sure if my problem is related to this issue. I just encountered it
and haven't looked into it deeply. Just wanted to check in with you to see
what you thought.

—
Reply to this email directly or view it on GitHub
#351 (comment).

spl · 2015-03-20T13:50:38Z

@snoyberg TBH, I'm not sure. Several things have changed simultaneously, one of those changes being the addition of the exception handler.

I'm building the same app with 3.0.9.3 right now to test it. It should be fairly easy to determine if the issue was introduced in 3.0.10 because the reaper is now coming for regular visits with my current test setup.

spl · 2015-03-20T14:50:27Z

Sadly (for me), it's not a problem introduced by this change. The response is not taking longer than 1 second in most cases, and the exception comes much later. There's a fork in the middle. Any suggestions on how to debug this?

spl · 2015-03-20T21:19:04Z

Okay, I may be missing something obvious, too. I'm able to reproduce my problem with a very simple program, adapted from @adamgundry's code:

module Test (main) where

import Data.ByteString.Lazy.Char8 ( pack )
import Data.Time.Clock            ( getCurrentTime )
import Network.HTTP.Types.Status  ( ok200 )
import Network.Wai                ( responseLBS, strictRequestBody )
import Network.Wai.Handler.Warp

main :: IO ()
main = runSettings stgs app
  where
    stgs = setTimeout 2 $ setPort 3000 $ setOnException err defaultSettings
    err _ e = do t <- getCurrentTime
                 putStrLn $ show t ++ "\tError: " ++ show e

    app rq k = do v <- strictRequestBody rq
                  t <- getCurrentTime
                  putStrLn $ show t ++ "\tRequest: " ++ show v
                  k $ responseLBS ok200 [] (pack "OK")

I POSTed a simple input:

$ runghc Test.hs
2015-03-20 21:17:06.614678 UTC  Request: "key=value"
2015-03-20 21:17:10.617833 UTC  Error: Thread killed by Warp's timeout reaper

This is using warp-3.0.10.

snoyberg · 2015-03-21T18:04:22Z

I'm unable to reproduce this, I just get a result immediately. I don't see anything which would cause the server to block long enough to cause the 2 second timeout to ever get triggered. How are you sending the post? I'm using the following script:

curl -d key=value http://localhost:3000

spl · 2015-03-21T21:03:37Z

Interesting. I was using the Postman plugin for Chrome. I just tried it with a GET request from both Chrome and Firefox, and the same reap happens. But with curl, the thread does not get reaped. The issue came up with an AWS EC2 worker receiving messages from SQS via some local server on the worker instance (nginx, I'd guess).

spl · 2015-03-21T21:05:05Z

Also, the request is always printed immediately no matter what I do. The reaper appears later, presumably after the timeout.

spl · 2015-03-21T21:09:24Z

I modified the code to print the full request.

# curl -d key=value http://localhost:3000
2015-03-21 21:07:24.959385 UTC  Request: Request {requestMethod = "POST", httpVersion = HTTP/1.1, rawPathInfo = "/", rawQueryString = "", requestHeaders = [("User-Agent","curl/7.37.1"),("Host","localhost:3000"),("Accept","*/*"),("Content-Length","9"),("Content-Type","application/x-www-form-urlencoded")], isSecure = False, remoteHost = 10.211.55.1:52115, pathInfo = [], queryString = [], requestBody = <IO ByteString>, vault = <Vault>, requestBodyLength = KnownLength 9, requestHeaderHost = Just "localhost:3000", requestHeaderRange = Nothing}
# Chrome/Postman
2015-03-21 21:07:38.778795 UTC  Request: Request {requestMethod = "POST", httpVersion = HTTP/1.1, rawPathInfo = "/", rawQueryString = "", requestHeaders = [("Host","localhost:3000"),("Connection","keep-alive"),("Content-Length","9"),("Accept","application/json"),("Cache-Control","no-cache"),("Origin","chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm"),("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36"),("Content-Type","application/x-www-form-urlencoded"),("Accept-Encoding","gzip, deflate"),("Accept-Language","en-US,en;q=0.8")], isSecure = False, remoteHost = 10.211.55.1:52117, pathInfo = [], queryString = [], requestBody = <IO ByteString>, vault = <Vault>, requestBodyLength = KnownLength 9, requestHeaderHost = Just "localhost:3000", requestHeaderRange = Nothing}
2015-03-21 21:07:42.783402 UTC  Error: Thread killed by Warp's timeout reaper

spl · 2015-03-21T21:19:41Z

It's not the HTTP headers. I used the Chrome headers with curl, but no reaping.

snoyberg · 2015-03-22T08:39:46Z

In your real application do you actually have such a low timeout value? Are you able to reproduce the error with the normal timeout? I have a feeling that the failures you're experiencing right now just have to do with chrome keeping a persistent connection.

spl · 2015-03-22T09:24:16Z

In your real application do you actually have such a low timeout value?

No, we use the default. I only used the low timeout here to reduce the wait before the reaper.

Are you able to reproduce the error with the normal timeout?

Yes. This is the result of using Chrome and removing the setTimeout from the above code:

2015-03-22 08:54:25.642874 UTC  Request: "key=value"
2015-03-22 08:55:25.673206 UTC  Error: Thread killed by Warp's timeout reaper

This reflects approximately what I saw in the real application as well: about a minute of delay before the reaper arrived.

I have a feeling that the failures you're experiencing right now just have to do with chrome keeping a persistent connection.

Okay... I have reproduced the same problem with netcat by forcing a persistent connection by sleeping for 70 seconds:

$ (printf "GET / HTTP/1.1\n\n"; sleep 70) | netcat localhost 3000

The result:

2015-03-22 09:13:03.124894 UTC  Request: ""
2015-03-22 09:14:03.161658 UTC  Error: Thread killed by Warp's timeout reaper

I don't understand the issue well enough to know what the problem is. But I have now reproduced it with four different HTTP clients (nginx, Chrome, Firefox, and netcat), so I do think there is a problem somewhere.

Should the timeout handler be independent of the connection persistence? Why is it still hanging around when the request has been fully processed and the response has been sent?

snoyberg · 2015-03-22T10:04:11Z

Are you actually experiencing any negative effects from this, or is it just an error message? This sounds to me like Warp is just letting you know that a persistent connection was killed by the timeout reaper, but that no problem actually exists. I just realized that you ignored the first parameter to the exception handler, which would indeed let you know if this exception happened during a request or just to a persistent connection.

spl · 2015-03-22T14:24:49Z

I just realized that you ignored the first parameter to the exception handler, which would indeed let you know if this exception happened during a request or just to a persistent connection.

I added printing the Maybe Request parameter, and it's a Nothing for the reaper. I don't know how the exception is thrown in relation to the request, so I wouldn't know the significance of that parameter being Nothing or Just.

Are you actually experiencing any negative effects from this, or is it just an error message?

I don't see any obvious direct negative effects due to the exception.

This sounds to me like Warp is just letting you know that a persistent connection was killed by the timeout reaper, but that no problem actually exists.

First, have you reproduced the issue? Or is it only me?

If there is no real error, I think no exception should be thrown. There is a problem with having an exception that isn't actually an error. I can choose to “ignore” it by catching it and doing nothing. But what if, later, there is an actual error, and I miss it?

Is it possible to change the timeout handler? I don't claim to understand much related to protecting against the slowloris attack. But, again, I wonder (and these may be meaningless questions for lack of understanding): Should the timeout handler be independent of connection persistence? Does the timeout exception mean that the connection was closed? Why is the timeout still around when the request has been fully processed and the response sent?

If it is not possible to change the timeout handler, then I would like to disable it. In my case, I know exactly who's making the request and the server is firewalled against everything else.

snoyberg · 2015-03-22T14:34:20Z

If you look at the default exception handler, you'll see that these
exceptions are by default not displayed. They do not indicate an error in
all cases, but instead indicate that "an exception was thrown." The
exception mechanism is the only means by which we can preemptively kill a
thread.

I don't really see what should be done here, beyond possibly improving
documentation. The defaults are correct, and we certainly should make it
possibly for users to get more details about threads being killed by the
reaper if users need to know that.

On Sun, Mar 22, 2015 at 4:24 PM Sean Leather notifications@github.com
wrote:

I just realized that you ignored the first parameter to the exception
handler, which would indeed let you know if this exception happened during
a request or just to a persistent connection.

I added printing the Maybe Request parameter, and it's a Nothing for the
reaper. I don't know how the exception is thrown in relation to the
request, so I wouldn't know the significance of that parameter being
Nothing or Just.

Are you actually experiencing any negative effects from this, or is it
just an error message?

I don't see any obvious direct negative effects due to the exception.

This sounds to me like Warp is just letting you know that a persistent
connection was killed by the timeout reaper, but that no problem actually
exists.

First, have you reproduced the issue? Or is it only me?

If there is no real error, I think no exception should be thrown. There is
a problem with having an exception that isn't actually an error. I can
choose to “ignore” it by catching it and doing nothing. But what if, later,
there is an actual error, and I miss it?

Is it possible to change the timeout handler? I don't claim to understand
much related to protecting against the slowloris attack. But, again, I
wonder (and these may be meaningless questions for lack of understanding):
Should the timeout handler be independent of connection persistence? Does
the timeout exception mean that the connection was closed? Why is the
timeout still around when the request has been fully processed and the
response sent?

If it is not possible to change the timeout handler, then I would like to
disable it. In my case, I know exactly who's making the request and the
server is firewalled against everything else.

—
Reply to this email directly or view it on GitHub
#351 (comment).

spl · 2015-03-22T15:07:26Z

Fair enough.

So, I currently see the following options:

Catch TimeoutThread if I only want to ignore the timeout.
Use defaultShouldDisplayException to ignore exceptions, including TimeoutThread, that do not always indicate errors.
Use Network.Wai.Handler.Warp.Timeout.stopManager to disable the timeout if I don't want to generate a TimeoutThread exception in any case.

spl · 2015-03-22T15:10:53Z

Use Network.Wai.Handler.Warp.Timeout.stopManager to disable the timeout if I don't want to generate a TimeoutThread exception in any case.

Or does this not even do what I would expect?

-- ...
import qualified Network.Wai.Handler.Warp.Timeout as Timeout

main :: IO ()
main = do
    mgr <- Timeout.initialize 30000000
    Timeout.stopManager mgr
    runSettings (stgs mgr) app
  where
    stgs mgr = setManager mgr $ setTimeout 2 $ setPort 3000 $ setOnException err defaultSettings
-- ...

I still get a visit from the reaper.

snoyberg · 2015-03-22T15:12:00Z

I've never tried to implement (3), so I'm not surprised it doesn't work.
I'd recommend instead just lengthening the timeout duration beyond what
most clients use for persistent connections. However, (1) and (2) sound
much better to me.

On Sun, Mar 22, 2015 at 5:07 PM Sean Leather notifications@github.com
wrote:

Fair enough.

So, I currently see the following options:

Catch TimeoutThread if I only want to ignore the timeout.

Use defaultShouldDisplayException to ignore exceptions, including
TimeoutThread, that do not always indicate errors.

Use Network.Wai.Handler.Warp.Timeout.stopManager to disable the
timeout if I don't want to generate a TimeoutThread exception in any
case.

—
Reply to this email directly or view it on GitHub
#351 (comment).

anacrolix · 2018-05-11T10:54:15Z

I don't think pauseTimeout works as intended. It sets the pauseTimeoutKey value to Paused, which is promptly lost on resume and tickle, which occur outside the callers control.

anacrolix · 2023-08-04T12:18:32Z

I can still confirm that pauseTimeout does not work :)

snoyberg closed this as completed Mar 20, 2015

tillmo mentioned this issue Aug 31, 2015

RESTful server seems to close the connection early spechub/Hets#1482

Closed

winterland1989 mentioned this issue Apr 14, 2016

Add some helpers kazu-yamamoto/logger#84

Open

ip1981 mentioned this issue Nov 24, 2016

Thread killed by timeout reaper when sending large HTTP/2 response #593

Closed

majorcode mentioned this issue Mar 20, 2017

Increase Warp connection timeout PostgREST/postgrest#834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout left running after use of sourceRequestBody #351

Timeout left running after use of sourceRequestBody #351

adamgundry commented Mar 19, 2015

snoyberg commented Mar 19, 2015

snoyberg commented Mar 19, 2015

snoyberg commented Mar 19, 2015

adamgundry commented Mar 20, 2015

snoyberg commented Mar 20, 2015

snoyberg commented Mar 20, 2015

spl commented Mar 20, 2015

snoyberg commented Mar 20, 2015

spl commented Mar 20, 2015

spl commented Mar 20, 2015

spl commented Mar 20, 2015

snoyberg commented Mar 21, 2015

spl commented Mar 21, 2015

spl commented Mar 21, 2015

spl commented Mar 21, 2015

spl commented Mar 21, 2015

snoyberg commented Mar 22, 2015

spl commented Mar 22, 2015

snoyberg commented Mar 22, 2015

spl commented Mar 22, 2015

snoyberg commented Mar 22, 2015

spl commented Mar 22, 2015

spl commented Mar 22, 2015

snoyberg commented Mar 22, 2015

anacrolix commented May 11, 2018

anacrolix commented Aug 4, 2023

Timeout left running after use of sourceRequestBody #351

Timeout left running after use of sourceRequestBody #351

Comments

adamgundry commented Mar 19, 2015

snoyberg commented Mar 19, 2015

snoyberg commented Mar 19, 2015

snoyberg commented Mar 19, 2015

adamgundry commented Mar 20, 2015

snoyberg commented Mar 20, 2015

snoyberg commented Mar 20, 2015

spl commented Mar 20, 2015

snoyberg commented Mar 20, 2015

spl commented Mar 20, 2015

spl commented Mar 20, 2015

spl commented Mar 20, 2015

snoyberg commented Mar 21, 2015

spl commented Mar 21, 2015

spl commented Mar 21, 2015

spl commented Mar 21, 2015

spl commented Mar 21, 2015

snoyberg commented Mar 22, 2015

spl commented Mar 22, 2015

snoyberg commented Mar 22, 2015

spl commented Mar 22, 2015

snoyberg commented Mar 22, 2015

spl commented Mar 22, 2015

spl commented Mar 22, 2015

snoyberg commented Mar 22, 2015

anacrolix commented May 11, 2018

anacrolix commented Aug 4, 2023