Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accept(2) error handling #553

Merged
merged 4 commits into from May 24, 2016

Conversation

Projects
None yet
4 participants
@reiddraper
Copy link
Contributor

commented May 18, 2016

This PR contains three commits which deal with an issue we've been having with our production Warp services. Currently, Warp treats all exceptions that can be raised by accept as unrecoverable (other than its special handling of ResourceExhausted, which I've removed for reasons described in 3011cad). However, some errors from the accept syscall are recoverable, and do not imply the entire listening socket is closed or in a bad state. Specifically, accept(2) and accept4(2) will return ECONNABORTED if a connection request is aborted between entering the socket's listen queue, and actually being accepted. In this case, a subsequent call to accept will simply return the next socket on the queue.

Since this was treated as a fatal error before, when this would happen to us in production, the webserver would begin its graceful shutdown routine, but since we use long-lived connections, graceful shutdown would wait indefinitely for HTTP requests that will never finish (we leave an HTTP connection open and stream data). This manifested itself as the entire application hanging indefinitely. In aebd8d4, I've added an option to the Warp Settings for an optional graceful shutdown timeout, so that if there is a fatal error raised by accept, the webserver won't simply hang forever, waiting to shut down.

I've kept this in three commits, so that it's easier to review, and if you only want some of this functionality, it can be cherry-picked.

reiddraper added some commits May 18, 2016

Do not special-handle ResourceExhausted
There is no guarantee that running out of file-descriptors is a
temporary condition, and simply sleeping for a second is a naive
solution. In the general case, it is better to scream loudly and alert
an administrator to this case.
Ignore ECONNABORTED errors during accept
ECONNABORTED is a recoverable error, and only applies to a particular
socket in the accept(2) queue. It is safe to simply try accept'ing the
next socket on the queue.
Add an optional graceful shutdown timeout
For servers with long-lived (or infinite lived) connections, waiting for
graceful shutdown can block the server indefinitely. A new setting is
added, which allows for a time (in seconds) to be applied when
commencing a graceful shutdown.
@reiddraper

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2016

The failure is a timeout on the BUILD=cabal GHCVER=7.6.3 CABALVER=1.16 HAPPYVER=1.19.5 ALEXVER=3.1.7 build. The other builds all pass.

@@ -347,6 +353,13 @@ setLogger :: (Request -> H.Status -> Maybe Integer -> IO ())
-> Settings -> Settings
setLogger lgr y = y { settingsLogger = lgr }

-- | Set the graceful shutdown timeout. A timeout of `Nothing' will
-- wait indefinitely, and a number, if provided, will be treated as seconds
-- to wait for requests to finish, before shutting down the server entirely.

This comment has been minimized.

Copy link
@MaxGabriel

MaxGabriel May 19, 2016

Member

It'd be good to add -- @since x.x.x or -- Since x.x.x like the other APIs have. I think the next version would be 3.2.8 (Current is 3.2.7, and this PR adds new APIs. Not sure if the ResourceExhausted thing would be considered a breaking change, though)

This comment has been minimized.

Copy link
@reiddraper

reiddraper May 19, 2016

Author Contributor

Added in 504fa37.

@reiddraper

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2016

I've force-pushed fce3103 to address the formatting issue in the documentation.

@snoyberg

This comment has been minimized.

Copy link
Member

commented May 20, 2016

Looks fine to me. @kazu-yamamoto do you want to review before merge?

@kazu-yamamoto

This comment has been minimized.

Copy link
Contributor

commented May 24, 2016

I support this changes.

@snoyberg snoyberg merged commit fce3103 into yesodweb:master May 24, 2016

1 check failed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
@snoyberg

This comment has been minimized.

Copy link
Member

commented May 24, 2016

Thanks!

@reiddraper reiddraper deleted the helium:bugfix/accept-error-handling branch May 24, 2016

@reiddraper reiddraper restored the helium:bugfix/accept-error-handling branch May 25, 2016

@kazu-yamamoto

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2016

Warp 3.2.7 has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.