Propagate Connection Closed Information up to top-level (fix #465) #545

sebsto · 2025-08-03T17:03:11Z

This PR implements a mechanism to propagate connection loss information from the Lambda runtime client to the runtime loop, enabling termination without backtrace when the connection to the Lambda control plane (or a Mock Server) is lost.

The changes are:

When the connection is lost, ChannelHandlerDelegate.channelInnactive() now correctly calls resume(throwing:) on the ending continuation, for all states (.waitingForNextInvocation and .sentResponse). This eliminates the hangs on connection lost..
I added top-level error handling on LambdaRuntime._run()
Add a unit test to check that either LambdaruntimeError.connectionToControlPlaneLost, a ChannelError, or an IOError is thrown when the server closes the connection

================================================
The below is the original PR description. It is kept here for history.

This PR implements a mechanism to propagate connection loss information from the Lambda runtime client to the runtime loop using EventLoopFuture, enabling termination without backtrace when the connection to the Lambda control plane (or a Mock Server) is lost.

This PR fixes #465

Changes Made
Core Changes:

LambdaRuntimeClient: Added futureConnectionClosed property to signal connection loss via EventLoopFuture
Lambda.swift: Modified runLoop to monitor connection status and throw an error on connection loss
LambdaRuntime and ServiceLifecycle Integration: Modified to properly handle connection loss (and other top-level) errors and avoid application to crash with a backtrace.

Termination without a backtrace

When connection to Lambda control plane is lost, the runtime exits cleanly
Prevents hanging or zombie processes in testing/development environments
Maintains compatibility with AWS Lambda service behavior (new runtime environment creation)

Use Cases

Performance Testing: Prevents hanging when testing against MockServer
Development: Cleaner shutdown behavior during local testing
Production: Aligns with AWS Lambda service expectations for connection loss scenarios

adam-fowler

I think you'd be better to use the state machine to work out state, instead of adding additional variables, outside of the state machine.

Why don't you just check for the disconnected state in LambdaRuntimeClient.nextInvocation and if it is set throw a LambdaRuntimeError(code: .connectionToControlPlaneLost)

sebsto · 2025-08-06T16:21:10Z

Why don't you just check for the disconnected state in LambdaRuntimeClient.nextInvocation and if it is set throw a LambdaRuntimeError(code: .connectionToControlPlaneLost)

@adam-fowler This was already implemented (not by me). But nextInvocation() is not called while blocked on sendNextRequest() Basically, when the client is waiting for a call to GET /next to return, the code path does not go through nextInvocation()

sebsto · 2025-08-06T16:28:36Z

I think you'd be better to use the state machine to work out state, instead of adding additional variables, outside of the state machine.

That doesn't work in this context because I need to propagate the error from a callback inside LambdaRuntimeClient up to Lambda. The states are private to the LambdaRuntimeClient and the LambdaChannelHandler and can not be accessible from Lambda

Throwing errors is not an option neither because the only place where I can trap the connectionClosed event while the client is blocked on GET /next is inside a callback channel.closeFuture.whenComplete { result ... } This block can not throw.

adam-fowler · 2025-08-07T08:38:30Z

@adam-fowler This was already implemented (not by me). But nextInvocation() is not called while blocked on sendNextRequest() Basically, when the client is waiting for a call to GET /next to return, the code path does not go through nextInvocation()

You are checking for the existence of the futureConnectionClosed just before calling nextInvocation. I don't see why you can't check for the .disconnected state inside nextInvocation and then throw your LambdaRuntimeError(code: .connectionToControlPlaneLost) at that point.

sebsto · 2025-08-07T11:00:18Z

Because connectionState is private in LambdaRuntimeClient, its not exposed on the protocol. Making it package obliges to open a lot of internal types (such as ConnectionState, the ChannelDelegate,...). It also obliges to remove the @UsableFromInline on Lambda.Run() and dance with the actor isolation. I'll send a modification to let see you the amount of changes that are needed (unless I miss something)

adam-fowler · 2025-08-07T13:58:10Z

Can you not add to the top of LambdaRuntimeClient.nextInvocation.

    @usableFromInline
    func nextInvocation() async throws -> (Invocation, Writer) {
        if case .disconnected = self.connectionState {
            throw LambdaRuntimeError(code: .connectionToControlPlaneLost)
        }
        return try await withTaskCancellationHandler {
            ...

sebsto · 2025-08-07T14:09:37Z

~~Yes ! I was trying to check that in the Lambda.loop which caused all type of challenge to get this calue outside of the isolated context. You're correct, your approach is much easier.~~

Nope, I have to think more about it. The first time we arrive in this method, we are in a disconnected state :-)

sebsto · 2025-08-07T14:51:10Z

Here is a new version.

I introduced a new connectionState : lostConnection (because .disconnected also means "not yet connected")
I moved the check back to Lambda because if we do the check in LambdaRuntimeClient.nextInvocation() the method still hands on lost connections

sebsto · 2025-08-07T22:12:35Z

Here is a new version.

I introduced a new connectionState : lostConnection (because .disconnected also means "not yet connected")
The check on connection lost is on LambdaClientRuntime.nextInvocation()
When the connection is lost, the callback now correctly calls resume(throwing:) for the pending continuation, eliminating the hangs I was observing from time to time.
DRY : I moved the top-level error handling on LambdaRuntime._run() method instead of the two run() functions (one for service lifecycle and one regular)

sebsto · 2025-08-24T12:03:37Z

I further simplified by removing a necessary extra state I introduced in the previous commit.
I removed other unnecessary code to keep this PR as short as possible.

Core changes are

LambdaRuntimeClient: lines 366 - 378
LambdaChannelHandler: line 900
LambdaRuntime.swift: lines 109 - 119

23 lines of code.
The rest are empty lines and logger statements.

adam-fowler · 2025-08-27T09:34:20Z

Sources/AWSLambdaRuntime/LambdaRuntimeClient.swift

+
+                    // resume any pending continuation on the handler
+                    if case .connected(_, let handler) = runtimeClient.connectionState {
+                        if case .connected(_, let lambdaState) = handler.state {


This shouldn't be necessary, as LambdaChannelHandler.channelInactive will have resumed the continuation (with error ioOnClosedChannel). None of the tests trigger this code either.

LambdaChannelHandlerHandler.channelInnactive is indeed called when the server closes a connection but not in the state waitingForNextInvocation.

When the server closes the connection, the state in the LambdaChannelHandler is
connected and LambdaState.idle

To reproduce what I observe:

MAX_INVOCATIONS=3 MODE=json PORT=7777 LOG_LEVEL=trace swift run MockServer

from another terminal, on the sebsto/shutdown_on_lost_connection branch : cd Examples/HelloJSON ; LAMBDA_USE_LOCAL_DEPS=../.. LOG_LEVEL=trace AWS_LAMBDA_RUNTIME_API=127.0.0.1:7777 swift run

The MockServer will start and will shutdown after three invocations.
The runtime will pull and process three events from the MockServer, and then will receive a connection closed event when trying to fetch the next event.

You will see that sometimes, the runtime catches the closed connection and gracefully shutdowns with Connection refused (errno: 61) and sometimes it hangs. In both cases LambdaChannelHandler.channelInactive() is called with self state == .connected(_, .idle) and lastError == nil.

The change to .idle happens after the response has been sent on this line

Looks like we have two behaviors, depending when the client detects the connection in closed.

Either the response has been sent, LambdaChannelHandler is in .idle state and the runtime detects the close of the connection before nextInvocation() has a chance to change the status and to send the next request. nextInvocation() correctly reports Connection refused (errno: 61) (we can trap the error by adding a do cacth block here

Either nextInvocation() already changed the state. It created a new promise and switched the state again to .waitingForNextInvocation (on this line)

This is where we have a problem, because this new promise is never fulfilled. We are after the call to LambdaChannelHandler.channelInactive() and there is no throwing functions that we can wrap in a do catch to trap the error.

Instead of capturing the new unfulfilled promise in channel.closeFuture.whenComplete, I can capture it in LambdaRuntimeClient.channelClosed() which is called by the latter. It's not a big difference.

Anyway, when we detect a connection is closed in the LambdaRuntimeClient, sometimes it happens after the call to LambdaChannelHandler.channelInactive()

And this is IMHO the code that allows to fulfill the promise.

if case .connected(_, let handler) = self.connectionState { if case .connected(_, let lambdaState) = handler.state { if case .waitingForNextInvocation(let continuation) = lambdaState { continuation.resume(throwing: LambdaRuntimeError(code: .connectionToControlPlaneLost)) } } }

Can you write a test that replicates this code being called.

Done, b0234f5

As you noticed, the connnectionToControlPlaneLost error was not triggered by the previous commit. This commit adds a test to make sure the two types of errors are triggered 156e6c7

Hello @adam-fowler can you check the updated test? It covers the two behaviours I’m observing (connection lost in .idle state and .waitingForNextInvocation state.

Ok had a look at the code, when running your new test and found what I think is an issue. The channelInactive only cleans up if the state is .connected(_, .waitingForNextInvocation). I think the code should be the following. It is better to do the cleanup in channelInactive than here. The runtime client shouldn't be cleaning up after the channel handler state machine, that is up to the channel handler.

func channelInactive(context: ChannelHandlerContext) { // fail any pending responses with last error or assume peer disconnected switch self.state { case .connected(_, let lambdaState): switch lambdaState { case .waitingForNextInvocation(let continuation): continuation.resume(throwing: self.lastError ?? ChannelError.ioOnClosedChannel) case .sentResponse(let continuation): continuation.resume(throwing: self.lastError ?? ChannelError.ioOnClosedChannel) case .idle, .sendingResponse, .waitingForResponse: break } self.state = .disconnected default: break } // we don't need to forward channelInactive to the delegate, as the delegate observes the // closeFuture context.fireChannelInactive() }

@adam-fowler Good catch. I moved the promise fulfilment code to channelInactive using the code you suggested. You're correct - this addresses the root cause

Sources/AWSLambdaRuntime/LambdaRuntimeClient.swift

… is closed, whatever is the state of the LambdaChannelHandler

sebsto · 2025-08-28T06:24:39Z

In French, we say "la nuit porte conseil" aka "good advices come during the night"
I refactored the test to use a timeout and a CancellationError

…ture.whenComplete

adam-fowler

Looks good

sebsto · 2025-09-01T10:21:17Z

Thank you @adam-fowler for your patience and your continuous mentoring. This is highly appreciated.

sebsto added 3 commits August 1, 2025 10:49

return HTTP accepted on error

2020d6b

force exit() when we loose connection to Lambda service

6e01c6e

propagate the connection closed info through a Future

166cd46

sebsto added this to the 2.0 milestone Aug 3, 2025

sebsto self-assigned this Aug 3, 2025

sebsto added the 🆕 semver/minor Adds new public API. label Aug 3, 2025

fix typos

a69ed54

sebsto changed the title ~~Propagate Connection Closed Information Through Future, up to top-level~~ [WIP] Propagate Connection Closed Information Through Future, up to top-level Aug 3, 2025

sebsto marked this pull request as draft August 3, 2025 17:44

sebsto changed the title ~~[WIP] Propagate Connection Closed Information Through Future, up to top-level~~ [WIP] Propagate Connection Closed Information up to top-level Aug 3, 2025

sebsto added 2 commits August 3, 2025 20:07

fix unit tests

04d9fc7

Merge branch 'main' into sebsto/shutdown_on_lost_connection

092da82

sebsto changed the title ~~[WIP] Propagate Connection Closed Information up to top-level~~ Propagate Connection Closed Information up to top-level Aug 3, 2025

sebsto requested a review from adam-fowler August 3, 2025 18:12

sebsto marked this pull request as ready for review August 3, 2025 18:13

sebsto changed the title ~~Propagate Connection Closed Information up to top-level~~ Propagate Connection Closed Information up to top-level (fix #465) Aug 3, 2025

sebsto added 4 commits August 4, 2025 08:26

Merge branch 'main' into sebsto/shutdown_on_lost_connection

5efb706

Merge branch 'main' into sebsto/shutdown_on_lost_connection

1d98a7c

Merge branch 'main' into sebsto/shutdown_on_lost_connection

4b23d4f

Merge branch 'main' into sebsto/shutdown_on_lost_connection

5822e0a

adam-fowler reviewed Aug 5, 2025

View reviewed changes

simplify by checking connection state in the nextInvocation() call

025a0e5

introducing a new connection state "lostConnection"

ce8b567

sebsto added 4 commits August 7, 2025 23:51

fix a case where continuation was resumed twice

9dcb4b3

fix unit test

f2d94a2

swift format

852391e

remove comment on max payload size

b13bf5c

sebsto added 3 commits August 24, 2025 13:38

Merge branch 'main' into sebsto/shutdown_on_lost_connection

1aa07b1

further simplify by removing the new state lostConnection

9c283c3

remove unecessary code

a620a2f

restrict access to channel handler state variable

f671228

adam-fowler reviewed Aug 27, 2025

View reviewed changes

Sources/AWSLambdaRuntime/LambdaRuntimeClient.swift Show resolved Hide resolved

sebsto added 9 commits August 27, 2025 16:50

add a unit test to verify that an error is thrown when the connection…

b0234f5

… is closed, whatever is the state of the LambdaChannelHandler

swift-format

482e09e

make sure connectionToControlPlaneLost error is triggered by the test

156e6c7

give more time to the server to close the connection

a0b1d57

add catch for IOError

54459e0

swift-format

89cd259

remove compilation warning

6b07955

improve test with a timeout

86ccbf4

remove debugging print statements

cbf9c5e

sebsto mentioned this pull request Aug 28, 2025

[core] - move Timeout code from example to core #553

Closed

sebsto added 3 commits August 28, 2025 09:34

add logger trace

9a70915

fulfill continuation in channelInactive() rather than channel.closeFu…

c54cb2f

…ture.whenComplete

remove private(set) on LambdaChannelHandler.state

ccb3d14

adam-fowler approved these changes Sep 1, 2025

View reviewed changes

sebsto merged commit efc4cd1 into swift-server:main Sep 1, 2025
34 of 35 checks passed

sebsto deleted the sebsto/shutdown_on_lost_connection branch September 1, 2025 10:21

Propagate Connection Closed Information up to top-level (fix #465) #545

Propagate Connection Closed Information up to top-level (fix #465) #545

Uh oh!

Conversation

sebsto commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-fowler left a comment

Choose a reason for hiding this comment

Uh oh!

sebsto commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebsto commented Aug 6, 2025

Uh oh!

adam-fowler commented Aug 7, 2025

Uh oh!

sebsto commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-fowler commented Aug 7, 2025

Uh oh!

sebsto commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebsto commented Aug 7, 2025

Uh oh!

sebsto commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebsto commented Aug 24, 2025

Uh oh!

adam-fowler Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam-fowler Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

adam-fowler Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

sebsto Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebsto commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-fowler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebsto commented Sep 1, 2025

Uh oh!

Uh oh!

sebsto commented Aug 3, 2025 •

edited

Loading

sebsto commented Aug 6, 2025 •

edited

Loading

sebsto commented Aug 7, 2025 •

edited

Loading

sebsto commented Aug 7, 2025 •

edited

Loading

sebsto commented Aug 7, 2025 •

edited

Loading

sebsto Aug 27, 2025 •

edited

Loading

sebsto Aug 27, 2025 •

edited

Loading

sebsto Aug 27, 2025 •

edited

Loading

sebsto commented Aug 28, 2025 •

edited

Loading