Skip to content

Conversation

sebsto
Copy link
Contributor

@sebsto sebsto commented Aug 3, 2025

This PR implements a mechanism to propagate connection loss information from the Lambda runtime client to the runtime loop, enabling termination without backtrace when the connection to the Lambda control plane (or a Mock Server) is lost.

The changes are:

  • When the connection is lost, the channel.closeFuture.whenComplete callback now correctly calls resume(throwing:) for the pending continuation, eliminating the hangs.

  • I added top-level error handling on LambdaRuntime._run()

================================================
The below is the original PR description. It is kept here for history.

This PR implements a mechanism to propagate connection loss information from the Lambda runtime client to the runtime loop using EventLoopFuture, enabling termination without backtrace when the connection to the Lambda control plane (or a Mock Server) is lost.

This PR fixes #465

Changes Made
Core Changes:

  • LambdaRuntimeClient: Added futureConnectionClosed property to signal connection loss via EventLoopFuture
  • Lambda.swift: Modified runLoop to monitor connection status and throw an error on connection loss
  • LambdaRuntime and ServiceLifecycle Integration: Modified to properly handle connection loss (and other top-level) errors and avoid application to crash with a backtrace.

Termination without a backtrace

  • When connection to Lambda control plane is lost, the runtime exits cleanly
  • Prevents hanging or zombie processes in testing/development environments
  • Maintains compatibility with AWS Lambda service behavior (new runtime environment creation)

Use Cases

  • Performance Testing: Prevents hanging when testing against MockServer
  • Development: Cleaner shutdown behavior during local testing
  • Production: Aligns with AWS Lambda service expectations for connection loss scenarios

@sebsto sebsto added this to the 2.0 milestone Aug 3, 2025
@sebsto sebsto self-assigned this Aug 3, 2025
@sebsto sebsto added the 🆕 semver/minor Adds new public API. label Aug 3, 2025
@sebsto sebsto changed the title Propagate Connection Closed Information Through Future, up to top-level [WIP] Propagate Connection Closed Information Through Future, up to top-level Aug 3, 2025
@sebsto sebsto marked this pull request as draft August 3, 2025 17:44
@sebsto sebsto changed the title [WIP] Propagate Connection Closed Information Through Future, up to top-level [WIP] Propagate Connection Closed Information up to top-level Aug 3, 2025
@sebsto sebsto changed the title [WIP] Propagate Connection Closed Information up to top-level Propagate Connection Closed Information up to top-level Aug 3, 2025
@sebsto sebsto requested a review from adam-fowler August 3, 2025 18:12
@sebsto sebsto marked this pull request as ready for review August 3, 2025 18:13
@sebsto sebsto changed the title Propagate Connection Closed Information up to top-level Propagate Connection Closed Information up to top-level (fix #465) Aug 3, 2025
Copy link
Member

@adam-fowler adam-fowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd be better to use the state machine to work out state, instead of adding additional variables, outside of the state machine.

Why don't you just check for the disconnected state in LambdaRuntimeClient.nextInvocation and if it is set throw a LambdaRuntimeError(code: .connectionToControlPlaneLost)

@sebsto
Copy link
Contributor Author

sebsto commented Aug 6, 2025

Why don't you just check for the disconnected state in LambdaRuntimeClient.nextInvocation and if it is set throw a LambdaRuntimeError(code: .connectionToControlPlaneLost)

@adam-fowler This was already implemented (not by me). But nextInvocation() is not called while blocked on sendNextRequest() Basically, when the client is waiting for a call to GET /next to return, the code path does not go through nextInvocation()

@sebsto
Copy link
Contributor Author

sebsto commented Aug 6, 2025

I think you'd be better to use the state machine to work out state, instead of adding additional variables, outside of the state machine.

That doesn't work in this context because I need to propagate the error from a callback inside LambdaRuntimeClient up to Lambda. The states are private to the LambdaRuntimeClient and the LambdaChannelHandler and can not be accessible from Lambda

Throwing errors is not an option neither because the only place where I can trap the connectionClosed event while the client is blocked on GET /next is inside a callback channel.closeFuture.whenComplete { result ... } This block can not throw.

@adam-fowler
Copy link
Member

@adam-fowler This was already implemented (not by me). But nextInvocation() is not called while blocked on sendNextRequest() Basically, when the client is waiting for a call to GET /next to return, the code path does not go through nextInvocation()

You are checking for the existence of the futureConnectionClosed just before calling nextInvocation. I don't see why you can't check for the .disconnected state inside nextInvocation and then throw your LambdaRuntimeError(code: .connectionToControlPlaneLost) at that point.

@sebsto
Copy link
Contributor Author

sebsto commented Aug 7, 2025

Because connectionState is private in LambdaRuntimeClient, its not exposed on the protocol. Making it package obliges to open a lot of internal types (such as ConnectionState, the ChannelDelegate,...). It also obliges to remove the @UsableFromInline on Lambda.Run() and dance with the actor isolation. I'll send a modification to let see you the amount of changes that are needed (unless I miss something)

@adam-fowler
Copy link
Member

Can you not add to the top of LambdaRuntimeClient.nextInvocation.

    @usableFromInline
    func nextInvocation() async throws -> (Invocation, Writer) {
        if case .disconnected = self.connectionState {
            throw LambdaRuntimeError(code: .connectionToControlPlaneLost)
        }
        return try await withTaskCancellationHandler {
            ...

@sebsto
Copy link
Contributor Author

sebsto commented Aug 7, 2025

Yes ! I was trying to check that in the Lambda.loop which caused all type of challenge to get this calue outside of the isolated context. You're correct, your approach is much easier.

Nope, I have to think more about it. The first time we arrive in this method, we are in a disconnected state :-)

@sebsto
Copy link
Contributor Author

sebsto commented Aug 7, 2025

Here is a new version.

  • I introduced a new connectionState : lostConnection (because .disconnected also means "not yet connected")
  • I moved the check back to Lambda because if we do the check in LambdaRuntimeClient.nextInvocation() the method still hands on lost connections

@sebsto
Copy link
Contributor Author

sebsto commented Aug 7, 2025

Here is a new version.

  • I introduced a new connectionState : lostConnection (because .disconnected also means "not yet connected")
  • The check on connection lost is on LambdaClientRuntime.nextInvocation()
  • When the connection is lost, the callback now correctly calls resume(throwing:) for the pending continuation, eliminating the hangs I was observing from time to time.
  • DRY : I moved the top-level error handling on LambdaRuntime._run() method instead of the two run() functions (one for service lifecycle and one regular)

@sebsto
Copy link
Contributor Author

sebsto commented Aug 24, 2025

I further simplified by removing a necessary extra state I introduced in the previous commit.
I removed other unnecessary code to keep this PR as short as possible.

Core changes are

  • LambdaRuntimeClient: lines 366 - 378
  • LambdaChannelHandler: line 900
  • LambdaRuntime.swift: lines 109 - 119

23 lines of code.
The rest are empty lines and logger statements.


// resume any pending continuation on the handler
if case .connected(_, let handler) = runtimeClient.connectionState {
if case .connected(_, let lambdaState) = handler.state {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, as LambdaChannelHandler.channelInactive will have resumed the continuation (with error ioOnClosedChannel). None of the tests trigger this code either.

Copy link
Contributor Author

@sebsto sebsto Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LambdaChannelHandlerHandler.channelInnactive is indeed called when the server closes a connection but not in the state waitingForNextInvocation.

When the server closes the connection, the state in the LambdaChannelHandler is
connected and LambdaState.idle

To reproduce what I observe:

  1. MAX_INVOCATIONS=3 MODE=json PORT=7777 LOG_LEVEL=trace swift run MockServer
  2. from another terminal, on the sebsto/shutdown_on_lost_connection branch : cd Examples/HelloJSON ; LAMBDA_USE_LOCAL_DEPS=../.. LOG_LEVEL=trace AWS_LAMBDA_RUNTIME_API=127.0.0.1:7777 swift run

The MockServer will start and will shutdown after three invocations.
The runtime will pull and process three events from the MockServer, and then will receive a connection closed event when trying to fetch the next event.

You will see that sometimes, the runtime catches the closed connection and gracefully shutdowns with Connection refused (errno: 61) and sometimes it hangs. In both cases LambdaChannelHandler.channelInactive() is called with self state == .connected(_, .idle) and lastError == nil.

Copy link
Contributor Author

@sebsto sebsto Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to .idle happens after the response has been sent on this line

Looks like we have two behaviors, depending when the client detects the connection in closed.

Either the response has been sent, LambdaChannelHandler is in .idle state and the runtime detects the close of the connection before nextInvocation() has a chance to change the status and to send the next request. nextInvocation() correctly reports Connection refused (errno: 61) (we can trap the error by adding a do cacth block here

Either nextInvocation() already changed the state. It created a new promise and switched the state again to .waitingForNextInvocation (on this line)

This is where we have a problem, because this new promise is never fulfilled. We are after the call to LambdaChannelHandler.channelInactive() and there is no throwing functions that we can wrap in a do catch to trap the error.

Copy link
Contributor Author

@sebsto sebsto Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of capturing the new unfulfilled promise in channel.closeFuture.whenComplete, I can capture it in LambdaRuntimeClient.channelClosed() which is called by the latter. It's not a big difference.

Anyway, when we detect a connection is closed in the LambdaRuntimeClient, sometimes it happens after the call to LambdaChannelHandler.channelInactive()

And this is IMHO the code that allows to fulfill the promise.

      if case .connected(_, let handler) = self.connectionState {
          if case .connected(_, let lambdaState) = handler.state {
              if case .waitingForNextInvocation(let continuation) = lambdaState {
                  continuation.resume(throwing: LambdaRuntimeError(code: .connectionToControlPlaneLost))
              }
          }
      }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a test that replicates this code being called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, b0234f5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noticed, the connnectionToControlPlaneLost error was not triggered by the previous commit. This commit adds a test to make sure the two types of errors are triggered 156e6c7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🆕 semver/minor Adds new public API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] LambdaRuntimeClient do nothing when server closes the connection
2 participants