tl;dr: during stratus startup, there is a tiny window where a socket message will go to the owning process/supervisor instead of the stratus actor.
Backstory
I have a project that talks to Home Assistant over a websocket. After upgrading to Stratus 2.0, my project stopped working. I eventually traced it back to not receiving the initial auth_required message from HA, which should be sent immediately upon connecting. I couldn't figure out the cause, so I let an LLM grind on it for quite a while, and eventually it found a race condition: there is a tiny window of time where a message will go to the owning process/supervisor instead of the stratus actor. My local home asisstant must be responding fast enough to fit in that window.
I will send a PR which fixes the issue for me shortly. It's small, but not exactly elegant. :)
The LLM's summary:
Root Cause
There is a race condition in stratus.start/1 (in build/packages/stratus/src/stratus.gleam) between:
- Handshake completion - The WebSocket handshake is performed by the calling process (in our case, the OTP supervisor context)
- Socket data arriving - Home Assistant sends
auth_required immediately after handshake
- Socket ownership transfer -
controlling_process/2 transfers socket ownership to the new actor
The timeline is:
1. Handshake completes (socket owned by supervisor context)
2. Home Assistant sends auth_required message
3. SSL socket message arrives in SUPERVISOR's mailbox
4. controlling_process() transfers socket to actor
5. Actor starts with selector, but auth_required is already in supervisor's mailbox
6. Actor never receives auth_required, stays in AuthPending state forever
7. Dependent actors timeout waiting for HA responses
8. Supervisor crashes with init_failed
Evidence
From debug output:
[DEBG] Handshake successful
"HA init: stratus.selecting called"
[EROR] Supervisor received unexpected message: [Ssl(Sslsocket(...), <<compressed data>>)]
The SSL socket message with the auth_required data went to the supervisor instead of the HA actor.
The Bug Location
In stratus.gleam lines ~490-508:
|> actor.start
|> result.map_error(ActorFailed)
|> result.try(fn(started) {
// PROBLEM: Socket is still owned by calling process at this point!
// Any incoming data goes to calling process's mailbox.
case transport {
Tcp -> tcp.controlling_process(handshake_response.socket, started.pid)
Ssl -> ssl.controlling_process(handshake_response.socket, started.pid)
}
// ...
})
The socket ownership transfer happens after actor.start returns, but by that time the server may have already sent data that ends up in the wrong mailbox.
tl;dr: during stratus startup, there is a tiny window where a socket message will go to the owning process/supervisor instead of the stratus actor.
Backstory
I have a project that talks to Home Assistant over a websocket. After upgrading to Stratus 2.0, my project stopped working. I eventually traced it back to not receiving the initial
auth_requiredmessage from HA, which should be sent immediately upon connecting. I couldn't figure out the cause, so I let an LLM grind on it for quite a while, and eventually it found a race condition: there is a tiny window of time where a message will go to the owning process/supervisor instead of the stratus actor. My local home asisstant must be responding fast enough to fit in that window.I will send a PR which fixes the issue for me shortly. It's small, but not exactly elegant. :)
The LLM's summary:
Root Cause
There is a race condition in
stratus.start/1(inbuild/packages/stratus/src/stratus.gleam) between:auth_requiredimmediately after handshakecontrolling_process/2transfers socket ownership to the new actorThe timeline is:
Evidence
From debug output:
The SSL socket message with the
auth_requireddata went to the supervisor instead of the HA actor.The Bug Location
In
stratus.gleamlines ~490-508:The socket ownership transfer happens after
actor.startreturns, but by that time the server may have already sent data that ends up in the wrong mailbox.