-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a state machine to manage session state transitions #220
Conversation
if (!parsed.payload.expectedSessionState) { | ||
// TODO: remove once we have upgraded all clients. | ||
({ session, isTransparentReconnect } = this.getOrCreateSession({ | ||
to: parsed.from, | ||
conn, | ||
sessionId: parsed.payload.sessionId, | ||
propagationCtx: parsed.tracing, | ||
})); | ||
} else if (parsed.payload.expectedSessionState.reconnect) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess python-client is deployed with this change replit/river-python#32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i figured we've marinated long enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense at face-value, I would've thought that the transitions are automatically handled within the sessions (e.g. the state machine automatically goes from handshaking to connected when the handshaking session gets a successful handshake), but I guess the transport layer eeds to do some verification in some cases and act as a controller. I think this is fine, but that was my intuition.
* │ server-side ▼ │ connection drop | ||
* └───────────────► 3. SessionConnected ────────────┘ heartbeat misses | ||
*/ | ||
abstract class StateMachineState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to split CommonSession
and this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to split out the state machine machinery and the session machinery
…ect automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a very thorough look, mainly nitpicks/bikeshedding and questions
transport/server.ts
Outdated
// 2. cliest is reconnecting to an existing session but we don't have it | ||
// reject this handshake, there's nothing we can do to salvage it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not for this PR, but should we have a mechanism for negotiating creating a new session instead? or is the client just expected to retry and request a new session explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to keep the handshake fixed-length (e.g. 2-way handshake for now), the intended way is to have the client retry these (we now distinguish retriable and fatal handshake responses so thats the goal there)
export async function advanceFakeTimersByDisconnectGrace() { | ||
for (let i = 0; i < testingSessionOptions.heartbeatsUntilDead; i++) { | ||
// wait for heartbeat interval to elapse | ||
await vi.runOnlyPendingTimersAsync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out runOnlyPendingTimersAsync
introduced incorrect fake timer behaviour so I did my best to remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
round 1/2 done (pausing for lunch and continuing afterwards)
await advanceFakeTimersBySessionGrace(); | ||
|
||
expect(clientTransport.sessions.size).toEqual(0); | ||
await waitFor(() => expect(numberOfConnections(clientTransport)).toBe(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using toEqual
here for consistency (also primitives don't have identity-matching defined). same elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe vitest + jest both allow toBe on primitives https://vitest.dev/api/expect.html#tobe
(it just allowed me to shorten it into a single line because toEqual pushes it to 3 😭)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rofl. i guess we should have consistency even if that causes prettier to do stuff that ain't as pretty.
*/ | ||
export const SessionStateGraph = { | ||
entrypoints: { | ||
NoConnection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be fun to mention that this is the client entrypoint and the WaitingForHandshake
is the server entrypoint. in other words, these two entrypoints don't need to have the same name as the state they represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go forth! please prioritize any comments about correctness / invariants. everything else can be done later.
@@ -614,4 +200,106 @@ export abstract class Transport<ConnType extends Connection> { | |||
getStatus(): TransportStatus { | |||
return this.status; | |||
} | |||
|
|||
protected updateSession<S extends Session<ConnType>>(session: S): S { | |||
const activeSession = this.sessions.get(session.to); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i thought that we were going to simplify things so that session.to
was always identical to session.id
(in other words, each client creates a session id upfront and then all participants refer to that as the id for fewer invariants, and we lose the distinction between client and session IDs, maybe even connection IDs too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet, i wanted to push that to a future PR (removing transport client ids entirely in favour of session ids)
this.conn.addCloseListener(listeners.onConnectionClosed); | ||
} | ||
|
||
onHandshakeData = (msg: Uint8Array) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we guaranteeing that once this is called, the handshake timeout will not fire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great question, right now we are relying on the transport to listen for the onHandshake and then transition out of this state (which clears the handshake timeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other state transitions don't clear the other listeners as soon as an even is emitted so this is consistent with it (the listener of the event is expected to do the transition which does cleanup)
* - Handshaking -> NoConnection (on close) | ||
* - Handshaking -> Connected (on handshake) | ||
*/ | ||
export class SessionHandshaking< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this state is only valid for clients (since the corresponding server-side change is SessionWaitingForHandshake
) should we add a Client
to this name and a Server
to the other state?
clearInterval(this.heartbeatHandle); | ||
this.heartbeatHandle = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clearInterval(this.heartbeatHandle); | |
this.heartbeatHandle = undefined; | |
if (this.heartbeatHandle) { | |
clearInterval(this.heartbeatHandle); | |
this.heartbeatHandle = undefined; | |
} |
idempotency ftw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't clearInterval
already idempotent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's putting a lot of trust that node / random browsers will never, ever for any circumstance accidentally reuse an interval ID.
|
||
// if we are not actively heartbeating, we are in passive | ||
// heartbeat mode and should send a response to the ack | ||
if (!this.isActivelyHeartbeating) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to create a new state for this? it feels like a new state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm we could but i dont think it would add much value, this to me seems like a parameter of a state (like a substate?)
this.protocolError(ProtocolError.RetriesExceeded, errMsg); | ||
return; | ||
} | ||
// check budget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth having the "sleeping before reconnecting" state be distinct from "no connection"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, i thought about this as well but not sure if it gets us anything (behaviorally this is identical to SessionConnecting)
## Why We removed the client being required to send heartbeat on interval due to browser timer throttling in #220. That makes sense, since we don't want the server to expect anything from the client that relies on client timers, making the timer server authoritative makes sense. That said, this entirely removed our ability to detect bad servers or phantom disconnects on the client. ## What changed Add a client timer to expect a heartbeat from the server every `options.heartbeatsUntilDead * options.heartbeatIntervalMs` , if we don't see a heartbeat, we close the connection. This is fine in the context of browser throttling timers since in the worst case of throttling we're holding on a connection for too long, and in the best case we can detect bad connections. ## Versioning - [ ] Breaking protocol change - [ ] Breaking ts/js API change Internal change. <!-- Kind reminder to add tests and updated documentation if needed --> --------- Co-authored-by: Jacky Zhao <j.zhao2k19@gmail.com>
Why
Keeping track of what handlers are registered when and knowing when to cleanup/register new callbacks and fields in a session's lifecycle is really difficult. Let's encode valid states via a state machine and make illegal states unrepresentable.
This PR also addresses two other problems that are heavily tied to the session lifecycle:
nextExpectedSeq
in the handshake which is useful to detect desyncs in server to client direction (that is, if the server was about to send a message that would cause the client to receive an out-of-order message, we should just close the session there instead of continuing). However, we never did this check in the other direction.setInterval
to send heartbeat messages to the server and the server would kill the client if it didn't send heartbeats frequently enough. Turns out that Chromium heavily throttles intervals when backgrounded (source: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/platform/scheduler/main_thread/page_scheduler_impl.h;l=54?q=kDefaultThrottledWakeUpInterval&ss=chromium%2Fchromium%2Fsrc) so asetInterval
for, say 3s, could take up to 60s to actually fire. This meant connections were getting killed far too frequently. However, if we make the timers server authoritative, the client can always respond to server heartbeats as data-listener-based wakeups are not throttled :)What changed
reconnect
, this can be inferred a combination ofnextExpectedSeq
andnextSentSeq
sessionId
to generate the ID on the session on the server side so the IDs of the client session and the server session matchnextSentSeq
. We do some more comparison here to check for mismatched session state before calling the handshake ok. Specifically, there are two cases where we cannot accept a session:client.nextSentSeq > server.ack
)server.seq > client.nextExpectedSeq
)numberOfConnections(transport)
andcloseAllConnections(transport)
which covers our existing use cases.Versioning