Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a state machine to manage session state transitions #220

Merged
merged 26 commits into from
Jul 8, 2024

Conversation

jackyzha0
Copy link
Member

@jackyzha0 jackyzha0 commented Jun 25, 2024

Why

Keeping track of what handlers are registered when and knowing when to cleanup/register new callbacks and fields in a session's lifecycle is really difficult. Let's encode valid states via a state machine and make illegal states unrepresentable.

This PR also addresses two other problems that are heavily tied to the session lifecycle:

  • Message order synchronization during handshake: we used to synchronize nextExpectedSeq in the handshake which is useful to detect desyncs in server to client direction (that is, if the server was about to send a message that would cause the client to receive an out-of-order message, we should just close the session there instead of continuing). However, we never did this check in the other direction.
  • Relying on browser timers: we used to rely on a client-side setInterval to send heartbeat messages to the server and the server would kill the client if it didn't send heartbeats frequently enough. Turns out that Chromium heavily throttles intervals when backgrounded (source: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/platform/scheduler/main_thread/page_scheduler_impl.h;l=54?q=kDefaultThrottledWakeUpInterval&ss=chromium%2Fchromium%2Fsrc) so a setInterval for, say 3s, could take up to 60s to actually fire. This meant connections were getting killed far too frequently. However, if we make the timers server authoritative, the client can always respond to server heartbeats as data-listener-based wakeups are not throttled :)

What changed

  • Implement a state machine that supports async-transitions via a callback-based system. We use a proxy to ensure that session states we've transitioned from cannot be used again.
    • Each state registers what to do on state exit and on close (exiting from the state machine entirely, e.g. closing the session object) so we have consistent cleanups that don't leak. This has been the source of quite a few bugs.
  • Made heartbeats server authoritative.
    • Clients no longer schedule their own heartbeats. Instead, it responds directly to any server-sent heartbeats.
    • Servers are responsible for closing a connection on too many heartbeat misses.
    • This is ok even in the case of phantom disconnects as closing the underlying TCP conn will still propagate in the case of a WebSocket transport (and UDS cannot have phantom disconnects)
  • Update the handshake messages
    • No longer sends reconnect, this can be inferred a combination of nextExpectedSeq and nextSentSeq
    • Now relies on sessionId to generate the ID on the session on the server side so the IDs of the client session and the server session match
    • Now sends nextSentSeq. We do some more comparison here to check for mismatched session state before calling the handshake ok. Specifically, there are two cases where we cannot accept a session:
      • the client is in the future (client.nextSentSeq > server.ack)
      • server is in the future (server.seq > client.nextExpectedSeq)
  • Distinguish retriable handshake failures from fatal ones.
    • Only a session state mismatch is retriable (i.e. kill the session and start over), everything else should be 'fatal' and stop retrying.
  • Testing infra changes
    • Made constructing a transport not automatically connect by default. We had cases where we registered listeners after connecting and would miss some events.
    • No longer expose connections to the transport (it doesn't manage these! it shouldn't be able to introspect connections). I pulled these out into test helpers numberOfConnections(transport) and closeAllConnections(transport) which covers our existing use cases.

Versioning

  • Breaking protocol change
  • Breaking ts/js API change

Comment on lines -323 to -331
if (!parsed.payload.expectedSessionState) {
// TODO: remove once we have upgraded all clients.
({ session, isTransparentReconnect } = this.getOrCreateSession({
to: parsed.from,
conn,
sessionId: parsed.payload.sessionId,
propagationCtx: parsed.tracing,
}));
} else if (parsed.payload.expectedSessionState.reconnect) {
Copy link
Member

@masad-frost masad-frost Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess python-client is deployed with this change replit/river-python#32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i figured we've marinated long enough

Copy link
Member

@masad-frost masad-frost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense at face-value, I would've thought that the transitions are automatically handled within the sessions (e.g. the state machine automatically goes from handshaking to connected when the handshaking session gets a successful handshake), but I guess the transport layer eeds to do some verification in some cases and act as a controller. I think this is fine, but that was my intuition.

transport/sessionStateMachine/SessionConnected.ts Outdated Show resolved Hide resolved
* │ server-side ▼ │ connection drop
* └───────────────► 3. SessionConnected ────────────┘ heartbeat misses
*/
abstract class StateMachineState {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to split CommonSession and this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to split out the state machine machinery and the session machinery

transport/sessionStateMachine/common.ts Outdated Show resolved Hide resolved
transport/sessionStateMachine/common.ts Outdated Show resolved Hide resolved
transport/sessionStateMachine/common.ts Outdated Show resolved Hide resolved
@jackyzha0 jackyzha0 changed the title checkpoint state machine work Use a state machine to manage session state transitions Jul 5, 2024
@jackyzha0 jackyzha0 marked this pull request as ready for review July 5, 2024 21:26
@jackyzha0 jackyzha0 requested a review from a team as a code owner July 5, 2024 21:26
@jackyzha0 jackyzha0 requested review from Monkatraz and lhchavez and removed request for a team July 5, 2024 21:26
Copy link
Contributor

@Monkatraz Monkatraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a very thorough look, mainly nitpicks/bikeshedding and questions

transport/sessionStateMachine/transitions.ts Outdated Show resolved Hide resolved
transport/sessionStateMachine/common.ts Show resolved Hide resolved
transport/sessionStateMachine/common.ts Show resolved Hide resolved
router/server.ts Show resolved Hide resolved
transport/connection.ts Outdated Show resolved Hide resolved
Comment on lines 250 to 251
// 2. cliest is reconnecting to an existing session but we don't have it
// reject this handshake, there's nothing we can do to salvage it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not for this PR, but should we have a mechanism for negotiating creating a new session instead? or is the client just expected to retry and request a new session explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to keep the handshake fixed-length (e.g. 2-way handshake for now), the intended way is to have the client retry these (we now distinguish retriable and fatal handshake responses so thats the goal there)

export async function advanceFakeTimersByDisconnectGrace() {
for (let i = 0; i < testingSessionOptions.heartbeatsUntilDead; i++) {
// wait for heartbeat interval to elapse
await vi.runOnlyPendingTimersAsync();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out runOnlyPendingTimersAsync introduced incorrect fake timer behaviour so I did my best to remove it

Copy link
Contributor

@lhchavez lhchavez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round 1/2 done (pausing for lunch and continuing afterwards)

await advanceFakeTimersBySessionGrace();

expect(clientTransport.sessions.size).toEqual(0);
await waitFor(() => expect(numberOfConnections(clientTransport)).toBe(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using toEqual here for consistency (also primitives don't have identity-matching defined). same elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe vitest + jest both allow toBe on primitives https://vitest.dev/api/expect.html#tobe

(it just allowed me to shorten it into a single line because toEqual pushes it to 3 😭)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rofl. i guess we should have consistency even if that causes prettier to do stuff that ain't as pretty.

__tests__/disconnects.test.ts Outdated Show resolved Hide resolved
__tests__/disconnects.test.ts Outdated Show resolved Hide resolved
router/server.ts Outdated Show resolved Hide resolved
router/server.ts Show resolved Hide resolved
transport/sessionStateMachine/transitions.ts Show resolved Hide resolved
transport/sessionStateMachine/transitions.ts Show resolved Hide resolved
*/
export const SessionStateGraph = {
entrypoints: {
NoConnection(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be fun to mention that this is the client entrypoint and the WaitingForHandshake is the server entrypoint. in other words, these two entrypoints don't need to have the same name as the state they represent.

transport/sessionStateMachine/transitions.ts Show resolved Hide resolved
transport/sessionStateMachine/transitions.ts Show resolved Hide resolved
Copy link
Contributor

@lhchavez lhchavez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go forth! please prioritize any comments about correctness / invariants. everything else can be done later.

transport/connection.ts Outdated Show resolved Hide resolved
transport/transport.ts Outdated Show resolved Hide resolved
transport/transport.ts Outdated Show resolved Hide resolved
@@ -614,4 +200,106 @@ export abstract class Transport<ConnType extends Connection> {
getStatus(): TransportStatus {
return this.status;
}

protected updateSession<S extends Session<ConnType>>(session: S): S {
const activeSession = this.sessions.get(session.to);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i thought that we were going to simplify things so that session.to was always identical to session.id (in other words, each client creates a session id upfront and then all participants refer to that as the id for fewer invariants, and we lose the distinction between client and session IDs, maybe even connection IDs too).

Copy link
Member Author

@jackyzha0 jackyzha0 Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet, i wanted to push that to a future PR (removing transport client ids entirely in favour of session ids)

this.conn.addCloseListener(listeners.onConnectionClosed);
}

onHandshakeData = (msg: Uint8Array) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we guaranteeing that once this is called, the handshake timeout will not fire?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great question, right now we are relying on the transport to listen for the onHandshake and then transition out of this state (which clears the handshake timeout)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll add a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other state transitions don't clear the other listeners as soon as an even is emitted so this is consistent with it (the listener of the event is expected to do the transition which does cleanup)

* - Handshaking -> NoConnection (on close)
* - Handshaking -> Connected (on handshake)
*/
export class SessionHandshaking<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this state is only valid for clients (since the corresponding server-side change is SessionWaitingForHandshake) should we add a Client to this name and a Server to the other state?

Comment on lines +177 to +178
clearInterval(this.heartbeatHandle);
this.heartbeatHandle = undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clearInterval(this.heartbeatHandle);
this.heartbeatHandle = undefined;
if (this.heartbeatHandle) {
clearInterval(this.heartbeatHandle);
this.heartbeatHandle = undefined;
}

idempotency ftw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't clearInterval already idempotent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's putting a lot of trust that node / random browsers will never, ever for any circumstance accidentally reuse an interval ID.


// if we are not actively heartbeating, we are in passive
// heartbeat mode and should send a response to the ack
if (!this.isActivelyHeartbeating) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to create a new state for this? it feels like a new state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm we could but i dont think it would add much value, this to me seems like a parameter of a state (like a substate?)

this.protocolError(ProtocolError.RetriesExceeded, errMsg);
return;
}
// check budget
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth having the "sleeping before reconnecting" state be distinct from "no connection"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, i thought about this as well but not sure if it gets us anything (behaviorally this is identical to SessionConnecting)

@jackyzha0 jackyzha0 merged commit 3d14dab into main Jul 8, 2024
4 checks passed
@jackyzha0 jackyzha0 deleted the jackyzha0/state-machine branch July 8, 2024 22:39
jackyzha0 added a commit that referenced this pull request Jul 12, 2024
## Why

We removed the client being required to send heartbeat on interval due
to browser timer throttling in #220. That makes sense, since we don't
want the server to expect anything from the client that relies on client
timers, making the timer server authoritative makes sense.

That said, this entirely removed our ability to detect bad servers or
phantom disconnects on the client.

## What changed

Add a client timer to expect a heartbeat from the server every
`options.heartbeatsUntilDead * options.heartbeatIntervalMs` , if we
don't see a heartbeat, we close the connection.

This is fine in the context of browser throttling timers since in the
worst case of throttling we're holding on a connection for too long, and
in the best case we can detect bad connections.


## Versioning

- [ ] Breaking protocol change
- [ ] Breaking ts/js API change

Internal change.

<!-- Kind reminder to add tests and updated documentation if needed -->

---------

Co-authored-by: Jacky Zhao <j.zhao2k19@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants