Skip to content

relay: 30-second grace period on binary disconnect before releasing server-id #8

@ilmoniemi

Description

@ilmoniemi

User Story

As the relay, I want to hold a server-id slot for 30 seconds after the binary disconnects (instead of releasing immediately), so that transient network blips don't cause a brief unavailability window followed by phones receiving 4404 when they reconnect.

Context

Implements the grace-period semantics from pyrycode/pyrycode/docs/protocol-mobile.md § Authentication → Binary → relay: "When the holding connection drops (close, ping timeout, network error), the server-id is released after a 30-second grace period."

The bare registry (#3) has immediate-release semantics; this ticket adds the time-based deferral on top.

Acceptance Criteria

  • Modify the binary-side disconnect path in /v1/server handler (relay: WS upgrade for /v1/server — accept binary connection, validate headers, claim server-id #4): instead of calling registry.ReleaseServer(serverID) immediately on WS close, call a new registry.ScheduleRelease(serverID, 30*time.Second) method.
  • New method (r *Registry) ScheduleRelease(serverID string, after time.Duration) in internal/relay/registry.go:
    • Marks the server-id as "scheduled for release at T + after."
    • During the grace window:
      • BinaryFor(serverID) continues to return the (now-closed) binary conn — but Send() on that conn errors immediately. Forwarders see the error and clean up.
      • Phones connecting via /v1/client during the grace window get registered normally (no 4404). Their first frames will fail to forward (binary's Send errors), but the phone connection is held until either:
        • A new binary claims the same server-id (registry.ClaimServer cancels the pending release and replaces the binary conn). Existing phones continue to work seamlessly with the new binary.
        • The 30s expires with no new binary → all queued/pending phones get WS close 1011 with reason "binary did not reconnect", server-id is fully released, subsequent connects get 4404.
    • (r *Registry) ClaimServer modified: if a server-id is in the grace window, the new claim cancels the pending release and replaces the binary conn atomically. Returns success, NOT ErrServerIDConflict — the grace window is exactly the path that allows reconnection.
  • Tests in internal/relay/registry_test.go:
    • ScheduleRelease + immediate ClaimServer → no release fires; new binary conn is the active one.
    • ScheduleRelease + 30s pass with no claim → server-id fully released; subsequent ClaimServer succeeds, RegisterPhone returns ErrNoServer.
    • Phone registered during grace window → still in registry after grace expires? No: phones get closed when grace expires with no new binary. Test verifies this explicitly.
    • Race: rapid disconnect/reclaim cycles don't leak goroutines or pending timers.

Technical Notes

  • Use time.AfterFunc for the deferred release. Track the timer per-server-id so reclaim can cancel it.
  • The grace window is 30 seconds (spec). Constant for now; revisit if real-world reconnect patterns suggest different.
  • Out of scope:
    • Phone-side grace period on phone disconnect (not in spec; phones come and go).
    • Per-server-id custom grace durations (one global value is fine).

Size Estimate

S — ~80 LOC + ~120 LOC tests (timing tests + race tests for the reclaim path).

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    security-sensitiveTouches auth, crypto, or internet-exposed input paths

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions