Skip to content

relay: single-instance constraint — doc + startup self-check (registry is in-memory) #39

@ilmoniemi

Description

@ilmoniemi

Why

The connection registry (internal/relay/registry.go) is in-memory per process. Two replicas = two disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B (registry-by-server-id lookup misses). The relay's v1 architecture supports single-instance only.

Docker + Fly.io makes fly scale count 3 a one-line command. Without a guard, someone (operator, AI agent, future-self) could scale out reflexively and silently break server-id routing for half the connections.

Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]] applies: the doc is the stochastic guard; the startup self-check is the deterministic backstop.

What

Two parts, both ship in this ticket:

  1. Doc: docs/architecture.md adds an explicit § "v1 = single-instance" section explaining the constraint, the registry shape, and what would have to change for multi-instance (shared registry via Redis pub/sub, NATS subjects, or sticky-session-on-server-id at the LB layer).

  2. Startup self-check: if the host platform exposes a replica-count signal, refuse to start when count > 1. Mechanism depends on host:

    • Fly.io: FLY_APP_NAME + fly machines list --app from inside the container is heavy. Cheaper: FLY_MACHINE_ID + read-only check that no sibling machines exist for the same app. Or simpler: env-var contract PYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=true set by deploy manifest, verified at startup with a refuse-to-start if NOT set in production mode.
    • Hetzner / generic: less standard signal — the env-var contract pattern works regardless.
    • Architect to choose — env-var contract is cheaper + portable; platform-API check is more robust but per-host.
  3. Failure mode is loud: refuse to start, log ERROR: multi-instance deploy detected, relay registry is in-memory and cannot share state across replicas. Set PYRYCODE_RELAY_SINGLE_INSTANCE=1 to bypass (NOT recommended for production). Document the bypass for emergency situations + future migration windows.

Out of scope

  • Implementing shared-registry support (Redis / NATS / sticky-session) — that's the next-architecture ticket if v1 outgrows single-instance. Filing separately when there's evidence demand exists.
  • Health-check integration with platform autoscalers — once the architecture supports it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions