Why
The connection registry (internal/relay/registry.go) is in-memory per process. Two replicas = two disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B (registry-by-server-id lookup misses). The relay's v1 architecture supports single-instance only.
Docker + Fly.io makes fly scale count 3 a one-line command. Without a guard, someone (operator, AI agent, future-self) could scale out reflexively and silently break server-id routing for half the connections.
Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]] applies: the doc is the stochastic guard; the startup self-check is the deterministic backstop.
What
Two parts, both ship in this ticket:
-
Doc: docs/architecture.md adds an explicit § "v1 = single-instance" section explaining the constraint, the registry shape, and what would have to change for multi-instance (shared registry via Redis pub/sub, NATS subjects, or sticky-session-on-server-id at the LB layer).
-
Startup self-check: if the host platform exposes a replica-count signal, refuse to start when count > 1. Mechanism depends on host:
- Fly.io:
FLY_APP_NAME + fly machines list --app from inside the container is heavy. Cheaper: FLY_MACHINE_ID + read-only check that no sibling machines exist for the same app. Or simpler: env-var contract PYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=true set by deploy manifest, verified at startup with a refuse-to-start if NOT set in production mode.
- Hetzner / generic: less standard signal — the env-var contract pattern works regardless.
- Architect to choose — env-var contract is cheaper + portable; platform-API check is more robust but per-host.
-
Failure mode is loud: refuse to start, log ERROR: multi-instance deploy detected, relay registry is in-memory and cannot share state across replicas. Set PYRYCODE_RELAY_SINGLE_INSTANCE=1 to bypass (NOT recommended for production). Document the bypass for emergency situations + future migration windows.
Out of scope
- Implementing shared-registry support (Redis / NATS / sticky-session) — that's the next-architecture ticket if v1 outgrows single-instance. Filing separately when there's evidence demand exists.
- Health-check integration with platform autoscalers — once the architecture supports it.
Why
The connection registry (
internal/relay/registry.go) is in-memory per process. Two replicas = two disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B (registry-by-server-id lookup misses). The relay's v1 architecture supports single-instance only.Docker + Fly.io makes
fly scale count 3a one-line command. Without a guard, someone (operator, AI agent, future-self) could scale out reflexively and silently break server-id routing for half the connections.Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]] applies: the doc is the stochastic guard; the startup self-check is the deterministic backstop.
What
Two parts, both ship in this ticket:
Doc:
docs/architecture.mdadds an explicit § "v1 = single-instance" section explaining the constraint, the registry shape, and what would have to change for multi-instance (shared registry via Redis pub/sub, NATS subjects, or sticky-session-on-server-id at the LB layer).Startup self-check: if the host platform exposes a replica-count signal, refuse to start when count > 1. Mechanism depends on host:
FLY_APP_NAME+fly machines list --appfrom inside the container is heavy. Cheaper:FLY_MACHINE_ID+ read-only check that no sibling machines exist for the same app. Or simpler: env-var contractPYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=trueset by deploy manifest, verified at startup with a refuse-to-start if NOT set in production mode.Failure mode is loud: refuse to start, log
ERROR: multi-instance deploy detected, relay registry is in-memory and cannot share state across replicas. Set PYRYCODE_RELAY_SINGLE_INSTANCE=1 to bypass (NOT recommended for production). Document the bypass for emergency situations + future migration windows.Out of scope