relay: single-instance constraint — doc + startup self-check (registry is in-memory)

## Why

The connection registry (`internal/relay/registry.go`) is in-memory per process. Two replicas = two disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B (registry-by-server-id lookup misses). The relay's v1 architecture supports **single-instance only**.

Docker + Fly.io makes `fly scale count 3` a one-line command. Without a guard, someone (operator, AI agent, future-self) could scale out reflexively and silently break server-id routing for half the connections.

Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]] applies: the doc is the stochastic guard; the startup self-check is the deterministic backstop.

## What

**Two parts, both ship in this ticket:**

1. **Doc:** `docs/architecture.md` adds an explicit § "v1 = single-instance" section explaining the constraint, the registry shape, and what would have to change for multi-instance (shared registry via Redis pub/sub, NATS subjects, or sticky-session-on-server-id at the LB layer).

2. **Startup self-check:** if the host platform exposes a replica-count signal, refuse to start when count > 1. Mechanism depends on host:
   - **Fly.io:** `FLY_APP_NAME` + `fly machines list --app` from inside the container is heavy. Cheaper: `FLY_MACHINE_ID` + read-only check that no sibling machines exist for the same app. Or simpler: env-var contract `PYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=true` set by deploy manifest, verified at startup with a refuse-to-start if NOT set in production mode.
   - **Hetzner / generic:** less standard signal — the env-var contract pattern works regardless.
   - **Architect to choose** — env-var contract is cheaper + portable; platform-API check is more robust but per-host.

3. Failure mode is loud: refuse to start, log `ERROR: multi-instance deploy detected, relay registry is in-memory and cannot share state across replicas. Set PYRYCODE_RELAY_SINGLE_INSTANCE=1 to bypass (NOT recommended for production)`. Document the bypass for emergency situations + future migration windows.

## Out of scope

- Implementing shared-registry support (Redis / NATS / sticky-session) — that's the next-architecture ticket if v1 outgrows single-instance. Filing separately when there's evidence demand exists.
- Health-check integration with platform autoscalers — once the architecture supports it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relay: single-instance constraint — doc + startup self-check (registry is in-memory) #39

Why

What

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

relay: single-instance constraint — doc + startup self-check (registry is in-memory) #39

Description

Why

What

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions