Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,31 @@ jobs:
# Trivy audits what physically shipped in the image.
vuln-type: os,library
exit-code: '1'

deploy:
# Runs only on push to main. PRs (including from forks) never satisfy
# this condition, so `FLY_API_TOKEN` is structurally unreachable from
# any job that runs untrusted PR code.
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
needs: [test, security, image-scan]
runs-on: ubuntu-latest
# Belt-and-suspenders: the workflow-level permissions block above
# already grants only `contents: read`, but a future top-level
# escalation would silently widen this job's scope. Pin the minimum
# here so the deploy job stays read-only against the repo even if
# the header drifts. `flyctl` itself authenticates to Fly via
# `FLY_API_TOKEN`, not via GitHub permissions.
permissions:
contents: read
steps:
- uses: actions/checkout@v6
# Tracks: superfly/flyctl-actions/setup-flyctl@1.6
# Pinned by commit SHA so a tag-swap upstream cannot change what
# holds `FLY_API_TOKEN` during the next deploy. Refresh the comment
# in lockstep with the SHA. Same convention as the Trivy pin in
# `image-scan` (#68) and the govulncheck pin in `security` (#41).
- uses: superfly/flyctl-actions/setup-flyctl@ed8efb33836e8b2096c7fd3ba1c8afe303ebbff1
- name: flyctl deploy
run: flyctl deploy --remote-only
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
8 changes: 8 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,14 @@ Intended uses:

**NOT recommended for production.** Setting it permanently silences a check whose entire purpose is to catch the silent-routing-failure mode described above.

## Hosting

Production deploys to a single **Fly.io** machine in one region. TLS terminates in the relay binary (autocert, #9) — Fly runs the substrate in raw-TCP passthrough mode on `:80` and `:443`, with a dedicated IPv4 so Let's Encrypt's HTTP-01 challenge resolves deterministically. The autocert cache lives on a Fly volume at `/var/lib/relay/autocert`.

The single-machine cap is platform-enforced via `min_machines_running = 1`, `auto_start_machines = false`, and `auto_stop_machines = "off"` in `fly.toml`, and binary-enforced via the `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check (#65). Multi-instance scaling is out of scope for v1 — see § *Single-instance constraint* above.

Bootstrap and rollback procedures: [`docs/deploy.md`](deploy.md). The manifest itself: [`fly.toml`](../fly.toml). CI deploy job: [`.github/workflows/ci.yml`](../.github/workflows/ci.yml).

## Threat model

Wire-protocol threats live in the protocol spec's [Security model](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#security-model). Operational threats specific to the relay binary as a deployed process — deploy, supply chain, DoS, log hygiene, cert handling, TLS config, error-leakage — live in [`docs/threat-model.md`](./threat-model.md).
75 changes: 75 additions & 0 deletions docs/deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Deploy

The relay deploys to a single [Fly.io](https://fly.io) machine. CI deploys on every merge to
`main`; manual deploys are needed only for the one-time bootstrap and for
rollbacks.

See [`docs/architecture.md` § *Hosting*](architecture.md#hosting) for the
hosting + TLS-termination decision record, and
[`docs/threat-model.md` § *Deploy security*](threat-model.md#deploy-security--vps-compromise)
for the operational threat surface the substrate sits on.

## One-time bootstrap (per environment)

Done once per Fly app — typically only the production app.

1. Edit `fly.toml`: replace `__REGION__` with a Fly region code (e.g.
`ams`, `arn`, `fra` — see `flyctl platform regions`) and `__DOMAIN__`
with the public domain (e.g. `relay.pyrycode.dev`). These are
placeholders by design — the first deploy fails loudly if either is
left unset, which is preferable to a silently-misconfigured production
relay.
2. `flyctl apps create pyrycode-relay` (must match `app =` in `fly.toml`).
3. `flyctl ips allocate-v4 --app pyrycode-relay` — a dedicated IPv4 is
**required**, not optional, for autocert's HTTP-01 challenge to
resolve deterministically to the running machine on port 80. Shared
IPv4 + TCP passthrough is not a supported combination on Fly. This is
billable; call it out at provisioning review.
4. `flyctl volumes create relay_autocert --region <region> --size 1
--app pyrycode-relay` — the volume name must match `source =` in the
`[[mounts]]` block of `fly.toml`. The autocert cache lives here and
survives machine recycles, avoiding repeated Let's Encrypt
re-issuance.
5. DNS: point the production domain (A record) at the IPv4 from step 3.
Let's Encrypt resolves the domain via HTTP-01 on first deploy;
without DNS in place, the first WSS request hangs ~minutes while
autocert retries.
6. GitHub repo secret: `FLY_API_TOKEN` = output of `flyctl auth token`.
Settings → Secrets and variables → Actions → New repository secret.
The token grants deploy access to the entire Fly org — scope it to a
`pyrycode-relay`-only deploy token if Fly's tokens UI offers that at
bootstrap time.

## Steady-state flow

1. Open a PR. CI runs `test`, `security`, and `image-scan` on the PR HEAD.
2. Merge to `main`. CI re-runs the three jobs against `main`, then runs
`deploy` (gated on all three passing).
3. `deploy` invokes `flyctl deploy --remote-only`. Fly's remote builder
rebuilds the image from `Dockerfile` and replaces the single machine
in place via the `immediate` deploy strategy.

Observability:

- `flyctl status` — machine health.
- `flyctl logs --app pyrycode-relay` — relay stderr in real time.
- The deploy job's GitHub Actions log records the build + roll output.

## Rollback

Two paths, in increasing order of disruption.

1. **By image digest (preferred).** `flyctl releases list` shows recent
release digests. `flyctl deploy --image <prior-digest> --remote-only`
pins to the prior image without rebuilding. The autocert cache
persists across rollbacks (it's on the volume), so no Let's Encrypt
re-issuance is triggered.
2. **By release number.** `flyctl releases rollback` rolls back the
*most recent* release. Use when the prior release's digest isn't to
hand and a rollback is needed immediately.

A rollback does **not** revert the `main` commit. To prevent CI's next
deploy from immediately re-rolling the broken release forward, either
revert the offending PR before the next merge to `main`, or disable the
`deploy` job temporarily by editing `.github/workflows/ci.yml` on a
revert PR.
4 changes: 3 additions & 1 deletion docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38).
- [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61).
- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
- [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32). PR-time Trivy CVE scan against the just-built image lives in CI as the `image-scan` job, fails on **fixable** CRITICAL/HIGH only (`ignore-unfixed: true`), action pinned by commit SHA with `# Tracks: <tag>` comment mirroring the Dockerfile pin convention; intentional overlap with `govulncheck` (source-reachability vs. shipped-artifact) (#68). Both scanners are also re-run daily against `main` via `.github/workflows/security-scan.yml` (cron + `workflow_dispatch`) so disclosed CVEs against unchanged deps surface within ≤24h rather than staying invisible until the next bump (#72); a red cron run also opens a `security-sensitive`-labelled GitHub issue via the workflow's `file-issue` job (artifact-handoff privilege split keeps `issues: write` off the scanners and out of workflow scope; deterministic-title dedup via `gh issue list --search 'in:title …'`) so regressions land as tracked work-items rather than passive Actions rows (#73).
Expand Down Expand Up @@ -31,7 +32,8 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Architecture

- [System overview](../architecture.md) — top-level: stateless WS router between phones and pyry binaries. Names the v1 single-instance constraint (in-memory registry → two replicas hold disjoint registries → silent `4404`), the two multi-instance paths documented as future work (shared registry; sticky-on-`x-pyrycode-server`), and the `PYRYCODE_RELAY_SINGLE_INSTANCE=1` bypass for the #65 startup self-check (#64). (Lives at `docs/architecture.md`; not yet split into `architecture/`.)
- [System overview](../architecture.md) — top-level: stateless WS router between phones and pyry binaries. Names the v1 single-instance constraint (in-memory registry → two replicas hold disjoint registries → silent `4404`), the two multi-instance paths documented as future work (shared registry; sticky-on-`x-pyrycode-server`), and the `PYRYCODE_RELAY_SINGLE_INSTANCE=1` bypass for the #65 startup self-check (#64). § *Hosting* records the Fly.io + relay-terminated-TLS + single-machine-via-fly.toml decision (#38). (Lives at `docs/architecture.md`; not yet split into `architecture/`.)
- [Deploy procedures](../deploy.md) — operator-facing bootstrap (one-time: `flyctl apps create` / `flyctl ips allocate-v4` / `flyctl volumes create relay_autocert` / DNS / `FLY_API_TOKEN` secret), steady-state (PR → merge to `main` → CI deploys), and rollback (`flyctl deploy --image <prior-digest>` preferred; `flyctl releases rollback` as fallback). Dedicated IPv4 is required for autocert's HTTP-01 challenge — not optional, billable (#38).

## Cross-cutting

Expand Down
Loading