Skip to content

Commit 784decc

Browse files
authored
feat(api): Phase 7b foundation — init, datetime-TZ, /readiness, status cache, proxy widening (#75)
* docs(api): Phase 7b foundation implementation plan Bite-sized TDD plan derived from the merged Phase 7b spec, covering all 9 clusters in dependency order: datetime-TZ → printer identity → lifespan init-order → alembic verify → /readiness → status cache → frontend proxy → README → verification. Refs #22 * feat(api): add serialize_datetime_utc helper for RFC3339 with Z Go frontend oapi-codegen rejects naive datetimes. Helper normalises any datetime to a timezone-aware ISO string before serialisation. Refs #22 * fix(api): TemplateRead emits RFC3339 datetimes with Z suffix Go oapi-codegen client rejected naive datetimes from /api/templates with `parsing time "..." cannot parse "" as "Z07:00"`. Apply the new serialize_datetime_utc helper via @field_serializer. Refs #22 * refactor(api): hoist api_client_with_seed fixture into conftest Centralises the API integration test fixture so Phase 7b Task B3 (PrinterRead and JobRead) can reuse it without duplication or cross-file imports. Refs #22 * fix(api): PrinterRead + JobRead emit RFC3339 datetimes with Z suffix Same Go-oapi-codegen contract fix as TemplateRead. JobRead.started_at and finished_at each get their own serializer that handles the nullable case. conftest.py re-discovers IntegrationRegistry after lifespan shutdown so the api_client_with_seed fixture works for all tests in sequence, not just the first one. Refs #22 * refactor(api): SQLAlchemy datetime columns are timezone-aware UTC Every model column (templates/printers/jobs/presets/printer_state/ printer_status_cache) now uses DateTime(timezone=True) with default_factory=lambda: datetime.now(UTC). Fresh inserts write tz-aware values that survive the SQLite roundtrip. Existing rows are migrated by the Phase 7b alembic data migration in Task B5. Refs #22 * fix(api): alembic data migration normalises naive datetimes to UTC Existing rows from Phase 5 inserts contain naive datetimes that break the Go frontend's RFC3339 parser. Migration appends '+00:00' to any value without an explicit TZ marker across templates/printers/jobs/ presets/printer_state/printer_status_cache. Idempotent via WHERE NOT LIKE '%+%' AND NOT LIKE '%Z'. SQLite is dynamically typed so no ALTER TABLE is needed — the new column types from the previous commit only affect new inserts via the SQLAlchemy layer. Refs #22 * fix(integration): suppress alembic fileConfig in migration test to restore caplog Alembic's command.upgrade() calls logging.config.fileConfig() which, by default, uses disable_existing_loggers=True. This marks every logger not explicitly named in alembic.ini — including app.integrations — as logger.disabled=True. Any _logger.error()/_logger.exception() call on a disabled logger silently drops the record, breaking caplog assertions in test_discovery.py tests that ran after the migration tests. The fix mirrors the guard already present in app/db/lifespan.py: set cfg.attributes["configure_logger"] = False so alembic skips its logging reconfiguration entirely. The four previously failing caplog assertions now pass in all orderings. Refs #22 * feat(api): derive_printer_id helper for deterministic UUIDv5 Lifespan can now compute a stable printer.id from env config (model, host, port) so the runtime printer and the DB row share the same id across restarts. Phase 7b Cluster 1b prep work. Refs #22 * feat(api): upsert_runtime_printer lifespan helper Creates or refreshes one DB Printer row from env config, keyed by the deterministic UUIDv5 from derive_printer_id(model, host, port). Returns None for the mock backend so the lifespan can no-op when no printer is configured. Idempotent across restarts. Refs #22 * refactor(api): driver.make_queue_printer accepts optional printer_id Lifespan can now hand the DB-deterministic UUID (from upsert_runtime_printer) to the in-memory queue printer so app.state.printer_id matches the DB row. Backwards compatible — omitting the parameter falls back to uuid4(). _PrinterLike.id and Job.printer_id promoted from str to UUID throughout the in-memory queue stack (print_queue, job_lifecycle, print_service) to maintain type consistency end-to-end. Refs #22 * fix(api): seed_templates aborts on empty loader cache instead of silent no-op Catches the Phase 7a bug pattern where lifespan called seed_templates before TemplateLoader.load_dir() — cache empty, 0 rows upserted, no error, UI shows no templates. The defensive RuntimeError surfaces the misordering at startup so it cannot reach production silently. Refs #22 * fix(api): re-order lifespan — load_dir before seed_templates + upsert printer Calls plugin discovery and TemplateLoader.load_dir() before seed_templates(), and adds upsert_runtime_printer(s, settings) between seed_templates and ensure_printer_state. Hands the resulting DB UUID to driver.make_queue_printer so app.state.printer_id matches the DB row. Closes the Phase 7a bug where a fresh deploy showed 0 templates and 0 printers in the UI. Removes the now-unnecessary D1 monkey-patches in test fixtures. Refs #22 * feat(api): verify_alembic_at_head fails fast on revision drift Lifespan calls verify_alembic_at_head(settings) right after run_migrations(). If the DB revision deviates from the script head (e.g. partial migration, downgrade, missing script file) the lifespan raises with a clear message before any ORM query runs. Takes settings explicitly (C2/D2 testability pattern) so unit tests can verify against ad-hoc DBs without monkey-patching get_settings(). Sync alembic work runs inside asyncio.to_thread to keep the event loop unblocked. configure_logger=False prevents alembic from clobbering pytest caplog handlers (Phase 7b B6 learning). Fixtures in test_lifespan.py and tests/integration/conftest.py extended to patch verify_alembic_at_head to a no-op alongside run_migrations, because create_all() does not populate alembic_version. Refs #22 * feat(api): readiness response schema (CheckStatus + ReadinessResponse) Frozen Pydantic models for the new /readiness deep-check endpoint introduced by Phase 7b Cluster 1e. Refs #22 * feat(api): readiness aggregator — database/alembic/templates/printer_runtime First four checks for the /readiness deep-check endpoint plus the ready/degraded/not-ready aggregation. Endpoint wiring lands in F4; remaining 4 checks (printer_db_sync, snmp_discovery, print_queue, sse_bus) land in F3. Refs #22 * feat(api): readiness aggregator — remaining 4 checks printer_db_sync, snmp_discovery (<90s ok / <600s stale / else fail), print_queue worker liveness, sse_bus subscriber capacity. Completes Cluster 1e aggregator. F4 wires the FastAPI route. Refs #22 * feat(api): expose /readiness deep-check endpoint Returns HTTP 200 with body.status in {ready, degraded} when the critical checks pass; 503 with status=not-ready when database/ alembic/template_seed fail. Pangolin can switch its healthcheck.path to /readiness — Docker keeps polling /healthz for liveness-only. Refs #22 * test(api): regression guard — /healthz must answer 200 even when DB broken Locks in the Cluster 1e contract: liveness probe is restart-relevant (must NOT touch the DB), readiness probe owns the deep checks. Prevents accidental DB queries sneaking back into /healthz. Refs #22 * feat(status): StatusProbeProducer persists printer_status_cache rows Every probe success writes parsed JSON + captured_at; SNMP timeouts persist online=False + last_error while preserving the prior parsed snapshot. No schema change — uses Phase 5 columns. Refs #22 * feat(status): PrinterStatus carries cache freshness + offline reason Adds captured_at, last_probe_age_s, last_error, note to the response of /api/printers/{id}/status so the UI can render staleness and offline reasons instead of guessing. Refs #22 * fix(status): /api/printers/{id}/status reads from cache, no sync SNMP Eliminates the 5-second block when the printer is offline. The probe worker keeps printer_status_cache fresh in the background; this endpoint returns whatever is there in <10ms. Refs #22 * feat(ui): proxy /docs, /openapi.json, /redoc to the backend Swagger UI and the raw OpenAPI document are now reachable behind the public domain (which sits behind Pangolin SSO + the Basic-Auth bypass). Closes the 404 reported in the hhdocker02 production smoke test. Refs #22 * docs(api): document /healthz vs /readiness contract in the README Explains the liveness/readiness split introduced in Phase 7b Cluster 1e and links to the spec for the full check list. Recommends using /readiness for reverse-proxy routing checks while keeping /healthz on Docker container healthchecks. Refs #22 * fix(api): readiness sse_bus check supports real EventBus + Settings cap The real EventBus exposes distinct_subscriber_count() (no zero-arg subscriber_count) and reads its cap from settings.sse_max_subscribers (no max_subscribers attribute). Probe both surfaces so production and unit-test fakes both report correct subscriber counts and caps. Refs #22 * fix(api): populate PrinterStatus.tape_loaded + error_state from cache Bot reviews (Copilot + Gemini, identical HIGH-priority finding on PR #75) flagged that the G3 endpoint rewrite stopped populating the schema's tape_loaded and error_state fields — they were always null. Map the cache JSON: loaded_tape_mm=12 → tape_loaded="12mm", error_flags=[...] → error_state="flag1, flag2". Existing test test_status_endpoint_returns_cached_tape_data extended to lock the contract. Also sanitises two private hostname references in the plan file that tripped the Privacy / secret scan workflow. Refs #22
1 parent c5a7964 commit 784decc

59 files changed

Lines changed: 5358 additions & 365 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,22 @@ curl http://localhost:8080/healthz # frontend → backend_reachable: true
9292
| `POST` | `/jobs/{job_id}/resume` | Resume a job paused by tape mismatch (after the user changed the tape physically) ||
9393
| `POST` | `/printer/resume` | Resume the printer queue after a recoverable error halted it (tape empty / cover open / offline) ||
9494
| `GET` | `/healthz` | Liveness probe for orchestrators ||
95+
| `GET` | `/readiness` | Readiness probe — deep check for reverse-proxy routing ||
96+
97+
### Health Probes
98+
99+
The backend exposes two HTTP probes with different semantics:
100+
101+
| Endpoint | Purpose | What it answers |
102+
|----------|---------|-----------------|
103+
| `GET /healthz` | Liveness — Docker / Kubernetes container restart signal | "the process and the event loop are alive" |
104+
| `GET /readiness` | Readiness — reverse-proxy routing signal | "the process can serve traffic right now": database connectable, alembic at head, templates seeded, runtime printer matches DB, SNMP probe fresh, queue worker alive, SSE bus capacity ok |
105+
106+
`/readiness` returns HTTP 200 with `status` of `ready` (all checks ok) or `degraded` (non-critical checks failing — still routable), and HTTP 503 with `not-ready` when a critical check (database, alembic, template_seed) fails.
107+
108+
Pangolin's `targets[0].healthcheck.path` can use `/readiness` for deep checks instead of `/healthz`; Docker container healthchecks should stay on `/healthz` to avoid restart loops on transient DB failures.
109+
110+
See `docs/superpowers/specs/2026-05-17-phase-7b-foundation-design.md` for the full check list and rationale.
95111

96112
### `POST /print` request body
97113

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
"""Phase 7b — normalise existing datetime rows to timezone-aware ISO strings.
2+
3+
Existing rows from Phase 5 inserts contain naive datetimes (no TZ suffix)
4+
that break the Go frontend's RFC3339 parser. This migration appends
5+
`+00:00` to any value that does NOT already contain `+` or end with `Z`.
6+
SQLite is dynamically typed so no ALTER TABLE is required — the new column
7+
type from B4 only affects new inserts via the SQLAlchemy layer.
8+
9+
Revision ID: 20260517_phase7b_datetime_tz
10+
Revises: b2668b6e8845
11+
Create Date: 2026-05-17
12+
"""
13+
14+
from alembic import op
15+
16+
# revision identifiers, used by Alembic.
17+
revision = "20260517_phase7b_datetime_tz"
18+
down_revision = "b2668b6e8845"
19+
branch_labels = None
20+
depends_on = None
21+
22+
23+
_TABLES_DT = [
24+
("templates", ["created_at", "updated_at"]),
25+
("printers", ["created_at", "updated_at"]),
26+
("jobs", ["created_at", "updated_at", "started_at", "finished_at"]),
27+
("presets", ["created_at", "updated_at"]),
28+
("printer_state", ["updated_at"]),
29+
("printer_status_cache", ["captured_at", "updated_at"]),
30+
]
31+
32+
33+
def upgrade() -> None:
34+
for table, cols in _TABLES_DT:
35+
for col in cols:
36+
op.execute(
37+
f"UPDATE {table} SET {col} = {col} || '+00:00' "
38+
f"WHERE {col} IS NOT NULL "
39+
f"AND {col} NOT LIKE '%+%' "
40+
f"AND {col} NOT LIKE '%Z'"
41+
)
42+
43+
44+
def downgrade() -> None:
45+
# The naive-datetime state being reverted to is exactly the bug we
46+
# are fixing. Downgrade is intentionally a no-op.
47+
pass

backend/app/api/routes/print.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
import logging
66
from typing import Any
7+
from uuid import UUID
78

89
from fastapi import APIRouter, HTTPException, Request, status
910
from fastapi.responses import JSONResponse
@@ -32,7 +33,7 @@
3233
class _PrinterResumeResponse(BaseModel):
3334
"""200 response body for POST /printer/resume."""
3435

35-
printer_id: str
36+
printer_id: UUID | str
3637
state: str
3738

3839

backend/app/api/routes/printers.py

Lines changed: 30 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222

2323
from __future__ import annotations
2424

25-
import asyncio
2625
import dataclasses
2726
import logging
2827
from datetime import UTC, datetime
@@ -166,65 +165,51 @@ def _error_label(block: Any) -> str | None:
166165
@router.get(
167166
"/{printer_id}/status",
168167
response_model=PrinterStatus,
169-
summary="Force a fresh printer status probe",
168+
summary="Return the latest cached printer status",
170169
description=(
171-
"Sends an ESC i S command to the printer over TCP/9100. "
172-
"The result is written back to ``printer_status_cache`` and returned. "
173-
"Returns 503 when the printer is unreachable."
170+
"Returns the most recent status written by the background SNMP probe worker. "
171+
"The response is served from ``printer_status_cache`` — no synchronous SNMP "
172+
"probe is performed, so the response always returns in <10 ms. "
173+
"When no probe has completed yet ``online`` is ``null`` and ``note`` explains why. "
174+
"Returns 404 when the printer is not registered."
174175
),
175176
)
176177
async def get_printer_status(
177178
printer_id: UUID,
178179
session: SessionDep,
179180
) -> PrinterStatus:
180-
"""Probe the printer and update the cache."""
181-
printer = await _get_printer_or_404(session, printer_id)
181+
"""Return the latest cached status for a printer; no sync SNMP probe."""
182+
await _get_printer_or_404(session, printer_id)
182183

183-
host: str | None = printer.connection.get("host") if printer.connection else None
184-
if not host:
185-
raise HTTPException(
186-
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
187-
detail=f"printer {printer_id} has no 'host' in connection config",
184+
row = await cache_repo.get(session, printer_id)
185+
if row is None or row.captured_at is None:
186+
return PrinterStatus(
187+
printer_id=printer_id,
188+
online=None,
189+
captured_at=None,
190+
note="No probe yet — wait up to 30s for first probe cycle",
188191
)
189192

190-
port: int = int(printer.connection.get("port", 9100))
191-
192-
try:
193-
result = await asyncio.to_thread(_probe_status_sync, host, port)
194-
except OSError as exc:
195-
raise HTTPException(
196-
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
197-
detail=f"printer {printer_id} unreachable: {exc}",
198-
) from exc
193+
parsed = row.parsed or {}
194+
captured = row.captured_at
195+
if captured.tzinfo is None:
196+
captured = captured.replace(tzinfo=UTC)
197+
age_s = int((datetime.now(UTC) - captured).total_seconds())
199198

200-
block = result["block"]
201-
raw: bytes = result["raw"]
202-
now = datetime.now(UTC)
203-
204-
parsed: dict[str, Any] = {
205-
"media_width_mm": block.media_width_mm,
206-
"media_type": block.media_type.name,
207-
"status_type": block.status_type.name,
208-
"phase_type": block.phase_type.name,
209-
"errors": int(block.errors),
210-
"tape_color": block.tape_color.name,
211-
"text_color": block.text_color.name,
212-
}
199+
loaded_tape_mm = parsed.get("loaded_tape_mm")
200+
tape_loaded = f"{loaded_tape_mm}mm" if loaded_tape_mm else None
213201

214-
await cache_repo.upsert(
215-
session,
216-
printer_id,
217-
raw_block=raw,
218-
parsed=parsed,
219-
captured_at=now,
220-
)
202+
error_flags = parsed.get("error_flags") or []
203+
error_state = ", ".join(error_flags) if error_flags else None
221204

222205
return PrinterStatus(
223206
printer_id=printer_id,
224-
online=True,
225-
tape_loaded=_tape_label(block),
226-
error_state=_error_label(block),
227-
captured_at=now,
207+
online=parsed.get("online"),
208+
tape_loaded=tape_loaded,
209+
error_state=error_state,
210+
captured_at=row.captured_at,
211+
last_probe_age_s=age_s,
212+
last_error=parsed.get("last_error"),
228213
)
229214

230215

backend/app/db/lifespan.py

Lines changed: 126 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,28 @@
77
88
Call order in main.py lifespan:
99
1. run_migrations() — apply pending Alembic revisions
10-
2. recover_inflight_jobs() — mark stale QUEUED/PRINTING jobs as failed_restart
11-
3. seed_templates() — upsert YAML seed templates into DB
12-
4. ensure_printer_state() — create missing printer_state rows
10+
1b. verify_alembic_at_head() — assert DB revision == script head (fail fast)
11+
2. _discover_plugins() — register integration + model plugins (idempotent)
12+
3. TemplateLoader.load_dir() — populate in-memory template cache (Cluster 1a)
13+
4. recover_inflight_jobs() — mark stale QUEUED/PRINTING jobs as failed_restart
14+
5. seed_templates() — YAML → DB upsert (defensive check on cache)
15+
6. upsert_runtime_printer() — env → DB Printer row (Cluster 1b)
16+
7. ensure_printer_state() — create missing printer_state rows per Printer
17+
18+
Note: steps 2 and 3 must precede step 5 — TemplateLoader.load_dir() validates
19+
templates against IntegrationRegistry (populated in step 2), and seed_templates()
20+
reads from the cache that load_dir() populates in step 3.
1321
"""
1422

1523
from __future__ import annotations
1624

25+
from uuid import UUID
26+
1727
from sqlalchemy.ext.asyncio import AsyncSession
1828

29+
from app.config import Settings
30+
from app.models.printer import Printer
31+
from app.services.printer_identity import derive_printer_id
1932
from app.services.template_loader import TemplateLoader
2033

2134

@@ -49,6 +62,55 @@ def _upgrade() -> None:
4962
await asyncio.to_thread(_upgrade)
5063

5164

65+
async def verify_alembic_at_head(settings: Settings) -> None:
66+
"""Raise RuntimeError if the DB's alembic revision does not match the script head.
67+
68+
Lifespan calls this right after run_migrations() so a half-applied or
69+
corrupted DB fails startup loudly with a clear log line, instead of
70+
crashing later inside ORM queries with cryptic schema errors.
71+
72+
Takes settings explicitly so unit tests can verify against ad-hoc DBs
73+
without monkey-patching the get_settings() lru_cache singleton — that's
74+
the C2/D2 testability pattern.
75+
"""
76+
import asyncio
77+
from pathlib import Path as _Path
78+
79+
from alembic.config import Config
80+
from alembic.runtime.migration import MigrationContext
81+
from alembic.script import ScriptDirectory
82+
from sqlalchemy import create_engine
83+
84+
# backend/app/db/lifespan.py → parents[2] = backend/
85+
ini_path = _Path(__file__).resolve().parents[2] / "alembic.ini"
86+
87+
def _check() -> tuple[str | None, str | None]:
88+
cfg = Config(str(ini_path))
89+
# Prevent alembic from calling logging.config.fileConfig() which would
90+
# reconfigure the root logger and break pytest caplog fixtures.
91+
cfg.attributes["configure_logger"] = False
92+
script = ScriptDirectory.from_config(cfg)
93+
head_rev = script.get_current_head()
94+
95+
# SQLAlchemy's synchronous engine: strip the async driver suffix
96+
sync_url = settings.database_url.replace("+aiosqlite", "")
97+
engine = create_engine(sync_url)
98+
try:
99+
with engine.connect() as conn:
100+
ctx = MigrationContext.configure(conn)
101+
current_rev = ctx.get_current_revision()
102+
finally:
103+
engine.dispose()
104+
105+
return current_rev, head_rev
106+
107+
current_rev, head_rev = await asyncio.to_thread(_check)
108+
if current_rev != head_rev:
109+
raise RuntimeError(
110+
f"Alembic migration drift detected: DB at {current_rev!r}, expected head {head_rev!r}"
111+
)
112+
113+
52114
async def recover_inflight_jobs(session: AsyncSession) -> int:
53115
"""Mark any QUEUED or PRINTING jobs as FAILED_RESTART.
54116
@@ -70,8 +132,16 @@ async def seed_templates(session: AsyncSession, loader: type[TemplateLoader]) ->
70132
main.py can call by name, and is the natural seam for unit tests that
71133
want to inject a mock loader without touching the real registry.
72134
135+
Raises RuntimeError if the loader cache is empty — calling seed_templates
136+
without first running TemplateLoader.load_dir() is a lifespan-ordering bug.
137+
73138
Returns the count of rows touched (inserted or updated).
74139
"""
140+
if not loader._cache:
141+
raise RuntimeError(
142+
"seed_templates called with empty TemplateLoader cache — "
143+
"TemplateLoader.load_dir() must run before seed_templates()."
144+
)
75145
return await loader.seed_db(session)
76146

77147

@@ -102,3 +172,56 @@ async def ensure_printer_state(session: AsyncSession) -> int:
102172
await session.commit()
103173

104174
return created
175+
176+
177+
async def upsert_runtime_printer(
178+
session: AsyncSession,
179+
settings: Settings,
180+
) -> UUID | None:
181+
"""Materialise one Printer row from env config; return its deterministic id.
182+
183+
Returns ``None`` when the environment does NOT declare a printer host
184+
(e.g. mock backend in CI). The lifespan calls this between
185+
``seed_templates`` and ``ensure_printer_state`` so every restart
186+
keeps the single runtime printer row consistent with the current env.
187+
188+
The Printer row is keyed by the deterministic UUIDv5 produced by
189+
``derive_printer_id(model, host, port)`` — the same id that the
190+
print-queue driver uses, so the DB row and the in-memory printer share
191+
one stable identity across restarts.
192+
"""
193+
model: str = settings.printer_model
194+
# Resolve host: pt750w takes precedence, ql820 is the fallback.
195+
host: str = settings.pt750w_host or settings.ql820_host or ""
196+
port: int = settings.pt750w_port if settings.pt750w_host else settings.ql820_port
197+
198+
if not (model and host and port):
199+
return None
200+
201+
printer_id: UUID = derive_printer_id(model, host, port)
202+
connection: dict[str, object] = {
203+
"host": host,
204+
"port": port,
205+
"snmp": settings.printer_discover_via_snmp,
206+
"snmp_community": settings.printer_snmp_community,
207+
}
208+
name: str = f"{model} ({host})"
209+
210+
existing = await session.get(Printer, printer_id)
211+
if existing is not None:
212+
existing.name = name
213+
existing.connection = connection
214+
existing.enabled = True
215+
else:
216+
session.add(
217+
Printer(
218+
id=printer_id,
219+
name=name,
220+
model=model.lower(),
221+
backend=settings.printer_backend,
222+
connection=connection,
223+
enabled=True,
224+
)
225+
)
226+
await session.flush()
227+
return printer_id

0 commit comments

Comments
 (0)