-
Notifications
You must be signed in to change notification settings - Fork 440
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Summary
pi.ruv.io (Cloud Run service ruvbrain) was returning 504 upstream request timeouts on all endpoints. Every HTTP request timed out at exactly 300s (Cloud Run max timeout) while background cognitive cycles continued running normally.
Root Cause
The mcp_brain_server cognitive cycle (runs every ~3 min) processes heavy workloads on the same async runtime as HTTP handlers:
- 200 auto-votes per cycle
- LoRA auto-submissions from SONA patterns
- Inference processing (11-21 inferences per cycle)
- Strange loop calculations
This appears to starve the HTTP handler, preventing it from serving any requests. The process stays alive (cognitive logs continue) but all HTTP responses hang until the 300s Cloud Run timeout.
Evidence
- All requests returned
504withlatency: 299.97s(Cloud Run timeout ceiling) - App logs showed healthy cognitive cycles (
Cognitive cycle #22-24) running every ~3 min SIGTERMgraceful shutdown observed at2026-03-27T13:58:50Z, followed by restart- After restart, cognitive cycles resumed but HTTP remained unresponsive
- Two instance IDs visible in logs — both instances affected
curlconfirmed TCP connect succeeds (0.003s) but app never sends response
Resolution (Temporary)
Redeployed fresh revision (ruvbrain-00159-cjl) — service immediately responsive (77ms).
Permanent Fix Needed
- Move cognitive cycle to a separate tokio task with budget/yield — ensure HTTP handlers are never starved by background processing
- Add a
/healthzliveness probe that Cloud Run can use to detect and restart unresponsive instances - Consider
tokio::task::spawn_blockingfor heavy compute (auto-votes, LoRA) to avoid blocking the async executor - Add request timeout middleware (e.g., 30s) so individual requests fail fast instead of hanging to 300s
Environment
- Service:
ruvbrain/us-central1 - Image:
gcr.io/ruv-dev/ruvbrain:latest - Revision:
ruvbrain-00158-2lb(affected),ruvbrain-00159-cjl(redeployed) - Config: 2 CPU, 2Gi memory, minScale=1, maxScale=10, session affinity enabled
Labels
- priority: high
- component: mcp-brain-server
- type: bug
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working