Skip to content

Decouple sessions from specific agents for resilience and failover #236

@JoshuaAFerguson

Description

@JoshuaAFerguson

Problem

Sessions are currently tightly coupled to the specific agent that created them. This creates operational fragility and prevents key features:

Current Architecture:

Session Record:
  agent_id: "k8s-prod-cluster-abc123"  ← Hardcoded to specific agent instance
  state: "running"

Impact of Tight Coupling:

  1. Agent Failure = Session Loss

  2. No Session Migration

    • Can't move running sessions between agents
    • Can't rebalance load
    • Can't drain agent for maintenance
  3. Blocks Auto-Scaling (Issue Agent auto-registration for Kubernetes auto-scaling #234)

    • Agent scale-down orphans sessions
    • Sessions can't reconnect to new agent instances
  4. Resource Waste

    • Orphaned Kubernetes resources continue running
    • Database shows incorrect state
    • Manual cleanup required

Observed Issues:

  • admin-brave-f7b5e0f5 stuck in "terminating" when agent replaced
  • admin-brave-fa96fa51 stuck in "pending" waiting for specific agent
  • All sessions require manual intervention when agent changes

Proposed Architecture: Agent Pool Model

Design Principles

  1. Logical Agent Groups: Sessions bind to agent pools, not specific instances
  2. Platform-Based Routing: Route by platform/region, not agent ID
  3. Dynamic Assignment: Agents claim sessions at runtime
  4. Graceful Failover: Sessions automatically reassign on agent failure

Database Schema Changes

Before (Current):

sessions:
  agent_id: "k8s-prod-cluster-abc123"  -- Specific instance

After (Proposed):

sessions:
  platform: "kubernetes"        -- Platform type
  region: "us-east-1"           -- Deployment region  
  agent_pool: "k8s-prod"        -- Logical pool
  assigned_agent_id: "..."      -- Current handler (nullable, dynamic)
  platform_resource_id: "..."   -- K8s pod name, Docker container ID, etc.

Implementation Plan

Phase 1: Database Schema (v2.1.0)

  • Add platform, region, agent_pool columns to sessions
  • Add assigned_agent_id (nullable, replaces agent_id)
  • Add platform_resource_id for K8s pod name / Docker container ID
  • Migration script to populate from existing agent_id

Phase 2: Agent Pool Registry (v2.1.0)

  • AgentHub tracks agents by pool: map[poolID][]Agent
  • Agent registration includes pool membership
  • Pool-based agent selection
  • Load balancing across pool members

Phase 3: Dynamic Session Assignment (v2.1.0)

  • Command dispatcher tries assigned agent first
  • Falls back to any agent in pool if assigned agent gone
  • Updates assigned_agent_id on successful dispatch

Phase 4: Platform Resource Tracking (v2.1.0)

  • Agents report platform_resource_id when creating sessions
  • Stored in database for cross-agent operations
  • Any agent can query/cleanup using resource ID

Benefits

  • ✅ Agent pod restart doesn't orphan sessions
  • ✅ Sessions automatically failover to healthy agents
  • ✅ No manual intervention for agent changes
  • ✅ Seamless agent auto-scaling
  • ✅ VNC connections survive agent restarts

Related Issues

Priority: P1 | Timeline: v2.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions