Skip to content

Sessions stuck in 'terminating' state when agent disconnects #235

@JoshuaAFerguson

Description

@JoshuaAFerguson

Problem

Sessions can get stuck in 'terminating' state when the agent that created them is disconnected or replaced before processing the stop command.

Reproduction:

  1. Agent creates a session (state: running)
  2. User terminates session via UI → database state: terminating
  3. Agent pod is deleted/recreated before processing stop command
  4. New agent has different ID and isn't approved yet
  5. Kubernetes resources (Deployment, Service, Pod) remain running indefinitely
  6. Database shows "terminating" but resources never cleaned up

Observed Behavior:

  • Session admin-brave-f7b5e0f5 stuck in "terminating" for 48+ minutes
  • Kubernetes resources still healthy and running
  • Manual cleanup required: kubectl delete deployment,service -l session=<id>

Root Cause:
No reconciliation mechanism for orphaned sessions when agent availability changes.

Impact

  • User Experience: Sessions appear stuck, UI shows incorrect state
  • Resource Waste: Pods/services continue running and consuming resources
  • Operational Burden: Requires manual Kubernetes cleanup
  • Scale Issues: In auto-scaling scenarios (Issue Agent auto-registration for Kubernetes auto-scaling #234), this becomes worse

Proposed Solutions

Solution 1: Session Reconciliation Loop (Recommended for v2.0-beta.1)

Add a background goroutine in the API that periodically checks for stuck sessions:

// Runs every 60 seconds
func (s *SessionReconciler) ReconcileStuckSessions() {
    // Find sessions in "terminating" for > 5 minutes
    stuckSessions := db.Query(`
        SELECT id, agent_id, state, updated_at
        FROM sessions
        WHERE state = 'terminating'
          AND updated_at < NOW() - INTERVAL '5 minutes'
    `)
    
    for _, session := range stuckSessions {
        agent := agentHub.GetConnection(session.AgentID)
        
        if agent != nil {
            // Agent is back - retry stop command
            log.Printf("Retrying stop for session %s", session.ID)
            dispatcher.DispatchCommand(session.AgentID, "stop_session", session.ID)
        } else {
            // Agent is gone - force cleanup
            log.Printf("Force-terminating orphaned session %s", session.ID)
            db.Exec(`UPDATE sessions SET state = 'terminated' WHERE id = $1`, session.ID)
            
            // Emit warning event for manual Kubernetes cleanup
            auditLog.Warn("Session %s orphaned - manual K8s cleanup may be required", session.ID)
        }
    }
}

Acceptance Criteria:

  • Background reconciliation loop runs every 60 seconds
  • Detects sessions in "terminating" state for > 5 minutes
  • Retries stop command if agent is available
  • Force-marks as "terminated" if agent is gone for > 10 minutes
  • Logs warnings for manual cleanup needed
  • Metrics: sessions_stuck_terminating counter

Solution 2: Agent Disconnect Cleanup (Future: v2.1.0)

When an agent disconnects, automatically handle its sessions:

func (h *AgentHub) UnregisterAgent(agentID string) {
    // Get all active sessions for this agent
    sessions := db.Query(`
        SELECT id FROM sessions 
        WHERE agent_id = $1 AND state IN ('running', 'hibernated')
    `, agentID)
    
    for _, session := range sessions {
        // Mark as "terminated" with reason "agent_disconnected"
        db.Exec(`
            UPDATE sessions 
            SET state = 'terminated', 
                termination_reason = 'agent_disconnected',
                terminated_at = NOW()
            WHERE id = $1
        `, session.ID)
    }
}

Acceptance Criteria:

  • Agent disconnect triggers session cleanup
  • Sessions marked as "terminated" with reason
  • Audit log event created
  • Optional: Grace period for agent reconnection (30s)

Solution 3: Kubernetes Resource Garbage Collection (Future: v2.1.0+)

Add a K8s controller or CronJob that cleans up orphaned resources:

# kubernetes-cleaner cronjob
schedule: "*/15 * * * *"  # Every 15 minutes
command: |
  # Delete deployments without corresponding session records
  # Or where session.state = 'terminated'

Acceptance Criteria:

  • CronJob runs every 15 minutes
  • Queries database for terminated sessions
  • Deletes corresponding K8s resources
  • Logs cleanup actions

Recommended Implementation Plan

v2.0-beta.1 (P1 - MUST FIX):

  • Implement Solution 1 (Session Reconciliation Loop)
  • Fixes immediate operational issue
  • Low complexity, high impact

v2.1.0 (P2):

  • Implement Solution 2 (Agent Disconnect Cleanup)
  • Prevents issue from occurring
  • Graceful degradation

v2.2.0 (P3):

  • Implement Solution 3 (K8s Garbage Collection)
  • Belt-and-suspenders approach
  • Handles edge cases

Files to Modify

  • api/internal/services/session_reconciler.go (new)
  • api/cmd/main.go - Start reconciler goroutine
  • api/internal/websocket/hub.go - Agent disconnect cleanup
  • chart/templates/kubernetes-cleaner-cronjob.yaml (future)

Testing

  1. Create session with agent A
  2. Terminate session
  3. Delete agent pod before it processes stop
  4. Wait 5 minutes
  5. Verify reconciler force-terminates session
  6. Verify audit log entry created
  7. Verify metrics updated

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions