Skip to content

Conversation

@deanq
Copy link
Member

@deanq deanq commented Nov 19, 2025

Summary

Implements a comprehensive flash undeploy command to manage and delete RunPod serverless endpoints tracked by the Flash CLI.

image

Changes

Core Implementation

  • New Command: src/tetra_rp/cli/commands/undeploy.py (~572 lines)
    • Multiple interaction modes: list, by name, --all, --interactive, --cleanup-stale
    • Rich formatted tables and panels for UI
    • Questionary for interactive checkbox selection
    • Health status checking for all endpoints
  • GraphQL Fix: Fixed delete_endpoint success detection in runpod.py
    • Changed from checking null value to checking key presence
    • Handles GraphQL's {"deleteEndpoint": null} success response
  • Session Management: Implemented async context manager for RunpodGraphQLClient
  • ResourceManager Extensions: Added list_all_resources() and find_resources_by_name()
  • CLI Integration: Registered undeploy command in main.py
  • Tests: Comprehensive unit tests with async mocking (~355 lines)

Build System

  • Makefile: Removed .tetra_resources.pkl cleanup from make clean
    • Tracking file now persists across builds
    • Use flash undeploy commands to manage endpoints

Documentation

  • README.md: Added flash undeploy section with examples
  • flash-undeploy.md: Comprehensive 280+ line documentation
    • All usage modes with examples
    • Status indicator explanations
    • Troubleshooting guide
    • Integration with @Remote decorator

Usage Examples

List all endpoints with status

flash undeploy list

Shows table with Name, Endpoint ID, Status (🟢 Active/🔴 Inactive/❓ Unknown), Type, Resource ID

Undeploy specific endpoint

flash undeploy my-api

Undeploy all endpoints (double confirmation)

flash undeploy --all

Interactive selection

flash undeploy --interactive

Clean up stale tracking

flash undeploy --cleanup-stale

Removes tracking for endpoints deleted externally (via RunPod UI/API)

Features

Status Checking

  • 🟢 Active: Endpoint exists and health check succeeds
  • 🔴 Inactive: Tracking exists but endpoint deleted externally
  • Unknown: Exception during health check
  • Performed via health check API calls (1 per endpoint)

Safety Features

  • Confirmation prompts before all deletions
  • Double confirmation for --all (yes/no + type "DELETE ALL")
  • Keyboard interrupt handling (Ctrl+C to cancel)
  • "Cannot be undone" warnings
  • Detailed error reporting per endpoint
  • Success/failure counts in summary
  • Continues processing remaining endpoints on failure

Cleanup Stale Tracking

  • Identifies endpoints deleted via RunPod UI/API
  • Removes orphaned tracking entries
  • No API deletion (endpoints already gone)
  • Prevents stale .tetra_resources.pkl file

Technical Details

GraphQL deleteEndpoint Behavior

  • Returns {"deleteEndpoint": null} on success (HTTP 200)
  • Success determined by key presence, not value
  • Exceptions thrown by _execute_graphql on failure

Async Session Management

  • RunpodGraphQLClient implements __aenter__/__aexit__
  • Properly closes aiohttp sessions
  • Prevents "Unclosed client session" warnings

Tracking File Protection

  • .tetra_resources.pkl excluded from make clean
  • Contains deployment state for @Remote decorator
  • Already in .gitignore

Test Coverage

  • List command (no endpoints, with endpoints)
  • Undeploy by name (cancelled, success, nonexistent)
  • Undeploy --all (wrong confirmation, success)
  • Delete endpoint function (success, API failure, exception)
  • Helper functions (status checking, resource type formatting)
  • Async context manager mocking

Testing Checklist

  • Unit tests pass (289 tests)
  • Quality checks pass (format, lint, coverage 37.37%)
  • GraphQL deletion fix verified
  • Async session cleanup verified
  • All interaction modes tested
  • Error handling works as expected
  • Confirmation flows prevent accidental deletions
  • Documentation accuracy verified

Related Issues

  • Fixes endpoint deletion bug (GraphQL null handling)
  • Fixes unclosed aiohttp session warnings
  • Protects tracking file from accidental deletion
  • Enables cleanup of externally deleted endpoints

@deanq deanq changed the title feat(cli): Add flash undeploy command for endpoint management feat(cli): Add flash undeploy command for endpoint management Nov 19, 2025
@deanq deanq requested a review from Copilot November 22, 2025 10:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive flash undeploy command to manage and delete RunPod serverless endpoints tracked by the Flash CLI. The command provides multiple interaction modes (list, by name, --all, --interactive, --cleanup-stale) with safety features including confirmation prompts and detailed error reporting.

Key Changes:

  • New undeploy command with multiple interaction modes and safety confirmations
  • Extended ResourceManager with list_all_resources() and find_resources_by_name() methods
  • Comprehensive test coverage for all command modes and helper functions

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/tetra_rp/cli/commands/undeploy.py Implements the undeploy command with list, interactive, and deletion logic
src/tetra_rp/core/resources/resource_manager.py Adds methods to list and find resources by name
src/tetra_rp/core/api/runpod.py Updates delete_endpoint success detection logic
src/tetra_rp/cli/main.py Registers the undeploy command
tests/unit/resources/test_resource_manager.py Tests for new ResourceManager methods
tests/unit/cli/test_undeploy.py Comprehensive tests for undeploy command
src/tetra_rp/cli/docs/flash-undeploy.md Complete documentation for the undeploy command
src/tetra_rp/cli/docs/README.md Updates main CLI docs with undeploy command
tests/unit/cli/init.py Initializes CLI test module
Makefile Removes .pkl file deletion from clean target

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Implements comprehensive error handling for missing API keys with
actionable guidance for users.

When RUNPOD_API_KEY is missing, users now receive helpful error
messages that include:
- Documentation URL for obtaining API keys
- Three setup methods (env var, .env file, shell profile)
- Context about which operation requires the key
- Troubleshooting guidance for .env file location

Implementation:
- Created RunpodAPIKeyError with helpful default message
- Added validate_api_key() helper functions
- Updated API clients to use custom exception
- Added resource deployment error context
- Enhanced flash init with API key documentation link
- Added 15 comprehensive tests (all passing)

Users previously saw generic "Runpod API key is required" errors.
Now they get clear, actionable steps to resolve the issue.
Simplify validation logic and extract duplicated error handling:

- Simplify API key validation condition in validation.py
  Changed `api_key.strip() == ""` to `not api_key.strip()` for clarity

- Extract duplicated error handling in resource_manager.py
  Created `_deploy_with_error_context` helper method to handle
  RunpodAPIKeyError with resource context, eliminating code duplication

All tests pass (231 unit tests).
Implements comprehensive undeploy command with multiple interaction modes:
- List all tracked endpoints with status indicators
- Delete specific endpoint by name
- Delete all endpoints with double confirmation
- Interactive checkbox selection for batch deletion

Changes:
- Add undeploy.py command with Rich/questionary UI
- Extend ResourceManager with list_all_resources() and find_resources_by_name()
- Integrate undeploy command into CLI
- Add comprehensive unit tests with mocking
- Fix duplicate import in resource_manager.py

Safety features include confirmation prompts, keyboard interrupt handling,
and detailed error reporting per endpoint.
Fixes two critical bugs preventing flash undeploy from working correctly:

1. GraphQL deleteEndpoint success detection
   - RunPod API returns {"deleteEndpoint": null} on success
   - Changed from checking null value to checking key presence
   - Added detailed comments explaining GraphQL response pattern

2. Unclosed aiohttp session
   - Wrapped RunpodGraphQLClient in async context manager
   - Ensures proper session cleanup after deletion
   - Eliminates "Unclosed client session" warnings

3. Updated test mocks to support async context manager usage

Tested with real endpoint deletion - successfully removes endpoint
from both RunPod and local .tetra_resources.pkl tracking file.
Removes automatic deletion of .pkl files from 'make clean' target to prevent
accidental loss of endpoint tracking state.

Rationale:
- .pkl files like .tetra_resources.pkl contain critical state for tracking
  deployed RunPod endpoints
- Deleting this file orphans deployed endpoints (still running, still costing
  money) without ability to manage them via CLI
- Users should explicitly manage endpoints via 'flash undeploy' command
- .pkl files are state/cache files, not build artifacts

Impact:
- make clean still removes build artifacts (dist, build, egg-info, pycache)
- .tetra_resources.pkl persists across clean operations
- Users must use 'flash undeploy' for proper endpoint cleanup
- Prevents accidental resource leaks and unexpected cloud costs
Adds ability to clean up stale endpoint entries from .tetra_resources.pkl
when endpoints have been deleted externally (via RunPod UI/API).

Changes:
- Added --cleanup-stale flag to flash undeploy command
- New _cleanup_stale_endpoints() function that:
  - Checks all tracked endpoints for inactive status
  - Lists inactive endpoints for user review
  - Prompts for confirmation before removal
  - Removes only from tracking (endpoints already deleted remotely)
- Added imports: Dict, DeployableResource, Confirm

Use case:
When users delete endpoints via RunPod UI/API instead of flash CLI,
the tracking file (.tetra_resources.pkl) becomes stale. This flag
identifies and removes those orphaned tracking entries.

Usage:
  flash undeploy --cleanup-stale

Note: The "Status" column in `flash undeploy list` makes a health check
API call for each endpoint to determine Active/Inactive state. While this
adds latency (6 endpoints = 6 API calls), it's valuable for identifying
stale entries that need cleanup.
Added documentation for the new flash undeploy command covering all usage
modes and features.

Changes:
- Updated src/tetra_rp/cli/docs/README.md with flash undeploy section
  - Command synopsis and options
  - Usage examples for all modes
  - Status indicator explanations
- Created src/tetra_rp/cli/docs/flash-undeploy.md
  - Detailed documentation (240+ lines)
  - All usage modes: list, by name, --all, --interactive, --cleanup-stale
  - Status check explanation and value proposition
  - Safety features and confirmations
  - Integration with @Remote decorator
  - Troubleshooting guide
  - Examples and workflows

Documentation covers:
- List endpoints with health status
- Undeploy by name with confirmation
- Undeploy all with double confirmation
- Interactive checkbox selection
- Cleanup stale tracking (--cleanup-stale flag)
- Status indicators (Active/Inactive/Unknown)
- Tracking file management
- Error handling and troubleshooting
@deanq deanq force-pushed the deanq/ae-1482-flash-cli-undeploy branch from 25265cc to 00ad37d Compare November 22, 2025 10:38
- Remove debugging print statements from test_undeploy.py
- Use Tuple from typing module instead of lowercase tuple for Python 3.9 consistency
- Update type hints in undeploy.py and resource_manager.py
- Add Tuple to imports in both files

All 289 tests pass, quality checks pass.
@deanq deanq requested a review from jhcipar November 22, 2025 11:14
@deanq deanq marked this pull request as ready for review November 22, 2025 11:14
@deanq deanq requested a review from justinwlin November 22, 2025 11:15
Dict with success status and message
"""
try:
async with RunpodGraphQLClient() as client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but would it be cleaner to have delete endpoint logic self container somewhere, either by the resource manager or by another layer? it feels a little weird to have this live inside of the cli, but it's pretty lightweight and that could well be over-abstracting before it's needed. ie could _delete_endpoint be a generally useful util for flash

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I'll do a follow-up PR for this refactor.

@deanq deanq merged commit cd32ffc into main Nov 25, 2025
7 checks passed
@deanq deanq deleted the deanq/ae-1482-flash-cli-undeploy branch November 25, 2025 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants