Skip to content

feat: Workspace environment persisted in the network volume #10

Merged
deanq merged 44 commits intomainfrom
deanq/ae-894-sandbox-env
Aug 5, 2025
Merged

feat: Workspace environment persisted in the network volume #10
deanq merged 44 commits intomainfrom
deanq/ae-894-sandbox-env

Conversation

@deanq
Copy link
Copy Markdown
Contributor

@deanq deanq commented Jul 23, 2025

This build is available for testing at runpod/tetra-rp:local and runpod/tetra-rp-cpu:local

This PR implements persistent workspace management with RunPod network volumes and introduces a complete architectural refactor. The changes include endpoint-specific workspaces, virtual environments, shared package caching, concurrency safety, and structured logging throughout the codebase.

Key Changes Summary

  • Complete architectural refactor with modular executor pattern
    • Introduced BaseExecutor pattern for extensible execution strategies
  • Moved from single handler.py to modular structure in src/ directory.
    • Modular structure with clear separation of concerns (8 new specialized modules)
    • Clean separation between workspace management, execution, and dependency handling.
  • Added persistent volume workspace management with endpoint isolation
    • Network volume detection and graceful fallback to container volume
    • Endpoint isolation prevents conflicts between different serverless endpoints
    • Shared caching (/runpod-volume/.uv-cache, /runpod-volume/.hf-cache) optimizes resource usage
    • Differential package installation reduces redundant downloads
  • Concurrency Safety
    • File-based locking with proper file descriptor management
    • Configurable timeouts and atomic operations
    • Error handling and recovery mechanisms
  • Implemented comprehensive testing framework
    • Comprehensive test suite with 85%+ coverage
    • Both unit and integration tests
    • CI pipeline tests all test_*.json files automatically
    • Quality gates with linting, formatting, and type checking
  • CI/CD pipeline with automated testing and quality checks
  • Replaced print statements with structured logging throughout

@deanq deanq changed the base branch from main to deanq/ae-884-testing-pipeline July 23, 2025 14:58
Base automatically changed from deanq/ae-884-testing-pipeline to main July 23, 2025 16:15
deanq added 13 commits July 23, 2025 15:41
Establish testing infrastructure and protocol validation tests. Create shared fixtures and validates the FunctionRequest/FunctionResponse data models that will be extended for volume workspace functionality.
Tests volume detection, virtual environment creation, file-based locking for concurrency, and timeout handling mechanisms.
Validates that functions execute in volume workspace, can access persistent packages, and fallback gracefully when volume is unavailable.
Tests complete request workflows, concurrent access safety, mixed execution scenarios, and realistic error handling patterns.
Adds volume detection logic, workspace initialization with file-based locking, virtual environment creation, and timeout handling to make some tests pass.
Implement smart dependency installation that only installs missing packages. Optimizes performance by leveraging persistent volume storage and avoiding redundant package installations.
…uration

Enable functions to execute in volume workspace with access to persistent packages. Configures Python path, environment variables, and UV cache to utilize volume storage effectively.
Update existing tests to work with new volume workspace functionality. Ensures backward compatibility and validates that all existing functionality continues to work with the new volume-aware implementation.
deanq added 9 commits July 31, 2025 17:51
- Move all Python modules to src/ for better organization
- Update Docker files to copy from src/ directory
- Update pyproject.toml with src/ in pythonpath
- Update Makefile to copy remote_execution.py to src/
- All tests pass with new structure
- Add _validate_virtual_environment() method to WorkspaceManager with symlink chain resolution using os.path.realpath()
- Add _remove_broken_virtual_environment() cleanup method
- Enhance initialize_workspace() with validation checks and automatic repair
- Add validation calls in setup_python_path() and dependency installer
- Update Docker files to work with src/ directory structure
- Update tests to mock new validation methods
- Fix pyproject.toml pythonpath configuration for tests

This resolves broken virtual environment symlinks when different endpoints
create venvs with different Python interpreter paths on shared volumes.
- Add RUNPOD_ENDPOINT_ID environment variable support for endpoint isolation
- Workspace paths now: /runpod-volume/runtimes/{endpoint_id}
- Shared UV cache at volume root: /runpod-volume/.uv-cache
- Add RUNTIMES_DIR_NAME constant for endpoint workspace organization
- Update WorkspaceManager to create endpoint-specific workspace paths
- Add comprehensive endpoint isolation tests
- Update integration tests for new workspace structure
- Resolve merge conflicts from virtual environment validation features
- Add HF_CACHE_DIR_NAME constant for .hf-cache directory
- Implement _configure_huggingface_cache() method in WorkspaceManager
- Set HF environment variables (HF_HOME, TRANSFORMERS_CACHE, etc.) to use volume paths
- Update unit and integration tests to mock os.makedirs calls
- Fix "No space left on device" errors when downloading HF models
- Add make test-handler command that tests all test_*.json files locally
- Update CI to use make test-handler for consistency between local and CI testing
- Ensure local development environment matches CI testing exactly
- Remove code duplication between Makefile and CI configuration
- Support cross-platform testing (handles timeout command availability)
- Update CLAUDE.md documentation with new testing commands
@deanq deanq marked this pull request as ready for review August 2, 2025 18:34
deanq added 5 commits August 2, 2025 16:26
…ization

- Add configurable timeout constants (WORKSPACE_INIT_TIMEOUT, WORKSPACE_LOCK_POLL_INTERVAL)
- Implement atomic lock file operations with proper file descriptor management
- Enhance lock file cleanup with comprehensive error handling
- Add workspace directory validation before lock acquisition
- Fix race condition in workspace functionality checks by making them atomic
- Add comprehensive timeout and edge case tests for concurrency scenarios
- Improve error messages and fallback behavior for various failure modes
- Maintain backward compatibility while significantly improving reliability
- Add BaseExecutor abstract base class with common functionality
- Update FunctionExecutor and ClassExecutor to inherit from BaseExecutor
- Standardize execution environment setup via _setup_execution_environment
- Update ClassExecutor constructor to accept workspace_manager parameter
- Fix ClassExecutor tests to mock workspace_manager dependency
- Add logging support to DependencyInstaller and WorkspaceManager
- Replace print calls with appropriate log levels (info, warning, error)
- Improve debugging and monitoring capabilities
@deanq deanq requested a review from pandyamarut August 4, 2025 22:37
deanq added 2 commits August 4, 2025 16:27
The fix will resolve the vLLM subprocess errors encountered while maintaining full compatibility with existing functionality. When deployed to RunPod with volumes, libraries like vLLM that hardcode /app/.venv paths will seamlessly use the volume's virtual environment.
- Add symlink from /app/.venv to volume venv to handle hardcoded paths
- Configure PYTHONPATH environment variable for subprocess compatibility
- Ensure libraries like vLLM can spawn subprocesses that find installed packages
- Add comprehensive test coverage for symlink functionality
- Maintain backward compatibility when no volume is present
Copy link
Copy Markdown
Contributor

@pandyamarut pandyamarut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@deanq deanq merged commit 6675ec1 into main Aug 5, 2025
10 checks passed
@deanq deanq deleted the deanq/ae-894-sandbox-env branch August 5, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants