Skip to content

fix: prevent timeout handler from overriding capacity errors with smart shape integration#70

Merged
senomorf merged 1 commit intomasterfrom
claude/issue-63-20250831-1522
Aug 31, 2025
Merged

fix: prevent timeout handler from overriding capacity errors with smart shape integration#70
senomorf merged 1 commit intomasterfrom
claude/issue-63-20250831-1522

Conversation

@senomorf
Copy link
Owner

@senomorf senomorf commented Aug 31, 2025

Fixes race condition where timeout handler overwrites capacity/limit error codes, causing false workflow failures in conjunction with PR #69's error-driven limit detection system.

Problem

Lines 235-240 in current master's launch-parallel.sh unconditionally override STATUS_A1/STATUS_E2 with EXIT_TIMEOUT_ERROR, even when:

This causes Oracle Cloud capacity/limit constraints to be misclassified as timeout failures, contradicting CLAUDE.md's documented expected behavior.

Solution

Architecture-aware timeout handling that:

Implementation

Technical Details

Replace problematic timeout section with:

# Architecture-aware timeout handling respecting smart shape filtering
if [[ $elapsed -ge $timeout_seconds ]]; then
    terminate_processes
    
    # Only apply timeout errors to shapes that were actually launched
    if [[ "$should_launch_a1" == true && $STATUS_A1 -eq 1 ]]; then
        STATUS_A1=$EXIT_TIMEOUT_ERROR
    fi
    
    if [[ "$should_launch_e2" == true && $STATUS_E2 -eq 1 ]]; then
        STATUS_E2=$EXIT_TIMEOUT_ERROR
    fi
fi

Impact

This fix ensures Issue #63 is resolved while maintaining full compatibility with the sophisticated architecture introduced in PR #58 and PR #69.

Resolves #63

@senomorf senomorf changed the title fix: preserve capacity error codes when timeout occurs fix: prevent timeout handler from overriding capacity errors with smart shape integration Aug 31, 2025
@senomorf senomorf force-pushed the claude/issue-63-20250831-1522 branch from fdc9942 to aa51ef8 Compare August 31, 2025 20:15
…rt shape integration

- Architecture-aware timeout handling that respects PR #69's smart shape filtering
- Only applies timeout errors to shapes that were actually launched (should_launch_* variables)
- Preserves capacity/limit error codes (exit codes 2, 5) for proper success classification
- Enhanced logging distinguishes timeout scenarios from capacity constraints
- Full integration with state management and error-driven limit detection from PR #58 and PR #69
- Maintains all existing optimization benefits while fixing Issue #63 race condition

Key improvements:
- Timeout logic checks should_launch_a1/should_launch_e2 before applying timeout codes
- Capacity errors (exit code 2) properly preserved → correct success per CLAUDE.md
- User limit errors (exit code 5) properly preserved → intelligent automation maintained
- Smart shape filtering benefits fully functional → 4,320+ fewer monthly futile API calls
- State management integration complete → caching and verification work correctly

Resolves #63
@senomorf senomorf force-pushed the claude/issue-63-20250831-1522 branch from aa51ef8 to 7736135 Compare August 31, 2025 20:19
@claude
Copy link

claude bot commented Aug 31, 2025

Claude finished @senomorf's task —— View job


PR Review: Timeout Handler Race Condition Fix

Todo List:


APPROVED - Excellent Race Condition Fix

This PR correctly addresses a critical race condition while perfectly integrating with the existing architecture. The implementation shows deep understanding of the system's requirements and constraints.

What This PR Fixes

Problem: The original timeout handler (lines 235-240 in master) unconditionally overwrote STATUS_A1 and STATUS_E2 with EXIT_TIMEOUT_ERROR, destroying valuable error information:

Solution: Architecture-aware timeout handling that preserves critical error codes and respects the smart shape filtering system.

Technical Review

🎯 Error Code Preservation Logic

# Only override if no specific error code was already captured
if [[ $STATUS_A1 -eq 1 ]]; then
    STATUS_A1=$EXIT_TIMEOUT_ERROR
    log_debug "A1 timeout applied (was launched, no specific error code)"
else
    log_debug "A1 timeout occurred but preserving error code $STATUS_A1 (capacity/limit detection)"
fi

✅ Perfect: Only overwrites generic failure codes (exit code 1), preserving:

🎯 Smart Shape Integration

if [[ "$should_launch_a1" == true ]]; then
    # Only apply timeout to launched shapes

✅ Excellent: Respects the should_launch_* variables from PR #69's smart filtering, ensuring timeout errors are only applied to shapes that were actually attempted (not cache hits).

🎯 Code Movement & Timing

Moving timeout handling after result file reading is architecturally sound:

  • Correct order: Read results → Apply timeout logic → Preserve valid error codes
  • Race condition eliminated: No longer overwrites specific error codes that arrived before timeout
  • Enhanced debugging: Clear distinction between actual timeouts vs. capacity constraints

Architecture Compliance

CLAUDE.md Requirements

  • "Capacity errors are EXPECTED" - Now properly preserved instead of overwritten
  • 55-second timeout protection - Maintained with smarter application
  • Performance optimization flags - Untouched and preserved
  • Error classification patterns - Enhanced rather than disrupted

Integration with Existing Features

Performance & Reliability Impact

Positive Impacts

Risk Assessment

  • Backward compatibility - 100% maintained
  • No performance regression - Same timeout logic, better application
  • Code complexity - Increased slightly but with significant architectural benefits

Code Quality

Strengths

  • Conditional logic - Proper use of should_launch_* flags
  • Error handling - Comprehensive preservation of specific exit codes
  • Logging - Enhanced debug output for troubleshooting
  • Documentation - Clear inline comments explaining the approach

Suggestions (Minor)

  1. Variable naming: Consider should_timeout_a1 alongside should_launch_a1 for even clearer intent
  2. Constants: Could extract the generic failure code (1) to a named constant for clarity

Testing Recommendations

To validate this fix:

# Test capacity error preservation
export OCI_SHAPE="VM.Standard.A1.Flex" 
# Simulate capacity error (exit 2) before timeout - should preserve code 2

# Test user limit preservation  
export OCI_SHAPE="VM.Standard.E2.1.Micro"
# Simulate limit error (exit 5) before timeout - should preserve code 5

# Test actual timeout scenario
# Block processes beyond 55s - should apply exit code 124

Conclusion

This PR represents sophisticated architectural integration that:

  • Fixes the race condition without disrupting existing optimizations
  • Preserves critical error information for proper workflow classification
  • Enables intelligent automation through error-driven limit detection
  • Maintains performance benefits from previous enhancements

This is exactly the kind of thoughtful, architecture-aware fix that shows deep understanding of the system requirements. Ready to merge!


@senomorf senomorf merged commit 3f3bcd6 into master Aug 31, 2025
10 checks passed
@senomorf senomorf deleted the claude/issue-63-20250831-1522 branch August 31, 2025 20:26
senomorf added a commit that referenced this pull request Aug 31, 2025
Cover critical gaps from error-driven limit detection (PR #69) and
architecture-aware timeout handling (PR #70) while optimizing for token efficiency.

- Add USER_LIMIT_REACHED (exit code 5) to error classification
- Document error-driven limit detection preventing 4,320+ monthly API calls
- Add architecture-aware timeout handling preserving capacity errors
- Create limits-management.md with essential state manager commands
- Add minimal troubleshooting for user limit scenarios
- Update README features with concise descriptions

All additions optimized for minimal token count while maintaining completeness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: timeout handler overrides capacity error detection causing false workflow failures

1 participant