fix: prevent timeout handler from overriding capacity errors with smart shape integration by senomorf · Pull Request #70 · senomorf/OracleInstanceCreator

senomorf · 2025-08-31T19:28:06Z

Fixes race condition where timeout handler overwrites capacity/limit error codes, causing false workflow failures in conjunction with PR #69's error-driven limit detection system.

Problem

Lines 235-240 in current master's launch-parallel.sh unconditionally override STATUS_A1/STATUS_E2 with EXIT_TIMEOUT_ERROR, even when:

Result files contain valid capacity errors (exit code 2) or user limit errors (exit code 5)
Smart shape filtering (PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69) has already detected limits via cached state
Processes completed with specific error codes before timeout was reached

This causes Oracle Cloud capacity/limit constraints to be misclassified as timeout failures, contradicting CLAUDE.md's documented expected behavior.

Solution

Architecture-aware timeout handling that:

Integrates with PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's smart shape filtering (should_launch_* variables)
Only applies timeout errors to shapes that were actually launched and timed out
Preserves capacity/limit error codes (exit codes 2, 5) that indicate expected Oracle Cloud behavior
Respects error-driven limit detection from PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's caching system
Enhanced logging to distinguish timeout scenarios from capacity constraints

Implementation

Uses PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's should_launch_* flags to determine which shapes were actually attempted
Only overrides exit codes for shapes that were launched AND have generic failure codes
Preserves all existing PR feat: Reduce Oracle API calls by 80-90% with state management caching #58 and PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69 optimization benefits
Maintains backward compatibility with existing error handling

Technical Details

Replace problematic timeout section with:

# Architecture-aware timeout handling respecting smart shape filtering
if [[ $elapsed -ge $timeout_seconds ]]; then
    terminate_processes
    
    # Only apply timeout errors to shapes that were actually launched
    if [[ "$should_launch_a1" == true && $STATUS_A1 -eq 1 ]]; then
        STATUS_A1=$EXIT_TIMEOUT_ERROR
    fi
    
    if [[ "$should_launch_e2" == true && $STATUS_E2 -eq 1 ]]; then
        STATUS_E2=$EXIT_TIMEOUT_ERROR
    fi
fi

Impact

Capacity errors properly preserved → correct success classification per CLAUDE.md
Smart shape filtering benefits maintained → 4,320+ fewer monthly futile API calls
Error-driven limit detection fully functional → intelligent automation from PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69
Enhanced debugging → clear distinction between actual timeouts and capacity limits

This fix ensures Issue #63 is resolved while maintaining full compatibility with the sophisticated architecture introduced in PR #58 and PR #69.

Resolves #63

…rt shape integration - Architecture-aware timeout handling that respects PR #69's smart shape filtering - Only applies timeout errors to shapes that were actually launched (should_launch_* variables) - Preserves capacity/limit error codes (exit codes 2, 5) for proper success classification - Enhanced logging distinguishes timeout scenarios from capacity constraints - Full integration with state management and error-driven limit detection from PR #58 and PR #69 - Maintains all existing optimization benefits while fixing Issue #63 race condition Key improvements: - Timeout logic checks should_launch_a1/should_launch_e2 before applying timeout codes - Capacity errors (exit code 2) properly preserved → correct success per CLAUDE.md - User limit errors (exit code 5) properly preserved → intelligent automation maintained - Smart shape filtering benefits fully functional → 4,320+ fewer monthly futile API calls - State management integration complete → caching and verification work correctly Resolves #63

claude · 2025-08-31T20:20:08Z

Claude finished @senomorf's task —— View job

PR Review: Timeout Handler Race Condition Fix

Todo List:

Read CLAUDE.md for context and requirements
Examine current launch-parallel.sh implementation
Compare with origin/master to understand changes
Review timeout handling logic for correctness
Verify integration with PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's smart shape filtering
Check error code preservation logic
Assess adherence to documented patterns
Provide comprehensive review feedback

✅ APPROVED - Excellent Race Condition Fix

This PR correctly addresses a critical race condition while perfectly integrating with the existing architecture. The implementation shows deep understanding of the system's requirements and constraints.

What This PR Fixes

Problem: The original timeout handler (lines 235-240 in master) unconditionally overwrote STATUS_A1 and STATUS_E2 with EXIT_TIMEOUT_ERROR, destroying valuable error information:

Capacity errors (exit code 2) → Lost, causing false workflow failures
User limit errors (exit code 5) → Lost, breaking PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's intelligent caching
Smart shape filtering benefits → Negated by incorrect error classification

Solution: Architecture-aware timeout handling that preserves critical error codes and respects the smart shape filtering system.

Technical Review

🎯 Error Code Preservation Logic

# Only override if no specific error code was already captured
if [[ $STATUS_A1 -eq 1 ]]; then
    STATUS_A1=$EXIT_TIMEOUT_ERROR
    log_debug "A1 timeout applied (was launched, no specific error code)"
else
    log_debug "A1 timeout occurred but preserving error code $STATUS_A1 (capacity/limit detection)"
fi

✅ Perfect: Only overwrites generic failure codes (exit code 1), preserving:

Capacity errors (exit code 2) - per CLAUDE.md requirement: "Capacity errors are EXPECTED (treat as success)"
User limit errors (exit code 5) - critical for PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's limit detection cache

🎯 Smart Shape Integration

if [[ "$should_launch_a1" == true ]]; then
    # Only apply timeout to launched shapes

✅ Excellent: Respects the should_launch_* variables from PR #69's smart filtering, ensuring timeout errors are only applied to shapes that were actually attempted (not cache hits).

🎯 Code Movement & Timing

Moving timeout handling after result file reading is architecturally sound:

✅ Correct order: Read results → Apply timeout logic → Preserve valid error codes
✅ Race condition eliminated: No longer overwrites specific error codes that arrived before timeout
✅ Enhanced debugging: Clear distinction between actual timeouts vs. capacity constraints

Architecture Compliance

CLAUDE.md Requirements ✅

✅ "Capacity errors are EXPECTED" - Now properly preserved instead of overwritten
✅ 55-second timeout protection - Maintained with smarter application
✅ Performance optimization flags - Untouched and preserved
✅ Error classification patterns - Enhanced rather than disrupted

Integration with Existing Features ✅

✅ PR feat: Reduce Oracle API calls by 80-90% with state management caching #58 optimizations - Fully preserved
✅ PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69 smart filtering - Seamlessly integrated via should_launch_* flags
✅ State management - Error-driven limit detection now functions correctly
✅ Notification system - Will receive correct error classifications

Performance & Reliability Impact

Positive Impacts ✅

4,320+ fewer monthly API calls - PR feat: implement error-driven limit detection to prevent 4,320+ monthly futile API calls #69's benefits now fully realized
Correct workflow status - Capacity constraints no longer cause false failures
Enhanced debugging - Clear logging distinguishes timeout vs. capacity scenarios
Intelligent automation - Error-driven limit detection works as intended

Risk Assessment ✅

Backward compatibility - 100% maintained
No performance regression - Same timeout logic, better application
Code complexity - Increased slightly but with significant architectural benefits

Code Quality

Strengths ✅

Conditional logic - Proper use of should_launch_* flags
Error handling - Comprehensive preservation of specific exit codes
Logging - Enhanced debug output for troubleshooting
Documentation - Clear inline comments explaining the approach

Suggestions (Minor)

Variable naming: Consider should_timeout_a1 alongside should_launch_a1 for even clearer intent
Constants: Could extract the generic failure code (1) to a named constant for clarity

Testing Recommendations

To validate this fix:

# Test capacity error preservation
export OCI_SHAPE="VM.Standard.A1.Flex" 
# Simulate capacity error (exit 2) before timeout - should preserve code 2

# Test user limit preservation  
export OCI_SHAPE="VM.Standard.E2.1.Micro"
# Simulate limit error (exit 5) before timeout - should preserve code 5

# Test actual timeout scenario
# Block processes beyond 55s - should apply exit code 124

Conclusion

This PR represents sophisticated architectural integration that:

✅ Fixes the race condition without disrupting existing optimizations
✅ Preserves critical error information for proper workflow classification
✅ Enables intelligent automation through error-driven limit detection
✅ Maintains performance benefits from previous enhancements

This is exactly the kind of thoughtful, architecture-aware fix that shows deep understanding of the system requirements. Ready to merge!

Cover critical gaps from error-driven limit detection (PR #69) and architecture-aware timeout handling (PR #70) while optimizing for token efficiency. - Add USER_LIMIT_REACHED (exit code 5) to error classification - Document error-driven limit detection preventing 4,320+ monthly API calls - Add architecture-aware timeout handling preserving capacity errors - Create limits-management.md with essential state manager commands - Add minimal troubleshooting for user limit scenarios - Update README features with concise descriptions All additions optimized for minimal token count while maintaining completeness.

senomorf changed the title ~~fix: preserve capacity error codes when timeout occurs~~ fix: prevent timeout handler from overriding capacity errors with smart shape integration Aug 31, 2025

senomorf force-pushed the claude/issue-63-20250831-1522 branch from fdc9942 to aa51ef8 Compare August 31, 2025 20:15

senomorf force-pushed the claude/issue-63-20250831-1522 branch from aa51ef8 to 7736135 Compare August 31, 2025 20:19

senomorf merged commit 3f3bcd6 into master Aug 31, 2025
10 checks passed

senomorf deleted the claude/issue-63-20250831-1522 branch August 31, 2025 20:26

senomorf mentioned this pull request Aug 31, 2025

docs: add concise documentation for issues 63 and 64 implementations #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent timeout handler from overriding capacity errors with smart shape integration#70

fix: prevent timeout handler from overriding capacity errors with smart shape integration#70
senomorf merged 1 commit intomasterfrom
claude/issue-63-20250831-1522

senomorf commented Aug 31, 2025 •

edited

Loading

Uh oh!

claude bot commented Aug 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

senomorf commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Implementation

Technical Details

Impact

Uh oh!

claude bot commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Timeout Handler Race Condition Fix

Todo List:

✅ APPROVED - Excellent Race Condition Fix

What This PR Fixes

Technical Review

🎯 Error Code Preservation Logic

🎯 Smart Shape Integration

🎯 Code Movement & Timing

Architecture Compliance

CLAUDE.md Requirements ✅

Integration with Existing Features ✅

Performance & Reliability Impact

Positive Impacts ✅

Risk Assessment ✅

Code Quality

Strengths ✅

Suggestions (Minor)

Testing Recommendations

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

senomorf commented Aug 31, 2025 •

edited

Loading

claude bot commented Aug 31, 2025 •

edited

Loading