fix(aspire): fail fast on Finished-state crashes during startup (#6342)#6346
Conversation
The fail-fast watcher (IsFailureState) only treated `Exited` + non-zero as a
failure. A crashed .NET project/executable reaches `Finished` (not `Exited`),
so its crash went unseen: it never becomes healthy, and the readiness wait
blocked for the entire ResourceTimeout instead of aborting immediately (the
reporter saw 480s for a resource that died at 41s).
- IsFailureState now also flags `Finished` + non-zero exit; clean (code 0)
exits stay excluded so one-shot migration/seeder resources aren't
mis-reported. Refactored to a pure (state, exitCode) overload for testing.
- DecodeExitCode/FormatExitCode turn opaque codes into causes (0xE0434352 ->
".NET unhandled exception", 0xC0000005 -> access violation, 137 -> SIGKILL/OOM,
etc.), surfaced in the concise message and the diagnostics block.
- FindAwaiters reports the reverse WaitFor graph ("Awaited by: media, web").
- Concise message now reports elapsed-to-exit via the existing timeline.
Adds pure unit tests for IsFailureState, DecodeExitCode/FormatExitCode,
DescribeState decoding, and FindAwaiters.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| CodeStyle | 1 minor |
🟢 Metrics 33 complexity
Metric Results Complexity 33
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Review
Verified the fix by building TUnit.Aspire.Core (net8/9/10, all succeed) and running the new/existing AspireDiagnosticsTests — all 39 pass.
Root cause & fix are correct. Finished genuinely is the DCP terminal state for projects/executables (vs. Exited for containers), so widening IsFailureState to check both terminal states + non-zero exit code is the right fix, and keeping the code-0 exclusion (documented in the PR description) correctly avoids mis-flagging one-shot migration/seeder resources.
Good engineering choices:
- Refactoring
IsFailureStateto a pure(string? state, int? exitCode)overload made it directly unit-testable without needing a liveCustomResourceSnapshot— nice pattern, and the parameterized[Arguments]test table covers the meaningful boundary cases (Finished/Exited × zero/non-zero, FailedToStart, null) concisely. FindAwaiterscorrectly mirrors the existingAppendDependencyChain's use ofTryGetAnnotationsOfType<WaitAnnotation>— reusing the established pattern for the reverse direction rather than inventing a new one.TerminalElapsedFor's origin-relative elapsed calculation follows the sametransitions[0].Elapsedconvention already used inFormatTimeline, and I checkedTrimTimeline— it always preserves the firsthead(80) entries, so the origin stays valid even after trimming a long-running flapping resource's timeline.- The defensive
try/catch (ObjectDisposedException)aroundGetRequiredService<DistributedApplicationModel>()inBuildDiagnosticsAndAttachAsyncis a sensible, narrowly-scoped guard given this runs during/after teardown — better to silently omit "Awaited by" than mask the real failure with an unrelated disposal exception. - Reused the existing
FakeComputeResourcetest helper rather than adding a new one.
No functional issues found. One very minor nit (non-blocking): the exit-code decode table (DecodeExitCode) is a reasonable, well-commented start, but as a hardcoded switch it will inevitably be incomplete for other common signals/HRESULTs. Not a concern for this PR — just something to keep in mind if this list grows, a small lookup table/dictionary might read more easily than an ever-expanding switch expression.
Approving — clean root-cause fix, thoughtful reuse of existing conventions, solid test coverage.
Fixes #6342.
Root cause
AspireFixture's fail-fast watcher (IsFailureState) only treatedExited+ non-zero exit as a failure. But a crashed .NET project/executable reachesFinished, notExited(confirmed against Aspire.Hosting:Finishedis the DCP project/exe terminal state,Exitedis the container path). So the crash went unseen — the resource never became healthy and the readiness wait blocked for the entireResourceTimeoutinstead of aborting immediately (the report showed 480s for a resource that died at ~41s).Changes
IsFailureStatenow also flagsFinished+ non-zero exit. Clean (code 0) exits stay excluded so one-shot migration/seeder resources aren't mis-reported. Refactored to a pure(string? state, int? exitCode)overload so it's unit-testable.DecodeExitCode/FormatExitCodeturn opaque codes into human-readable causes:0xE0434352→ ".NET unhandled exception",0xC0000005→ access violation,137→ SIGKILL/OOM,139SIGSEGV,134SIGABRT, etc. Surfaced in both the concise message and the diagnostics block.FindAwaitersreports the reverseWaitForgraph — the downstream resources that were blocked ("Awaited by: media, web").Decision: code-0 exits
The issue's acceptance criteria also asked for clean (code 0) exits to be flagged as failures during startup. This was intentionally not done — the existing code excludes code 0 precisely so one-shot resources (migration runners, seeders) that finish successfully aren't mis-reported. Flipping that would regress those scenarios. Only non-zero terminal exits fail fast.
Tests
Added pure (no-Docker) unit tests to
AspireDiagnosticsTests:IsFailureState(Finished/Exited × zero/non-zero, FailedToStart, null),DecodeExitCode/FormatExitCode,DescribeStatedecoding, andFindAwaiters. All 39 tests in the class pass locally;TUnit.Aspire.Corebuilds clean across net8/net9/net10.Example message