Skip to content

fix(aspire): fail fast on Finished-state crashes during startup (#6342)#6346

Merged
thomhurst merged 2 commits into
mainfrom
fix/6342-aspire-finished-state-failfast
Jul 1, 2026
Merged

fix(aspire): fail fast on Finished-state crashes during startup (#6342)#6346
thomhurst merged 2 commits into
mainfrom
fix/6342-aspire-finished-state-failfast

Conversation

@thomhurst

Copy link
Copy Markdown
Owner

Fixes #6342.

Root cause

AspireFixture's fail-fast watcher (IsFailureState) only treated Exited + non-zero exit as a failure. But a crashed .NET project/executable reaches Finished, not Exited (confirmed against Aspire.Hosting: Finished is the DCP project/exe terminal state, Exited is the container path). So the crash went unseen — the resource never became healthy and the readiness wait blocked for the entire ResourceTimeout instead of aborting immediately (the report showed 480s for a resource that died at ~41s).

Changes

  • IsFailureState now also flags Finished + non-zero exit. Clean (code 0) exits stay excluded so one-shot migration/seeder resources aren't mis-reported. Refactored to a pure (string? state, int? exitCode) overload so it's unit-testable.
  • DecodeExitCode / FormatExitCode turn opaque codes into human-readable causes: 0xE0434352 → ".NET unhandled exception", 0xC0000005 → access violation, 137 → SIGKILL/OOM, 139 SIGSEGV, 134 SIGABRT, etc. Surfaced in both the concise message and the diagnostics block.
  • FindAwaiters reports the reverse WaitFor graph — the downstream resources that were blocked ("Awaited by: media, web").
  • The concise message now reports elapsed-to-exit ("after 41.3s") by reusing the timeline the background monitor already records.

Decision: code-0 exits

The issue's acceptance criteria also asked for clean (code 0) exits to be flagged as failures during startup. This was intentionally not done — the existing code excludes code 0 precisely so one-shot resources (migration runners, seeders) that finish successfully aren't mis-reported. Flipping that would regress those scenarios. Only non-zero terminal exits fail fast.

Tests

Added pure (no-Docker) unit tests to AspireDiagnosticsTests: IsFailureState (Finished/Exited × zero/non-zero, FailedToStart, null), DecodeExitCode/FormatExitCode, DescribeState decoding, and FindAwaiters. All 39 tests in the class pass locally; TUnit.Aspire.Core builds clean across net8/net9/net10.

Example message

Aspire startup aborted: resource 'chat' failed during startup.
  chat: Finished, exit code -532462766 (.NET unhandled exception) after 41.3s
     Awaited by: media, web
     Hint: The process exited with a non-zero code on startup. The logs usually show the cause...

The fail-fast watcher (IsFailureState) only treated `Exited` + non-zero as a
failure. A crashed .NET project/executable reaches `Finished` (not `Exited`),
so its crash went unseen: it never becomes healthy, and the readiness wait
blocked for the entire ResourceTimeout instead of aborting immediately (the
reporter saw 480s for a resource that died at 41s).

- IsFailureState now also flags `Finished` + non-zero exit; clean (code 0)
  exits stay excluded so one-shot migration/seeder resources aren't
  mis-reported. Refactored to a pure (state, exitCode) overload for testing.
- DecodeExitCode/FormatExitCode turn opaque codes into causes (0xE0434352 ->
  ".NET unhandled exception", 0xC0000005 -> access violation, 137 -> SIGKILL/OOM,
  etc.), surfaced in the concise message and the diagnostics block.
- FindAwaiters reports the reverse WaitFor graph ("Awaited by: media, web").
- Concise message now reports elapsed-to-exit via the existing timeline.

Adds pure unit tests for IsFailureState, DecodeExitCode/FormatExitCode,
DescribeState decoding, and FindAwaiters.
@codacy-production

codacy-production Bot commented Jul 1, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 1 minor

Alerts:
⚠ 1 issue (≤ 0 issues of at least minor severity)

Results:
1 new issue

Category Results
CodeStyle 1 minor

View in Codacy

🟢 Metrics 33 complexity

Metric Results
Complexity 33

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Verified the fix by building TUnit.Aspire.Core (net8/9/10, all succeed) and running the new/existing AspireDiagnosticsTests — all 39 pass.

Root cause & fix are correct. Finished genuinely is the DCP terminal state for projects/executables (vs. Exited for containers), so widening IsFailureState to check both terminal states + non-zero exit code is the right fix, and keeping the code-0 exclusion (documented in the PR description) correctly avoids mis-flagging one-shot migration/seeder resources.

Good engineering choices:

  • Refactoring IsFailureState to a pure (string? state, int? exitCode) overload made it directly unit-testable without needing a live CustomResourceSnapshot — nice pattern, and the parameterized [Arguments] test table covers the meaningful boundary cases (Finished/Exited × zero/non-zero, FailedToStart, null) concisely.
  • FindAwaiters correctly mirrors the existing AppendDependencyChain's use of TryGetAnnotationsOfType<WaitAnnotation> — reusing the established pattern for the reverse direction rather than inventing a new one.
  • TerminalElapsedFor's origin-relative elapsed calculation follows the same transitions[0].Elapsed convention already used in FormatTimeline, and I checked TrimTimeline — it always preserves the first head (80) entries, so the origin stays valid even after trimming a long-running flapping resource's timeline.
  • The defensive try/catch (ObjectDisposedException) around GetRequiredService<DistributedApplicationModel>() in BuildDiagnosticsAndAttachAsync is a sensible, narrowly-scoped guard given this runs during/after teardown — better to silently omit "Awaited by" than mask the real failure with an unrelated disposal exception.
  • Reused the existing FakeComputeResource test helper rather than adding a new one.

No functional issues found. One very minor nit (non-blocking): the exit-code decode table (DecodeExitCode) is a reasonable, well-commented start, but as a hardcoded switch it will inevitably be incomplete for other common signals/HRESULTs. Not a concern for this PR — just something to keep in mind if this list grows, a small lookup table/dictionary might read more easily than an ever-expanding switch expression.

Approving — clean root-cause fix, thoughtful reuse of existing conventions, solid test coverage.

@thomhurst thomhurst enabled auto-merge (squash) July 1, 2026 16:52
@thomhurst thomhurst merged commit f9d0bea into main Jul 1, 2026
12 of 14 checks passed
@thomhurst thomhurst deleted the fix/6342-aspire-finished-state-failfast branch July 1, 2026 17:10
This was referenced Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TUnit.Aspire] Fail fast (and attributed) when a resource crashes during fixture startup

1 participant