Skip to content

Inject run config store into WorkloadFromContainerInfo#4342

Open
JAORMX wants to merge 3 commits intomainfrom
fix-flaky-workload-from-container-info
Open

Inject run config store into WorkloadFromContainerInfo#4342
JAORMX wants to merge 3 commits intomainfrom
fix-flaky-workload-from-container-info

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Mar 24, 2026

Summary

  • WorkloadFromContainerInfo called loadRunConfigFields which internally created a real state.LocalStore hitting the XDG filesystem on every call. In CI, parallel tests could create or truncate runconfig files, causing intermittent EOF errors from json.Decode — making TestRuntimeStatusManager_GetWorkload flaky.
  • Add a state.Store parameter to WorkloadFromContainerInfo and loadRunConfigFields so callers inject the store, matching the existing DI pattern in fileStatusManager. Add a runConfigStore field to runtimeStatusManager for parity.
  • Fix file_status_test.go mock readers to use DoAndReturn for fresh readers on each call, preventing EOF when the store is read more than once per test.

Fixes #4341

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)

Changes

File Change
pkg/workloads/types/types.go Add state.Store param to WorkloadFromContainerInfo and loadRunConfigFields; rewrite loadRunConfigFields to use injected store directly
pkg/workloads/types/workload_test.go Pass existing store to updated WorkloadFromContainerInfo
pkg/workloads/statuses/status.go Add runConfigStore field to runtimeStatusManager; update NewStatusManagerFromRuntime to return error; pass store at 2 call sites
pkg/workloads/statuses/file_status.go Pass f.runConfigStore at 6 call sites; name receiver on mergeHealthyWorkloadData
pkg/workloads/statuses/status_test.go Inject mock store into runtimeStatusManager; set up Exists expectations
pkg/workloads/statuses/file_status_test.go Change Return(mockReader, nil).AnyTimes() to DoAndReturn(...) for fresh readers

Does this introduce a user-facing change?

No

Special notes for reviewers

The loadRunConfigFields rewrite switches from state.LoadRunConfig (which creates a new store internally) to direct store.Exists + store.GetReader calls. The semantics are identical: returns empty config when not found, propagates errors otherwise — but no longer creates a store on every invocation.

Generated with Claude Code

WorkloadFromContainerInfo called loadRunConfigFields which internally
created a real state.LocalStore hitting the XDG filesystem on every
call. This caused flaky tests when parallel tests created or truncated
runconfig files, leading to intermittent EOF errors from json.Decode.

Add a state.Store parameter to WorkloadFromContainerInfo and
loadRunConfigFields so callers inject the store, matching the existing
dependency injection pattern in fileStatusManager. Add a runConfigStore
field to runtimeStatusManager for parity.

Fix file_status_test.go mock readers to use DoAndReturn for fresh
readers on each call, preventing EOF when the store is read more than
once per test.

Fixes #4341

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 24, 2026
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 52.63158% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.90%. Comparing base (df1a5cd) to head (ff81f7d).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloads/types/types.go 37.50% 6 Missing and 4 partials ⚠️
pkg/workloads/statuses/status.go 61.53% 3 Missing and 2 partials ⚠️
pkg/workloads/statuses/file_status.go 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4342      +/-   ##
==========================================
- Coverage   69.08%   68.90%   -0.18%     
==========================================
  Files         478      479       +1     
  Lines       48432    48517      +85     
==========================================
- Hits        33457    33431      -26     
- Misses      12314    12324      +10     
- Partials     2661     2762     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Check reader.Close error return and break long function signature line
to satisfy errcheck and lll linters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 24, 2026
Copy link
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DI approach is architecturally correct and cleanly fixes the flaky test root cause. The DoAndReturn changes are exactly right — sharing a single pre-consumed reader across .AnyTimes() calls was the actual bug.

Two things need addressing before merge (marked blocker); two more are nits that can be tackled in a follow-up.


Blockers

  1. loadRunConfigFields: reader.Close() error is silently discarded — diverges from established project pattern
  2. loadRunConfigFields: TOCTOU between store.Exists and store.GetReader — a concurrent deletion between the two calls returns an error that was previously handled gracefully as "not found"

Nits (follow-up PR OK)
3. store.Exists and store.GetReader errors are returned without workload name context, making debugging harder
4. TestNewStatusManagerFromRuntime hits the real XDG filesystem via NewRunConfigStore, inconsistent with the mock-based pattern used everywhere else in the suite

if err != nil {
return nil, err
}
defer func() { _ = reader.Close() }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: reader.Close() error is silently discarded with _ =, but the established project pattern (see state/runconfig.go:93-95, statuses/file_status.go:99-101) is to log it. A file-backed reader can return flush/OS errors on Close() that are worth observing.

Suggested change
defer func() { _ = reader.Close() }()
defer func() {
if err := reader.Close(); err != nil {
slog.Warn("failed to close run config reader", "workload", name, "error", err)
}
}()

return runConfig, nil
reader, err := store.GetReader(ctx, name)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: TOCTOU between store.Exists (line 27) and store.GetReader here. If a workload is deleted between the two calls (e.g. a concurrent thv delete), GetReader returns an error — but the old code via state.LoadRunConfig handled this case gracefully by returning an empty config.

The PR description says "semantics are identical", but that only holds in the absence of concurrent writes. The fix is to treat a not-found error from GetReader the same as !exists:

Suggested change
return nil, err
if state.IsNotFound(err) {
return &minimalRunConfig{}, nil
}
return nil, err

(adjust the not-found check to match the actual sentinel/error type the store returns)

if errors.Is(err, werr.ErrRunConfigNotFound) {
return &minimalRunConfig{}, nil
}
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (follow-up OK): both store.Exists and store.GetReader errors are returned without wrapping the workload name, making it hard to tell which workload triggered a ListWorkloads failure. Consider:

Suggested change
return nil, err
return nil, fmt.Errorf("failed to check run config for workload %q: %w", name, err)

Same pattern applies to the GetReader error a few lines below.


mockRuntime := rtmocks.NewMockRuntime(ctrl)
manager := NewStatusManagerFromRuntime(mockRuntime)
manager, err := NewStatusManagerFromRuntime(mockRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (follow-up OK): NewStatusManagerFromRuntime internally calls state.NewRunConfigStoreos.MkdirAll on the real XDG state directory. Every other test in this file injects a stateMocks.NewMockStore. This one is inconsistent and could flake in sandboxed CI environments.

The simplest fix is t.Setenv("XDG_STATE_HOME", t.TempDir()) before the call, or restructure NewStatusManagerFromRuntime to accept the store as a parameter (following the DI pattern this PR establishes).

…elds

The Exists+GetReader sequence had a race where a concurrent workload
deletion between the two calls would return an error instead of the
intended empty config. Remove the Exists call entirely and handle
not-found from GetReader directly via httperr status code.

Also log reader.Close() errors with slog.Warn to match the established
project pattern, and wrap errors with workload name for debuggability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: WorkloadFromContainerInfo hits real XDG filesystem

2 participants