Skip to content

Conversation

@adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Nov 15, 2025

Problem

The step-checker feature was causing warnings and preventing sticky disk commits in container jobs:

Warning: Unable to check for previous step failures: _diag directory not found at /__w/web/web/_diag
Warning: Skipping sticky disk commit due to ambiguity in failure detection

Root Cause:

  • Container jobs mount the workspace at /__w/ inside the container
  • The _diag directory exists on the host at /home/runner/_diag but is not mounted into containers
  • The step-checker tried to access _diag and failed, returning an error
  • This prevented sticky disk commits from happening
  • Modern kernels use cgroup v2 which doesn't include docker/containerd in the path

Impact:

  • Affecting ~300-400 runs per day since November 1st, 2025 when step-checker was deployed
  • Only affects container jobs (jobs with container: in workflow config)

Solution

Add multi-layered container detection that works with both old and new systems:

  1. Check for /.dockerenv file (docker-specific indicator)
  2. Check /proc/1/cgroup for docker/containerd (cgroup v1)
  3. Check if working directory starts with /__w/ (GitHub Actions container mount - works with cgroup v2)

When any of these conditions are met, skip the step-checker gracefully since _diag is not accessible.

No changes to existing path detection logic - only adds the container check at the beginning.

Testing

  • ✅ All tests pass with updated mocks
  • ✅ Build successful with no linter errors
  • ✅ Verified detection works in actual container (pwd starts with /__w/ ✓)
  • ✅ Container jobs will now skip step-checker and allow sticky disk commits
  • ✅ Non-container jobs continue to work exactly as before

Verification in VM

Tested in actual container environment:

✓ Has /.dockerenv
✗ Not in cgroup (cgroup v2: "0::/")
✓ PWD starts with /__w/
✓ CONTAINER DETECTED - step-checker will be skipped

Related

Based on ClickHouse analysis, this has been affecting the useblacksmith/web repo's container verification jobs since the step-checker feature was deployed on November 1st.


Note

Adds container detection (/.dockerenv, /proc/1/cgroup, cwd prefix) to skip step-checker when _diag isn’t accessible, and updates tests accordingly.

  • Step-checker:
    • Add container detection using /.dockerenv, /proc/1/cgroup (docker/containerd), and process.cwd().startsWith('/__w/').
    • When detected, skip checking _diag and return no failures; add debug logs for detected paths.
    • Preserve existing runner root detection and log parsing logic.
  • Tests:
    • Update src/step-checker.test.ts to mock container/non-container signals and verify behaviors (no _diag, no logs, JSON/text parsing, error handling, boolean check).
  • Build:
    • Regenerate dist/index.js and source map.

Written by Cursor Bugbot for commit e1196de. This will update automatically on new commits. Configure here.

The step-checker was causing warnings in container jobs because the _diag
directory exists on the host but is not mounted into containers.

Changes:
- Check for /.dockerenv file (docker-specific indicator)
- Check cgroup for container indicators (works with cgroup v1)
- Check if working directory starts with /__w/ (GitHub Actions container mount)
- Skip step-checker gracefully when any of these conditions are met

This handles both cgroup v1 and v2 formats and allows sticky disk commits
to proceed normally for container jobs.

Affects container jobs only - regular jobs continue to work as before.
@adityamaru adityamaru force-pushed the fix/step-checker-container-support branch from f89efcf to e1196de Compare November 15, 2025 22:29
@adityamaru adityamaru merged commit 4023ba4 into main Nov 17, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants