-
Couldn't load subscription status.
- Fork 2
Closed
Labels
Description
Summary
Our nightly CI runs execute a comprehensive matrix of tests across multiple .NET versions, server configurations, and host environments. When these scheduled runs fail, it's currently difficult to quickly identify and track recurring issues across the matrix dimensions. We need an automated system to aggregate failures from all matrix jobs and create/update a single tracking issue for better visibility and follow-up.
Why This is Needed
- Scattered failure information: Matrix failures are spread across multiple job logs, making it hard to get a unified view
- Manual tracking overhead: Currently requires manual review of each failed matrix combination
- Lost context: Intermittent failures may not be noticed or tracked consistently
- Nightly-specific focus: We want to track systematic issues from full matrix runs without noise from PR/push failures
High-Level Approach
- Separate reporter workflow: Create a
workflow_run-triggered workflow that activates only after scheduled runs of the "C# tests" workflow - Artifact-based aggregation: Collect structured failure data (JSON/TRX) from all matrix job artifacts
- Single rolling issue: Maintain one open issue per repository with a dedicated label (
ci-failure) that gets updated with new failures - Smart deduplication: Avoid duplicate comments for the same run and intelligently merge failure information
Key Requirements
Triggering and Scope
- Activate for
schedule- triggered runs of the "C# tests" workflow (not PR or push events) - Activate for
workflow_dispatchtriggered runs where the full-matrix input is true - Only run when the triggering workflow has
conclusion == 'failure' - Use
workflow_runevent to trigger after the main test workflow completes
Data Collection and Aggregation
- Download artifacts from all matrix jobs in the failed run
- Parse test results from TRX files or structured JSON artifacts
- Aggregate failures across all matrix dimensions (dotnet version, server type/version, host OS/arch)
- Include matrix context (framework, server, host) for each failure
Issue Management
- Search for existing open issues with
ci-failurelabel - If no open issue exists, create a new one with aggregated failure details
- If open issue exists, add a comment with new failure information
- Include direct links to the failed workflow run and individual job logs
- Prevent duplicate comments for the same workflow run ID
Failure Reporting Format
- Clear summary at the top: total failed tests, affected matrix combinations
- Organized sections per matrix combination with failed test details
- Collapsible sections for readability when there are many failures
- Include error messages and relevant context for each failed test
Acceptance Criteria
Core Functionality
- Reporter workflow only triggers for scheduled "C# tests" runs that fail
- Successfully aggregates failures from all matrix job artifacts
- Creates a single issue when no open
ci-failureissue exists - Adds comments to existing open
ci-failureissue for new failures - Includes workflow run ID and prevents duplicate comments for the same run
Data Quality
- Reports include matrix context (dotnet version, server, host) for each failure
- Failed test names, error messages, and stack traces are captured accurately
- Direct links to workflow run and job logs are included
- Summary statistics (total failures, affected combinations) are accurate
Reliability and Security
- Uses least-privilege permissions (
actions: read,contents: read,issues: write) - Handles cases where artifacts are missing or malformed gracefully
- Respects GitHub comment size limits (chunk or summarize if needed)
- Works correctly in the valkey-io organization context
Testing and Validation
- Dry-run capability for testing without creating real issues
- Tested scenarios: new issue creation, existing issue commenting, duplicate run detection
- Verified on actual failed scheduled runs before merging
Implementation Notes
- Reference implementation guide available in
.vs/error-workflow.md - Will require minor updates to existing
tests.ymlto ensure TRX output and structured failure artifacts - Consider using
actions/github-scriptfor GitHub API interactions and artifact processing - May need
dawidd6/action-download-artifactor similar to fetch cross-workflow artifacts
Related Files
.github/workflows/tests.yml(main test workflow).github/workflows/report-failures.yml(new reporter workflow to be created).vs/error-workflow.md(detailed implementation checklist)