Skip to content

Automated Matrix CI Failure Reporting for Nightly Builds #71

@jbrinkman

Description

@jbrinkman

Summary

Our nightly CI runs execute a comprehensive matrix of tests across multiple .NET versions, server configurations, and host environments. When these scheduled runs fail, it's currently difficult to quickly identify and track recurring issues across the matrix dimensions. We need an automated system to aggregate failures from all matrix jobs and create/update a single tracking issue for better visibility and follow-up.

Why This is Needed

  • Scattered failure information: Matrix failures are spread across multiple job logs, making it hard to get a unified view
  • Manual tracking overhead: Currently requires manual review of each failed matrix combination
  • Lost context: Intermittent failures may not be noticed or tracked consistently
  • Nightly-specific focus: We want to track systematic issues from full matrix runs without noise from PR/push failures

High-Level Approach

  1. Separate reporter workflow: Create a workflow_run-triggered workflow that activates only after scheduled runs of the "C# tests" workflow
  2. Artifact-based aggregation: Collect structured failure data (JSON/TRX) from all matrix job artifacts
  3. Single rolling issue: Maintain one open issue per repository with a dedicated label (ci-failure) that gets updated with new failures
  4. Smart deduplication: Avoid duplicate comments for the same run and intelligently merge failure information

Key Requirements

Triggering and Scope

  • Activate for schedule- triggered runs of the "C# tests" workflow (not PR or push events)
  • Activate for workflow_dispatch triggered runs where the full-matrix input is true
  • Only run when the triggering workflow has conclusion == 'failure'
  • Use workflow_run event to trigger after the main test workflow completes

Data Collection and Aggregation

  • Download artifacts from all matrix jobs in the failed run
  • Parse test results from TRX files or structured JSON artifacts
  • Aggregate failures across all matrix dimensions (dotnet version, server type/version, host OS/arch)
  • Include matrix context (framework, server, host) for each failure

Issue Management

  • Search for existing open issues with ci-failure label
  • If no open issue exists, create a new one with aggregated failure details
  • If open issue exists, add a comment with new failure information
  • Include direct links to the failed workflow run and individual job logs
  • Prevent duplicate comments for the same workflow run ID

Failure Reporting Format

  • Clear summary at the top: total failed tests, affected matrix combinations
  • Organized sections per matrix combination with failed test details
  • Collapsible sections for readability when there are many failures
  • Include error messages and relevant context for each failed test

Acceptance Criteria

Core Functionality

  • Reporter workflow only triggers for scheduled "C# tests" runs that fail
  • Successfully aggregates failures from all matrix job artifacts
  • Creates a single issue when no open ci-failure issue exists
  • Adds comments to existing open ci-failure issue for new failures
  • Includes workflow run ID and prevents duplicate comments for the same run

Data Quality

  • Reports include matrix context (dotnet version, server, host) for each failure
  • Failed test names, error messages, and stack traces are captured accurately
  • Direct links to workflow run and job logs are included
  • Summary statistics (total failures, affected combinations) are accurate

Reliability and Security

  • Uses least-privilege permissions (actions: read, contents: read, issues: write)
  • Handles cases where artifacts are missing or malformed gracefully
  • Respects GitHub comment size limits (chunk or summarize if needed)
  • Works correctly in the valkey-io organization context

Testing and Validation

  • Dry-run capability for testing without creating real issues
  • Tested scenarios: new issue creation, existing issue commenting, duplicate run detection
  • Verified on actual failed scheduled runs before merging

Implementation Notes

  • Reference implementation guide available in .vs/error-workflow.md
  • Will require minor updates to existing tests.yml to ensure TRX output and structured failure artifacts
  • Consider using actions/github-script for GitHub API interactions and artifact processing
  • May need dawidd6/action-download-artifact or similar to fetch cross-workflow artifacts

Related Files

  • .github/workflows/tests.yml (main test workflow)
  • .github/workflows/report-failures.yml (new reporter workflow to be created)
  • .vs/error-workflow.md (detailed implementation checklist)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions