Skip to content

ROX-34387: Initial work on CI triage step#2

Merged
mtodor merged 1 commit intomainfrom
mtodor/ROX-34387-add-CI-failure-triage
May 5, 2026
Merged

ROX-34387: Initial work on CI triage step#2
mtodor merged 1 commit intomainfrom
mtodor/ROX-34387-add-CI-failure-triage

Conversation

@mtodor
Copy link
Copy Markdown
Collaborator

@mtodor mtodor commented May 4, 2026

Description

Extends the acs-triage workflow to perform deep CI failure root cause analysis for issues classified as CI_FAILURE. When a CI failure is triaged, the workflow now automatically reads the stackrox-ci-failure-investigator agent methodology from the cloned stackrox repository and performs a thorough investigation (4-5 minutes per issue), producing root cause, failure category, affected components, risk assessment, and a proposed fix.

The deep analysis is informational only — it enriches JIRA triage comments and reports but does not influence team assignment. To accommodate the deeper investigation, the workflow timeout has been increased from 300s to 1800s and the max issues per run reduced from 20 to 5.

Changes

  • reference/constants.md — Timeout 300→1800s, max issues 20→5, added deep analysis constants and failure category enum
  • .claude/commands/triage.md — Phase 1a verifies investigator agent availability; Phase 4a restructured into Stage 1 (shallow pattern matching) + Stage 2 (deep root cause analysis); Phase 5 unchanged
  • templates/jira-comment.md — Added conditional "CI Failure Root Cause Analysis" section with example
  • templates/triage-report.md — Extended CI Failures section with deep analysis fields
  • .ambient/ambient.json — Updated config (timeout, maxIssues) and systemPrompt
  • CLAUDE.md — Documented deep analysis feature, updated constraints, added external dependency
  • FIELD_REFERENCE.md — Added ci_analysis.deep_analysis.* field definitions
  • .claude/commands/comment-issues.md — Updated CI failure comment example with deep analysis

Validation

  • Test in Ambient: use Custom Workflow to test /triage against a known CI_FAILURE issue

AI-assisted development prompts

This PR was developed with AI assistance. Below are the prompts used:

Prompt: "We have currently workflow this defined in 'acs-triage'. You can read what that workflow does and how it works to understand it better. What I would like to add is actually if we have identified the ticket type is a CI failure, to also triage that failure and add in a comment also analysis of a failure. You can find some information about triaging or analysis of a failure in acs-ci-triage workflow. So what I would like to achieve is to actually extend a acs-triage workflow to support such capability."
Response: Explored both workflows in parallel, understood acs-triage's 7-phase pipeline and acs-ci-triage's deep analysis using the stackrox CI failure investigator agent. Designed an integration plan extending Phase 4a with deep root cause analysis.

Prompt: "Read agent file at runtime" (clarification on methodology approach)
Response: Changed the design from embedding a distilled methodology to reading the investigator agent file at runtime from the cloned stackrox repo at /tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md.

Prompt: "Always-on, adjust limits" (clarification on deep analysis timing)
Response: Made deep CI failure analysis always-on for CI_FAILURE issues, with timeout increased and max issues reduced to accommodate 4-5 minutes per issue.

Prompt: "We need to change plan a bit: 1c - do not assign team based on failure analysis. 1e - give more time to CI failure analysis 4-5 mins should be ok. If we need to reduce something to avoid long running, let's limit number of issues processed to 5 or so."
Response: Updated the plan: deep analysis is informational only (no influence on Phase 5 team assignment), time budget increased to 4-5 minutes per issue, max issues reduced to 5, timeout increased to 1800s.

@janisz janisz self-requested a review May 5, 2026 15:33
@mtodor mtodor marked this pull request as ready for review May 5, 2026 16:22
@mtodor mtodor merged commit 2700f59 into main May 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants