Skip to content

fix(infra): grafana 12 expression migration + alert message improvements#3530

Merged
sunghyun1000 merged 4 commits into
mainfrom
fix/alert-expressions
Apr 27, 2026
Merged

fix(infra): grafana 12 expression migration + alert message improvements#3530
sunghyun1000 merged 4 commits into
mainfrom
fix/alert-expressions

Conversation

@manamana32321
Copy link
Copy Markdown
Member

@manamana32321 manamana32321 commented Apr 5, 2026

Description

Grafana 12.x에서 classic_condition expression type이 제거되어 모든 alert rule이 에러 상태로 firing되는 문제를 수정하고, 알림 메시지 품질을 개선합니다.

수정 사항:

  1. Expression type 마이그레이션 (필수 fix)

    • 모든 alert rule의 condition을 reduce (refId B) + threshold (refId C) 패턴으로 교체
  2. Notification policy 차등 (스팸 방지)

    severity repeat_interval 의도
    critical 1h incident 진행 중 빈번 알림
    warning (default) 24h 임계 초과 일일 1회
    error state 7d rule 자체 깨졌을 때 스팸 방지
  3. 알림 메시지 한글화 + 조치 가이드

    • title/summary/description 한글 통일 (uid는 영어 유지)
    • 단위 사람 친화적 변환 (humanize1024로 bytes → human)
    • 조치 가이드 추가 (kubectl top pod -A, df -h 등)
  4. for duration 조정

    • Node Not Ready: 5m → 2m (critical 빠른 감지)
    • RDS DatabaseConnections: 5m → 10m (트래픽 변동 흡수)

Additional context

  • 에러 메시지: invalid command type in expression 'C': 'classic_condition' is not a recognized expression type
  • stage 환경에서 배포 후 Discord 알림으로 확인된 문제
  • 4시간마다 6개 룰 × N회 반복되는 스팸 패턴 직접 확인됨

Before submitting the PR, please make sure you do the following

Grafana 12.x removed classic_condition expression type. Replace all
alert rule conditions with reduce (refId B) + threshold (refId C)
pattern as required by modern Grafana.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates Grafana alert configurations from classic_condition to the Unified Alerting engine by introducing reduce and threshold expressions across the production and stage environments. While the structural changes are correct, the alert descriptions in the annotations still reference the raw range query ($values.A) instead of the reduced result ($values.B). This could lead to incorrect or missing values in alert notifications, so it is recommended to update these references to ensure proper data reporting.

Comment thread infra/k8s/monitoring/grafana/overlays/production/alerting/values.yaml Outdated
Comment thread infra/k8s/monitoring/grafana/overlays/stage/alerting/values.yaml Outdated
@manamana32321 manamana32321 marked this pull request as ready for review April 26, 2026 16:19
@manamana32321
Copy link
Copy Markdown
Member Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add severity-based notification policy routes:
  * critical: 1h repeat
  * warning (default): 24h repeat
  * error state: 7d repeat (avoid spam from rule errors)
- Translate alert titles, summaries, descriptions to Korean
- Add action guidance to descriptions (commands to investigate)
- Display current value with proper formatting (humanize1024 for bytes)
- Adjust 'for' duration:
  * Node Not Ready: 2m (faster critical detection)
  * RDS DatabaseConnections: 10m (absorb traffic spikes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@manamana32321 manamana32321 changed the title fix(infra): replace classic_condition with reduce+threshold expressions fix(infra): grafana 12 expression migration + alert message improvements Apr 26, 2026
@manamana32321
Copy link
Copy Markdown
Member Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

… syntax

- Remove unquoted matcher values per grafana provisioning docs format
- Fix node-not-ready: drop \`== 0\` filter from PromQL, use threshold lt 1
  Previous logic returned series only when value=0 (NotReady), then
  threshold gt 0 evaluated to false → alert never fired

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@sunghyun1000 sunghyun1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

좋습니다!

@sunghyun1000 sunghyun1000 added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit 669acb0 Apr 27, 2026
21 checks passed
@sunghyun1000 sunghyun1000 deleted the fix/alert-expressions branch April 27, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants