Skip to content

feat(infra): add rabbitmq monitoring with prometheus and grafana dashboard#3508

Merged
manamana32321 merged 3 commits into
mainfrom
feat/rabbitmq-monitoring
Apr 2, 2026
Merged

feat(infra): add rabbitmq monitoring with prometheus and grafana dashboard#3508
manamana32321 merged 3 commits into
mainfrom
feat/rabbitmq-monitoring

Conversation

@manamana32321
Copy link
Copy Markdown
Member

@manamana32321 manamana32321 commented Mar 26, 2026

Description

RabbitMQ에 대한 Prometheus 모니터링 및 Grafana 대시보드를 stage/production 모두에 추가합니다.

  • RabbitmqCluster: rabbitmq_prometheus 플러그인 명시적 활성화 (:15692 엔드포인트)
  • ServiceMonitor: Prometheus → RabbitMQ 직접 스크래핑 (15s interval, release: prometheus label)
  • Grafana: RabbitMQ Overview 대시보드 (#10991) 프로비저닝 (stage + production)
    • datasource input 변수를 명시적으로 매핑하여 uid 불일치 방지
  • OTel Collector: 중복 prometheus/rabbitmq receiver 제거 — ServiceMonitor로 대체

Data flow (after):

RabbitMQ (:15692) → ServiceMonitor → Prometheus → Grafana (dashboard #10991)

Additional context

  • 기존에는 OTel Collector가 RabbitMQ를 pull 방식으로 스크래핑한 뒤 Prometheus exporter로 내보내는 우회 경로였으나, ServiceMonitor를 통한 직접 스크래핑으로 변경하여 대시보드 호환성과 구조를 개선했습니다.
  • 배포 후 검증 필요:
    • stage ArgoCD sync 후 RabbitMQ pod 정상 재시작 확인
    • Prometheus targets에 rabbitmq ServiceMonitor 표시 확인
    • Grafana에 "RabbitMQ Overview" 대시보드 자동 프로비저닝 확인
    • 대시보드에서 메트릭 데이터 정상 표시 확인
    • OTel Collector 로그에 rabbitmq 관련 에러 없는지 확인

Before submitting the PR, please make sure you do the following

🤖 Generated with Claude Code

…board

- enable rabbitmq_prometheus plugin on RabbitmqCluster
- add ServiceMonitor for direct prometheus scraping (15s interval)
- provision RabbitMQ Overview dashboard (grafana.com #10991) on stage and production
- remove redundant prometheus/rabbitmq receiver from otel-collector

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

이 PR은 RabbitMQ 모니터링 시스템을 개선하기 위해 기존의 OTel Collector를 통한 간접적인 스크래핑 방식을 Prometheus ServiceMonitor를 이용한 직접 스크래핑 방식으로 전환합니다. 이를 통해 모니터링 구조를 단순화하고, Grafana에 RabbitMQ Overview 대시보드를 통합하여 RabbitMQ의 상태를 보다 효율적으로 시각화할 수 있도록 합니다.

Highlights

  • RabbitmqCluster 설정 업데이트: RabbitmqCluster에 rabbitmq_prometheus 플러그인을 명시적으로 활성화하여 Prometheus 메트릭 엔드포인트(:15692)를 노출하도록 했습니다.
  • Prometheus ServiceMonitor 추가: Prometheus가 RabbitMQ를 직접 스크래핑할 수 있도록 ServiceMonitor 리소스를 추가했습니다. 이는 15초 간격으로 rabbitmq 네임스페이스의 RabbitMQ 인스턴스를 모니터링합니다.
  • Grafana 대시보드 프로비저닝: Grafana에 공식 RabbitMQ Overview 대시보드(#10991)를 스테이지 및 프로덕션 환경 모두에 프로비저닝하여 RabbitMQ 메트릭을 시각화할 수 있도록 했습니다.
  • OTel Collector 설정 정리: 기존에 RabbitMQ를 스크래핑하던 OTel Collector의 prometheus/rabbitmq 리시버를 제거하여 중복 모니터링 경로를 없애고 구조를 단순화했습니다.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates RabbitMQ monitoring into the Kubernetes infrastructure. Key changes include enabling the rabbitmq_prometheus plugin, configuring a ServiceMonitor for Prometheus to directly scrape RabbitMQ metrics, and adding a new Grafana dashboard for RabbitMQ overview. The previous RabbitMQ scraping configuration via the OTel collector has been removed. The review comments point out a potential issue with the Grafana dashboard's datasource configuration in both production and stage environments, suggesting that the current datasource: Prometheus might cause a UID mismatch and prevent the dashboard from correctly finding its data source. An explicit mapping of the datasource input variable is recommended to resolve this.

Comment thread infra/k8s/monitoring/grafana/overlays/production/values.yaml Outdated
Comment thread infra/k8s/monitoring/grafana/overlays/stage/values.yaml Outdated
The gnetId 10991 dashboard uses 'datasource' as its input variable name.
Using a plain string could cause UID mismatch with the provisioned datasource.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@manamana32321 manamana32321 enabled auto-merge April 1, 2026 22:32
github-merge-queue Bot pushed a commit that referenced this pull request Apr 1, 2026
### Description

채점 요청 파이프라인(`API → RabbitMQ → Iris`)의 k6 로드테스트를 추가합니다.

- **150 VU**: 140 정상 + 10 빌런 (자원 소모 악성 코드)
- **언어별 페이로드**: C, Cpp, Java, Python3, PyPy3 — 정상/빌런 각 5종
- **이중 결과 출력**: Prometheus remote write (Grafana 연동) + 로컬 JSON (보고서용)
- **환경변수 기반**: `BASE_URL`, `USERNAME`, `PASSWORD`, `PROBLEM_ID` — 실행 시
동적 지정
- **에러 로그**: 응답 body를 500자로 truncate하여 과도한 로깅 방지

```
k6 run \
  -e BASE_URL=https://codedang.com/api \
  -e USERNAME=... -e PASSWORD=... -e PROBLEM_ID=... \
  --out experimental-prometheus-rw \
  --out "json=results/$(date +%Y%m%d-%H%M%S).json" \
  -e K6_PROMETHEUS_RW_SERVER_URL=https://prometheus.codedang.com/api/v1/write \
  submission-load.js
```

### Stage 환경 검증 결과

VU를 축소(normal 2 + villain 1, 40초)하여 stage 환경에서 스크립트 동작을 검증했습니다.

```
✓ login succeeded
✓ submission accepted
✓ http_req_failed: 0.00% (0 out of 43)
✓ submission_errors: 0
✓ p(95) = 79ms < 5000ms threshold

submission_duration: avg=36.68ms med=33.89ms p(95)=60.42ms
iterations: 42 (로그인 1회 + 제출 42건)
```

#3508 에서 추가한 RabbitMQ Overview 대시보드(Grafana #10991)로 테스트 중 메트릭도 시각적으로
확인:

<img width="1893" height="788" alt="image"
src="https://github.com/user-attachments/assets/bd10d564-e907-4f9b-8181-b76636daa292"
/>

- Incoming messages/s가 테스트 기간 동안 ~0.87로 정상 유입
- Ready messages 0, Unacknowledged 0 — 큐 적체 없이 정상 소비
- Publishers 8, Consumers 8, Queues 4 — 정상 토폴로지

### Additional context

- Related: #3508 (RabbitMQ 모니터링 대시보드 추가)
- Grafana에서 k6 메트릭(`k6_*`)과 RabbitMQ/Iris 메트릭을 같은 시간축으로 비교 가능
- `user_type` (`normal`/`villain`), `language` 태그로 필터링 가능
- `k6 inspect`로 스크립트 문법 검증 완료

---

### Before submitting the PR, please make sure you do the following

- [x] Read the [Contributing
Guidelines](https://github.com/skkuding/next/blob/main/CONTRIBUTING.md)
- [x] Read the [Contributing
Guidelines](https://github.com/skkuding/next/blob/main/CONTRIBUTING.md#pr-and-branch)
and follow the [Commit
Convention](https://github.com/skkuding/next/blob/main/CONTRIBUTING.md#commit-convention)
- [x] Provide a description in this PR that addresses **what** the PR is
solving, or reference the issue that it solves (e.g. `fixes #123`).
- [ ] Ideally, include relevant tests that fail without this PR but pass
with it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tasoo-oos
Copy link
Copy Markdown
Contributor

tasoo-oos commented Apr 2, 2026

RabbitMQ Grafana 대시보드(gnetId: 10991, revision: 12)는 import된 입력 DS_PROMETHEUS를 사용하므로, name: datasource는 차트 grafana 10.5.13에서 대시보드 placeholder를 대체하지 않습니다.

스테이지 및 프로덕션 오버레이를 모두 다음과 같이 업데이트했습니다:

  • name: DS_PROMETHEUS
  • value: Prometheus

Copy link
Copy Markdown
Contributor

@tasoo-oos tasoo-oos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@manamana32321 manamana32321 added this pull request to the merge queue Apr 2, 2026
Merged via the queue into main with commit 88e2021 Apr 2, 2026
11 checks passed
@manamana32321 manamana32321 deleted the feat/rabbitmq-monitoring branch April 2, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants