fix(infra): split promtail journal scrape into 3 jobs per matches limitation by manamana32321 · Pull Request #3545 · skkuding/codedang

manamana32321 · 2026-04-27T04:31:47Z

Description

#3542 머지 후 stage 클러스터 promtail이 즉시 CrashLoop에 빠진 것을 핫픽스합니다. 동시에 봇 리뷰에서 발견된 sshd-내-sudo 중복 ingest 문제도 같이 정리합니다.

증상: 머지 직후(2026-04-27 02:19 KST) stage promtail-stage ApplicationSet이 자동 sync → DaemonSet rolling update 시도 → skkuding-1 노드의 새 pod이 시작 즉시 죽음, 24+회 재시작.

에러 로그 (stage promtail-5hqsh):

{"level":"error","msg":"error creating promtail",
 "error":"failed to make journal target manager: Error parsing journal reader 'matches' config value"}

원인 두 가지:

promtail의 matches 파서는 모든 토큰이 FIELD=VALUE 형식이어야 함. + 토큰은 =이 없어서 거부됨. (grafana/loki journaltarget.go의 journalTargetWithReader)
+ 빼도 의미가 깨짐. sd-journal은 다른 필드 간엔 AND로 결합 — _SYSTEMD_UNIT=foo와 _TRANSPORT=kernel을 같은 matches에 두면 "unit=foo이면서 transport=kernel"인 entry를 찾으려고 해서 0건 매치.

수정: 필드 패밀리별로 journal job 분리. 2개: systemd-journal-units, systemd-journal-kernel. 둘 다 labels.job: systemd-journal로 같은 정적 라벨 부여 → 기존 PR description의 모든 LogQL 예시 ({job="systemd-journal"}, {transport="kernel"} 등) 그대로 동작.

보안 시그널 (sshd, sudo)은 의도적으로 제외 — sshd 안에서 실행한 sudo entry가 두 job 모두에 매치되어 Loki에 중복 저장되는 문제 (gemini + codex 봇 모두 지적)가 있어, L4 보안 작업 별도 PR에서 k8s API audit log와 함께 묶어 처리할 예정. 이번 PR은 L1-L3에 집중.

수집 범위 (이번 PR 기준)

계층	unit
L2 Platform	`k3s.service`, `k3s-agent.service`, `containerd.service`
L3 OS	kernel transport, `systemd-logind.service`

L4-cheap (sshd, sudo)은 별도 PR에서 K8s audit log + Falco 검토와 묶어 처리.

Additional context

왜 stage에서만 잡혔나 — kubectl run + helm template으론 YAML 문법만 검증, 실제 promtail 런타임 파서 검증 안 됨. stage가 첫 통합 검증 단계로 정확히 작동한 셈. 이 PR의 수정은 머지 전 docker dry-run으로 사전 검증 완료:

docker run --rm \
  -v <rendered>:/cfg/promtail.yaml:ro \
  -v /var/log/journal:/var/log/journal:ro \
  -v /etc/machine-id:/etc/machine-id:ro \
  --user root grafana/promtail:3.5.1 \
  -config.file=/cfg/promtail.yaml

"Error parsing journal reader 'matches'" 등 fatal 에러 0건, journal target 정상 기동, Loki push 시도 (= 데이터 read 됨) 확인.

원본 PR의 안전 캐비엇: stage가 자동 배포 채널이라 prod에는 영향 없음 (사용자가 prod 승격을 수동 처리하므로). 이번 핫픽스 머지 후 stage 자동 회복 → 검증 → 사용자 prod 승격 결정.

리뷰 포인트

matches: 화이트리스트는 folded scalar >-로 작성 (units job은 4개 entry, kernel job은 1개라 그대로 인라인)
메모리 256Mi 그대로 — 2 job × cursor 1개씩, journal 볼륨 작아서 충분
봇 리뷰 4건 모두 resolved (3건은 sudo job 제거로 해결, 1건은 folded scalar 적용)

검증 계획 (이 PR 머지 후 stage):

stage promtail DaemonSet 3/3 Ready, RestartCount 안정화
{job="systemd-journal"} 결과 등장 (2개 job 합산)
{job="systemd-journal", unit="k3s-agent.service"} (units job 검증)
{job="systemd-journal", transport="kernel"} (kernel job 검증)
promtail pod 메모리 256 MiB 이내 유지
기존 kubernetes-pods scrape 회귀 없음

Before submitting the PR, please make sure you do the following

Read the Contributing Guidelines
Read the Contributing Guidelines and follow the Commit Convention
Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. fixes #123).
Ideally, include relevant tests that fail without this PR but pass with it. (런타임 검증은 docker dry-run으로 머지 전 수행. 회귀 방지를 위한 GHA 워크플로 도입은 별도 작업으로 추적 예정)

🤖 Generated with Claude Code

…itation The single-job design with 'matches: "... + _TRANSPORT=kernel + _COMM=sudo"' crashed promtail at startup with "Error parsing journal reader 'matches' config value". Two distinct issues: 1. promtail's matches parser only accepts FIELD=VALUE tokens. The '+' token has no '=' and is rejected outright (see journaltarget.go in grafana/loki, function journalTargetWithReader). 2. Even after stripping '+', semantics break: sd-journal AND-combines matches across different fields. _SYSTEMD_UNIT=foo combined with _TRANSPORT=kernel matches zero entries (no entry has both). Fix: one journal job per field family — units, kernel, sudo. All three share `labels.job: systemd-journal` so a single LogQL selector `{job="systemd-journal"}` still works for users. Stage cluster has been in CrashLoop since the original PR merged (02:19 KST). 1/3 nodes new pod failing, 2/3 still on old config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refactors the Promtail configuration by splitting the systemd-journal scraping into three separate jobs—units, kernel, and sudo—to address limitations in the matches parser's ability to handle OR logic across different fields. Review feedback highlights a risk of log duplication in Loki due to overlapping match criteria and differing labels. Additionally, improvements were suggested regarding YAML formatting for better readability and ensuring label consistency across all journal jobs by adding the transport label to the sudo job.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe4a850224

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The 3-job split introduced double-ingestion for sudo invocations inside sshd sessions: the entry carries both _SYSTEMD_UNIT=sshd.service and _COMM=sudo, so the units job and the sudo job each ingested it under different label sets that Loki cannot dedupe. L4-cheap (sshd, sudo) is intentionally removed from this PR's scope — defer to a dedicated security PR alongside k8s API audit log so the audit pipeline can be designed once. Result: 3 jobs -> 2 (units + kernel), zero duplicates. Also fold the units job 'matches' into a >- block scalar for readability (noted by gemini-code-assist). Validated via docker dry-run with host /var/log/journal + /etc/machine-id mounted: promtail starts, journal target opens, no parser errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tasoo-oos

LGTM!

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated

Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated

Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated

manamana32321 requested review from doyoon323, sunghyun1000 and tasoo-oos April 27, 2026 08:17

manamana32321 self-assigned this Apr 27, 2026

manamana32321 added 🪰 bug Something isn't working ⛳️ team-infra labels Apr 27, 2026

tasoo-oos approved these changes May 2, 2026

View reviewed changes

tasoo-oos added this pull request to the merge queue May 2, 2026

Merged via the queue into main with commit 563f04f May 2, 2026
21 checks passed

tasoo-oos deleted the fix/promtail-journal-matches-multi-job branch May 2, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(infra): split promtail journal scrape into 3 jobs per matches limitation#3545

fix(infra): split promtail journal scrape into 3 jobs per matches limitation#3545
tasoo-oos merged 2 commits into
mainfrom
fix/promtail-journal-matches-multi-job

manamana32321 commented Apr 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

tasoo-oos left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

manamana32321 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

수집 범위 (이번 PR 기준)

Additional context

Before submitting the PR, please make sure you do the following

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tasoo-oos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

manamana32321 commented Apr 27, 2026 •

edited

Loading