fix(infra): split promtail journal scrape into 3 jobs per matches limitation#3545
Conversation
…itation
The single-job design with 'matches: "... + _TRANSPORT=kernel + _COMM=sudo"'
crashed promtail at startup with "Error parsing journal reader 'matches'
config value". Two distinct issues:
1. promtail's matches parser only accepts FIELD=VALUE tokens. The '+'
token has no '=' and is rejected outright (see journaltarget.go in
grafana/loki, function journalTargetWithReader).
2. Even after stripping '+', semantics break: sd-journal AND-combines
matches across different fields. _SYSTEMD_UNIT=foo combined with
_TRANSPORT=kernel matches zero entries (no entry has both).
Fix: one journal job per field family — units, kernel, sudo. All three
share `labels.job: systemd-journal` so a single LogQL selector
`{job="systemd-journal"}` still works for users.
Stage cluster has been in CrashLoop since the original PR merged
(02:19 KST). 1/3 nodes new pod failing, 2/3 still on old config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the Promtail configuration by splitting the systemd-journal scraping into three separate jobs—units, kernel, and sudo—to address limitations in the matches parser's ability to handle OR logic across different fields. Review feedback highlights a risk of log duplication in Loki due to overlapping match criteria and differing labels. Additionally, improvements were suggested regarding YAML formatting for better readability and ensuring label consistency across all journal jobs by adding the transport label to the sudo job.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe4a850224
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The 3-job split introduced double-ingestion for sudo invocations inside sshd sessions: the entry carries both _SYSTEMD_UNIT=sshd.service and _COMM=sudo, so the units job and the sudo job each ingested it under different label sets that Loki cannot dedupe. L4-cheap (sshd, sudo) is intentionally removed from this PR's scope — defer to a dedicated security PR alongside k8s API audit log so the audit pipeline can be designed once. Result: 3 jobs -> 2 (units + kernel), zero duplicates. Also fold the units job 'matches' into a >- block scalar for readability (noted by gemini-code-assist). Validated via docker dry-run with host /var/log/journal + /etc/machine-id mounted: promtail starts, journal target opens, no parser errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Description
#3542 머지 후 stage 클러스터 promtail이 즉시 CrashLoop에 빠진 것을 핫픽스합니다. 동시에 봇 리뷰에서 발견된 sshd-내-sudo 중복 ingest 문제도 같이 정리합니다.
증상: 머지 직후(
2026-04-27 02:19 KST) stagepromtail-stageApplicationSet이 자동 sync → DaemonSet rolling update 시도 → skkuding-1 노드의 새 pod이 시작 즉시 죽음, 24+회 재시작.에러 로그 (stage promtail-5hqsh):
원인 두 가지:
FIELD=VALUE형식이어야 함.+토큰은=이 없어서 거부됨. (grafana/loki journaltarget.go의journalTargetWithReader)+빼도 의미가 깨짐. sd-journal은 다른 필드 간엔 AND로 결합 —_SYSTEMD_UNIT=foo와_TRANSPORT=kernel을 같은matches에 두면 "unit=foo이면서 transport=kernel"인 entry를 찾으려고 해서 0건 매치.수정: 필드 패밀리별로 journal job 분리. 2개:
systemd-journal-units,systemd-journal-kernel. 둘 다labels.job: systemd-journal로 같은 정적 라벨 부여 → 기존 PR description의 모든 LogQL 예시 ({job="systemd-journal"},{transport="kernel"}등) 그대로 동작.보안 시그널 (sshd, sudo)은 의도적으로 제외 — sshd 안에서 실행한 sudo entry가 두 job 모두에 매치되어 Loki에 중복 저장되는 문제 (gemini + codex 봇 모두 지적)가 있어, L4 보안 작업 별도 PR에서 k8s API audit log와 함께 묶어 처리할 예정. 이번 PR은 L1-L3에 집중.
수집 범위 (이번 PR 기준)
k3s.service,k3s-agent.service,containerd.servicesystemd-logind.serviceL4-cheap (sshd, sudo)은 별도 PR에서 K8s audit log + Falco 검토와 묶어 처리.
Additional context
왜 stage에서만 잡혔나 —
kubectl run+helm template으론 YAML 문법만 검증, 실제 promtail 런타임 파서 검증 안 됨. stage가 첫 통합 검증 단계로 정확히 작동한 셈. 이 PR의 수정은 머지 전 docker dry-run으로 사전 검증 완료:"Error parsing journal reader 'matches'" 등 fatal 에러 0건, journal target 정상 기동, Loki push 시도 (= 데이터 read 됨) 확인.
원본 PR의 안전 캐비엇: stage가 자동 배포 채널이라 prod에는 영향 없음 (사용자가 prod 승격을 수동 처리하므로). 이번 핫픽스 머지 후 stage 자동 회복 → 검증 → 사용자 prod 승격 결정.
리뷰 포인트
matches:화이트리스트는 folded scalar>-로 작성 (units job은 4개 entry, kernel job은 1개라 그대로 인라인)검증 계획 (이 PR 머지 후 stage):
{job="systemd-journal"}결과 등장 (2개 job 합산){job="systemd-journal", unit="k3s-agent.service"}(units job 검증){job="systemd-journal", transport="kernel"}(kernel job 검증)kubernetes-podsscrape 회귀 없음Before submitting the PR, please make sure you do the following
fixes #123).🤖 Generated with Claude Code