Skip to content

fix(infra): split promtail journal scrape into 3 jobs per matches limitation#3545

Merged
tasoo-oos merged 2 commits into
mainfrom
fix/promtail-journal-matches-multi-job
May 2, 2026
Merged

fix(infra): split promtail journal scrape into 3 jobs per matches limitation#3545
tasoo-oos merged 2 commits into
mainfrom
fix/promtail-journal-matches-multi-job

Conversation

@manamana32321
Copy link
Copy Markdown
Member

@manamana32321 manamana32321 commented Apr 27, 2026

Description

#3542 머지 후 stage 클러스터 promtail이 즉시 CrashLoop에 빠진 것을 핫픽스합니다. 동시에 봇 리뷰에서 발견된 sshd-내-sudo 중복 ingest 문제도 같이 정리합니다.

증상: 머지 직후(2026-04-27 02:19 KST) stage promtail-stage ApplicationSet이 자동 sync → DaemonSet rolling update 시도 → skkuding-1 노드의 새 pod이 시작 즉시 죽음, 24+회 재시작.

에러 로그 (stage promtail-5hqsh):

{"level":"error","msg":"error creating promtail",
 "error":"failed to make journal target manager: Error parsing journal reader 'matches' config value"}

원인 두 가지:

  1. promtail의 matches 파서는 모든 토큰이 FIELD=VALUE 형식이어야 함. + 토큰은 =이 없어서 거부됨. (grafana/loki journaltarget.go의 journalTargetWithReader)
  2. + 빼도 의미가 깨짐. sd-journal은 다른 필드 간엔 AND로 결합 — _SYSTEMD_UNIT=foo_TRANSPORT=kernel을 같은 matches에 두면 "unit=foo이면서 transport=kernel"인 entry를 찾으려고 해서 0건 매치.

수정: 필드 패밀리별로 journal job 분리. 2개: systemd-journal-units, systemd-journal-kernel. 둘 다 labels.job: systemd-journal로 같은 정적 라벨 부여 → 기존 PR description의 모든 LogQL 예시 ({job="systemd-journal"}, {transport="kernel"} 등) 그대로 동작.

보안 시그널 (sshd, sudo)은 의도적으로 제외 — sshd 안에서 실행한 sudo entry가 두 job 모두에 매치되어 Loki에 중복 저장되는 문제 (gemini + codex 봇 모두 지적)가 있어, L4 보안 작업 별도 PR에서 k8s API audit log와 함께 묶어 처리할 예정. 이번 PR은 L1-L3에 집중.

수집 범위 (이번 PR 기준)

계층 unit
L2 Platform k3s.service, k3s-agent.service, containerd.service
L3 OS kernel transport, systemd-logind.service

L4-cheap (sshd, sudo)은 별도 PR에서 K8s audit log + Falco 검토와 묶어 처리.

Additional context

왜 stage에서만 잡혔나kubectl run + helm template으론 YAML 문법만 검증, 실제 promtail 런타임 파서 검증 안 됨. stage가 첫 통합 검증 단계로 정확히 작동한 셈. 이 PR의 수정은 머지 전 docker dry-run으로 사전 검증 완료:

docker run --rm \
  -v <rendered>:/cfg/promtail.yaml:ro \
  -v /var/log/journal:/var/log/journal:ro \
  -v /etc/machine-id:/etc/machine-id:ro \
  --user root grafana/promtail:3.5.1 \
  -config.file=/cfg/promtail.yaml

"Error parsing journal reader 'matches'" 등 fatal 에러 0건, journal target 정상 기동, Loki push 시도 (= 데이터 read 됨) 확인.

원본 PR의 안전 캐비엇: stage가 자동 배포 채널이라 prod에는 영향 없음 (사용자가 prod 승격을 수동 처리하므로). 이번 핫픽스 머지 후 stage 자동 회복 → 검증 → 사용자 prod 승격 결정.

리뷰 포인트

  • matches: 화이트리스트는 folded scalar >-로 작성 (units job은 4개 entry, kernel job은 1개라 그대로 인라인)
  • 메모리 256Mi 그대로 — 2 job × cursor 1개씩, journal 볼륨 작아서 충분
  • 봇 리뷰 4건 모두 resolved (3건은 sudo job 제거로 해결, 1건은 folded scalar 적용)

검증 계획 (이 PR 머지 후 stage):

  • stage promtail DaemonSet 3/3 Ready, RestartCount 안정화
  • {job="systemd-journal"} 결과 등장 (2개 job 합산)
  • {job="systemd-journal", unit="k3s-agent.service"} (units job 검증)
  • {job="systemd-journal", transport="kernel"} (kernel job 검증)
  • promtail pod 메모리 256 MiB 이내 유지
  • 기존 kubernetes-pods scrape 회귀 없음

Before submitting the PR, please make sure you do the following

  • Read the Contributing Guidelines
  • Read the Contributing Guidelines and follow the Commit Convention
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. fixes #123).
  • Ideally, include relevant tests that fail without this PR but pass with it. (런타임 검증은 docker dry-run으로 머지 전 수행. 회귀 방지를 위한 GHA 워크플로 도입은 별도 작업으로 추적 예정)

🤖 Generated with Claude Code

…itation

The single-job design with 'matches: "... + _TRANSPORT=kernel + _COMM=sudo"'
crashed promtail at startup with "Error parsing journal reader 'matches'
config value". Two distinct issues:

1. promtail's matches parser only accepts FIELD=VALUE tokens. The '+'
   token has no '=' and is rejected outright (see journaltarget.go in
   grafana/loki, function journalTargetWithReader).

2. Even after stripping '+', semantics break: sd-journal AND-combines
   matches across different fields. _SYSTEMD_UNIT=foo combined with
   _TRANSPORT=kernel matches zero entries (no entry has both).

Fix: one journal job per field family — units, kernel, sudo. All three
share `labels.job: systemd-journal` so a single LogQL selector
`{job="systemd-journal"}` still works for users.

Stage cluster has been in CrashLoop since the original PR merged
(02:19 KST). 1/3 nodes new pod failing, 2/3 still on old config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Promtail configuration by splitting the systemd-journal scraping into three separate jobs—units, kernel, and sudo—to address limitations in the matches parser's ability to handle OR logic across different fields. Review feedback highlights a risk of log duplication in Loki due to overlapping match criteria and differing labels. Additionally, improvements were suggested regarding YAML formatting for better readability and ensuring label consistency across all journal jobs by adding the transport label to the sudo job.

Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated
Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated
Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe4a850224

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated
The 3-job split introduced double-ingestion for sudo invocations inside
sshd sessions: the entry carries both _SYSTEMD_UNIT=sshd.service and
_COMM=sudo, so the units job and the sudo job each ingested it under
different label sets that Loki cannot dedupe.

L4-cheap (sshd, sudo) is intentionally removed from this PR's scope —
defer to a dedicated security PR alongside k8s API audit log so the
audit pipeline can be designed once. Result: 3 jobs -> 2 (units +
kernel), zero duplicates.

Also fold the units job 'matches' into a >- block scalar for readability
(noted by gemini-code-assist).

Validated via docker dry-run with host /var/log/journal + /etc/machine-id
mounted: promtail starts, journal target opens, no parser errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@tasoo-oos tasoo-oos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tasoo-oos tasoo-oos added this pull request to the merge queue May 2, 2026
Merged via the queue into main with commit 563f04f May 2, 2026
21 checks passed
@tasoo-oos tasoo-oos deleted the fix/promtail-journal-matches-multi-job branch May 2, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🪰 bug Something isn't working ⛳️ team-infra

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants