Skip to content

feat: add cgroup CFS throttling metrics to OTel meter#1114

Merged
fenos merged 1 commit into
supabase:masterfrom
mcollina:feat/cgroup-cfs-throttling-metrics
May 21, 2026
Merged

feat: add cgroup CFS throttling metrics to OTel meter#1114
fenos merged 1 commit into
supabase:masterfrom
mcollina:feat/cgroup-cfs-throttling-metrics

Conversation

@mcollina
Copy link
Copy Markdown
Contributor

Summary

  • Add installCgroupCpuMetrics(meter) in src/internal/monitoring/cgroup-cpu-metrics.ts. Detects cgroup v1 vs v2 at startup, parses cpu.stat (normalising the throttled-time field to nanoseconds), and registers four observable instruments on the provided OTel Meter:
    • process.cpu.cfs.periods — total CFS periods elapsed (counter, {period})
    • process.cpu.cfs.throttled_periods — periods the cgroup was throttled (counter, {period})
    • process.cpu.cfs.throttled_time — total throttled time (counter, ns)
    • process.cpu.cfs.throttled_ratio — fraction of recent periods throttled (gauge, 1), delta-based with a divide-by-zero guard on first sample and zero-period intervals.
  • Wire it into otel-metrics.ts right after metrics.setGlobalMeterProvider(meterProvider).
  • Reads happen inside the OTel observable callback (no recurring setInterval). Non-Linux platforms and missing/unreadable cpu.stat short-circuit cleanly with a single debug log. Mid-lifetime read failures are caught silently with a once-per-process warn log so they don't surface as SDK warnings.

Motivation

App runs in containers (ECS / Kubernetes); we want to correlate intermittent multi-100ms GC pauses with CFS throttling. These metrics are project-local — they are not yet part of the OTel semantic conventions, which the module's top comment documents.

Test plan

  • npx tsc -noEmit clean
  • npx biome check clean on touched files
  • npm run test:unit -- otel-metrics.test.ts cgroup-cpu-metrics.test.ts → 9/9 passing
  • Verify in a Linux container that the four metrics appear in the OTLP / Prometheus export and are non-zero under load

🤖 Generated with Claude Code

@mcollina mcollina requested a review from a team as a code owner May 20, 2026 06:53
Copilot AI review requested due to automatic review settings May 20, 2026 06:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Linux cgroup CFS throttling metrics to the project’s OpenTelemetry metrics setup so container CPU throttling can be correlated with runtime behavior (e.g., GC pauses).

Changes:

  • Introduces installCgroupCpuMetrics(meter) to detect cgroup v1/v2 and register four observable CFS instruments based on cpu.stat.
  • Wires cgroup metrics installation into OTel meter provider initialization.
  • Adds unit tests for cpu.stat parsing and the new metric installation behavior; updates existing OTel metrics tests to mock the new module.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/internal/monitoring/otel-metrics.ts Installs cgroup CFS metrics immediately after setting the global meter provider.
src/internal/monitoring/otel-metrics.test.ts Mocks the new cgroup metrics module and updates MeterProvider mock to include getMeter.
src/internal/monitoring/cgroup-cpu-metrics.ts Implements cgroup detection, cpu.stat parsing, and observable metric registration.
src/internal/monitoring/cgroup-cpu-metrics.test.ts Adds coverage for parsing and observable metric behavior (including throttled ratio guards).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/internal/monitoring/cgroup-cpu-metrics.ts
Comment thread src/internal/monitoring/cgroup-cpu-metrics.ts
@coveralls
Copy link
Copy Markdown

coveralls commented May 20, 2026

Coverage Report for CI Build 26211996717

Coverage increased (+0.07%) to 74.943%

Details

  • Coverage increased (+0.07%) from the base build.
  • Patch coverage: 9 uncovered changes across 1 file (63 of 72 lines covered, 87.5%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
src/internal/monitoring/cgroup-cpu-metrics.ts 71 62 87.32%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 10365
Covered Lines: 8181
Line Coverage: 78.93%
Relevant Branches: 5994
Covered Branches: 4079
Branch Coverage: 68.05%
Branches in Coverage %: Yes
Coverage Strength: 411.54 hits per line

💛 - Coveralls

@fenos fenos enabled auto-merge (squash) May 21, 2026 07:28
Register four observable instruments on the existing MeterProvider to
surface container CPU bandwidth-control state so GC pauses can be
correlated with throttling in our observability backend:

- process.cpu.cfs.periods (counter)
- process.cpu.cfs.throttled_periods (counter)
- process.cpu.cfs.throttled_time (counter, ns)
- process.cpu.cfs.throttled_ratio (gauge, delta-based)

Supports cgroup v1 and v2, short-circuits cleanly on non-Linux and when
cpu.stat is unreadable, and reads the file from the OTel observable
callback (no separate timer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ferhatelmas ferhatelmas force-pushed the feat/cgroup-cfs-throttling-metrics branch from 1f318b8 to 6e4791e Compare May 21, 2026 07:29
@fenos fenos merged commit ca2e918 into supabase:master May 21, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants