pkg/election: add keepalive tick interval metric, live TTL gauge#10649
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughElection lease metrics refactored: gauge-based local TTL replaced by a scrape-time collector; new histograms for keepalive request duration and tick interval added; leases register/unregister with the collector; Lease ID API made internal and propagated to keepalive, leadership, Grafana, and tests. ChangesLease Keepalive Metrics and TTL Tracking
Sequence Diagram(s)sequenceDiagram
participant Lease
participant Collector as LocalTTLRemainingCollector
participant Prometheus as PrometheusScraper
Lease->>Collector: register(lease) / unregister(lease)
Prometheus->>Collector: Collect()
Collector->>Collector: compute time.Until(lease.expireTime) per purpose
Collector->>Prometheus: emit local_ttl_remaining_seconds samples
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10649 +/- ##
==========================================
- Coverage 79.06% 79.03% -0.03%
==========================================
Files 535 535
Lines 72823 72929 +106
==========================================
+ Hits 57575 57637 +62
- Misses 11187 11217 +30
- Partials 4061 4075 +14
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| Name: "keepalive_request_duration_seconds", | ||
| Help: "Duration of etcd Lease.KeepAliveOnce requests observed by PD, by purpose and result.", | ||
| Name: "keepalive_tick_interval_seconds", | ||
| Help: "Interval between consecutive iterations of the lease keepalive worker loop, by purpose. Spikes correlate with the `the interval between keeping alive lease is too long` warning.", | ||
| Buckets: prometheus.ExponentialBuckets(0.001, 2, 16), |
There was a problem hiding this comment.
The tick interval histogram is capped at ~32.768s. Is it enough?
There was a problem hiding this comment.
I think this is enough, because we can‘t make a lease renewal interval reach such a long time.
| type localTTLRemainingCollector struct { | ||
| desc *prometheus.Desc | ||
| mu sync.Mutex | ||
| leases map[string]*Lease // purpose -> active Lease |
There was a problem hiding this comment.
This collector keys active leases only by purpose, but one process can hold multiple active leases with the same purpose, for example TSO expected-primary leases across keyspace groups use "<service> expected primary" without the group ID. Later registrations overwrite earlier ones, so the collector does not actually emit every active lease.
There was a problem hiding this comment.
Currently, all purpose values that have observational significance have been changed to be unique.
Regarding the "expect primary" value you mentioned, it has already been filtered out from the Grafana and lacks any practical observational significance. Therefore, we can leave it as is for now.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/election/lease.go (1)
241-243: ⚡ Quick winInclude
purpose(and the measuredtick-interval) in the long-tick warning for consistency.This Warn is the only log statement in
keepAliveWorkerthat wasn't updated to includepurpose, while every other Info/Warn nearby now carries it. Since you already computetickIntervalimmediately above and the new histogram is per-purpose, propagating both into the log makes the warning correlatable with the new metric without grepping bylast-time.♻️ Proposed diff
- if tickInterval > interval*2 { - log.Warn("the interval between keeping alive lease is too long", zap.Time("last-time", lastTime)) - } + if tickInterval > interval*2 { + log.Warn("the interval between keeping alive lease is too long", + zap.String("purpose", l.purpose), + zap.Duration("tick-interval", tickInterval), + zap.Duration("expected-interval", interval), + zap.Time("last-time", lastTime)) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/election/lease.go` around lines 241 - 243, The warning in keepAliveWorker currently logs only lastTime when tickInterval > interval*2; update the log.Warn call to include purpose and the measured tickInterval (the tickInterval variable) alongside lastTime so it matches nearby logs and the per-purpose histogram. Locate the check using tickInterval, interval, lastTime and change the log.Warn invocation (log.Warn(...)) to add zap.String("purpose", purpose) and zap.Duration("tick-interval", tickInterval) to the structured fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/election/lease.go`:
- Around line 241-243: The warning in keepAliveWorker currently logs only
lastTime when tickInterval > interval*2; update the log.Warn call to include
purpose and the measured tickInterval (the tickInterval variable) alongside
lastTime so it matches nearby logs and the per-purpose histogram. Locate the
check using tickInterval, interval, lastTime and change the log.Warn invocation
(log.Warn(...)) to add zap.String("purpose", purpose) and
zap.Duration("tick-interval", tickInterval) to the structured fields.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4435061c-7a59-48b9-8195-c52cc03c592e
📥 Commits
Reviewing files that changed from the base of the PR and between 69bc1d5b94c23f78a3e26d8a733014ac8aec9b54 and d93393639e108b5263011ba45953f593105b5ec6.
📒 Files selected for processing (5)
pkg/election/leadership.gopkg/election/lease.gopkg/election/metrics.gopkg/election/metrics_test.gopkg/mcs/utils/expected_primary.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/election/metrics.go
|
@YuhaoZhang00: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-3 |
2 similar comments
|
/test pull-unit-test-next-gen-3 |
|
/test pull-unit-test-next-gen-3 |
|
/retest |
- Add pd_lease_keepalive_tick_interval_seconds histogram in keepAliveWorker, recording start.Sub(lastTime) on every loop iteration. Same source signal as the 'the interval between keeping alive lease is too long' warning, but exposed as a quantitative, per-purpose distribution. - Replace the pd_lease_local_ttl_remaining_seconds GaugeVec with a custom Prometheus collector that computes time.Until(expireTime) at scrape time. Active leases register on Grant and unregister on Close, so the metric reflects live lease decay (sawtooth between renewals) instead of being pinned at leaseTimeout. Closed leases no longer emit phantom zero series. - Grafana: add 'Lease keepalive tick interval' panel to the Leader row (P99 by job/purpose); shuffle 'Lease renewal terminations' to keep the layout grid clean. ref tikv#9389 Signed-off-by: JmPotato <github@ipotato.me>
- Make `Lease.Purpose`/`Lease.ID` private (`purpose`/`id`); add a public `GetID()` accessor that hides the underlying `atomic.Value`, and a `setID` helper. Update `pkg/mcs/utils` to use `GetID()`. - In Close(), unregister from the metrics collector before resetting state and revoking the lease so the order matches Grant(). - Add register/unregister logging (with lease-id) in `localTTLRemainingCollector` to make duplicate or stale entries observable, and add `TestLocalTTLRemainingCollector` covering the per-purpose dedup/skip paths. - Add lease-id to keepalive/close/revoke logs and split long zap lines for readability. ref tikv#9389 Signed-off-by: JmPotato <github@ipotato.me>
Previously `KeepExpectedPrimaryAlive` derived the lease purpose only from `msParam.ServiceName`, so a single TSO process serving N keyspace groups ended up granting N leases that all shared the purpose "tso expected primary". This made the new `localTTLRemainingCollector` drop all but the first lease (the `skipped registering a duplicate lease` warning fires once per extra group) and collapsed N keepalive workers onto one Prometheus series, so per-keyspace-group expected-primary observability was effectively blind. - `pkg/mcs/utils`: append `%05d` group id to the purpose for the TSO service. Scheduling and resource manager keep the existing `<service> expected primary` string since they only ever own one expected primary lease, so the suffix would be pure noise. - `pkg/election`: enrich the existing 'interval between keeping alive lease is too long' warning with the lease purpose and observed/expected intervals; without these fields the warning was untraceable to a specific lease. - `metrics/grafana/pd.json`: relax the `purpose!~".+ expected primary"` filter to `purpose!~".*expected primary.*"` so the 5 lease panels still exclude expected primary leases under the new suffixed format. Verified end-to-end with a microservice playground (4 keyspace groups split out): each TSO keyspace group now has its own `tso expected primary 0000X` series, no more duplicate-register warnings, and the Grafana filter still excludes all expected primary variants. Signed-off-by: JmPotato <github@ipotato.me>
Address review comment from @okJiang: drop sub-10ms granularity for the keepalive request duration and tick interval histograms, since neither metric is expected to land in that range. Use ExponentialBuckets(0.01, 2, 13), covering 10ms to ~40s, which comfortably spans the typical leaseTimeout (5s) and a 30s upper bound for outlier spikes. Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Previously `lastTime` was only set once on the first iteration, so every subsequent tickInterval computed `start − firstStart` instead of `start − previousStart`. After a few minutes all observations exceeded the histogram's largest bucket (40.96s), pinning the P99 at ~41s. Now `lastTime` is updated on every iteration and the first tick is skipped to avoid recording the zero-time delta. Also update Grafana panel descriptions for the lease metrics rows so on-call engineers can interpret them without reading the code. Signed-off-by: JmPotato <ghzpotato@gmail.com> Signed-off-by: JmPotato <github@ipotato.me>
26b0b8d to
9f7a53d
Compare
…C latency Signed-off-by: JmPotato <github@ipotato.me>
Use atomic.Value.Swap for lastTime to eliminate the race between concurrently running KeepAliveOnce goroutines. Separate tick boundary time (start) from actual request time (requestStart) for clearer semantics. Use logger.Warn/Error consistently instead of bare log to retain purpose/lease-id/interval fields. Update tick interval metric description to match the new measurement semantics. Signed-off-by: JmPotato <ghzpotato@gmail.com> Signed-off-by: JmPotato <github@ipotato.me>
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lhy1024, okJiang, YuhaoZhang00 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/test pull-unit-test-next-gen-3 |
1 similar comment
|
/test pull-unit-test-next-gen-3 |
…v#10649) ref tikv#10653\n\n- Add `pd_lease_keepalive_tick_interval_seconds` histogram in `keepAliveWorker`, recording `start.Sub(lastTime)` on every loop iteration. Same source signal as the `the interval between keeping alive lease is too long` warning, but exposed as a quantitative, per-`purpose` distribution. - Replace the `pd_lease_local_ttl_remaining_seconds` GaugeVec with a custom Prometheus collector that computes `time.Until(expireTime)` at scrape time. Active leases register on `Grant` and unregister on `Close`, so the metric reflects live lease decay (sawtooth between renewals) instead of being pinned at `leaseTimeout`. Closed leases no longer emit phantom zero series. - Grafana: add `Lease keepalive tick interval` panel to the Leader row (P99 by job/purpose); shuffle `Lease renewal terminations` to keep the layout grid clean.\n\nSigned-off-by: JmPotato <github@ipotato.me>\nSigned-off-by: JmPotato <ghzpotato@gmail.com> (cherry picked from commit f6653ed) Signed-off-by: JmPotato <github@ipotato.me>


What problem does this PR solve?
Issue Number: ref #10653.
The lease keepalive worker emits a
the interval between keeping alive lease is too longwarning when its main-loop tick interval exceeds2 × interval(seepkg/election/lease.go). This signal is currently only observable via log scraping (e.g. Loki), which makes it hard to alert on, hard to compare across keyspace groups, and impossible to surface in Grafana.The existing
pd_lease_local_ttl_remaining_secondsgauge is also misleading: it is onlySet()on each keepalive response, so every Prometheus scrape sees a value pinned at ~leaseTimeout. Real lease decay between renewals is invisible, defeating the original intent of the metric.What is changed and how does it work?
Check List
Tests
Manual test: started a
tiup playgroundcluster with PD microservices mode (--pd.mode ms --tso 2 --scheduling 1 --resource-manager 1) and 3 keyspace groups. Verified thatpd_lease_local_ttl_remaining_secondssamples vary in[3.33s, 5s]across consecutive 1s scrapes (sawtooth of live lease decay), andpd_lease_keepalive_tick_interval_secondsbuckets are populated on every PD/TSO instance.Side effects
pd_lease_local_ttl_remaining_secondsis preserved for dashboard backward compatibility; the value semantics shift from "frozen at TTL" to "live remaining TTL", which strictly improves observability without affecting any existing< thresholdalerts (steady-state minimum is stillleaseTimeout - leaseTimeout/3).Release note
Summary by CodeRabbit
New Features
Improvements
Tests
Refactor