Skip to content

pkg/election: add keepalive tick interval metric, live TTL gauge#10649

Merged
ti-chi-bot[bot] merged 8 commits into
tikv:masterfrom
JmPotato:pkg-election-tick-interval-live-ttl
May 12, 2026
Merged

pkg/election: add keepalive tick interval metric, live TTL gauge#10649
ti-chi-bot[bot] merged 8 commits into
tikv:masterfrom
JmPotato:pkg-election-tick-interval-live-ttl

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented May 9, 2026

What problem does this PR solve?

Issue Number: ref #10653.

The lease keepalive worker emits a the interval between keeping alive lease is too long warning when its main-loop tick interval exceeds 2 × interval (see pkg/election/lease.go). This signal is currently only observable via log scraping (e.g. Loki), which makes it hard to alert on, hard to compare across keyspace groups, and impossible to surface in Grafana.

The existing pd_lease_local_ttl_remaining_seconds gauge is also misleading: it is only Set() on each keepalive response, so every Prometheus scrape sees a value pinned at ~leaseTimeout. Real lease decay between renewals is invisible, defeating the original intent of the metric.

What is changed and how does it work?

- Add `pd_lease_keepalive_tick_interval_seconds` histogram in
  `keepAliveWorker`, recording `start.Sub(lastTime)` on every loop
  iteration. Same source signal as the `the interval between keeping
  alive lease is too long` warning, but exposed as a quantitative,
  per-`purpose` distribution.
- Replace the `pd_lease_local_ttl_remaining_seconds` GaugeVec with a
  custom Prometheus collector that computes `time.Until(expireTime)`
  at scrape time. Active leases register on `Grant` and unregister on
  `Close`, so the metric reflects live lease decay (sawtooth between
  renewals) instead of being pinned at `leaseTimeout`. Closed leases
  no longer emit phantom zero series.
- Grafana: add `Lease keepalive tick interval` panel to the Leader row
  (P99 by job/purpose); shuffle `Lease renewal terminations` to keep
  the layout grid clean.

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)

Manual test: started a tiup playground cluster with PD microservices mode (--pd.mode ms --tso 2 --scheduling 1 --resource-manager 1) and 3 keyspace groups. Verified that pd_lease_local_ttl_remaining_seconds samples vary in [3.33s, 5s] across consecutive 1s scrapes (sawtooth of live lease decay), and pd_lease_keepalive_tick_interval_seconds buckets are populated on every PD/TSO instance.

image

Side effects

  • None. Metric name pd_lease_local_ttl_remaining_seconds is preserved for dashboard backward compatibility; the value semantics shift from "frozen at TTL" to "live remaining TTL", which strictly improves observability without affecting any existing < threshold alerts (steady-state minimum is still leaseTimeout - leaseTimeout/3).

Release note

None.

Summary by CodeRabbit

  • New Features

    • Added a "Lease keepalive tick interval" panel to the Grafana dashboard.
    • Added histograms for keepalive request durations and tick intervals.
  • Improvements

    • Switched to scrape-time TTL reporting for active leases for more accurate remaining-TTL metrics.
    • Dashboard queries updated to consistently exclude "expected primary" series.
  • Tests

    • Added tests validating the TTL reporting collector behavior.
  • Refactor

    • Internal lease handling updated to support the new metrics and instrumentation.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 9, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Election lease metrics refactored: gauge-based local TTL replaced by a scrape-time collector; new histograms for keepalive request duration and tick interval added; leases register/unregister with the collector; Lease ID API made internal and propagated to keepalive, leadership, Grafana, and tests.

Changes

Lease Keepalive Metrics and TTL Tracking

Layer / File(s) Summary
Metric definitions & Custom Collector
pkg/election/metrics.go
Adds keepalive_tick_interval_seconds and keepalive_request_duration_seconds histograms; replaces gauge local_ttl_remaining_seconds with LocalTTLRemainingCollector that computes TTL at scrape time from lease expire times.
Lease struct API changes
pkg/election/lease.go
Make Purpose/ID unexported (purpose/id); add internal setID and GetID; update NewLease.
Lease Grant / Close lifecycle
pkg/election/lease.go
Grant() stores ID via setID, sets expireTime, and registers the lease with the TTL collector; Close() unregisters before clearing expireTime and revokes using GetID().
KeepAlive response & timeout paths
pkg/election/lease.go
Remove direct TTL observations from KeepAlive success and timeout branches; update deferred logging to use unexported purpose.
Keepalive worker instrumentation & goroutine
pkg/election/lease.go
Compute and record tickInterval once per loop, use it for the interval-too-long check, and obtain lease ID via GetID() for keepalive calls.
Leadership lease usage update
pkg/election/leadership.go
Attach lease to leader key using newLease.GetID() in Campaign transaction.
Grafana dashboard update
metrics/grafana/pd.json
Update PromQL purpose filters, reposition "Lease renewal terminations" panel, and add "Lease keepalive tick interval" panel showing the 0.99 quantile of the tick-interval histogram.
Collector unit tests
pkg/election/metrics_test.go
Adds tests exercising LocalTTLRemainingCollector register/unregister behavior and test-lease helper.
Expected-primary utils
pkg/mcs/utils/expected_primary.go
Use lease.GetID() when marking expected-primary flag.

Sequence Diagram(s)

sequenceDiagram
  participant Lease
  participant Collector as LocalTTLRemainingCollector
  participant Prometheus as PrometheusScraper
  Lease->>Collector: register(lease) / unregister(lease)
  Prometheus->>Collector: Collect()
  Collector->>Collector: compute time.Until(lease.expireTime) per purpose
  Collector->>Prometheus: emit local_ttl_remaining_seconds samples
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • tikv/pd#10622: Modifies pkg/election metrics and lease keepalive instrumentation; strongly related.

Suggested labels

size/XXL, lgtm

Suggested reviewers

  • okJiang
  • lhy1024
  • rleungx

Poem

🐰 I count the ticks between each little lease,

Histograms hum softly as monitors cease,
The collector watches expiry on cue,
Keepalives steady, IDs tucked true,
Dashboards wink — the metrics snug and new.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: adding keepalive tick interval metric and converting TTL gauge to a live metric.
Description check ✅ Passed The PR description is comprehensive, covering the problem statement (issue reference), detailed explanation of changes with commit message, test coverage, and side effects analysis.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 9, 2026
@JmPotato
Copy link
Copy Markdown
Member Author

JmPotato commented May 9, 2026

/retest

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.03%. Comparing base (f4813f3) to head (a261601).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10649      +/-   ##
==========================================
- Coverage   79.06%   79.03%   -0.03%     
==========================================
  Files         535      535              
  Lines       72823    72929     +106     
==========================================
+ Hits        57575    57637      +62     
- Misses      11187    11217      +30     
- Partials     4061     4075      +14     
Flag Coverage Δ
unittests 79.03% <83.33%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/election/metrics.go Outdated
Name: "keepalive_request_duration_seconds",
Help: "Duration of etcd Lease.KeepAliveOnce requests observed by PD, by purpose and result.",
Name: "keepalive_tick_interval_seconds",
Help: "Interval between consecutive iterations of the lease keepalive worker loop, by purpose. Spikes correlate with the `the interval between keeping alive lease is too long` warning.",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 16),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tick interval histogram is capped at ~32.768s. Is it enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is enough, because we can‘t make a lease renewal interval reach such a long time.

Comment thread pkg/election/metrics.go
type localTTLRemainingCollector struct {
desc *prometheus.Desc
mu sync.Mutex
leases map[string]*Lease // purpose -> active Lease
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collector keys active leases only by purpose, but one process can hold multiple active leases with the same purpose, for example TSO expected-primary leases across keyspace groups use "<service> expected primary" without the group ID. Later registrations overwrite earlier ones, so the collector does not actually emit every active lease.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, all purpose values that have observational significance have been changed to be unique.

Regarding the "expect primary" value you mentioned, it has already been filtered out from the Grafana and lacks any practical observational significance. Therefore, we can leave it as is for now.

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/election/lease.go (1)

241-243: ⚡ Quick win

Include purpose (and the measured tick-interval) in the long-tick warning for consistency.

This Warn is the only log statement in keepAliveWorker that wasn't updated to include purpose, while every other Info/Warn nearby now carries it. Since you already compute tickInterval immediately above and the new histogram is per-purpose, propagating both into the log makes the warning correlatable with the new metric without grepping by last-time.

♻️ Proposed diff
-			if tickInterval > interval*2 {
-				log.Warn("the interval between keeping alive lease is too long", zap.Time("last-time", lastTime))
-			}
+			if tickInterval > interval*2 {
+				log.Warn("the interval between keeping alive lease is too long",
+					zap.String("purpose", l.purpose),
+					zap.Duration("tick-interval", tickInterval),
+					zap.Duration("expected-interval", interval),
+					zap.Time("last-time", lastTime))
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/election/lease.go` around lines 241 - 243, The warning in keepAliveWorker
currently logs only lastTime when tickInterval > interval*2; update the log.Warn
call to include purpose and the measured tickInterval (the tickInterval
variable) alongside lastTime so it matches nearby logs and the per-purpose
histogram. Locate the check using tickInterval, interval, lastTime and change
the log.Warn invocation (log.Warn(...)) to add zap.String("purpose", purpose)
and zap.Duration("tick-interval", tickInterval) to the structured fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/election/lease.go`:
- Around line 241-243: The warning in keepAliveWorker currently logs only
lastTime when tickInterval > interval*2; update the log.Warn call to include
purpose and the measured tickInterval (the tickInterval variable) alongside
lastTime so it matches nearby logs and the per-purpose histogram. Locate the
check using tickInterval, interval, lastTime and change the log.Warn invocation
(log.Warn(...)) to add zap.String("purpose", purpose) and
zap.Duration("tick-interval", tickInterval) to the structured fields.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4435061c-7a59-48b9-8195-c52cc03c592e

📥 Commits

Reviewing files that changed from the base of the PR and between 69bc1d5b94c23f78a3e26d8a733014ac8aec9b54 and d93393639e108b5263011ba45953f593105b5ec6.

📒 Files selected for processing (5)
  • pkg/election/leadership.go
  • pkg/election/lease.go
  • pkg/election/metrics.go
  • pkg/election/metrics_test.go
  • pkg/mcs/utils/expected_primary.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/election/metrics.go

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 11, 2026

@YuhaoZhang00: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment thread pkg/election/metrics.go Outdated
Comment thread pkg/election/metrics.go Outdated
@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-2

@okJiang
Copy link
Copy Markdown
Member

okJiang commented May 11, 2026

image

“Lease keepalive response interval” and “Lease keepalive tick interval” look exactly the same. Are they duplicates?

@JmPotato
Copy link
Copy Markdown
Member Author

JmPotato commented May 11, 2026

image “Lease keepalive response interval” and “Lease keepalive tick interval” look exactly the same. Are they duplicates?

These two metrics measure different points in the keepalive pipeline:

  1. Tick interval — the interval between consecutive KeepAliveOnce requests being dispatched to etcd (recorded at the top of the keepAliveWorker loop).
  2. Response interval — the interval between receiving successful renewal responses in the main keepalive loop.

Under normal conditions they are nearly identical. Under runtime pressure (scheduler delay, etcd latency, GC pauses), the gap widens — tick interval captures how often we attempt renewal, while response interval captures how often renewal actually completes.

@JmPotato JmPotato requested a review from okJiang May 11, 2026 07:13
@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

2 similar comments
@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

@JmPotato
Copy link
Copy Markdown
Member Author

/retest

@okJiang
Copy link
Copy Markdown
Member

okJiang commented May 11, 2026

image “Lease keepalive response interval” and “Lease keepalive tick interval” look exactly the same. Are they duplicates?

These two metrics measure different points in the keepalive pipeline:

  1. Tick interval — the interval between consecutive KeepAliveOnce requests being dispatched to etcd (recorded at the top of the keepAliveWorker loop).
  2. Response interval — the interval between receiving successful renewal responses in the main keepalive loop.

Under normal conditions they are nearly identical. Under runtime pressure (scheduler delay, etcd latency, GC pauses), the gap widens — tick interval captures how often we attempt renewal, while response interval captures how often renewal actually completes.

Tick interval — the interval between consecutive KeepAliveOnce requests being dispatched to etcd (recorded at the top of the keepAliveWorker loop).

"the interval between consecutive KeepAliveOnce requests being dispatched to etcd" Should it be placed before KeepAliveOnce to avoid the impact of goroutine scheduling?

Under normal conditions they are nearly identical. Under runtime pressure (scheduler delay, etcd latency, GC pauses), the gap widens — tick interval captures how often we attempt renewal, while response interval captures how often renewal actually completes.

Perhaps you could add some descriptions suggesting when each one should be used, rather than just explaining what they are, so that a oncaller can understand them without having to pore over the code. Their names are too easy to confuse.

JmPotato added 2 commits May 11, 2026 17:38
- Add pd_lease_keepalive_tick_interval_seconds histogram in
  keepAliveWorker, recording start.Sub(lastTime) on every loop
  iteration. Same source signal as the 'the interval between keeping
  alive lease is too long' warning, but exposed as a quantitative,
  per-purpose distribution.
- Replace the pd_lease_local_ttl_remaining_seconds GaugeVec with a
  custom Prometheus collector that computes time.Until(expireTime)
  at scrape time. Active leases register on Grant and unregister on
  Close, so the metric reflects live lease decay (sawtooth between
  renewals) instead of being pinned at leaseTimeout. Closed leases
  no longer emit phantom zero series.
- Grafana: add 'Lease keepalive tick interval' panel to the Leader row
  (P99 by job/purpose); shuffle 'Lease renewal terminations' to keep
  the layout grid clean.

ref tikv#9389

Signed-off-by: JmPotato <github@ipotato.me>
- Make `Lease.Purpose`/`Lease.ID` private (`purpose`/`id`); add a public
  `GetID()` accessor that hides the underlying `atomic.Value`, and a
  `setID` helper. Update `pkg/mcs/utils` to use `GetID()`.
- In Close(), unregister from the metrics collector before resetting
  state and revoking the lease so the order matches Grant().
- Add register/unregister logging (with lease-id) in
  `localTTLRemainingCollector` to make duplicate or stale entries
  observable, and add `TestLocalTTLRemainingCollector` covering the
  per-purpose dedup/skip paths.
- Add lease-id to keepalive/close/revoke logs and split long zap
  lines for readability.

ref tikv#9389

Signed-off-by: JmPotato <github@ipotato.me>
JmPotato added 4 commits May 11, 2026 17:38
Previously `KeepExpectedPrimaryAlive` derived the lease purpose only from
`msParam.ServiceName`, so a single TSO process serving N keyspace groups
ended up granting N leases that all shared the purpose
"tso expected primary". This made the new
`localTTLRemainingCollector` drop all but the first lease (the
`skipped registering a duplicate lease` warning fires once per extra
group) and collapsed N keepalive workers onto one Prometheus series, so
per-keyspace-group expected-primary observability was effectively
blind.

- `pkg/mcs/utils`: append `%05d` group id to the purpose for the TSO
  service. Scheduling and resource manager keep the existing
  `<service> expected primary` string since they only ever own one
  expected primary lease, so the suffix would be pure noise.
- `pkg/election`: enrich the existing 'interval between keeping alive
  lease is too long' warning with the lease purpose and observed/expected
  intervals; without these fields the warning was untraceable to a
  specific lease.
- `metrics/grafana/pd.json`: relax the `purpose!~".+ expected primary"`
  filter to `purpose!~".*expected primary.*"` so the 5 lease panels still
  exclude expected primary leases under the new suffixed format.

Verified end-to-end with a microservice playground (4 keyspace groups
split out): each TSO keyspace group now has its own
`tso expected primary 0000X` series, no more duplicate-register
warnings, and the Grafana filter still excludes all expected primary
variants.

Signed-off-by: JmPotato <github@ipotato.me>
Address review comment from @okJiang: drop sub-10ms granularity for the
keepalive request duration and tick interval histograms, since neither
metric is expected to land in that range. Use ExponentialBuckets(0.01,
2, 13), covering 10ms to ~40s, which comfortably spans the typical
leaseTimeout (5s) and a 30s upper bound for outlier spikes.

Signed-off-by: JmPotato <github@ipotato.me>
Previously `lastTime` was only set once on the first iteration, so
every subsequent tickInterval computed `start − firstStart` instead of
`start − previousStart`. After a few minutes all observations exceeded
the histogram's largest bucket (40.96s), pinning the P99 at ~41s.

Now `lastTime` is updated on every iteration and the first tick is
skipped to avoid recording the zero-time delta.

Also update Grafana panel descriptions for the lease metrics rows so
on-call engineers can interpret them without reading the code.

Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato JmPotato force-pushed the pkg-election-tick-interval-live-ttl branch from 26b0b8d to 9f7a53d Compare May 11, 2026 11:50
@JmPotato
Copy link
Copy Markdown
Member Author

image “Lease keepalive response interval” and “Lease keepalive tick interval” look exactly the same. Are they duplicates?

These two metrics measure different points in the keepalive pipeline:

  1. Tick interval — the interval between consecutive KeepAliveOnce requests being dispatched to etcd (recorded at the top of the keepAliveWorker loop).
  2. Response interval — the interval between receiving successful renewal responses in the main keepalive loop.

Under normal conditions they are nearly identical. Under runtime pressure (scheduler delay, etcd latency, GC pauses), the gap widens — tick interval captures how often we attempt renewal, while response interval captures how often renewal actually completes.

Tick interval — the interval between consecutive KeepAliveOnce requests being dispatched to etcd (recorded at the top of the keepAliveWorker loop).

"the interval between consecutive KeepAliveOnce requests being dispatched to etcd" Should it be placed before KeepAliveOnce to avoid the impact of goroutine scheduling?

Under normal conditions they are nearly identical. Under runtime pressure (scheduler delay, etcd latency, GC pauses), the gap widens — tick interval captures how often we attempt renewal, while response interval captures how often renewal actually completes.

Perhaps you could add some descriptions suggesting when each one should be used, rather than just explaining what they are, so that a oncaller can understand them without having to pore over the code. Their names are too easy to confuse.

My latest commit includes two changes:

  1. Improved the descriptions for each lease-related panel so that even those unfamiliar with the codebase can understand the meaning of each monitor.
  2. Fixed the statistical scope to accurately record the timestamp between each KeepAliveOnce call.

JmPotato added 2 commits May 11, 2026 20:00
…C latency

Signed-off-by: JmPotato <github@ipotato.me>
Use atomic.Value.Swap for lastTime to eliminate the race between
concurrently running KeepAliveOnce goroutines. Separate tick boundary
time (start) from actual request time (requestStart) for clearer
semantics. Use logger.Warn/Error consistently instead of bare log
to retain purpose/lease-id/interval fields. Update tick interval
metric description to match the new measurement semantics.

Signed-off-by: JmPotato <ghzpotato@gmail.com>
Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato
Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024, okJiang, YuhaoZhang00

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 12, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 12, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-09 10:56:07.997761909 +0000 UTC m=+9005.223228583: ☑️ agreed by lhy1024.
  • 2026-05-12 06:55:59.971745884 +0000 UTC m=+162328.504525223: ☑️ agreed by okJiang.

@JmPotato
Copy link
Copy Markdown
Member Author

/retest

@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

1 similar comment
@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

@ti-chi-bot ti-chi-bot Bot merged commit f6653ed into tikv:master May 12, 2026
50 of 54 checks passed
@JmPotato JmPotato deleted the pkg-election-tick-interval-live-ttl branch May 12, 2026 11:21
JmPotato added a commit to JmPotato/pd that referenced this pull request May 18, 2026
…v#10649)

ref tikv#10653\n\n- Add `pd_lease_keepalive_tick_interval_seconds` histogram in
  `keepAliveWorker`, recording `start.Sub(lastTime)` on every loop
  iteration. Same source signal as the `the interval between keeping
  alive lease is too long` warning, but exposed as a quantitative,
  per-`purpose` distribution.
- Replace the `pd_lease_local_ttl_remaining_seconds` GaugeVec with a
  custom Prometheus collector that computes `time.Until(expireTime)`
  at scrape time. Active leases register on `Grant` and unregister on
  `Close`, so the metric reflects live lease decay (sawtooth between
  renewals) instead of being pinned at `leaseTimeout`. Closed leases
  no longer emit phantom zero series.
- Grafana: add `Lease keepalive tick interval` panel to the Leader row
  (P99 by job/purpose); shuffle `Lease renewal terminations` to keep
  the layout grid clean.\n\nSigned-off-by: JmPotato <github@ipotato.me>\nSigned-off-by: JmPotato <ghzpotato@gmail.com>

(cherry picked from commit f6653ed)
Signed-off-by: JmPotato <github@ipotato.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants