tso, server: add debug logs for TSO sync, closure, and forwarding paths#10439
tso, server: add debug logs for TSO sync, closure, and forwarding paths#10439ti-chi-bot[bot] merged 10 commits intotikv:masterfrom
Conversation
Add observability to TSO operations that were previously silent: - Log each successful etcd timestamp save on the normal TSO update path (debug) - Clarify the Reset(true) path logs to indicate leadership resignation - Distinguish embedded TSO allocator closure from microservice Close() - Log when TSO dynamic switching is disabled and service check is skipped - Log when PD forwards a TSO stream to independent TSO service Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
Hi @YuhaoZhang00. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @YuhaoZhang00! |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds structured persistence logs for TSO operations, tracks TSO dynamic-switching state with an atomic in RaftCluster to suppress redundant messages and skip checks when disabled, increments a new Prometheus counter when forwarding TSO streams, and updates Grafana layout adding a panel for forwarded TSO streams. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant GrpcServer
participant RaftCluster
participant TSOService
participant Etcd
Client->>GrpcServer: open TSO stream
GrpcServer->>RaftCluster: checkTSOService()
RaftCluster->>RaftCluster: atomic read/update tsoDynamicSwitchingState
alt dynamic switching disabled
RaftCluster-->>GrpcServer: log "skipping TSO service checks"
GrpcServer->>TSOService: forward stream (independent) / Inc metric
else dynamic switching enabled
RaftCluster-->>GrpcServer: log "resuming TSO service checks"
GrpcServer->>TSOService: forward or serve embedded / Inc metric
end
TSOService->>Etcd: persist TSO window (save)
Etcd-->>TSOService: ack
TSOService-->>GrpcServer: stream responses
GrpcServer-->>Client: stream responses
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest |
|
@YuhaoZhang00: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10439 +/- ##
==========================================
+ Coverage 78.86% 78.89% +0.03%
==========================================
Files 529 532 +3
Lines 71102 71883 +781
==========================================
+ Hits 56072 56714 +642
- Misses 11014 11130 +116
- Partials 4016 4039 +23
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/retest |
pkg/tso/tso.go
Outdated
| t.metrics.updateSaveDuration.Observe(time.Since(start).Seconds()) | ||
| saveDuration := time.Since(start) | ||
| t.metrics.updateSaveDuration.Observe(saveDuration.Seconds()) | ||
| log.Debug("saved timestamp to etcd", |
There was a problem hiding this comment.
What do you think about only logging this and setting it to the INFO level when the saveDuration exceeds a certain threshold?
We don't really need to know every time it saves; we are much more concerned with how long the saving process actually takes.
server/cluster/cluster.go
Outdated
| c.SetServiceIndependent(constant.TSOServiceName) | ||
| } | ||
| if !c.opt.GetMicroserviceConfig().IsTSODynamicSwitchingEnabled() { | ||
| log.Debug("TSO dynamic switching is disabled, skipping TSO service check") |
There was a problem hiding this comment.
It isn’t on a hot path here, or it won’t be printed that frequently. So I think using INFO-level logging is fine, and it can give us some information about the timing.
There was a problem hiding this comment.
Under isKeyspaceGroupEnabled = true and isTSODynamicSwitchingEnabled = false config, the log will be printed once per 100ms. Maybe worth keeping it in Debug level, but would be better eliminating it. For this method, logs on successful/failed state transition is just enough.
server/grpc_service.go
Outdated
| defer done() | ||
| } | ||
| if s.IsServiceIndependent(constant.TSOServiceName) { | ||
| log.Debug("forwarding TSO stream to independent TSO service") |
There was a problem hiding this comment.
I feel this indicator would be better suited as a Grafana metric. This way, we can observe the count on a monitoring dashboard.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/tso/tso.go (1)
289-305: Minor:getLastSavedTime()called twice.
t.getLastSavedTime()is called once in the condition (line 289) and again at line 290. Consider capturing it once before the condition check to avoid the redundant atomic load, though this is a minor point since user-reset is infrequent.♻️ Suggested refactor
// save into etcd only if nextPhysical is close to lastSavedTime +lastSavedTime := t.getLastSavedTime() -if typeutil.SubRealTimeByWallClock(t.getLastSavedTime(), nextPhysical) <= updateTimestampGuard { - lastSavedTime := t.getLastSavedTime() +if typeutil.SubRealTimeByWallClock(lastSavedTime, nextPhysical) <= updateTimestampGuard { save := nextPhysical.Add(t.saveInterval)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tso/tso.go` around lines 289 - 305, The code calls t.getLastSavedTime() twice (in the condition using typeutil.SubRealTimeByWallClock and immediately after), causing redundant atomic loads; capture the value once into a local variable (e.g., lastSavedTime := t.getLastSavedTime()) before evaluating the condition, use that local in the SubRealTimeByWallClock comparison and later when logging/updating, and keep the existing logic that computes save := nextPhysical.Add(t.saveInterval) and calls t.saveTimestamp(save) and t.lastSavedTime.Store(save).server/cluster/cluster.go (1)
472-497: Good state transition logging, minor asymmetry on startup.The implementation correctly addresses the concern about logging every 100ms by only logging on state transitions. However, there's a subtle asymmetry:
- Startup with dynamic switching enabled: no log (Unknown→Enabled doesn't trigger)
- Startup with dynamic switching disabled: logs "skipping TSO service checks" (Unknown→Disabled triggers)
This may be intentional since logging when checks are being skipped is useful information. If symmetric behavior is desired, the enabled path could also log on Unknown→Enabled:
♻️ Optional: symmetric startup logging
prev := c.tsoDynamicSwitchingState.Swap(tsoDynamicSwitchingStateEnabled) -if prev == tsoDynamicSwitchingStateDisabled { +if prev != tsoDynamicSwitchingStateEnabled { log.Info("TSO dynamic switching is enabled, resuming TSO service checks") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/cluster/cluster.go` around lines 472 - 497, The logging is asymmetric on startup: when swapping to tsoDynamicSwitchingStateEnabled you only log if prev == tsoDynamicSwitchingStateDisabled, but when swapping to tsoDynamicSwitchingStateDisabled you log for any non-disabled previous state (including unknown); make the enabled-path symmetric by treating the unknown state the same as disabled for logging. Update the swap branch that uses c.tsoDynamicSwitchingState.Swap(tsoDynamicSwitchingStateEnabled) (the block that currently checks prev == tsoDynamicSwitchingStateDisabled) to also log when prev == tsoDynamicSwitchingStateUnknown so startup Unknown→Enabled emits a log message similar to Unknown→Disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/tso/tso.go`:
- Around line 289-305: The code calls t.getLastSavedTime() twice (in the
condition using typeutil.SubRealTimeByWallClock and immediately after), causing
redundant atomic loads; capture the value once into a local variable (e.g.,
lastSavedTime := t.getLastSavedTime()) before evaluating the condition, use that
local in the SubRealTimeByWallClock comparison and later when logging/updating,
and keep the existing logic that computes save :=
nextPhysical.Add(t.saveInterval) and calls t.saveTimestamp(save) and
t.lastSavedTime.Store(save).
In `@server/cluster/cluster.go`:
- Around line 472-497: The logging is asymmetric on startup: when swapping to
tsoDynamicSwitchingStateEnabled you only log if prev ==
tsoDynamicSwitchingStateDisabled, but when swapping to
tsoDynamicSwitchingStateDisabled you log for any non-disabled previous state
(including unknown); make the enabled-path symmetric by treating the unknown
state the same as disabled for logging. Update the swap branch that uses
c.tsoDynamicSwitchingState.Swap(tsoDynamicSwitchingStateEnabled) (the block that
currently checks prev == tsoDynamicSwitchingStateDisabled) to also log when prev
== tsoDynamicSwitchingStateUnknown so startup Unknown→Enabled emits a log
message similar to Unknown→Disabled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b66f8a8-61f3-4177-923b-6f6fdc57001f
📒 Files selected for processing (2)
pkg/tso/tso.goserver/cluster/cluster.go
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
pkg/tso/tso.go
Outdated
|
|
||
| t.metrics.syncOKEvent.Inc() | ||
| log.Info("sync and save timestamp", | ||
| log.Info("persisted tso window to etcd (sync)", |
There was a problem hiding this comment.
This is a critical path log that has existed for a long time, and I tend to avoid changing all the fields, as that might affect some alerting systems.
|
/hold |
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
/unhold |
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
/retest |
|
/retest |
1 similar comment
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bufferflies, JmPotato The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: ref #10329
When investigating TSO fallback and dual-writer incidents such as #10329, the existing logs and
metrics are not enough to reconstruct:
What is changed and how does it work?
This PR improves observability for TSO persistence and forwarding paths.
Add persistence-focused logs for successful TSO window writes to etcd:
syncTimestampupdateTimestampresetUserTimestampImprove TSO switching related logs:
UpdateTSOfails and the allocator resigns leadershipclosing the embedded TSO allocatoris disabled or re-enabled
Replace the forwarded TSO stream debug log with a Prometheus metric:
pd_server_tso_forward_stream_totalAdd a Grafana panel for the new metric in the PD dashboard:
Forward TSO streams countAll new logs on hot paths use debug level to avoid overwhelming logs under burst requests.
Check List
Tests
Code changes
N/A
Side effects
Related changes
N/A
Release note
Summary by CodeRabbit