Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add region heartbeat duration breakdown metrics #7871

Merged
merged 9 commits into from Mar 12, 2024

Conversation

nolouch
Copy link
Contributor

@nolouch nolouch commented Mar 4, 2024

What problem does this PR solve?

Issue Number: Close #7868

What is changed and how does it work?

*: add region heartbeat duration breakdown metrics
- add a tracer during the heartbeat process
- statistic the lock wait time in RegionsInfo 

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

image

  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Copy link
Contributor

ti-chi-bot bot commented Mar 4, 2024

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JmPotato
  • rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot requested review from JmPotato and rleungx March 4, 2024 09:20
@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 4, 2024
Copy link

codecov bot commented Mar 4, 2024

Codecov Report

Merging #7871 (1bf72a3) into master (c1eabda) will increase coverage by 0.17%.
The diff coverage is 95.45%.

❗ Current head 1bf72a3 differs from pull request most recent head e6d9cd9. Consider uploading reports for the commit e6d9cd9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7871      +/-   ##
==========================================
+ Coverage   73.32%   73.49%   +0.17%     
==========================================
  Files         435      435              
  Lines       48195    48225      +30     
==========================================
+ Hits        35337    35442     +105     
+ Misses       9784     9728      -56     
+ Partials     3074     3055      -19     
Flag Coverage Δ
unittests 73.49% <95.45%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 4, 2024
Signed-off-by: nolouch <nolouch@gmail.com>
pkg/core/metrics.go Outdated Show resolved Hide resolved
Comment on lines +163 to +164
preCheckDurationSum.Add(h.preCheckDuration.Seconds())
preCheckCount.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to flush these metrics asynchronously? It appears that setting the metrics in the middle of processing may affect the performance.

Copy link
Contributor Author

@nolouch nolouch Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok because all of them use atomic. and we can dynamically disable the trace by the config enable-heartbeat-breakdown-metrics

Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch
Copy link
Contributor Author

nolouch commented Mar 8, 2024

ptal @JmPotato @rleungx

Copy link
Member

@rleungx rleungx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, will it cause too much lock contention?

@@ -1653,6 +1695,42 @@ func (r *RegionsInfo) GetRegionSizeByRange(startKey, endKey []byte) int64 {
return size
}

// metrics default poll interval
const magicCount = 15 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just DefaultPollInterval?

@nolouch
Copy link
Contributor Author

nolouch commented Mar 11, 2024

@rleungx This PR does not introduce additional locking and unlocking processes. All trace operations are added within the existing locks. Since metrics are directly atomic operations, I think it's okay, but it may increase some CPU time (holding locks). However, this CPU time should be relatively small compared to other parts.

@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 11, 2024
@nolouch
Copy link
Contributor Author

nolouch commented Mar 12, 2024

ptal @JmPotato

@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 12, 2024
@JmPotato
Copy link
Member

Should we conduct a benchmark test on this feature to assess any potential impact on performance?

@nolouch
Copy link
Contributor Author

nolouch commented Mar 12, 2024

@JmPotato Here is a benchmark result with the follower code :

func BenchmarkRandomAtomicPutRegion(b *testing.B) {
	regions := NewRegionsInfo()
	var items []*RegionInfo
	const treeSize = 1000000
	for i := 0; i < treeSize; i++ {
		peers := []*metapb.Peer{
			{StoreId: uint64((i % 20) + 1), Id: uint64(i*3 + 1)},
			{StoreId: uint64((i % 20) + 2), Id: uint64(i*3 + 2)},
			{StoreId: uint64((i % 20)) + 3, Id: uint64(i*3 + 3)}}
		region := NewRegionInfo(&metapb.Region{
			Id:       uint64(i + 3000001),
			Peers:    peers,
			StartKey: []byte(fmt.Sprintf("%20d", i)),
			EndKey:   []byte(fmt.Sprintf("%20d", i+1)),
		}, peers[0])
		origin, overlaps, rangeChanged := regions.SetRegion(region)
		regions.UpdateSubTree(region, origin, overlaps, rangeChanged)
		items = append(items, region)
	}
	order := mrand.Perm(treeSize)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		tracer := NewNoopHeartbeatProcessTracer()
		idx := order[i%treeSize]
		item := items[idx]
		item.approximateKeys += int64(200000)
		item.approximateSize += int64(20)
		item.leader = item.meta.Peers[2]
		regions.AtomicCheckAndPutRegion(item, tracer)
	}
}

Result

Noop tracer

➜  pd git:(add-lock-metrics) ✗ go test -bench ^BenchmarkRandomAtomicPutRegion$ -benchtime=1000000x -count=3  github.com/tikv/pd/pkg/core -cpu=16
 goos: linux
goarch: amd64
pkg: github.com/tikv/pd/pkg/core
cpu: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
BenchmarkRandomAtomicPutRegion-16        1000000              2221 ns/op
BenchmarkRandomAtomicPutRegion-16        1000000              2046 ns/op
BenchmarkRandomAtomicPutRegion-16        1000000              2030 ns/op
PASS

Metrics Tracer

➜  pd git:(add-lock-metrics) ✗ go test -bench ^BenchmarkRandomAtomicPutRegion$ -benchtime=1000000x -count=3  github.com/tikv/pd/pkg/core -cpu=16
goos: linux
goarch: amd64
pkg: github.com/tikv/pd/pkg/core
cpu: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
BenchmarkRandomAtomicPutRegion-16        1000000              2238 ns/op
BenchmarkRandomAtomicPutRegion-16        1000000              2213 ns/op
BenchmarkRandomAtomicPutRegion-16        1000000              2252 ns/op
PASS

@nolouch
Copy link
Contributor Author

nolouch commented Mar 12, 2024

already approved by @easonn7

/merge

Copy link
Contributor

ti-chi-bot bot commented Mar 12, 2024

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Copy link
Contributor

ti-chi-bot bot commented Mar 12, 2024

This pull request has been accepted and is ready to merge.

Commit hash: 1bf72a3

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 12, 2024
@ti-chi-bot ti-chi-bot bot merged commit 96590de into tikv:master Mar 12, 2024
22 checks passed
@nolouch nolouch deleted the add-lock-metrics branch March 12, 2024 12:42
@HuSharp
Copy link
Member

HuSharp commented Mar 14, 2024

why not add metrics in grafana/pd.json?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Heartbeat trace statistics help analyze performance bottlenecks
4 participants