Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store heartbeat cannot be consumed, heartbeat storm in big cluster #15184

Closed
nolouch opened this issue Jul 24, 2023 · 0 comments · Fixed by #15191
Closed

Store heartbeat cannot be consumed, heartbeat storm in big cluster #15184

nolouch opened this issue Jul 24, 2023 · 0 comments · Fixed by #15191
Assignees
Labels

Comments

@nolouch
Copy link
Contributor

nolouch commented Jul 24, 2023

Development Task

Once PD has a very large pressure, the store heartbeat latency may be high caused by the heavy lock competition. but currently, the retry mechanism is not very reasonable. it will retry 10 times, but every retry round may increase the pressure. on the other hand. the newstore_heartbeat will produce with 10s intervals, which into a vicious circle.

Repreduce

add a fail point sleep(4s) in pd server when handle the store_heartbeat, run cluster:

tiup playground v7.1.0  --tiflash=0 --pd.binpath=./pd-server

we can see:
image
and the logs like:

[2023/07/24 17:14:37.755 +08:00] [ERROR] [util.rs:462] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"]
[2023/07/24 17:14:38.025 +08:00] [INFO] [util.rs:604] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2023/07/24 17:14:38.026 +08:00] [INFO] [util.rs:604] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2023/07/24 17:14:38.027 +08:00] [INFO] [util.rs:769] ["connected to PD member"] [endpoints=http://127.0.0.1:2379]
[2023/07/24 17:14:38.027 +08:00] [INFO] [util.rs:225] ["heartbeat sender and receiver are stale, refreshing ..."]
[2023/07/24 17:14:38.027 +08:00] [INFO] [util.rs:238] ["buckets sender and receiver are stale, refreshing ..."]
[2023/07/24 17:14:38.027 +08:00] [INFO] [util.rs:266] ["update pd client"] [via=] [leader=http://127.0.0.1:2379] [prev_via=] [prev_leader=http://127.0.0.1:2379]
[2023/07/24 17:14:38.027 +08:00] [INFO] [util.rs:400] ["trying to update PD client done"] [spend=2.380167ms]
...

if there are many stores, this increases the pressure in pd side and may cause OOM issue.

Others

the user have a large cluster, there is 200 store, and we found many goroutine locks in the store heartbeat.and details goroutine will like this:

goroutine 7557968 [semacquire, 14 minutes]:
sync.runtime_SemacquireMutex(0x3404c50?, 0x0?, 0xc03afc8e40?)
	/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0003c0000)
	/usr/local/go/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:90
sync.(*RWMutex).Lock(0xc0dc562b00?)
	/usr/local/go/src/sync/rwmutex.go:147 +0x36
github.com/tikv/pd/server/cluster.(*RaftCluster).HandleStoreHeartbeat(0xc0003c0000, 0xc064e16280)
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/cluster.go:669 +0x6e
github.com/tikv/pd/server.(*GrpcServer).StoreHeartbeat(0xc001f4a118, {0x340b958?, 0xc0e864ba70?}, 0xc0f29cc7c0)
	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/grpc_service.go:604 +0x26f
github.com/pingcap/kvproto/pkg/pdpb._PD_StoreHeartbeat_Handler.func1({0x340b958, 0xc0e864ba70}, {0x26d2cc0?, 0xc0f29cc7c0})
	/go/pkg/mod/github.com/pingcap/kvproto@v0.0.0-20220510035547-0e2f26c0a46a/pkg/pdpb/pdpb.pb.go:7467 +0x7b
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x340b958?, 0xc0e864ba70?}, {0x26d2cc0?, 0xc0f29cc7c0?})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:31 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1({0x340b958, 0xc0e864ba70}, {0x26d2cc0, 0xc0f29cc7c0}, 0x24f7640?, 0xc0a46906e0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0x87
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x340b958?, 0xc0e864ba70?}, {0x26d2cc0?, 0xc0f29cc7c0?})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34 +0x6f
go.etcd.io/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1({0x340b958, 0xc0e864ba70}, {0x26d2cc0?, 0xc0f29cc7c0}, 0x0?, 0xc0a46906e0)
	/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/api/v3rpc/interceptor.go:64 +0x1b6
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x340b958?, 0xc0e864ba70?}, {0x26d2cc0?, 0xc0f29cc7c0?})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34 +0x6f
go.etcd.io/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1({0x340b958, 0xc0e864ba70}, {0x26d2cc0, 0xc0f29cc7c0}, 0xc0ebb75da0, 0xc0a46906e0)
	/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/api/v3rpc/interceptor.go:71 +0xc3
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x340b958, 0xc0e864ba70}, {0x26d2cc0, 0xc0f29cc7c0}, 0xc0ebb75da0, 0xc0e8d451b8)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:39 +0x1a3
github.com/pingcap/kvproto/pkg/pdpb._PD_StoreHeartbeat_Handler({0x274f6c0?, 0xc001f4a118}, {0x340b958, 0xc0e864ba70}, 0xc0a135c540, 0xc0006fc6f0)
	/go/pkg/mod/github.com/pingcap/kvproto@v0.0.0-20220510035547-0e2f26c0a46a/pkg/pdpb/pdpb.pb.go:7469 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc001581080, {0x3418060, 0xc098e90780}, 0xc065221e00, 0xc0022fd470, 0x43cc048, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1024 +0xd2f
google.golang.org/grpc.(*Server).handleStream(0xc001581080, {0x3418060, 0xc098e90780}, 0xc065221e00, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1313 +0xa16
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:722 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:720 +0xea
@nolouch nolouch added the type/enhancement Type: Issue - Enhancement label Jul 24, 2023
@nolouch nolouch self-assigned this Jul 25, 2023
@nolouch nolouch changed the title Store heartbeat cannot be consumed Store heartbeat cannot be consumed, heartbeat storm in big cluster Jul 25, 2023
ti-chi-bot bot added a commit that referenced this issue Jul 28, 2023
…15191)

ref tikv/pd#6556, close #15184

The store heartbeat will report periodically, no need to do retires
- do not retry the store heartbeat
- change `remain_reconnect_count` as `remain_request_count`
- fix some metrics

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Jul 28, 2023
ref tikv/pd#6556, close tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Jul 28, 2023
ref tikv/pd#6556, close tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Jul 28, 2023
ref tikv/pd#6556, close tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this issue Aug 14, 2023
…15191) (#15231)

ref tikv/pd#6556, close #15184

The store heartbeat will report periodically, no need to do retires
- do not retry the store heartbeat
- change `remain_reconnect_count` as `remain_request_count`
- fix some metrics

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue Aug 31, 2023
…15191) (#15232)

ref tikv/pd#6556, close #15184

The store heartbeat will report periodically, no need to do retires
- do not retry the store heartbeat
- change `remain_reconnect_count` as `remain_request_count`
- fix some metrics

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
@jebter jebter removed the affects-7.5 label Nov 8, 2023
ti-chi-bot bot added a commit that referenced this issue Nov 24, 2023
…erval and reduce retry times (#15837)

ref #15184

- The min-resolved-ts will report periodically, no need to do retires
- support dynamic change `min-resolved-ts` report interval

Signed-off-by: husharp <jinhao.hu@pingcap.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 24, 2023
ref tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 24, 2023
ref tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 24, 2023
ref tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Nov 24, 2023
ref tikv#15184

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants