Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins #7251

Closed
Lily2025 opened this issue Oct 25, 2023 · 5 comments · Fixed by #7301
Assignees
Labels
affects-7.3 affects-7.4 affects-7.5 affects-7.6 severity/major The issue's severity is major. type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Oct 25, 2023

Bug Report

What did you do?

1、run workload
2、inject pdleader io delay 500ms last for 5mins
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: kv-timeout-data
namespace: testbed-xxx
spec:
action: latency
mode: one
selector:
namespaces:
- testbed-xxx
labelSelectors:
statefulset.kubernetes.io/pod-name: tc-pd-0
volumePath: /var/lib/pd
path: "/var/lib/pd/data/**/*"
delay: "500ms"
percent: 100
duration: "300s"

What did you expect to see?

1、pd can select new leader when inject pdleader io delay 500ms last for 5mins
2、qps can recover within 2mins when inject pdleader io delay 500ms

What did you see instead?

pd can not select new leader when inject pdleader io delay 500ms last for 5mins
write qps drop to zero
3bf7daa4-ac99-4ae7-8651-52b59d4ea66e
46775ab0-f274-434b-a4b8-b61f1b71e352

What version of PD are you using (pd-server -V)?

./pd-server -V
Release Version: v6.5.0-nightly
Edition: Community
Git Commit Hash: 77d6f5b
Git Branch: heads/refs/tags/v6.5.0-nightly
UTC Build Time: 2023-10-07 10:21:09
2023-10-19T09:33:53.650+0800

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Oct 25, 2023
@github-actions github-actions bot added this to Need Triage in Questions and Bug Reports Oct 25, 2023
@Lily2025
Copy link
Author

/type bug
/severity major
/assign HuSharp

@Lily2025 Lily2025 changed the title pd can not select new leader when inject pdleader io delay 500ms last for 5mins pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins Oct 27, 2023
@HuSharp
Copy link
Member

HuSharp commented Nov 1, 2023

Situation

img_v3_024k_d7092279-28a0-464a-be0e-36b77cb7bd1g

It looks like pd0 kept trying to elect leader timed out for five minutes. Finally, pd2 is elected.

  • It looks like pd0 is running for leader because the apply is slow (io delay).
  • etcd leader was updated at the end of pd0's campaign.
  • etcd updated leader because heartbeat is slow.
  • The qps drop to zero continue time depends on how long it took for etcd to change leader.

img_v3_024k_30afc0c3-d479-44ee-b4fc-10e1ceacb27g

Solution

maybe we need to maintain a slice record election times

Questions and Bug Reports automation moved this from Need Triage to Closed Nov 16, 2023
ti-chi-bot bot added a commit that referenced this issue Nov 16, 2023
close #7251, ref #7377

when pd leader frequently campaign leader, but etcd leader did not change.
We need to prevent this pd leader campaign and resign to another member.

Signed-off-by: husharp <jinhao.hu@pingcap.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
@Lily2025
Copy link
Author

/open

@Lily2025
Copy link
Author

The phenomenon still exists

@JmPotato
Copy link
Member

Fixed by #7737.

Questions and Bug Reports automation moved this from Need Triage to Closed Jan 30, 2024
ti-chi-bot bot added a commit that referenced this issue Feb 2, 2024
close #7251, ref #7377

when pd leader frequently campaign leader, but etcd leader did not change.
We need to prevent this pd leader campaign and resign to another member.

Signed-off-by: husharp <jinhao.hu@pingcap.com>

Co-authored-by: husharp <jinhao.hu@pingcap.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue Feb 4, 2024
close #7251

Cherry-pick the etcd client health checker improvements from #7725. #7727, #7743,  #7737 and #7779.

Signed-off-by: JmPotato <ghzpotato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.3 affects-7.4 affects-7.5 affects-7.6 severity/major The issue's severity is major. type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants