pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins #7251

Lily2025 · 2023-10-25T09:44:58Z

Bug Report

What did you do?

1、run workload
2、inject pdleader io delay 500ms last for 5mins
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: kv-timeout-data
namespace: testbed-xxx
spec:
action: latency
mode: one
selector:
namespaces:
- testbed-xxx
labelSelectors:
statefulset.kubernetes.io/pod-name: tc-pd-0
volumePath: /var/lib/pd
path: "/var/lib/pd/data/**/*"
delay: "500ms"
percent: 100
duration: "300s"

What did you expect to see?

1、pd can select new leader when inject pdleader io delay 500ms last for 5mins
2、qps can recover within 2mins when inject pdleader io delay 500ms

What did you see instead?

pd can not select new leader when inject pdleader io delay 500ms last for 5mins
write qps drop to zero

What version of PD are you using (`pd-server -V`)?

./pd-server -V
Release Version: v6.5.0-nightly
Edition: Community
Git Commit Hash: 77d6f5b
Git Branch: heads/refs/tags/v6.5.0-nightly
UTC Build Time: 2023-10-07 10:21:09
2023-10-19T09:33:53.650+0800

The text was updated successfully, but these errors were encountered:

Lily2025 · 2023-10-25T09:45:40Z

/type bug
/severity major
/assign HuSharp

HuSharp · 2023-11-01T08:15:35Z

Situation

It looks like pd0 kept trying to elect leader timed out for five minutes. Finally, pd2 is elected.

It looks like pd0 is running for leader because the apply is slow (io delay).
etcd leader was updated at the end of pd0's campaign.
etcd updated leader because heartbeat is slow.
The qps drop to zero continue time depends on how long it took for etcd to change leader.

Solution

maybe we need to maintain a slice record election times

close #7251, ref #7377 when pd leader frequently campaign leader, but etcd leader did not change. We need to prevent this pd leader campaign and resign to another member. Signed-off-by: husharp <jinhao.hu@pingcap.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

Lily2025 · 2023-11-27T03:54:26Z

/open

Lily2025 · 2023-11-27T03:58:19Z

The phenomenon still exists

JmPotato · 2024-01-30T10:11:16Z

Fixed by #7737.

close #7251, ref #7377 when pd leader frequently campaign leader, but etcd leader did not change. We need to prevent this pd leader campaign and resign to another member. Signed-off-by: husharp <jinhao.hu@pingcap.com> Co-authored-by: husharp <jinhao.hu@pingcap.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close #7251 Cherry-pick the etcd client health checker improvements from #7725. #7727, #7743, #7737 and #7779. Signed-off-by: JmPotato <ghzpotato@gmail.com>

Lily2025 added the type/bug The issue is confirmed as a bug. label Oct 25, 2023

github-actions bot added this to Need Triage in Questions and Bug Reports Oct 25, 2023

ti-chi-bot bot assigned HuSharp Oct 25, 2023

ti-chi-bot bot added severity/major The issue's severity is major. may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Oct 25, 2023

Lily2025 changed the title ~~pd can not select new leader when inject pdleader io delay 500ms last for 5mins~~ pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins Oct 27, 2023

HuSharp mentioned this issue Nov 2, 2023

member: avoid frequent campaign times #7301

Merged

HuSharp removed may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Nov 13, 2023

ti-chi-bot bot closed this as completed in #7301 Nov 16, 2023

Questions and Bug Reports automation moved this from Need Triage to Closed Nov 16, 2023

jebter reopened this Nov 27, 2023

Questions and Bug Reports automation moved this from Closed to Need Triage Nov 27, 2023

JmPotato mentioned this issue Dec 6, 2023

Reduce the impact of etcd leader on the availability of PD leader #7499

Closed

This was referenced Jan 12, 2024

server/*: add trend to check etcd healthy #7682

Closed

lease: keep alive Leadership immediately after finish the election of leader #7707

Closed

jebter added the affects-7.6 label Jan 24, 2024

JmPotato closed this as completed Jan 30, 2024

Questions and Bug Reports automation moved this from Need Triage to Closed Jan 30, 2024

JmPotato added affects-7.3 affects-7.4 affects-7.5 labels Jan 30, 2024

ti-chi-bot mentioned this issue Feb 1, 2024

member: avoid frequent campaign times (#7301) #7790

Merged

JmPotato mentioned this issue Feb 2, 2024

*: cherry-pick the etcd client health checker improvements #7793

Merged

HuSharp mentioned this issue Feb 2, 2024

member: return err when meet frequently campaign leader #7566

Merged

niubell mentioned this issue Feb 26, 2024

add release note for v7.5.1 pingcap/docs-cn#16660

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins #7251

pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins #7251

Lily2025 commented Oct 25, 2023 •

edited

Loading

Lily2025 commented Oct 25, 2023

HuSharp commented Nov 1, 2023

Lily2025 commented Nov 27, 2023

Lily2025 commented Nov 27, 2023

JmPotato commented Jan 30, 2024

pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins #7251

pd can not select new leader which lead qps drop to zero when inject pdleader io delay 500ms last for 5mins #7251

Comments

Lily2025 commented Oct 25, 2023 • edited Loading

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

Lily2025 commented Oct 25, 2023

HuSharp commented Nov 1, 2023

Situation

Solution

Lily2025 commented Nov 27, 2023

Lily2025 commented Nov 27, 2023

JmPotato commented Jan 30, 2024

Lily2025 commented Oct 25, 2023 •

edited

Loading

What version of PD are you using (`pd-server -V`)?