Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect leader health and automatically do failover #6403

Closed
nolouch opened this issue May 4, 2023 · 1 comment · Fixed by #6447
Closed

Detect leader health and automatically do failover #6403

nolouch opened this issue May 4, 2023 · 1 comment · Fixed by #6447
Assignees
Labels
affects-6.5 affects-7.1 severity/major The issue's severity is major. type/bug The issue is confirmed as a bug. type/enhancement The issue belongs to an enhancement.

Comments

@nolouch
Copy link
Contributor

nolouch commented May 4, 2023

Feature Request

Describe your feature request related problem

PD will bind the PD leader to the etcd leader to reduce the burden of understanding for users. based on this behavior, here meet a problem that the PD leader lease lost but etcd leader doesn't lose. and then the previous PD cannot elect as leader again because of some problems with the leader election. we can simulate the problem like dropping all pockets coming from this connection.

sudo iptables -A INPUT -p tcp --sport 44438 -j DROP

While etcd raft heartbeat to keep etcd leadership goes to other nodes, the PD leader lease keepalive goes directly to the local peer advertise address, so completely different connections. In this case, the PD leader lost but other followers cannot elect a new leader due to the etcd leader still being in the old one, then PD cannot serve, and the cluster is unavailable for a long time until etcd leader be changed.

Describe the feature you'd like

Reduce the unavailable time of the cluster.

Describe alternatives solutions you've considered

PD Leader health detect

Because all followers watch the pd leader's key, so actually all members know who is the leader. we can store the leader-member id and the update time in the memory of all members. once the leader key lease is lost, the leader key will be deleted because the lease expired, then all members should know it by watching the key, then clear the leader id and record the upated time and reset it until the new leader is elected.

the leader record struct like:

type leaderEvent struct {
   leader  *pdpb.Member
   updatedTime time.Time
}

and members can watch the leader key and do relatively handle for it here:

pd/server/server.go

Lines 1428 to 1430 in 46fdd96

log.Info("start to watch pd leader", zap.Stringer("pd-leader", leader))
// WatchLeader will keep looping and never return unless the PD leader has changed.
leader.Watch(s.serverLoopCtx)

Resign etcd leader if no pd leader for a long time

After knowing the pd leader and updated time, pd members can decide to do a failover with let etcd to do a new election base on the lost time. we can do this logic on this:

pd/server/server.go

Lines 1567 to 1582 in 46fdd96

func (s *Server) etcdLeaderLoop() {
defer logutil.LogPanic()
defer s.serverLoopWg.Done()
ctx, cancel := context.WithCancel(s.serverLoopCtx)
defer cancel()
for {
select {
case <-time.After(s.cfg.LeaderPriorityCheckInterval.Duration):
s.member.CheckPriority(ctx)
case <-ctx.Done():
log.Info("server is closed, exit etcd leader loop")
return
}
}
}

once we detect there has etcd leader but no pd leader for a long time(such as 10 * etcdElectionTimeout), we can let the first follower member, with sorted by member id, to do an etcd re-election, the interface can use

func (m *EmbeddedEtcdMember) MoveEtcdLeader(ctx context.Context, old, new uint64) error {

ETA

a week fix on master

@nolouch nolouch added the type/feature-request The issue belongs to a feature request. label May 4, 2023
@nolouch nolouch self-assigned this May 5, 2023
@nolouch nolouch added type/bug The issue is confirmed as a bug. type/enhancement The issue belongs to an enhancement. and removed type/feature-request The issue belongs to a feature request. labels May 5, 2023
@github-actions github-actions bot added this to Need Triage in Questions and Bug Reports May 5, 2023
@nolouch nolouch added affects-6.5 affects-7.1 severity/major The issue's severity is major. labels May 5, 2023
ti-chi-bot bot pushed a commit that referenced this issue May 12, 2023
ref #6403

Signed-off-by: Ryan Leung <rleungx@gmail.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
ref tikv#6403

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
ref tikv#6403

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot added a commit that referenced this issue May 12, 2023
…d leader intact (#6447)

close #6403

server: fix the leader cannot election after pd leader lost while etcd leader intact

Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Questions and Bug Reports automation moved this from Need Triage to Closed May 12, 2023
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
close tikv#6403

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
close tikv#6403

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot added a commit that referenced this issue May 15, 2023
…d leader intact (#6447) (#6461)

close #6403, ref #6447

server: fix the leader cannot election after pd leader lost while etcd leader intact

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 15, 2023
ref #6403, ref #6409

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ryan Leung <rleungx@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 24, 2023
…d leader intact (#6447) (#6460)

close #6403, ref #6447

server: fix the leader cannot election after pd leader lost while etcd leader intact

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: nolouch <nolouch@gmail.com>

Co-authored-by: ShuNing <nolouch@gmail.com>
Co-authored-by: nolouch <nolouch@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 24, 2023
ref #6403, ref #6409

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ryan Leung <rleungx@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
@nolouch
Copy link
Contributor Author

nolouch commented Aug 25, 2023

ref #6403, ref #6556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 severity/major The issue's severity is major. type/bug The issue is confirmed as a bug. type/enhancement The issue belongs to an enhancement.
Projects
Development

Successfully merging a pull request may close this issue.

1 participant