Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

v01dstar · 2023-06-06T07:32:12Z

Bug Report

What did you do?

In a 3 nodes cluster, replace a broken store with a new one.

What did you expect to see?

The cluster returns to normal after the operation.

What did you see instead?

TiKVRegionPendingPeerTooLong alarm is fired.

There are 3 regions that experience "pending-peer" problem for 2 days. They all have 4 peers: 2 regular healthy voters, 1 healthy learner (located in the new store 2751139) 1 down peer (in the manually deleted store 4).

Example region info, click me

{
  "id": 55929554,
  "epoch": {
    "conf_ver": 6,
    "version": 109399
  },
  "peers": [
    {
      "id": 55929555,
      "store_id": 1,
      "role_name": "Voter"
    },
    {
      "id": 55929556,
      "store_id": 4,
      "role_name": "Voter"
    },
    {
      "id": 55929557,
      "store_id": 5,
      "role_name": "Voter"
    },
    {
      "id": 55929558,
      "store_id": 2751139,
      "role": 1,
      "role_name": "Learner",
      "is_learner": true
    }
  ],
  "leader": {
    "id": 55929555,
    "store_id": 1,
    "role_name": "Voter"
  },
  "down_peers": [
    {
      "down_seconds": 40307,
      "peer": {
        "id": 55929556,
        "store_id": 4,
        "role_name": "Voter"
      }
    }
  ],
  "pending_peers": [
    {
      "id": 55929556,
      "store_id": 4,
      "role_name": "Voter"
    }
  ],
  "cpu_usage": 0,
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 1,
  "approximate_keys": 40960
}

This state is probably due to an unfinished recovery process. Usually, this intermediate state can be resolved by PD automatically in 2 ways:

This state does not comply with the 3 replicas rule. So, PD tried to remove one replica, the peer with "unusual role" (in this case, the learner) is preferred in this case. But, to proceed with this operation, it requires all other peers to be healthy, which is not true in this case. So, this is skipped. This can be confirmed by PD metric "skip-remove-orphan-peer".
This state does not comply with the "no down peer" rule. So, PD tried to remove the down peer and add a new peer, this is done through: 1. add a learner. 2. promote learner + demote voter through joint consensus 3. remove demoted learner. But, since this cluster only has 3 nodes, and all of them already have a peer belonging to these regions, so this operation is also not able to proceed. This can be confirmed by PD metrics "replace-down" and "no-store-replace".

Because of above constraints, these 3 regions get stuck in this state.

PD should be able to handle this case. e.g. When find a region with 4 peers, 2 voters + 1 down peer + 1 learner. It promotes the learner to be a voter and removes the down peer.

What version of PD are you using (`pd-server -V`)?

6.5.0

The text was updated successfully, but these errors were encountered:

close #6559 add logic try to replace unhealthy peer with orphan peer Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#6559 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

close #6559 add logic try to replace unhealthy peer with orphan peer Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: ShuNing <nolouch@gmail.com> Co-authored-by: nolouch <nolouch@gmail.com>

seiya-annie · 2024-06-04T02:23:31Z

/found customer

v01dstar added the type/bug The issue is confirmed as a bug. label Jun 6, 2023

v01dstar mentioned this issue Jun 6, 2023

Abnormal output from "get regions by state" #6560

Closed

jebter added the severity/critical The issue's severity is critical. label Jul 7, 2023

ti-chi-bot bot added may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Jul 7, 2023

nolouch self-assigned this Jul 13, 2023

nolouch mentioned this issue Jul 18, 2023

placement: add peer healthy factor when comparing region fit #6822

Closed

nolouch added affects-6.5 affects-7.1 and removed may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 affects-7.1 affects-6.5 labels Jul 18, 2023

nolouch mentioned this issue Jul 21, 2023

rule_checker: can replace unhealthPeer with orphanPeer #6831

Merged

nolouch added the affects-6.5 label Jul 21, 2023

ti-chi-bot bot closed this as completed in #6831 Jul 26, 2023

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Jul 26, 2023

This is an automated cherry-pick of tikv#6831

d9c8ffb

close tikv#6559 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this issue Jul 26, 2023

rule_checker: can replace unhealthPeer with orphanPeer (#6831) #6843

Merged

ti-chi-bot mentioned this issue Jul 26, 2023

rule_checker: can replace unhealthPeer with orphanPeer (#6831) #6844

Merged

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Jul 26, 2023

This is an automated cherry-pick of tikv#6831

70c7621

close tikv#6559 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

HuSharp mentioned this issue Aug 4, 2023

add release notes for v7.3.0 pingcap/docs-cn#14565

Merged

16 tasks

niubell mentioned this issue Oct 18, 2023

v7.1.2: add release notes pingcap/docs-cn#15256

Merged

18 tasks

ti-chi-bot added the affects-6.1 label Apr 8, 2024

ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

v01dstar commented Jun 6, 2023

seiya-annie commented Jun 4, 2024

Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559

Comments

v01dstar commented Jun 6, 2023

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

seiya-annie commented Jun 4, 2024

What version of PD are you using (`pd-server -V`)?