Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Down-peer-region can't recover when enable placement-rule policy #7808

Closed
Smityz opened this issue Feb 5, 2024 · 5 comments · Fixed by #7996
Closed

Down-peer-region can't recover when enable placement-rule policy #7808

Smityz opened this issue Feb 5, 2024 · 5 comments · Fixed by #7996
Labels
affects-6.5 affects-7.1 affects-7.5 affects-8.1 severity/moderate The issue's severity is moderate. type/bug The issue is confirmed as a bug.

Comments

@Smityz
Copy link
Contributor

Smityz commented Feb 5, 2024

Bug Report

What version of TiKV are you using?

v6.5.3

Steps to reproduce

  1. set placement-rule:
[
  {
    "group_id": "pd",
    "id": "1",
    "start_key": "",
    "end_key": "",
    "role": "voter",
    "is_witness": false,
    "count": 2,
    "label_constraints": [
      {
        "key": "disk_type",
        "op": "in",
        "values": [
          "ssd"
        ]
      }
    ],
    "location_labels": [
      "host"
    ],
    "isolation_level": "host",
    "create_timestamp": 1706696121
  },
  {
    "group_id": "pd",
    "id": "2",
    "start_key": "",
    "end_key": "",
    "role": "follower",
    "is_witness": false,
    "count": 1,
    "label_constraints": [
      {
        "key": "disk_type",
        "op": "in",
        "values": [
          "mix"
        ]
      }
    ],
    "location_labels": [
      "host"
    ],
    "isolation_level": "host",
    "create_timestamp": 1706696121
  }
]
  1. Start 8 tikv nodes, where 3 nodes are labeled disk_type=mix, and the other 5 nodes are labeled disk_type=ssd. Then load data and observe that there is no leader on the mix node, which is expected.
    image
  2. force scale-in one mix node, find there are lots of down-peer not recover
    image
  3. disable placement-rule, then down-peer starts to recover

Is this caused by my wrong placement policy or a bug of placement-rule?

@Smityz
Copy link
Contributor Author

Smityz commented Feb 5, 2024

Once I delete the isolation_level rule, peers can recover.
The logic behind the placement-rule is quite complex. Can someone help me debug it together?

@rleungx
Copy link
Member

rleungx commented Feb 6, 2024

Maybe you can take a look at https://docs.pingcap.com/tidb/stable/schedule-replicas-by-topology-labels#pd-schedules-based-on-topology-label to find how isolation_level works.

@Smityz
Copy link
Contributor Author

Smityz commented Feb 6, 2024

I have familiarized myself with the rules of isolation, but I am not certain if I fully understand them.
In my opinion, under this rule, as long as there is at least one follower node whose host is different from the leader node's host, these down-peers can be recovered.

@rleungx
Copy link
Member

rleungx commented Mar 12, 2024

Can you also paste your store labels?

@Smityz
Copy link
Contributor Author

Smityz commented Mar 12, 2024

Can you also paste your store labels?

In our cluster, 40+ nodes were mix and 100+ nodes were SSD 'and can be queried by pd-ctl store` command.

   {
      "store": {
        "id": 801704197,
        "address": "x",
        "labels": [
          {
            "key": "host",
            "value": "x"
          },
          {
            "key": "disk_type",
            "value": "mix"
          }
        ],
        "version": "6.5.3",
        "peer_address": "x",
        "status_address": "x,
        "git_hash": "fd5f88a7fdda1bf70dcb0d239f60137110c54d46",
        "start_timestamp": 1696908201,
        "deploy_path": "x",
        "last_heartbeat": 1710229078287679784,
        "state_name": "Up"
      },
     "store": {
        "id": 7534286,
        "address": "x",
        "labels": [
          {
            "key": "host",
            "value": "x"
          },
          {
            "key": "disk_type",
            "value": "ssd"
          }
        ],
        "version": "6.5.3",
        "peer_address": "x",
        "status_address": "x",
        "git_hash": "fd5f88a7fdda1bf70dcb0d239f60137110c54d46",
        "start_timestamp": 1696908176,
        "deploy_path": "x",
        "last_heartbeat": 1710229083389078969,
        "state_name": "Up"
      },

And this bug can be easily reproduced.

ti-chi-bot bot pushed a commit that referenced this issue Apr 10, 2024
close #7808

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Yongbo Jiang <cabinfeveroier@gmail.com>
Questions and Bug Reports automation moved this from Need Triage to Closed Apr 10, 2024
ti-chi-bot bot pushed a commit that referenced this issue May 28, 2024
close #7808

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ryan Leung <rleungx@gmail.com>
Co-authored-by: lhy1024 <liuhanyang@pingcap.com>
ti-chi-bot bot added a commit that referenced this issue May 28, 2024
close #7808

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ryan Leung <rleungx@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue Jun 4, 2024
close #7808

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Co-authored-by: Ryan Leung <rleungx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 affects-7.5 affects-8.1 severity/moderate The issue's severity is moderate. type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

3 participants