-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] LB: Allow adding peers to Tservers which have pending deletes #21806
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
1 task
druzac
added a commit
that referenced
this issue
Apr 12, 2024
Summary: The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS. I did some digging and the original diff which introduced this is D408. There is not much motivation given there besides deflaking a unit test. I validated that unit test is not flaked by my change in this diff. This passes: ``` % ybd --with-tests --cxx-test-filter-re 'raft_consensus-itest' --cxx-test raft_consensus-itest --gtest_filter 'RaftConsensusITest.TestMasterReplacesEvictedFollowers' -n 100 ``` Jira: DB-10699 Test Plan: ``` ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete' ``` Reviewers: asrivastava, jhe Reviewed By: jhe Subscribers: ybase, slingam Differential Revision: https://phorge.dev.yugabyte.com/D33994
druzac
pushed a commit
that referenced
this issue
Apr 18, 2024
…o the same tablet Summary: Original commit: 4f98051 / D33994 The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS. This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet. This backport has a different description because the original diff's description refers to a previous version of the diff and doesn't reflect the committed changes. Jira: DB-10699 Test Plan: ``` ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete' ``` Reviewers: asrivastava, jhe Reviewed By: asrivastava Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34136
druzac
added a commit
that referenced
this issue
Apr 18, 2024
… to the same tablet Summary: Original commit: 4f98051 / D33994 The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS. This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet. This backport has a different description because the original diff's description refers to a previous version of the diff and doesn't reflect the committed changes. Jira: DB-10699 Test Plan: ``` ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete' ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDeleteForSameTablet' ``` Reviewers: asrivastava, jhe Reviewed By: asrivastava Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34132
druzac
added a commit
that referenced
this issue
Apr 18, 2024
…o the same tablet Summary: Original commit: 4f98051 / D33994 The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS. This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet. This backport has a different description because the original diff's description refers to a previous version of the diff and doesn't reflect the committed changes. Jira: DB-10699 Test Plan: ``` ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete' ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDeleteForSameTablet' ``` Reviewers: asrivastava, jhe Reviewed By: asrivastava Subscribers: slingam, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34131
druzac
pushed a commit
that referenced
this issue
Apr 18, 2024
…o the same tablet Summary: Original commit: 4f98051 / D33994 The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS. This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet. Jira: DB-10699 Test Plan: ``` ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete' ``` Reviewers: asrivastava, jhe Reviewed By: asrivastava Subscribers: ybase, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34148
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-10699
Description
Currently if there is any pending delete for any tablet on a T-server, Load balancer skips adding any more peers on that t-server. This is too restrictive and causes the load balancer to get stuck with this message :
cluster_balance_util.cc:451] tablet server 71ac175813a14cb89714f4619b681c31 has a pending delete. Not allowing it to take more tablets
Proposed fix: Allow adding more peers to a T-server even if there is a pending delete on it as long as we are not adding the peers for a tablet which has a pending delete
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: