Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] LB: Allow adding peers to Tservers which have pending deletes #21806

Closed
1 task done
lingamsandeep opened this issue Apr 3, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@lingamsandeep
Copy link
Contributor

lingamsandeep commented Apr 3, 2024

Jira Link: DB-10699

Description

Currently if there is any pending delete for any tablet on a T-server, Load balancer skips adding any more peers on that t-server. This is too restrictive and causes the load balancer to get stuck with this message :

cluster_balance_util.cc:451] tablet server 71ac175813a14cb89714f4619b681c31 has a pending delete. Not allowing it to take more tablets

Proposed fix: Allow adding more peers to a T-server even if there is a pending delete on it as long as we are not adding the peers for a tablet which has a pending delete

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@lingamsandeep lingamsandeep added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Apr 3, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Apr 3, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 3, 2024
druzac added a commit that referenced this issue Apr 12, 2024
Summary:
The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS.

I did some digging and the original diff which introduced this is D408. There is not much motivation given there besides deflaking a unit test. I validated that unit test is not flaked by my change in this diff. This passes:
```
% ybd --with-tests --cxx-test-filter-re 'raft_consensus-itest' --cxx-test raft_consensus-itest --gtest_filter 'RaftConsensusITest.TestMasterReplacesEvictedFollowers' -n 100
```
Jira: DB-10699

Test Plan:
```
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete'
```

Reviewers: asrivastava, jhe

Reviewed By: jhe

Subscribers: ybase, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D33994
druzac pushed a commit that referenced this issue Apr 18, 2024
…o the same tablet

Summary:
Original commit: 4f98051 / D33994
The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS.

This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet.

This backport has a different description because the original diff's description refers to a previous version of the diff and doesn't reflect the committed changes.

Jira: DB-10699

Test Plan:
```
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete'
```

Reviewers: asrivastava, jhe

Reviewed By: asrivastava

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34136
druzac added a commit that referenced this issue Apr 18, 2024
… to the same tablet

Summary:
Original commit: 4f98051 / D33994

The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS.

This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet.

This backport has a different description because the original diff's description refers to a previous version of the diff and doesn't reflect the committed changes.
Jira: DB-10699

Test Plan:
```
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete'
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDeleteForSameTablet'
```

Reviewers: asrivastava, jhe

Reviewed By: asrivastava

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34132
druzac added a commit that referenced this issue Apr 18, 2024
…o the same tablet

Summary:
Original commit: 4f98051 / D33994

The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS.

This diff tightens the validation logic to only block tablet replica additions to a tserver which has a pending delete against that same tablet.

This backport has a different description because the original diff's description refers to a previous version of the diff and doesn't reflect the committed changes.
Jira: DB-10699

Test Plan:
```
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete'
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDeleteForSameTablet'
```

Reviewers: asrivastava, jhe

Reviewed By: asrivastava

Subscribers: slingam, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34131
druzac pushed a commit that referenced this issue Apr 18, 2024
…o the same tablet

Summary:
Original commit: 4f98051 / D33994
The load balancer has validation logic to prevent adding tablet replicas to tablet servers that are currently deleting a tablet replica. This logic slows down load balancer actions when many tablet replica moves across the cluster are happening, such as during node additions/removals. We've also seen cases where the catalog manager is unable to delete a tablet replica and this pending delete state gets wedged for a TS, preventing the load balancer from ever adding another tablet replica to this TS.

This diff tightens the validation logic to only block tablet replica additions to a tserver which
has a pending delete against that same tablet.

Jira: DB-10699

Test Plan:
```
ybd --with-tests --cxx-test-filter-re 'load_balancer_mocked-test' --cxx-test load_balancer_mocked-test --gtest_filter 'LoadBalancerMockedTest.TestAddReplicaToTSWithPendingDelete'
```

Reviewers: asrivastava, jhe

Reviewed By: asrivastava

Subscribers: ybase, slingam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants