New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove node sometimes doesn't wait for graceful decommission if master leader fails over at same time #2453
Comments
Sequence:
and then yugaware waits for data move:
but then abruptly around here gives up and decides to stop the TServer too early:
|
node08: initially the "get move percent completed" keeps running against this master which is leader:
node09 is the new leader, and YugaWare correctly asks the new leader for blacklisted servers load:
But the initial load (for blacklisted servers) still being 0 on the new leader (because it hasn't gotten a full tablet report yet from the blacklisted server yet) then is causing this code in catalog_manager.cc to incorrectly report that all tablets have been moved.
This impact of issue should be limited to when master leader failover happens at the same time. Discussed with @rajukumaryb -- one small safety check we'll add is if BlackList count (above 554) is greater than initial load (0), then reset initial load to 554. But this isn't a bullet proof fix. Ideally, either yb-master needs to wait till it has heard one heartbeat from blacklisted server before responding to a GetLoadMoveCompletionPercent(), or YugaWare should check directly with the blacklisted yb-tserver to make sure its tablets have dropped to 0. |
…tserver blacklisting Summary: When a tserver is blacklisted, master leader snapshots the number of replicas to move so as to allow computation of progress as a percentage. When master leader fails, this initial snapshot of count of tablets to move is not available at the new leader. So reinitialize the count of tablets to move at the new master leader. Follow on tasks - #2552 #2553 #2554 Test Plan: ./yb_build.sh debug --scb --java-test org.yb.loadtester.TestClusterExpandShrink#testClusterExpandAndShrinkWithKillMasterLeader Reviewers: rahuldesirazu, ram, hector, amitanand, bogdan Reviewed By: bogdan Subscribers: kannan, nicolas, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7323
Platform remove node didn't perform a graceful decommission and before the wait for data migration to complete, we went ahead and stopped the node.
The text was updated successfully, but these errors were encountered: