Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove node sometimes doesn't wait for graceful decommission if master leader fails over at same time #2453

Closed
ramkumarvs opened this issue Sep 27, 2019 · 2 comments
Assignees
Labels
area/docdb YugabyteDB core features area/platform Yugabyte Platform priority/high High Priority

Comments

@ramkumarvs
Copy link
Contributor

Platform remove node didn't perform a graceful decommission and before the wait for data migration to complete, we went ahead and stopped the node.

2019-09-27 21:21:27,706 [INFO] from com.yugabyte.yw.scheduler.Scheduler in application-akka.actor.default-dispatcher-5276 - Running scheduler
2019-09-27 21:21:27,708 [INFO] from com.yugabyte.yw.commissioner.HealthChecker in application-akka.actor.default-dispatcher-5299 - Skipping customer c49f14c6-bb09-4534-be64-351a910e50a5 due to missing alerting config...
2019-09-27 21:21:30,932 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6600, percent=21.52777777777777, numErrors=0.
2019-09-27 21:21:41,597 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6700, percent=22.08333333333333, numErrors=0.
2019-09-27 21:21:52,247 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6800, percent=22.361111111111114, numErrors=0.
2019-09-27 21:22:02,960 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6900, percent=22.77777777777777, numErrors=0.
2019-09-27 21:22:13,588 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=7000, percent=23.055555555555557, numErrors=0.
2019-09-27 21:22:19,797 [INFO] from com.yugabyte.yw.common.services.LocalYBClientService in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Closing client masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:22:19,818 [INFO] from com.yugabyte.yw.commissioner.SubTaskGroup in TaskPool-1 - Running task list AnsibleClusterServerCtl.
2019-09-27 21:22:19,832 [INFO] from com.yugabyte.yw.common.DevopsBase in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - Command to run: [bin/ybcloud.sh onprem --region p16 --zone sjc --node_metadata {"ip":"10.44.127.174","sshPort":22,"sshUser":"yugabyte","region":"p16","zone":"sjc","instanceType":"ybnode","nodeName":"yb-dev-QualysDemo-n4"} instance control tserver stop --vars_file /opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.vault --vault_password_file /opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.vault_password --private_key_file /opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.pem yb-dev-QualysDemo-n4]
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - PLAY [Perform command stop on YB tserver] **************************************
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - TASK [setup] *******************************************************************
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - Friday 27 September 2019  21:22:21 +0000 (0:00:00.217)       0:00:00.217 ****** 
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - ok: [10.44.127.174]
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - TASK [Call the ctl script with appropriate args] *******************************
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - Friday 27 September 2019  21:22:28 +0000 (0:00:07.704)       0:00:07.921 ****** 
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - changed: [10.44.127.174]
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - PLAY RECAP *********************************************************************
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 10.44.127.174              : ok=2    changed=1    unreachable=0    failed=0   
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - Friday 27 September 2019  21:22:38 +0000 (0:00:09.875)       0:00:17.796 ****** 
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - =============================================================================== 
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - Call the ctl script with appropriate args ------------------------------- 9.88s
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - setup ------------------------------------------------------------------- 7.70s
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 2019-09-27 21:22:20,478 INFO: Found onprem cloud credentials in env.
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 2019-09-27 21:22:20,479 INFO: Running ctl command stop for process: tserver in instance: yb-dev-QualysDemo-n4
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - 2019-09-27 21:22:20,479 INFO: Running ansible command ["ansible-playbook" "/opt/yugabyte/devops/yb-server-ctl.yml" "--vault-password-file" "/opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.vault_password" "--private-key" "/opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.pem" "--user" "yugabyte" "-i" "10.44.127.174," "-c" "ssh" "--extra-vars" "{\"ansible_port\": 22, \"cloud_type\": \"onprem\", \"server_type\": \"cluster-server\", \"process\": \"tserver\", \"ssd_size_gb\": 250, \"instance_search_pattern\": \"all\", \"yb_server_ssh_user\": \"yugabyte\", \"instance_name\": \"yb-dev-QualysDemo-n4\", \"command\": \"stop\", \"vars_file\": \"/opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.vault\", \"cloud_zone\": \"sjc\", \"disk_iops\": 1000, \"ssh_user\": \"yugabyte\", \"yb_ansible_host\": \"10.44.127.174\", \"user_name\": \"yugabyte\", \"cloud_region\": \"p16\"}"]
2019-09-27 21:22:38,835 [INFO] from com.yugabyte.yw.commissioner.AbstractTaskBase in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - [AnsibleClusterServerCtl(0550653b-0cd0-4e28-9262-c0fb78e9524d, yb-dev-QualysDemo-n4)(yb-dev-QualysDemo-n4, tserver: stop)] STDOUT: 'PLAY [Perform command stop on YB tserver] **************************************

TASK [setup] *******************************************************************
Friday 27 September 2019  21:22:21 +0000 (0:00:00.217)       0:00:00.217 ****** 
ok: [10.44.127.174]

TASK [Call the ctl script with appropriate args] *******************************
Friday 27 September 2019  21:22:28 +0000 (0:00:07.704)       0:00:07.921 ****** 
changed: [10.44.127.174]

PLAY RECAP *********************************************************************
10.44.127.174              : ok=2    changed=1    unreachable=0    failed=0   

Friday 27 September 2019  21:22:38 +0000 (0:00:09.875)       0:00:17.796 ****** 
=============================================================================== 
Call the ctl script with appropriate args ------------------------------- 9.88s
setup ------------------------------------------------------------------- 7.70s'
@ramkumarvs ramkumarvs added the area/platform Yugabyte Platform label Sep 27, 2019
@ramkumarvs ramkumarvs self-assigned this Sep 27, 2019
@kmuthukk
Copy link
Collaborator

Sequence:

019-09-27 21:09:49,899 [DEBUG] from com.yugabyte.yw.commissioner.AbstractTaskBase in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-0 - Changing node yb-dev-QualysDemo-n4 state from Live to Removing in universe 0550653b-0cd0-4e28-9262-c0fb78e9524d

and then yugaware waits for data move:

2019-09-27 21:09:50,027 [INFO] from com.yugabyte.yw.common.services.LocalYBClientService in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-1 - Closing client masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:09:50,038 [INFO] from com.yugabyte.yw.commissioner.SubTaskGroup in TaskPool-1 - Running task list WaitForDataMove.
2019-09-27 21:09:50,042 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Running WaitForDataMove on masterAddress = 10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:09:50,080 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Leader Master UUID=2fb4f66704b6421e902197a40e41eaf7.
2019-09-27 21:09:50,096 [INFO] from org.yb.client.AsyncYBClient in New I/O worker #5657 - Discovered tablet YB Master for table YB Master with partition ["", "")
2019-09-27 21:09:53,132 [INFO] from com.yugabyte.yw.common.DevopsBase in application-akka.actor.default-dispatcher-5276 - Command to run: [bin/ybcloud.sh aws query current-host --metadata_types instance-id vpc-id privateIp region]
2019-09-27 21:09:53,564 [INFO] from org.yb.client.AsyncYBClient in New I/O worker #5722 - Discovered tablet YB Master for table YB Master with partition ["", "")
2019-09-27 21:09:53,570 [INFO] from com.yugabyte.yw.common.services.LocalYBClientService in application-akka.actor.default-dispatcher-5302 - Closing client masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:09:56,824 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in application-akka.actor.default-dispatcher-5276 - {"error": "Unable to fetch host metadata"}
2019-09-27 21:09:56,824 [INFO] from com.yugabyte.yw.common.DevopsBase in application-akka.actor.default-dispatcher-5276 - Command to run: [bin/ybcloud.sh gcp query current-host]
2019-09-27 21:09:58,047 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in application-akka.actor.default-dispatcher-5276 - {"error": "Host not in GCP."}
2019-09-27 21:09:58,710 [INFO] from com.yugabyte.yw.common.services.LocalYBClientService in application-akka.actor.default-dispatcher-5276 - Closing client masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:09:58,714 [INFO] from com.yugabyte.yw.common.services.LocalYBClientService in application-akka.actor.default-dispatcher-5276 - Closing client masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:09:58,714 [WARN] from com.yugabyte.yw.common.services.LocalYBClientService in application-akka.actor.default-dispatcher-5276 - Closing client with masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100 hit error Cannot proceed, the client to [10.44.127.171:7100, 10.44.127.172:7100, 10.44\
.127.173:7100] has already been closed.
2019-09-27 21:10:00,634 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=100, percent=0.13888888888888573, numErrors=0.
2019-09-27 21:10:11,164 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=200, percent=0.27777777777777146, numErrors=0.
2019-09-27 21:10:21,725 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=300, percent=0.4166666666666714, numErrors=0.
2019-09-27 21:10:32,278 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=400, percent=0.5555555555555571, numErrors=0.
2019-09-27 21:10:42,895 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=500, percent=1.1111111111111143, numErrors=0.
2019-09-27 21:10:53,528 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=600, percent=2.0833333333333286, numErrors=0.
2019-09-27 21:11:04,107 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=700, percent=2.0833333333333286, numErrors=0.
2019-09-27 21:11:14,702 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=800, percent=2.5, numErrors=0.

but then abruptly around here gives up and decides to stop the TServer too early:

019-09-27 21:21:41,597 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6700, percent=22.08333333333333, numErrors=0.
2019-09-27 21:21:52,247 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6800, percent=22.361111111111114, numErrors=0.
2019-09-27 21:22:02,960 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=6900, percent=22.77777777777777, numErrors=0.
2019-09-27 21:22:13,588 [INFO] from com.yugabyte.yw.commissioner.tasks.subtasks.WaitForDataMove in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Info: iters=7000, percent=23.055555555555557, numErrors=0.
2019-09-27 21:22:19,797 [INFO] from com.yugabyte.yw.common.services.LocalYBClientService in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-2 - Closing client masters=10.44.127.171:7100,10.44.127.172:7100,10.44.127.173:7100.
2019-09-27 21:22:19,818 [INFO] from com.yugabyte.yw.commissioner.SubTaskGroup in TaskPool-1 - Running task list AnsibleClusterServerCtl.
2019-09-27 21:22:19,832 [INFO] from com.yugabyte.yw.common.DevopsBase in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - Command to run: [bin/ybcloud.sh onprem --region p16 --zone sjc --node_metadata {"ip":"10.44.127.174","sshPort":22,"sshUser":"yugabyte","region":"p16","zone":"sjc","i\
nstanceType":"ybnode","nodeName":"yb-dev-QualysDemo-n4"} instance control tserver stop --vars_file /opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.vault --vault_password_file /opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.vault_password --\
private_key_file /opt/yugabyte/yugaware/data/keys/20dc9218-8b82-46c0-bdc7-3250ab5fd77f/qualysp16-key.pem yb-dev-QualysDemo-n4]
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 -
2019-09-27 21:22:38,834 [INFO] from com.yugabyte.yw.common.ShellProcessHandler in TaskPool-RemoveNodeFromUniverse(0550653b-0cd0-4e28-9262-c0fb78e9524d)-3 - PLAY [Perform command stop on YB tserver] **************************************

@kmuthukk kmuthukk added the priority/high High Priority label Sep 27, 2019
@kmuthukk kmuthukk changed the title Platform remove node on didn't do a graceful decommission Platform remove node sometimes doesn't wait for graceful decommission if master leader fails over at same time Sep 28, 2019
@kmuthukk kmuthukk assigned rajukumaryb and unassigned ramkumarvs and Arnav15 Sep 28, 2019
@kmuthukk
Copy link
Collaborator

kmuthukk commented Sep 28, 2019

node08:

initially the "get move percent completed" keeps running against this master which is leader:

I0927 21:22:17.395844 442554 catalog_manager.cc:6177] Blacklisted count 554 in 1290 tablets, across 1 servers, with initial load 720
I0927 21:22:17.500135  5255 catalog_manager.cc:6177] Blacklisted count 554 in 1290 tablets, across 1 servers, with initial load 720
....
I0927 21:22:18.575513 413574 raft_consensus.cc:2910] T 00000000000000000000000000000000 P 2fb4f66704b6421e902197a40e41eaf7 [term 77 LEADER]: Stepping down as leader of term 77 since new term is 78

node09 is the new leader, and YugaWare correctly asks the new leader for blacklisted servers load:

I0927 21:22:18.890250 85419 catalog_manager.cc:5956] Set blacklist size = 1 with load 720 for num_tablets = 1290
...
0927 21:22:19.677361 356277 catalog_manager.cc:6177] Blacklisted count 554 in 1290 tablets, across 1 servers, with initial load 0

But the initial load (for blacklisted servers) still being 0 on the new leader (because it hasn't gotten a full tablet report yet from the blacklisted server yet) then is causing this code in catalog_manager.cc to incorrectly report that all tablets have been moved.

  // Case when a blacklisted servers did not have any starting load.
  if (state.initial_load_ == 0) {
    resp->set_percent(100);
    return Status::OK();
  }

This impact of issue should be limited to when master leader failover happens at the same time.

Discussed with @rajukumaryb -- one small safety check we'll add is if BlackList count (above 554) is greater than initial load (0), then reset initial load to 554. But this isn't a bullet proof fix. Ideally, either yb-master needs to wait till it has heard one heartbeat from blacklisted server before responding to a GetLoadMoveCompletionPercent(), or YugaWare should check directly with the blacklisted yb-tserver to make sure its tablets have dropped to 0.

@kmuthukk kmuthukk changed the title Platform remove node sometimes doesn't wait for graceful decommission if master leader fails over at same time remove node sometimes doesn't wait for graceful decommission if master leader fails over at same time Sep 28, 2019
@kmuthukk kmuthukk added the area/docdb YugabyteDB core features label Sep 28, 2019
@bmatican bmatican added this to To Do in YBase features via automation Oct 8, 2019
@bmatican bmatican moved this from To Do to In progress in YBase features Oct 8, 2019
rajukumaryb added a commit that referenced this issue Oct 9, 2019
…tserver blacklisting

Summary:
  When a tserver is blacklisted, master leader snapshots the number of replicas to move
  so as to allow computation of progress as a percentage. When master leader fails, this
  initial snapshot of count of tablets to move is not available at the new leader. So
  reinitialize the count of tablets to move at the new master leader.

Follow on tasks -

#2552
#2553
#2554

Test Plan: ./yb_build.sh debug --scb --java-test org.yb.loadtester.TestClusterExpandShrink#testClusterExpandAndShrinkWithKillMasterLeader

Reviewers: rahuldesirazu, ram, hector, amitanand, bogdan

Reviewed By: bogdan

Subscribers: kannan, nicolas, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7323
YBase features automation moved this from In progress to Done Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features area/platform Yugabyte Platform priority/high High Priority
Projects
Development

No branches or pull requests

4 participants