[yugabyted] yugabyted node fails to restart when correct leader master is not known. #14440

nchandrappa · 2022-10-12T15:19:16Z

Issue: Leader master changes need to be handled transparently

Scenario 1:

Rolling update/ rolling upgrade

T0: 5 node cluster with --join 127.0.0.1

T1: bring down first node, 127.0.0.1

leader master will change, say 127.0.0.3

Rolling update of the leader master

T2: make required changes and bring back 127.0.0.1

./bin/yugabyted start
this fails ?

Yugabyted logs:

About to start master with cmd /Users/nikhil/Software/yugabyte-2.17.0.0/bin/yb-master --stop_on_parent_termination --undefok=stop_on_parent_termination --fs_data_dirs=/Users/nikhil/var/data --webserver_interface=127.0.0.1 --metrics_snapshotter_tserver_metrics_whitelist=handler_latency_yb_tserver_TabletServerService_Read_count,handler_latency_yb_tserver_TabletServerService_Write_count,handler_latency_yb_tserver_TabletServerService_Read_sum,handler_latency_yb_tserver_TabletServerService_Write_sum,disk_usage,cpu_usage,node_up --yb_num_shards_per_tserver=1 --ysql_num_shards_per_tserver=1 --placement_cloud=cloud1 --placement_region=datacenter1 --placement_zone=rack1 --rpc_bind_addresses=127.0.0.1:7100 --server_broadcast_addresses=127.0.0.1:7100 --replication_factor=1 --use_initial_sys_catalog_snapshot --server_dump_info_path=/Users/nikhil/var/data/master-info --master_enable_metrics_snapshotter=true --webserver_port=7000 --default_memory_limit_to_ram_ratio=0.35 --instance_uuid_override=e1ca780af5754747945b84afe363e254 --master_addresses=127.0.0.1:7100 --cluster_uuid=62bbd491-cc8c-4385-929a-a45941793b23
[yugabyted start] 2022-11-09 17:33:42,812 INFO:  | 0.1s | master started running with PID 39242.
[yugabyted start] 2022-11-09 17:33:42,813 INFO:  | 0.1s | Node was a member of some cluster before. Skipping master setup
[yugabyted start] 2022-11-09 17:33:42,813 INFO:  | 0.1s | Querying for all masters in cluster
[yugabyted start] 2022-11-09 17:33:42,814 INFO:  | 0.1s | Waiting to get the full master addrs list from master
[yugabyted start] 2022-11-09 17:33:42,814 INFO:  | 0.1s | run_process: cmd: [u'/Users/nikhil/Software/yugabyte-2.17.0.0/bin/yb-admin', u'--master_addresses', u'127.0.0.1:7100', u'list_all_masters']

Solution:

Yugabyted can be updated to create a list of masters, which can be provided to yb-admin command. This may not require update to yb-admin command

Rolling update of other nodes (non-leader or non-master nodes)

T3: master leader has changed, so the ip-address given in the --join flag is no longer the leader master

- ./bin/yugabyted stop

T4: start the node back up
- ./bin/yugabyted start --join=
- is join flag persisted?
- we need to provide the new master leader ip-address to work

Scenario 2:

Description:

T0: 5 node cluster with --join 127.0.0.1

T1: master leader, 127.0.0.1 fails
- new lead master gets elected, say 12.0.0.3

T2: when I get to 127.0.0.2

do ./bin/yugabyted status - this fails

yb_admin command is used to find list of all masters. However based on the below logs, we can create the list of ip-address from the list of master already available to yugabyted.

Yugabyted.logs

[yugabyted start] 2022-11-09 17:15:52,317 INFO:  | 1.8s | run_process returned 0:
OUT >>
Master UUID                             RPC Host/Port           State           Role    Broadcast Host/Port
e1ca780af5754747945b84afe363e254        127.0.0.1:7100          ALIVE           LEADER  127.0.0.1:7100
0282f7e09beb4be18507c63aa3c227e2        127.0.0.3:7100          ALIVE           FOLLOWER        127.0.0.3:7100

<< ERR >>

<<
[yugabyted start] 2022-11-09 17:15:52,317 INFO:  | 1.8s | Got all masters: [u'127.0.0.1:7100', u'127.0.0.3:7100']
[yugabyted status] 2022-11-09 17:21:49,729 INFO:  | 0.0s | cmd = status using config file: /Users/nikhil/yugabyte-2.15.1.0/node3/conf/yugabyted.conf (args.config=None)
[yugabyted status] 2022-11-09 17:21:49,730 INFO:  | 0.0s | Found directory /Users/nikhil/Software/yugabyte-2.17.0.0/bin for file gen_certs.sh
[yugabyted status] 2022-11-09 17:21:49,730 INFO:  | 0.0s | Found directory /Users/nikhil/Software/yugabyte-2.17.0.0/bin for file yb-admin
[yugabyted status] 2022-11-09 17:21:49,740 INFO:  | 0.0s | Waiting to get the full master addrs list from master
[yugabyted status] 2022-11-09 17:21:49,740 INFO:  | 0.0s | run_process: cmd: [u'/Users/nikhil/Software/yugabyte-2.17.0.0/bin/yb-admin', u'--master_addresses', u'127.0.0.1:7100', u'list_all_masters']

Solution:

Yugabyted can be updated to have a list of current masters, which can be provided to yb-admin command. This may not require update to yb-admin command

The text was updated successfully, but these errors were encountered:

…n flag during cluster creation Summary: Code changes for handling multi-node cluster deployment using yugabyted, This diff will handle the following scenarios in yugabyted multi-node deployment - Any nodes ip-address can be used for cluster creation. Code changes are made to use t-server api/v1/masters endpoint for getting the active list of masters for cluster creation. - This diff will also handle leader master failures scenario, and - The scenario of rolling upgrade of the cluster without needing to update the current leader master in each node Test Plan: yugabyted tests Reviewers: sgarg-yb Reviewed By: sgarg-yb Subscribers: nikhil Differential Revision: https://phabricator.dev.yugabyte.com/D22116

Summary: Code changes for [#14440] broke the behavior of EAR. Code changes to fix the bug. Test Plan: no tests Reviewers: sgarg-yb Reviewed By: sgarg-yb Subscribers: nikhil Differential Revision: https://phabricator.dev.yugabyte.com/D22316

nchandrappa · 2023-01-23T16:43:26Z

Code changes landed.

nchandrappa added area/ossexp DB usability Project area/db-usability issue related to DB usability project. Including yugabyted cli and yugabyted ui issues. labels Oct 12, 2022

nchandrappa mentioned this issue Oct 12, 2022

[YugabyteD] [QA] Yugabyted fails to start after a host restart. #14111

Open

nchandrappa added this to To-do in DB Usability Oct 12, 2022

nchandrappa added the priority/high High Priority label Oct 12, 2022

nchandrappa moved this from To-do to In-progress in DB Usability Nov 17, 2022

nchandrappa changed the title ~~[yugabyted] yugabyted nodes fails to restart when correct leader master is not known.~~ [yugabyted] yugabyted node fails to restart when correct leader master is not known. Nov 18, 2022

nchandrappa closed this as completed Jan 23, 2023

nchandrappa moved this from In-progress to Done in DB Usability Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[yugabyted] yugabyted node fails to restart when correct leader master is not known. #14440

[yugabyted] yugabyted node fails to restart when correct leader master is not known. #14440

nchandrappa commented Oct 12, 2022 •

edited

nchandrappa commented Jan 23, 2023

[yugabyted] yugabyted node fails to restart when correct leader master is not known. #14440

[yugabyted] yugabyted node fails to restart when correct leader master is not known. #14440

Comments

nchandrappa commented Oct 12, 2022 • edited

Issue: Leader master changes need to be handled transparently

Scenario 1:

Rolling update/ rolling upgrade

Rolling update of the leader master

Solution:

Scenario 2:

Description:

nchandrappa commented Jan 23, 2023

nchandrappa commented Oct 12, 2022 •

edited