Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yugabyted] yugabyted node fails to restart when correct leader master is not known. #14440

Closed
nchandrappa opened this issue Oct 12, 2022 · 1 comment
Labels
area/db-usability issue related to DB usability project. Including yugabyted cli and yugabyted ui issues. area/ossexp DB usability Project priority/high High Priority

Comments

@nchandrappa
Copy link
Contributor

nchandrappa commented Oct 12, 2022

Issue: Leader master changes need to be handled transparently

Scenario 1:

Rolling update/ rolling upgrade

T0: 5 node cluster with --join 127.0.0.1

T1: bring down first node, 127.0.0.1

  • leader master will change, say 127.0.0.3

Rolling update of the leader master

T2: make required changes and bring back 127.0.0.1

  • ./bin/yugabyted start
  • this fails ?
Yugabyted logs:

About to start master with cmd /Users/nikhil/Software/yugabyte-2.17.0.0/bin/yb-master --stop_on_parent_termination --undefok=stop_on_parent_termination --fs_data_dirs=/Users/nikhil/var/data --webserver_interface=127.0.0.1 --metrics_snapshotter_tserver_metrics_whitelist=handler_latency_yb_tserver_TabletServerService_Read_count,handler_latency_yb_tserver_TabletServerService_Write_count,handler_latency_yb_tserver_TabletServerService_Read_sum,handler_latency_yb_tserver_TabletServerService_Write_sum,disk_usage,cpu_usage,node_up --yb_num_shards_per_tserver=1 --ysql_num_shards_per_tserver=1 --placement_cloud=cloud1 --placement_region=datacenter1 --placement_zone=rack1 --rpc_bind_addresses=127.0.0.1:7100 --server_broadcast_addresses=127.0.0.1:7100 --replication_factor=1 --use_initial_sys_catalog_snapshot --server_dump_info_path=/Users/nikhil/var/data/master-info --master_enable_metrics_snapshotter=true --webserver_port=7000 --default_memory_limit_to_ram_ratio=0.35 --instance_uuid_override=e1ca780af5754747945b84afe363e254 --master_addresses=127.0.0.1:7100 --cluster_uuid=62bbd491-cc8c-4385-929a-a45941793b23
[yugabyted start] 2022-11-09 17:33:42,812 INFO:  | 0.1s | master started running with PID 39242.
[yugabyted start] 2022-11-09 17:33:42,813 INFO:  | 0.1s | Node was a member of some cluster before. Skipping master setup
[yugabyted start] 2022-11-09 17:33:42,813 INFO:  | 0.1s | Querying for all masters in cluster
[yugabyted start] 2022-11-09 17:33:42,814 INFO:  | 0.1s | Waiting to get the full master addrs list from master
[yugabyted start] 2022-11-09 17:33:42,814 INFO:  | 0.1s | run_process: cmd: [u'/Users/nikhil/Software/yugabyte-2.17.0.0/bin/yb-admin', u'--master_addresses', u'127.0.0.1:7100', u'list_all_masters']

Solution:

Yugabyted can be updated to create a list of masters, which can be provided to yb-admin command. This may not require update to yb-admin command

Rolling update of other nodes (non-leader or non-master nodes)

T3: master leader has changed, so the ip-address given in the --join flag is no longer the leader master

- ./bin/yugabyted stop

T4: start the node back up
- ./bin/yugabyted start --join=
- is join flag persisted?
- we need to provide the new master leader ip-address to work

Scenario 2:

Description:

T0: 5 node cluster with --join 127.0.0.1

T1: master leader, 127.0.0.1 fails
- new lead master gets elected, say 12.0.0.3

T2: when I get to 127.0.0.2

  • do ./bin/yugabyted status - this fails

yb_admin command is used to find list of all masters. However based on the below logs, we can create the list of ip-address from the list of master already available to yugabyted.

Yugabyted.logs

[yugabyted start] 2022-11-09 17:15:52,317 INFO:  | 1.8s | run_process returned 0:
OUT >>
Master UUID                             RPC Host/Port           State           Role    Broadcast Host/Port
e1ca780af5754747945b84afe363e254        127.0.0.1:7100          ALIVE           LEADER  127.0.0.1:7100
0282f7e09beb4be18507c63aa3c227e2        127.0.0.3:7100          ALIVE           FOLLOWER        127.0.0.3:7100

<< ERR >>

<<
[yugabyted start] 2022-11-09 17:15:52,317 INFO:  | 1.8s | Got all masters: [u'127.0.0.1:7100', u'127.0.0.3:7100']
[yugabyted status] 2022-11-09 17:21:49,729 INFO:  | 0.0s | cmd = status using config file: /Users/nikhil/yugabyte-2.15.1.0/node3/conf/yugabyted.conf (args.config=None)
[yugabyted status] 2022-11-09 17:21:49,730 INFO:  | 0.0s | Found directory /Users/nikhil/Software/yugabyte-2.17.0.0/bin for file gen_certs.sh
[yugabyted status] 2022-11-09 17:21:49,730 INFO:  | 0.0s | Found directory /Users/nikhil/Software/yugabyte-2.17.0.0/bin for file yb-admin
[yugabyted status] 2022-11-09 17:21:49,740 INFO:  | 0.0s | Waiting to get the full master addrs list from master
[yugabyted status] 2022-11-09 17:21:49,740 INFO:  | 0.0s | run_process: cmd: [u'/Users/nikhil/Software/yugabyte-2.17.0.0/bin/yb-admin', u'--master_addresses', u'127.0.0.1:7100', u'list_all_masters']

Solution:

Yugabyted can be updated to have a list of current masters, which can be provided to yb-admin command. This may not require update to yb-admin command

@nchandrappa nchandrappa added area/ossexp DB usability Project area/db-usability issue related to DB usability project. Including yugabyted cli and yugabyted ui issues. labels Oct 12, 2022
@nchandrappa nchandrappa added this to To-do in DB Usability Oct 12, 2022
@nchandrappa nchandrappa added the priority/high High Priority label Oct 12, 2022
@nchandrappa nchandrappa moved this from To-do to In-progress in DB Usability Nov 17, 2022
@nchandrappa nchandrappa changed the title [yugabyted] yugabyted nodes fails to restart when correct leader master is not known. [yugabyted] yugabyted node fails to restart when correct leader master is not known. Nov 18, 2022
nchandrappa added a commit that referenced this issue Jan 13, 2023
…n flag during cluster creation

Summary:
Code changes for handling multi-node cluster deployment using yugabyted, This diff will handle the
following scenarios in yugabyted multi-node deployment

- Any nodes ip-address can be used for cluster creation. Code changes are made to use t-server
api/v1/masters endpoint for getting the active list of masters for cluster creation.
- This diff will also handle leader master failures scenario, and
- The scenario of rolling upgrade of the cluster without needing to update the current leader master
in each node

Test Plan: yugabyted tests

Reviewers: sgarg-yb

Reviewed By: sgarg-yb

Subscribers: nikhil

Differential Revision: https://phabricator.dev.yugabyte.com/D22116
nchandrappa added a commit that referenced this issue Jan 19, 2023
Summary: Code changes for [#14440] broke the behavior of EAR. Code changes to fix the bug.

Test Plan: no tests

Reviewers: sgarg-yb

Reviewed By: sgarg-yb

Subscribers: nikhil

Differential Revision: https://phabricator.dev.yugabyte.com/D22316
@nchandrappa
Copy link
Contributor Author

Code changes landed.

@nchandrappa nchandrappa moved this from In-progress to Done in DB Usability Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-usability issue related to DB usability project. Including yugabyted cli and yugabyted ui issues. area/ossexp DB usability Project priority/high High Priority
Projects
Development

No branches or pull requests

1 participant