New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904
Closed
1 task done
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/high
High Priority
Comments
lingamsandeep
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Jun 22, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
priority/high
High Priority
and removed
status/awaiting-triage
Issue awaiting triage
priority/medium
Medium priority issue
labels
Jun 22, 2023
charleswang234
added a commit
that referenced
this issue
Jul 20, 2023
…om YBA during master/tserver startup Summary: Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, nbhatia, hzare, nsingh, yshchetinin Reviewed By: yshchetinin Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D26413
charleswang234
added a commit
that referenced
this issue
Jul 28, 2023
…ers/tservers from YBA during master/tserver startup Summary: Original Commit: 05f97ef / D26413 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, hzare, yshchetinin, nbhatia Reviewed By: sanketh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27304
charleswang234
added a commit
that referenced
this issue
Jul 28, 2023
…ers/tservers from YBA during master/tserver startup Summary: Original Commit: 05f97ef / D26413 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, nbhatia, yshchetinin, hzare Reviewed By: sanketh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27309
charleswang234
added a commit
that referenced
this issue
Aug 20, 2023
…to all masters/tservers from YBA during master/tserver startup Summary: Also, backports `[PLAT-9408]Manage tags and skip_tags with Ansible script better` as it is needed for this change Original commits: 05f97ef / D26413 6e0129a / D26555 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.` Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: vpatibandla, sanketh, hzare Reviewed By: hzare Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27413
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
and removed
kind/bug
This issue is a bug
labels
Aug 23, 2023
1 task
charleswang234
added a commit
that referenced
this issue
Sep 25, 2023
…to all masters/tservers from YBA during master/tserver startup Summary: Original Commit: 05f97ef / D26413 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Backport 2.14 will not contain changes for allowing software upgrade populate the cluster_uuid field. Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a gflags upgrade. After the gflags upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, yshchetinin, hzare, nbhatia Reviewed By: sanketh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27310
rahuldesirazu
added a commit
that referenced
this issue
Sep 27, 2023
…erent universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. ####Upgrade Implications This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. ####Backport Plan We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder, asrivastava, zdrudi Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D27858
rahuldesirazu
added a commit
that referenced
this issue
Sep 27, 2023
…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. ####Upgrade Implications This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. ####Backport Plan We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Original Commit: fb98e56 / D27858 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28842
rahuldesirazu
added a commit
that referenced
this issue
Oct 23, 2023
…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28869
rahuldesirazu
added a commit
that referenced
this issue
Oct 31, 2023
…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Original Commit: fb98e56/D27858 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: zdrudi Subscribers: ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29567
rahuldesirazu
added a commit
that referenced
this issue
Nov 6, 2023
…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Original Commit: fb98e56/D27858 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29676
Backported down to 2.14 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/high
High Priority
Jira Link: DB-6983
Description
Currently it is possible for t-servers to heartbeat to masters in different universe if the Master or T-server nodes are misconfigured.
In such cases, those heartbeat RPCs should be rejected by the receiver due to mismatch in universe ids.
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: