[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904

lingamsandeep · 2023-06-22T23:21:13Z

Jira Link: DB-6983

Description

Currently it is possible for t-servers to heartbeat to masters in different universe if the Master or T-server nodes are misconfigured.
In such cases, those heartbeat RPCs should be rejected by the receiver due to mismatch in universe ids.

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

…om YBA during master/tserver startup Summary: Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, nbhatia, hzare, nsingh, yshchetinin Reviewed By: yshchetinin Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D26413

…ers/tservers from YBA during master/tserver startup Summary: Original Commit: 05f97ef / D26413 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, hzare, yshchetinin, nbhatia Reviewed By: sanketh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27304

…ers/tservers from YBA during master/tserver startup Summary: Original Commit: 05f97ef / D26413 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, nbhatia, yshchetinin, hzare Reviewed By: sanketh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27309

…to all masters/tservers from YBA during master/tserver startup Summary: Also, backports `[PLAT-9408]Manage tags and skip_tags with Ansible script better` as it is needed for this change Original commits: 05f97ef / D26413 6e0129a / D26555 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.` Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: vpatibandla, sanketh, hzare Reviewed By: hzare Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27413

…to all masters/tservers from YBA during master/tserver startup Summary: Original Commit: 05f97ef / D26413 Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB versions can use this gflag to check inter-node RPCs (For this issue: #17904). Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Backport 2.14 will not contain changes for allowing software upgrade populate the cluster_uuid field. Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node) ``` --allow_insecure_connections=false --callhome_collection_level=medium --cert_node_filename=10.150.2.146 --certs_dir=/home/yugabyte/yugabyte-tls-config --certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer --certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config --cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768 --cql_proxy_bind_address=10.150.2.146:9042 --cql_proxy_webserver_port=12000 --enable_ysql=true --fs_data_dirs=/mnt/d0 --max_log_size=256 --metric_node_name=yb-admin-cwang-4-node-universe-n2 --pgsql_proxy_bind_address=10.150.2.146:5433 --pgsql_proxy_webserver_port=13000 --placement_cloud=gcp --placement_region=us-west1 --placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b --placement_zone=us-west1-c --redis_proxy_bind_address=10.150.2.146:6379 --replication_factor=3 --rpc_bind_addresses=10.150.2.146:9100 --server_broadcast_addresses= --start_cql_proxy=true --start_redis_proxy=false --tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100 --txn_table_wait_min_ts_count=4 --undefok=enable_ysql --use_cassandra_authentication=true --use_client_to_server_encryption=true --use_node_to_node_encryption=true --webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt --webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt --webserver_interface=10.150.2.146 --webserver_port=9000 --webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key --webserver_redirect_http_to_https=true --ysql_enable_auth=true --ysql_hba_conf_csv=local all yugabyte trust ``` Note: - `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint. Test Plan: Tested the following scenarios: 1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process). 2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a gflags upgrade. After the gflags upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag. Reviewers: sanketh, yshchetinin, hzare, nbhatia Reviewed By: sanketh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D27310

…erent universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. ####Upgrade Implications This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. ####Backport Plan We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder, asrivastava, zdrudi Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D27858

…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. ####Upgrade Implications This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. ####Backport Plan We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Original Commit: fb98e56 / D27858 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28842

…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28869

…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Original Commit: fb98e56/D27858 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: zdrudi Subscribers: ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29567

…eader in a different universe Summary: Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers. We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set. We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to. On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios: 1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set. 2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster. 3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic. This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation: 1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version. 2. Older version master M2 becomes leader. 3. M2 replicates a cluster config change. `universe_uuid` is unset. We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line: 2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is. 2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade. 2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true. Jira: DB-6983 Original Commit: fb98e56/D27858 Test Plan: ybd debug --cxx-test master-test --gtest_filter *UniverseUuid* ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster* Reviewers: hsunder, asrivastava, zdrudi Reviewed By: hsunder Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29676

rahuldesirazu · 2023-11-06T23:09:10Z

Backported down to 2.14

lingamsandeep added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jun 22, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue priority/high High Priority and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels Jun 22, 2023

yugabyte-ci assigned lingamsandeep and rahuldesirazu and unassigned lingamsandeep Jun 28, 2023

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Aug 23, 2023

druzac mentioned this issue Aug 24, 2023

[DocDB] Persist tserver's list of master addresses #18846

Open

1 task

yugabyte-ci closed this as completed Oct 16, 2023

yugabyte-ci reopened this Oct 16, 2023

rahuldesirazu closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904

[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904

lingamsandeep commented Jun 22, 2023 •

edited by jira bot

rahuldesirazu commented Nov 6, 2023

[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904

[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904

Comments

lingamsandeep commented Jun 22, 2023 • edited by jira bot

Description

Warning: Please confirm that this issue does not contain any sensitive information

rahuldesirazu commented Nov 6, 2023

lingamsandeep commented Jun 22, 2023 •

edited by jira bot