Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Reject heartbeats if universe_id of sender does not match receiver's universe_id #17904

Closed
1 task done
lingamsandeep opened this issue Jun 22, 2023 · 1 comment
Closed
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/high High Priority

Comments

@lingamsandeep
Copy link
Contributor

lingamsandeep commented Jun 22, 2023

Jira Link: DB-6983

Description

Currently it is possible for t-servers to heartbeat to masters in different universe if the Master or T-server nodes are misconfigured.
In such cases, those heartbeat RPCs should be rejected by the receiver due to mismatch in universe ids.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@lingamsandeep lingamsandeep added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jun 22, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue priority/high High Priority and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels Jun 22, 2023
charleswang234 added a commit that referenced this issue Jul 20, 2023
…om YBA during master/tserver startup

Summary:
Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB  versions can use this gflag to check inter-node RPCs (For this issue: #17904).

Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set).

Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node)
```
--allow_insecure_connections=false
--callhome_collection_level=medium
--cert_node_filename=10.150.2.146
--certs_dir=/home/yugabyte/yugabyte-tls-config
--certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer
--certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config
--cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768
--cql_proxy_bind_address=10.150.2.146:9042
--cql_proxy_webserver_port=12000
--enable_ysql=true
--fs_data_dirs=/mnt/d0
--max_log_size=256
--metric_node_name=yb-admin-cwang-4-node-universe-n2
--pgsql_proxy_bind_address=10.150.2.146:5433
--pgsql_proxy_webserver_port=13000
--placement_cloud=gcp
--placement_region=us-west1
--placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b
--placement_zone=us-west1-c
--redis_proxy_bind_address=10.150.2.146:6379
--replication_factor=3
--rpc_bind_addresses=10.150.2.146:9100
--server_broadcast_addresses=
--start_cql_proxy=true
--start_redis_proxy=false
--tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100
--txn_table_wait_min_ts_count=4
--undefok=enable_ysql
--use_cassandra_authentication=true
--use_client_to_server_encryption=true
--use_node_to_node_encryption=true
--webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt
--webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt
--webserver_interface=10.150.2.146
--webserver_port=9000
--webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key
--webserver_redirect_http_to_https=true
--ysql_enable_auth=true
--ysql_hba_conf_csv=local all yugabyte trust
```

Note:
- `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.

Test Plan:
Tested the following scenarios:

1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process).

2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag.

Reviewers: sanketh, nbhatia, hzare, nsingh, yshchetinin

Reviewed By: yshchetinin

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D26413
charleswang234 added a commit that referenced this issue Jul 28, 2023
…ers/tservers from YBA during master/tserver startup

Summary:
Original Commit: 05f97ef / D26413

Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB  versions can use this gflag to check inter-node RPCs (For this issue: #17904).

Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set).

Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node)
```
--allow_insecure_connections=false
--callhome_collection_level=medium
--cert_node_filename=10.150.2.146
--certs_dir=/home/yugabyte/yugabyte-tls-config
--certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer
--certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config
--cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768
--cql_proxy_bind_address=10.150.2.146:9042
--cql_proxy_webserver_port=12000
--enable_ysql=true
--fs_data_dirs=/mnt/d0
--max_log_size=256
--metric_node_name=yb-admin-cwang-4-node-universe-n2
--pgsql_proxy_bind_address=10.150.2.146:5433
--pgsql_proxy_webserver_port=13000
--placement_cloud=gcp
--placement_region=us-west1
--placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b
--placement_zone=us-west1-c
--redis_proxy_bind_address=10.150.2.146:6379
--replication_factor=3
--rpc_bind_addresses=10.150.2.146:9100
--server_broadcast_addresses=
--start_cql_proxy=true
--start_redis_proxy=false
--tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100
--txn_table_wait_min_ts_count=4
--undefok=enable_ysql
--use_cassandra_authentication=true
--use_client_to_server_encryption=true
--use_node_to_node_encryption=true
--webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt
--webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt
--webserver_interface=10.150.2.146
--webserver_port=9000
--webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key
--webserver_redirect_http_to_https=true
--ysql_enable_auth=true
--ysql_hba_conf_csv=local all yugabyte trust
```

Note:
- `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.

Test Plan:
Tested the following scenarios:

1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process).

2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag.

Reviewers: sanketh, hzare, yshchetinin, nbhatia

Reviewed By: sanketh

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D27304
charleswang234 added a commit that referenced this issue Jul 28, 2023
…ers/tservers from YBA during master/tserver startup

Summary:
Original Commit: 05f97ef / D26413

Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB  versions can use this gflag to check inter-node RPCs (For this issue: #17904).

Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set).

Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node)
```
--allow_insecure_connections=false
--callhome_collection_level=medium
--cert_node_filename=10.150.2.146
--certs_dir=/home/yugabyte/yugabyte-tls-config
--certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer
--certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config
--cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768
--cql_proxy_bind_address=10.150.2.146:9042
--cql_proxy_webserver_port=12000
--enable_ysql=true
--fs_data_dirs=/mnt/d0
--max_log_size=256
--metric_node_name=yb-admin-cwang-4-node-universe-n2
--pgsql_proxy_bind_address=10.150.2.146:5433
--pgsql_proxy_webserver_port=13000
--placement_cloud=gcp
--placement_region=us-west1
--placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b
--placement_zone=us-west1-c
--redis_proxy_bind_address=10.150.2.146:6379
--replication_factor=3
--rpc_bind_addresses=10.150.2.146:9100
--server_broadcast_addresses=
--start_cql_proxy=true
--start_redis_proxy=false
--tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100
--txn_table_wait_min_ts_count=4
--undefok=enable_ysql
--use_cassandra_authentication=true
--use_client_to_server_encryption=true
--use_node_to_node_encryption=true
--webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt
--webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt
--webserver_interface=10.150.2.146
--webserver_port=9000
--webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key
--webserver_redirect_http_to_https=true
--ysql_enable_auth=true
--ysql_hba_conf_csv=local all yugabyte trust
```

Note:
- `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.

Test Plan:
Tested the following scenarios:

1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process).

2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag.

Reviewers: sanketh, nbhatia, yshchetinin, hzare

Reviewed By: sanketh

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D27309
charleswang234 added a commit that referenced this issue Aug 20, 2023
…to all masters/tservers from YBA during master/tserver startup

Summary:
Also, backports `[PLAT-9408]Manage tags and skip_tags with Ansible script better` as it is needed for this change

Original commits:
05f97ef / D26413
6e0129a / D26555

Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB  versions can use this gflag to check inter-node RPCs (For this issue: #17904).

Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags or installs software. So upon software upgrade, all nodes containing at least one of tserver/master processes will have this gflag set in its conf file (with exclusion of read-replicas, where the master/conf/server.conf will not be set).

Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node)
```
--allow_insecure_connections=false
--callhome_collection_level=medium
--cert_node_filename=10.150.2.146
--certs_dir=/home/yugabyte/yugabyte-tls-config
--certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer
--certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config
--cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768
--cql_proxy_bind_address=10.150.2.146:9042
--cql_proxy_webserver_port=12000
--enable_ysql=true
--fs_data_dirs=/mnt/d0
--max_log_size=256
--metric_node_name=yb-admin-cwang-4-node-universe-n2
--pgsql_proxy_bind_address=10.150.2.146:5433
--pgsql_proxy_webserver_port=13000
--placement_cloud=gcp
--placement_region=us-west1
--placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b
--placement_zone=us-west1-c
--redis_proxy_bind_address=10.150.2.146:6379
--replication_factor=3
--rpc_bind_addresses=10.150.2.146:9100
--server_broadcast_addresses=
--start_cql_proxy=true
--start_redis_proxy=false
--tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100
--txn_table_wait_min_ts_count=4
--undefok=enable_ysql
--use_cassandra_authentication=true
--use_client_to_server_encryption=true
--use_node_to_node_encryption=true
--webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt
--webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt
--webserver_interface=10.150.2.146
--webserver_port=9000
--webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key
--webserver_redirect_http_to_https=true
--ysql_enable_auth=true
--ysql_hba_conf_csv=local all yugabyte trust
```

Note:
- `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.`

Test Plan:
Tested the following scenarios:

1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process).

2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a software upgrade. After the software upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag.

Reviewers: vpatibandla, sanketh, hzare

Reviewed By: hzare

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D27413
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Aug 23, 2023
charleswang234 added a commit that referenced this issue Sep 25, 2023
…to all masters/tservers from YBA during master/tserver startup

Summary:
Original Commit:
05f97ef / D26413

Allow the ability to populate the `--cluster_uuid` gflag on both tserver and master processes. Before, YBA only set this GFLAG on the master nodes. We add this change so future YBDB  versions can use this gflag to check inter-node RPCs (For this issue: #17904).

Adding these changes will allow the `--cluster_uuid` gflag to be populated once we perform any action that configures gflags (with exclusion of read-replicas, where the master/conf/server.conf will not be set). Backport 2.14 will not contain changes for allowing software upgrade populate the cluster_uuid field.

Example of tserver's, `server.conf` file (located under `~/tserver/conf/server.conf` in the node)
```
--allow_insecure_connections=false
--callhome_collection_level=medium
--cert_node_filename=10.150.2.146
--certs_dir=/home/yugabyte/yugabyte-tls-config
--certs_for_cdc_dir=/home/yugabyte/yugabyte-tls-producer
--certs_for_client_dir=/home/yugabyte/yugabyte-client-tls-config
--cluster_uuid=4ea76522-acfe-42ef-8832-a46fb7929768
--cql_proxy_bind_address=10.150.2.146:9042
--cql_proxy_webserver_port=12000
--enable_ysql=true
--fs_data_dirs=/mnt/d0
--max_log_size=256
--metric_node_name=yb-admin-cwang-4-node-universe-n2
--pgsql_proxy_bind_address=10.150.2.146:5433
--pgsql_proxy_webserver_port=13000
--placement_cloud=gcp
--placement_region=us-west1
--placement_uuid=4e3a2744-618b-46ec-8aff-0e4b7170062b
--placement_zone=us-west1-c
--redis_proxy_bind_address=10.150.2.146:6379
--replication_factor=3
--rpc_bind_addresses=10.150.2.146:9100
--server_broadcast_addresses=
--start_cql_proxy=true
--start_redis_proxy=false
--tserver_master_addrs=10.150.2.144:7100,10.150.2.146:7100,10.150.2.155:7100
--txn_table_wait_min_ts_count=4
--undefok=enable_ysql
--use_cassandra_authentication=true
--use_client_to_server_encryption=true
--use_node_to_node_encryption=true
--webserver_ca_certificate_file=/home/yugabyte/yugabyte-tls-config/ca.crt
--webserver_certificate_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.crt
--webserver_interface=10.150.2.146
--webserver_port=9000
--webserver_private_key_file=/home/yugabyte/yugabyte-tls-config/node.10.150.2.146.key
--webserver_redirect_http_to_https=true
--ysql_enable_auth=true
--ysql_hba_conf_csv=local all yugabyte trust
```

Note:
- `cluster_uuid` gflag is equivalent to `universe_uuid` from YBA standpoint.

Test Plan:
Tested the following scenarios:

1. Create a new universe with read replicas. A universe with 6 nodes in the primary cluster with replication factor of 3. Also have a read-replica cluster with 3 nodes. Make sure that upon successful creation, all nodes in the primary cluster have both their master and tserver gflags conf file containing the `cluster_uuid` gflag set (see summary for example). For the read-replica cluster, check that the tserver's conf file contains the `cluster_uuid` gflag. The master conf file will not contain the gflag (RR will never have a master process).

2. Have an existing universe created before this change was added with the same configurations as (1). Then only master nodes will have the gflag set for the master/tserver conf files. Perform a gflags upgrade. After the gflags upgrade, we should see that all nodes (considering the RR edge case) should have both their master and tserver gflags conf file populated with the `cluster_uuid` gflag.

Reviewers: sanketh, yshchetinin, hzare, nbhatia

Reviewed By: sanketh

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D27310
rahuldesirazu added a commit that referenced this issue Sep 27, 2023
…erent universe

Summary:
Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers.

We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set.

We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to.

On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios:
1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set.
2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster.
3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic.

####Upgrade Implications

This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation:
1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version.
2. Older version master M2 becomes leader.
3. M2 replicates a cluster config change. `universe_uuid` is unset.

####Backport Plan

We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line:

2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is.
2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade.
2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true.

Jira: DB-6983

Test Plan:
ybd debug --cxx-test master-test --gtest_filter *UniverseUuid*
ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster*

Reviewers: hsunder, asrivastava, zdrudi

Reviewed By: hsunder, asrivastava, zdrudi

Subscribers: ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D27858
rahuldesirazu added a commit that referenced this issue Sep 27, 2023
…eader in a different universe

Summary:
Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers.

We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set.

We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to.

On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios:
1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set.
2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster.
3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic.

####Upgrade Implications

This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation:
1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version.
2. Older version master M2 becomes leader.
3. M2 replicates a cluster config change. `universe_uuid` is unset.

####Backport Plan

We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line:

2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is.
2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade.
2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true.

Jira: DB-6983

Original Commit: fb98e56 / D27858

Test Plan:
ybd debug --cxx-test master-test --gtest_filter *UniverseUuid*
ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster*

Reviewers: hsunder, asrivastava, zdrudi

Reviewed By: hsunder

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28842
@yugabyte-ci yugabyte-ci reopened this Oct 16, 2023
rahuldesirazu added a commit that referenced this issue Oct 23, 2023
…eader in a different universe

Summary:
Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers.

We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set.

We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to.

On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios:
1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set.
2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster.
3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic.

This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation:
1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version.
2. Older version master M2 becomes leader.
3. M2 replicates a cluster config change. `universe_uuid` is unset.

We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line:

2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is.
2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade.
2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true.

Jira: DB-6983

Test Plan:
ybd debug --cxx-test master-test --gtest_filter *UniverseUuid*
ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster*

Reviewers: hsunder, asrivastava, zdrudi

Reviewed By: hsunder

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28869
rahuldesirazu added a commit that referenced this issue Oct 31, 2023
…eader in a different universe

Summary:
Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers.

We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set.

We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to.

On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios:
1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set.
2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster.
3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic.

This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation:
1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version.
2. Older version master M2 becomes leader.
3. M2 replicates a cluster config change. `universe_uuid` is unset.

We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line:

2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is.
2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade.
2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true.

Jira: DB-6983

Original Commit: fb98e56/D27858

Test Plan:
ybd debug --cxx-test master-test --gtest_filter *UniverseUuid*
ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster*

Reviewers: hsunder, asrivastava, zdrudi

Reviewed By: zdrudi

Subscribers: ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29567
rahuldesirazu added a commit that referenced this issue Nov 6, 2023
…eader in a different universe

Summary:
Currently, if a tserver heartbeats to a master leader in a different universe, it can successfully register even though it is part of a different universe. This can happen, for example, if tserver's `--tserver_master_addrs` are incorrectly set or if a master is wiped and added to a new cluster but not properly removed from the existing cluster. This can result in data loss scenarios, as tasks will be triggered to clean up orphaned tablets on these tservers.

We introduce new cluster config field `universe_uuid` that is only generated by the master leader (as opposed to cluster_uuid which can be passed in as a flag). The master leader will set `universe_uuid` on the VisitSysCatalog path (newly elected leader), and set `universe_uuid` in the cluster config if not already set.

We also add a similarly named field `universe_uuid` to the tserver instance metadata, indicating which cluster this tserver belongs to.

On the heartbeat path, the tserver sets the `universe_uuid` in the request if it is set in its instance metadata. Otherwise, it is left unset. Master leader checks the value of this passed in uuid against whats in the cluster config. Here are the scenarios:
1. If master's uuid is unset, return a TryAgain error to the tserver to retry until the uuid is set.
2. If both are set but mismatch, then fail since tserver is heartbeating to the wrong cluster.
3. If master is set but tserver is unset, then return the uuid to the tserver so it can set state in the instance metadata. The master heartbeat path will now wait for tserver to set universe_uuid before preceding with any logic.

This feature is gated by a kLocalPersisted autoflag `master_enable_universe_uuid_heartbeat_check = true`. When this flag is enabled, master both enables the uuid check on heartbeat, and sets the `universe_uuid` as part of catalog manager bg tasks. We need an auto flag here to guard against the following situation:
1. Master leader M1 on newer version replicates `universe_uuid` to followers on older version.
2. Older version master M2 becomes leader.
3. M2 replicates a cluster config change. `universe_uuid` is unset.

We want to backport this change down to 2.14 line. Due to the usage of autoflags we require a different backport plan for each line:

2.18+: Autoflags exist with autopromotion in YBA, so we will backport the change as is.
2.16: Autoflags exist, but there is no autopromotion in YBA. We will backport the change as is, but the user will have to manually promote this flag post-upgrade.
2.14: Autoflags do not exist at all. We will need to backport a change that just uses a gflag set to false. After upgrade, the user will have to manually set this flag to true.

Jira: DB-6983

Original Commit: fb98e56/D27858

Test Plan:
ybd debug --cxx-test master-test --gtest_filter *UniverseUuid*
ybd debug --cxx-test master_heartbeat-itest --gtest_filter *PreventHeartbeatWrongCluster*

Reviewers: hsunder, asrivastava, zdrudi

Reviewed By: hsunder

Subscribers: bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29676
@rahuldesirazu
Copy link
Contributor

Backported down to 2.14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/high High Priority
Projects
None yet
Development

No branches or pull requests

3 participants