[DocDB] Improve MetaCache leader change detection #20234

mdbridge · 2023-12-07T19:15:30Z

Jira Link: DB-9194

Description

When the tablet leader changes, the meta cache is guessing the leader based on the list of peers it has. When we have lots of peers or if the peers are in far away geo this process can take a long time to find the leader and sometimes causes the RPC to fail.

The current implementation also handles poorly nodes that go down. When we try to contact a tablet peer on a down node, we can lose many seconds until the connection times out. We have some protection today, where we will not retry the same node for 60 seconds after connecting to it failed, but in practice this has been insufficient when the node is permanently down.

This is in part because if we can find a leader given the current stale contents of the cache (even after connection failures), we do not refresh the cache, allowing the staleness and connection failures to persist indefinitely.

This task is to improve how we do this. Various ideas are being discussed.

Capturing some:

When the tablet leader changes the meta cache is guessing the leader based on the list of peers it has. When we have lots of peers or if the peers are in far away geo this process can take a long time to find rhe leader and sometimes causes the RPC to fail.
Most often the follower knows who the new leader is. So instead of just returning LEADER_NOT_FOUND error it should also send back the location of the new leader.
Only when the follower does not host the tablet anymore, meta cache has to try a different node.
And if the second node also does not have the tablet then we should reach out to master immediately. This is very rare and indicated a major tablet move happened.
Thsi combined with meta cache LRUs should help improve the efficiency of having a stale meta cache.

don't put any delay between pinging the peers.

The most efficient option is for the follower to act as a proxy for the request and include the new leader info in the response so that the next request goes directly to the leader. But this would be more work and increase load on follower for a short time (have to get some numbers to see if it is negligible)

we have node blacklist info which we can propagate to Tservers and use that to clear the cache

I think it’s a good idea to piggy back the leadership information from the Tablet (in case it is a follower and already knows the leader).

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

…ById response returns Summary: Problem Background: In our system, when a client needs to perform an operation on a specific tablet, it first needs to find out which server is currently responsible for that operation. If the operation is a WriteRpc for example, it must find the tablet leader server. However, the system's current method of figuring out the tablet leader is not very efficient. It tries to guess the leader based on a list of potential servers (peers), but this guessing game can be slow, especially when there are many servers or when the servers are located far apart geographically. This inefficiency can lead to operations failing because the leader wasn't found quickly enough. Additionally, the system doesn't handle server failures well. If a server is down, it might take a long time for the system to stop trying to connect to it, wasting valuable seconds on each attempt. While there's a mechanism to avoid retrying a failed server for 60 seconds, it's not very effective when a server is permanently out of service. One reason for this inefficiency is that the system's information about who the leaders are (stored in something called the meta cache) can become outdated, and it doesn't get updated if the system can still perform its tasks with the outdated information, even if doing so results in repeated connection failures. Solution Introduction: This ticket introduces a preliminary change aimed at improving how the system tracks the current leader for each piece of data. The idea is to add a new piece of information to the meta cache called "raft_config_opid," which records the latest confirmed leadership configuration for each tablet. This way, when the system receives new information about the leadership configuration (which can happen during normal operations from other servers), it can check this new information against what it already knows. If the new information is more up-to-date, the system can update its meta cache, potentially avoiding wasted efforts on trying to connect to servers that are no longer leaders or are down. In simpler terms, the goal is to make the system smarter about who it tries to talk to when performing operations, which should help it avoid unnecessary delays and failures caused by outdated information or attempts to communicate with servers that aren't currently leading or are unavailable. This change sets the stage for further improvements that will rely on the new "raft_config_opid_index" information to keep the meta cache up-to-date. **Upgrade/Rollback safety:** The added field in the response is not to be persisted on disk, it is guarded by protobuf's backward compatibility Jira: DB-9194 Test Plan: Manual testing by inspecting meta-cache endpoint, verify that the raft_config_opId_index field is actually being updated Reviewers: mlillibridge, arybochkin Reviewed By: mlillibridge, arybochkin Subscribers: hsunder, arybochkin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D33197

…base Summary: Previously, our system used a feature called "capabilities" to track what each server in our distributed database could do. However, since we've introduced a more dynamic way of handling these capabilities through "auto flags," the old capabilities system has become outdated and unnecessary. In our ongoing project to improve the meta-cache (which helps our system efficiently locate which server is currently leading for a specific piece of data), we've encountered a problem. The meta-cache still holds a list of these outdated capabilities for each server, but we no longer have a straightforward way to update or use this capabilities information effectively. This is because the capabilities of the servers involved in the consensus process (raft peers) can't be easily determined by querying a single server. To solve this issue, this change removes the outdated capabilities information from the entire codebase. We're also removing any tests that were specifically checking for this capabilities data. By doing this, we streamline the meta-cache structure and remove unnecessary complexity, laying better groundwork for further improvements in how we track and update leadership information in our distributed system. This simplification helps us focus on more efficient and up-to-date methods of ensuring our database operations are directed to the right servers, enhancing overall performance and reliability. **Upgrade/Rollback safety:** The last capability was added in 2.6, and all customers are on version higher than this, so it is safe to remove this feature. Jira: DB-9194 Test Plan: Build and test with Jenkins Reviewers: mlillibridge, hsunder, sergei Reviewed By: mlillibridge, hsunder, sergei Subscribers: bogdan, yql, ycdcxcluster, ybase Differential Revision: https://phorge.dev.yugabyte.com/D33598

…fo when received NOT_THE_LEADER error Summary: Problem Background: In our system, when a client needs to perform an operation on a specific tablet, it first needs to find out which server is currently responsible for that operation. If the operation is a WriteRpc for example, it must find the tablet leader server. However, the system's current method of figuring out the tablet leader is not very efficient. It tries to guess the leader based on a list of potential servers (peers), but this guessing game can be slow, especially when there are many servers or when the servers are located far apart geographically. This inefficiency can lead to operations failing because the leader wasn't found quickly enough. Additionally, the system doesn't handle server failures well. If a server is down, it might take a long time for the system to stop trying to connect to it, wasting valuable seconds on each attempt. While there's a mechanism to avoid retrying a failed server for 60 seconds, it's not very effective when a server is permanently out of service. One reason for this inefficiency is that the system's information about who the leaders are (stored in something called the meta cache) can become outdated, and it doesn't get updated if the system can still perform its tasks with the outdated information, even if doing so results in repeated connection failures. Solution Introduction: This ticket introduces a preliminary change aimed at improving how the system tracks the current leader for each piece of data. The idea is to add a new piece of information to the meta cache called "raft_config_opid," which records the latest confirmed leadership configuration for each tablet. This way, when the system receives new information about the leadership configuration (which can happen during normal operations from other servers), it can check this new information against what it already knows. If the new information is more up-to-date, the system can update its meta cache, potentially avoiding wasted efforts on trying to connect to servers that are no longer leaders or are down. This diff, combined with D33197 and D33598, updates the meta-cache using TabletConsensusInfo that is piggybacked by a Write/Read/GetChanges/GetTransactionStatus ResponsePB when we sent a request to a non-leader but requires a leader to receive our request. These frequent RPC requests should be able to keep our meta-cache sufficiently up to date to avoid the situation that caused the CE. Upgrade/Rollback safety: The added field in the ResponsePBs is not to be persisted on disk, it is guarded by protobuf's backward compatibility Jira: DB-9194 Test Plan: Unit Testing: 1. ClientTest.TestMetacacheRefreshWhenSentToWrongLeader: Changes leadership of a RaftGroup after meta-cache is already filled in. This introduces a discrepancy between the information available in the meta-cache and the actual cluster configuration, and should return back a NOT_THE_LEADER error for our caller. Normally, this will prompt the TabletInvoker to try the next-in-line replica's Tablet Server, and using our test set up, this will guarantee that the TabletInvoker will retry the RPC at least 3 times. However, because this diff introduces the mechanism to refresh the meta-cache right away after a NOT_THE_LEADER error, we should observe that the RPC will succeed in 2 tries instead of one, the first attempt will piggyback the TabletConsensusInfo and update the meta-cache, while the other attempt will use that newest meta-cache and find the correct leader to send the request to. 2. CDCServiceTestMultipleServersOneTablet.TestGetChangesRpcTabletConsensusInfo: Since the GetChanges code path for updating meta-cache is sufficiently diverged from other RPC types, this test is introduced to explicitly check that when a cdc proxy receives a not the leader error message, its meta-cache should be refreshed. Reviewers: mlillibridge, xCluster, hsunder Reviewed By: mlillibridge Subscribers: yql, jason, ycdcxcluster, hsunder, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D33533

…C responds Summary: Background: As detailed in https://phorge.dev.yugabyte.com/D33533, the YBClient currently fails to update its meta-cache when there are changes in a raft group's configuration. This lapse can lead to inefficiencies such as persistent follower reads not recognizing the addition of closer followers. Consequently, even if there are suitable nearer followers, the system continues to rely on more distant followers as long as the RPCs are successful. Solution: This update proposes enhancing RPC mechanisms (Read, Write, GetChanges, GetTransactionStatus) by appending the current raft_config_opid_index to each request. Upon receiving a request, if the raft_config_opid_index is outdated compared to the committed raft config opid index on the TabletPeer handling the request, the Tablet Server will include updated raft consensus state information in its response. This change aims to ensure that the meta-cache remains current, thus improving the system's efficiency in recognizing and utilizing the optimal server configurations for processing requests. This adjustment is part of a series of updates (alongside D33197 and D33598) designed to keep the meta-cache sufficiently current, thereby preventing the inefficiencies previously caused by outdated cache information. A flag enable_metacache_partial_refresh is added to turn the feature on, it is by default of right now. Upgrade/Rollback safety: The additional field in the Response and Request Protobufs is temporary and will not be stored on disk, maintaining compatibility and safety during potential system upgrades or rollbacks. Jira: DB-9194 Test Plan: Jenkins: urgent Full Coverage Testing: Added a test flag FLAGS_TEST_always_return_consensus_Info_for_succeeded_rpc which will be turned on during debug mode. This flag will prompt the GetRaftConfigOpidIndex method on RemoteTablet to always return an OpId Index of value -2. So when the Tablet server is about to send back a successful response, it will find out that the request's piggybacked OpId index is stale, thus piggyback a TabletConsensusInfo to the response. When we receive the response in the aforementioned RPCs, if this flag is turned on, it will use a DCHECK to verify that if the RPC response can contain a TabletConsensusInfo and that the response was successful, then it must be the case that the TabletConsensusInfo exists in the response. This essentially allows us to leverage all the existing tests in the code base that exercises these RPCs to DCHECK our code path. Unit testing: Added metacache_refresh_itest.cc, which contains the following tests: TestMetacacheRefreshFromFollowerRead: 1. Sets up an external mini-cluster. 2. Fills in the meta-cache by issuing a write op. 3. Change the raft configuration of the tablet group by blacklisting a node and adding a node. 4. Verify the next ConsistentPrefix read successfully refreshes meta-cache using a sync point. TestMetacacheNoRefreshFromWrite: 1. Turns off the FLAGS_TEST_always_return_consensus_Info_for_succeeded_rpc 2. Fills in the meta-cache by issuing a write op. 3. Issue another write op and observe that no refresh happened. Reviewers: mlillibridge, xCluster, hsunder Reviewed By: mlillibridge Subscribers: bogdan, ybase, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D34272

…fo when received NOT_THE_LEADER error Summary: Problem Background: In our system, when a client needs to perform an operation on a specific tablet, it first needs to find out which server is currently responsible for that operation. If the operation is a WriteRpc for example, it must find the tablet leader server. However, the system's current method of figuring out the tablet leader is not very efficient. It tries to guess the leader based on a list of potential servers (peers), but this guessing game can be slow, especially when there are many servers or when the servers are located far apart geographically. This inefficiency can lead to operations failing because the leader wasn't found quickly enough. Additionally, the system doesn't handle server failures well. If a server is down, it might take a long time for the system to stop trying to connect to it, wasting valuable seconds on each attempt. While there's a mechanism to avoid retrying a failed server for 60 seconds, it's not very effective when a server is permanently out of service. One reason for this inefficiency is that the system's information about who the leaders are (stored in something called the meta cache) can become outdated, and it doesn't get updated if the system can still perform its tasks with the outdated information, even if doing so results in repeated connection failures. Solution Introduction: This ticket introduces a preliminary change aimed at improving how the system tracks the current leader for each piece of data. The idea is to add a new piece of information to the meta cache called "raft_config_opid," which records the latest confirmed leadership configuration for each tablet. This way, when the system receives new information about the leadership configuration (which can happen during normal operations from other servers), it can check this new information against what it already knows. If the new information is more up-to-date, the system can update its meta cache, potentially avoiding wasted efforts on trying to connect to servers that are no longer leaders or are down. This diff, combined with D33197 and D33598, updates the meta-cache using TabletConsensusInfo that is piggybacked by a Write/Read/GetChanges/GetTransactionStatus ResponsePB when we sent a request to a non-leader but requires a leader to receive our request. These frequent RPC requests should be able to keep our meta-cache sufficiently up to date to avoid the situation that caused the CE. Upgrade/Rollback safety: The added field in the ResponsePBs is not to be persisted on disk, it is guarded by protobuf's backward compatibility Jira: DB-9194 Test Plan: Unit Testing: 1. ClientTest.TestMetacacheRefreshWhenSentToWrongLeader: Changes leadership of a RaftGroup after meta-cache is already filled in. This introduces a discrepancy between the information available in the meta-cache and the actual cluster configuration, and should return back a NOT_THE_LEADER error for our caller. Normally, this will prompt the TabletInvoker to try the next-in-line replica's Tablet Server, and using our test set up, this will guarantee that the TabletInvoker will retry the RPC at least 3 times. However, because this diff introduces the mechanism to refresh the meta-cache right away after a NOT_THE_LEADER error, we should observe that the RPC will succeed in 2 tries instead of one, the first attempt will piggyback the TabletConsensusInfo and update the meta-cache, while the other attempt will use that newest meta-cache and find the correct leader to send the request to. 2. CDCServiceTestMultipleServersOneTablet.TestGetChangesRpcTabletConsensusInfo: Since the GetChanges code path for updating meta-cache is sufficiently diverged from other RPC types, this test is introduced to explicitly check that when a cdc proxy receives a not the leader error message, its meta-cache should be refreshed. Reviewers: mlillibridge, xCluster, hsunder Reviewed By: mlillibridge Subscribers: yql, jason, ycdcxcluster, hsunder, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D33533

…C responds Summary: Background: As detailed in https://phorge.dev.yugabyte.com/D33533, the YBClient currently fails to update its meta-cache when there are changes in a raft group's configuration. This lapse can lead to inefficiencies such as persistent follower reads not recognizing the addition of closer followers. Consequently, even if there are suitable nearer followers, the system continues to rely on more distant followers as long as the RPCs are successful. Solution: This update proposes enhancing RPC mechanisms (Read, Write, GetChanges, GetTransactionStatus) by appending the current raft_config_opid_index to each request. Upon receiving a request, if the raft_config_opid_index is outdated compared to the committed raft config opid index on the TabletPeer handling the request, the Tablet Server will include updated raft consensus state information in its response. This change aims to ensure that the meta-cache remains current, thus improving the system's efficiency in recognizing and utilizing the optimal server configurations for processing requests. This adjustment is part of a series of updates (alongside D33197 and D33598) designed to keep the meta-cache sufficiently current, thereby preventing the inefficiencies previously caused by outdated cache information. A flag enable_metacache_partial_refresh is added to turn the feature on, it is by default of right now. Upgrade/Rollback safety: The additional field in the Response and Request Protobufs is temporary and will not be stored on disk, maintaining compatibility and safety during potential system upgrades or rollbacks. Jira: DB-9194 Test Plan: Jenkins: urgent Full Coverage Testing: Added a test flag FLAGS_TEST_always_return_consensus_Info_for_succeeded_rpc which will be turned on during debug mode. This flag will prompt the GetRaftConfigOpidIndex method on RemoteTablet to always return an OpId Index of value -2. So when the Tablet server is about to send back a successful response, it will find out that the request's piggybacked OpId index is stale, thus piggyback a TabletConsensusInfo to the response. When we receive the response in the aforementioned RPCs, if this flag is turned on, it will use a DCHECK to verify that if the RPC response can contain a TabletConsensusInfo and that the response was successful, then it must be the case that the TabletConsensusInfo exists in the response. This essentially allows us to leverage all the existing tests in the code base that exercises these RPCs to DCHECK our code path. Unit testing: Added metacache_refresh_itest.cc, which contains the following tests: TestMetacacheRefreshFromFollowerRead: 1. Sets up an external mini-cluster. 2. Fills in the meta-cache by issuing a write op. 3. Change the raft configuration of the tablet group by blacklisting a node and adding a node. 4. Verify the next ConsistentPrefix read successfully refreshes meta-cache using a sync point. TestMetacacheNoRefreshFromWrite: 1. Turns off the FLAGS_TEST_always_return_consensus_Info_for_succeeded_rpc 2. Fills in the meta-cache by issuing a write op. 3. Issue another write op and observe that no refresh happened. Reviewers: mlillibridge, xCluster, hsunder Reviewed By: mlillibridge Subscribers: bogdan, ybase, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D34272

mdbridge added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Dec 7, 2023

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Dec 7, 2023

yugabyte-ci assigned mdbridge Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Improve MetaCache leader change detection #20234

[DocDB] Improve MetaCache leader change detection #20234

mdbridge commented Dec 7, 2023 •

edited by jira bot

[DocDB] Improve MetaCache leader change detection #20234

[DocDB] Improve MetaCache leader change detection #20234

Comments

mdbridge commented Dec 7, 2023 • edited by jira bot

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

mdbridge commented Dec 7, 2023 •

edited by jira bot