[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653

vaibhav-yb · 2022-08-17T07:51:12Z

Jira Link: DB-3219
What’s happening is that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

Steps to reproduce:

Create a table
Create stream ID using yb-admin create_change_data_stream ysql.<namespace-name>
Deploy the DBZ connector config
Delete the DBZ connector config
Delete the stream ID using yb-admin delete_change_data_stream <stream-id>
Create another table
Create a new stream ID
Deploy the connector config using the stream created in step 7 - you will see a tserver crash

However, there is a workaround which has been identified to unblock the users i.e.

Set the GFlag yb_system_namespace_readonly to false
Using ycqlsh delete the entries corresponding to the deleted stream ID in step 5

The text was updated successfully, but these errors were encountered:

… the cdc_state table causing tserver crash Summary: During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. Test Plan: Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: aagrawal, skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D18986

…ntries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Jenkins: urgent Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19054

…_state table causing tserver crash Summary: During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. Test Plan: Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: aagrawal, skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D18986

…ries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Running all the c and java testcases Jenkins: urgent Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19056

vaibhav-yb created this issue from a note in CDC (To do) Aug 17, 2022

vaibhav-yb moved this from To do to In progress in CDC Aug 17, 2022

vaibhav-yb assigned sureshdash2022-yb Aug 17, 2022

vaibhav-yb added the area/cdcsdk CDC SDK label Aug 17, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 17, 2022

sureshdash2022-yb moved this from In progress to Done in CDC Aug 19, 2022

yugabyte-ci added the status/awaiting-triage Issue awaiting triage label Aug 22, 2022

yugabyte-ci closed this as completed Aug 22, 2022

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653

[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653

vaibhav-yb commented Aug 17, 2022 •

edited by yugabyte-ci

[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653

[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653

Comments

vaibhav-yb commented Aug 17, 2022 • edited by yugabyte-ci

vaibhav-yb commented Aug 17, 2022 •

edited by yugabyte-ci