Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653

Closed
vaibhav-yb opened this issue Aug 17, 2022 · 0 comments
Assignees
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/medium Medium priority issue
Projects

Comments

@vaibhav-yb
Copy link
Contributor

vaibhav-yb commented Aug 17, 2022

Jira Link: DB-3219
What’s happening is that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

Steps to reproduce:

  1. Create a table
  2. Create stream ID using yb-admin create_change_data_stream ysql.<namespace-name>
  3. Deploy the DBZ connector config
  4. Delete the DBZ connector config
  5. Delete the stream ID using yb-admin delete_change_data_stream <stream-id>
  6. Create another table
  7. Create a new stream ID
  8. Deploy the connector config using the stream created in step 7 - you will see a tserver crash

However, there is a workaround which has been identified to unblock the users i.e.

  1. Set the GFlag yb_system_namespace_readonly to false
  2. Using ycqlsh delete the entries corresponding to the deleted stream ID in step 5
@vaibhav-yb vaibhav-yb created this issue from a note in CDC (To do) Aug 17, 2022
@vaibhav-yb vaibhav-yb moved this from To do to In progress in CDC Aug 17, 2022
@vaibhav-yb vaibhav-yb added the area/cdcsdk CDC SDK label Aug 17, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 17, 2022
samiahmedsiddiqui pushed a commit to samiahmedsiddiqui/yugabyte-db that referenced this issue Aug 19, 2022
… the cdc_state table causing tserver crash

Summary:
During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again.

Test Plan: Running all the c and java testcases

Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar

Reviewed By: aagrawal, skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D18986
@sureshdash2022-yb sureshdash2022-yb moved this from In progress to Done in CDC Aug 19, 2022
sureshdash2022-yb added a commit that referenced this issue Aug 20, 2022
…ntries in the cdc_state table causing tserver crash

Summary:
"Original commit:
 - 2787d62/D18882
 - 86a78b7/D18986"
During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again.

[#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration.

Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time.  After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time,  TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the  CDC Service Cache, so that LEADER and FOLLOWERS are in sync.

Test Plan: Jenkins: urgent

Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D19054
sureshdash2022-yb added a commit that referenced this issue Aug 20, 2022
…_state table causing tserver crash

Summary:
During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again.

Test Plan: Running all the c and java testcases

Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar

Reviewed By: aagrawal, skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D18986
sureshdash2022-yb added a commit that referenced this issue Aug 20, 2022
…ries in the cdc_state table causing tserver crash

Summary:
"Original commit:
 - 2787d62/D18882
  -  86a78b7/D18986"
During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again.

[#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration.

Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time.  After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time,  TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the  CDC Service Cache, so that LEADER and FOLLOWERS are in sync.

Test Plan:
Running all the c and java testcases

Jenkins: urgent
Running all the c and java testcases

Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D19056
@yugabyte-ci yugabyte-ci added the status/awaiting-triage Issue awaiting triage label Aug 22, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/medium Medium priority issue
Projects
CDC
  
Done
Development

No branches or pull requests

3 participants