New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Race Condition between UpdatePeersAndMetrics revising retention barriers and createStream setting initial retention barriers #20145
Labels
jira-originated
kind/new-feature
This is a request for a completely new feature
priority/low
Low priority
Comments
yugabyte-ci
added
jira-originated
kind/new-feature
This is a request for a completely new feature
priority/low
Low priority
status/awaiting-triage
Issue awaiting triage
and removed
status/awaiting-triage
Issue awaiting triage
labels
Dec 4, 2023
asrinivasanyb
added a commit
that referenced
this issue
Dec 18, 2023
Summary: This is related to the CDC Consistent Snapshot feature (GH #18508). The retention barriers are set and revised by multiple threads. These include the update_peers_and_metrics thread, threads in CDCService as part of handling RPC requests. As part of creating a consistent snapshot stream, the ChangeMetadataOperation will also now be setting retention barriers. The slowest consumer is determined by examining the entries of the cdc_state table. It is possible that the update_peers_and_metrics thread determines the slowest consumer of the tablet, while a stream creation is in progress, from the cdc_state table that has not yet been updated with the checkpoint of the stream whose creation is in progress. This could lead to a situation where the retention requirements of the newly created stream are overridden. This race condition will be addressed by preventing forward revisions of the retention barrier for a period of time following the setting of the revision barrier by a new stream. A gflag "retention_barrier_no_revision_interval_secs" is introduced to configure this duration. The idea is that the status of the creation of the stream will become known within this duration. If successful, the (stream_id, tablet_id) entry will be inserted into the cdc_state table and hence will be considered from this point on while determining the slowest consumer. UPGRADE/ROLLBACK SAFETY: Upgrade safe Rollback safe PB changes 1) yb::cdc::UpdateCdcReplicatedIndexRequestPB - addition of 1 optional field. This is used in an RPC request. This feature is currently disabled. Changes are guarded by the TEST flag yb_enable_cdc_consistent_snapshot_streams which is set to false by default. This same flag will be redefined as an autoflag (LocalPersisted) once all the support for this feature is in place. Jira: DB-9083 Test Plan: Jenkins: test regex: .*XCluster.*|.*CDCSDK.* New / Modified Unit Tests - ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestRetentionBarrierSettingRace ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesWhenNoStreamsOnTablet ./yb_build.sh --cxx-test integration-tests_xcluster-test --gtest_filter XClusterTest.CdcCheckpointPeerMove Reviewers: sergei, skumar, xCluster, hsunder, jhe Reviewed By: sergei, skumar Subscribers: stiwary, ycdcxcluster, bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D30889
asrinivasanyb
added a commit
that referenced
this issue
Dec 20, 2023
…ting retention barriers Summary: **Backport Description** There was 1 simple merge conflict in # cdc_service.cc **Original Description** Original commit: 3760b39 / D30889 This is related to the CDC Consistent Snapshot feature (GH #18508). The retention barriers are set and revised by multiple threads. These include the update_peers_and_metrics thread, threads in CDCService as part of handling RPC requests. As part of creating a consistent snapshot stream, the ChangeMetadataOperation will also now be setting retention barriers. The slowest consumer is determined by examining the entries of the cdc_state table. It is possible that the update_peers_and_metrics thread determines the slowest consumer of the tablet, while a stream creation is in progress, from the cdc_state table that has not yet been updated with the checkpoint of the stream whose creation is in progress. This could lead to a situation where the retention requirements of the newly created stream are overridden. This race condition will be addressed by preventing forward revisions of the retention barrier for a period of time following the setting of the revision barrier by a new stream. A gflag "retention_barrier_no_revision_interval_secs" is introduced to configure this duration. The idea is that the status of the creation of the stream will become known within this duration. If successful, the (stream_id, tablet_id) entry will be inserted into the cdc_state table and hence will be considered from this point on while determining the slowest consumer. UPGRADE/ROLLBACK SAFETY: Upgrade safe Rollback safe PB changes 1) yb::cdc::UpdateCdcReplicatedIndexRequestPB - addition of 1 optional field. This is used in an RPC request. This feature is currently disabled. Changes are guarded by the TEST flag yb_enable_cdc_consistent_snapshot_streams which is set to false by default. This same flag will be redefined as an autoflag (LocalPersisted) once all the support for this feature is in place. Jira: DB-9083, DB-9341 Test Plan: All tests New / Modified Unit Tests - ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestRetentionBarrierSettingRace ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesWhenNoStreamsOnTablet ./yb_build.sh --cxx-test integration-tests_xcluster-test --gtest_filter XClusterTest.CdcCheckpointPeerMove Reviewers: sergei, skumar, xCluster, hsunder, jhe Reviewed By: skumar Subscribers: ybase, bogdan, ycdcxcluster, stiwary Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31161
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
jira-originated
kind/new-feature
This is a request for a completely new feature
priority/low
Low priority
Jira Link: DB-9083
The text was updated successfully, but these errors were encountered: