Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Race Condition between UpdatePeersAndMetrics revising retention barriers and createStream setting initial retention barriers #20145

Closed
yugabyte-ci opened this issue Dec 4, 2023 · 0 comments
Labels
jira-originated kind/new-feature This is a request for a completely new feature priority/low Low priority

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Dec 4, 2023

Jira Link: DB-9083

@yugabyte-ci yugabyte-ci added jira-originated kind/new-feature This is a request for a completely new feature priority/low Low priority status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Dec 4, 2023
asrinivasanyb added a commit that referenced this issue Dec 18, 2023
Summary:
This is related to the CDC Consistent Snapshot feature (GH #18508).

The retention barriers are set and revised by multiple threads.
These include the update_peers_and_metrics thread, threads in CDCService as part of handling RPC requests.
As part of creating a consistent snapshot stream, the ChangeMetadataOperation will also now be setting
retention barriers.

The slowest consumer is determined by examining the entries of the cdc_state table.
It is possible that the update_peers_and_metrics thread determines the slowest consumer of the tablet,
while a stream creation is in progress, from the cdc_state table that has not yet been updated with the
checkpoint of the stream whose creation is in progress. This could lead to a situation where the retention
requirements of the newly created stream are overridden.

This race condition will be addressed by preventing forward revisions of the retention barrier for a period of time
following the setting of the revision barrier by a new stream. A gflag "retention_barrier_no_revision_interval_secs"
is introduced to configure this duration. The idea is that the status of the creation of the stream will become known
within this duration. If successful, the (stream_id, tablet_id) entry will be inserted into the cdc_state table and
hence will be considered from this point on while determining the slowest consumer.

UPGRADE/ROLLBACK SAFETY:
Upgrade safe
Rollback safe

PB changes

1) yb::cdc::UpdateCdcReplicatedIndexRequestPB - addition of 1 optional field. This is used in an RPC request.

This feature is currently disabled. Changes are guarded by the TEST flag
yb_enable_cdc_consistent_snapshot_streams which is set to false by default.
This same flag will be redefined as an autoflag (LocalPersisted) once all
the support for this feature is in place.
Jira: DB-9083

Test Plan:
Jenkins: test regex: .*XCluster.*|.*CDCSDK.*

New / Modified Unit Tests -
    ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestRetentionBarrierSettingRace
    ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesWhenNoStreamsOnTablet
    ./yb_build.sh --cxx-test integration-tests_xcluster-test --gtest_filter XClusterTest.CdcCheckpointPeerMove

Reviewers: sergei, skumar, xCluster, hsunder, jhe

Reviewed By: sergei, skumar

Subscribers: stiwary, ycdcxcluster, bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D30889
asrinivasanyb added a commit that referenced this issue Dec 20, 2023
…ting retention barriers

Summary:
**Backport Description**
There was 1 simple merge conflict in

  # cdc_service.cc

**Original Description**
Original commit: 3760b39 / D30889
This is related to the CDC Consistent Snapshot feature (GH #18508).

The retention barriers are set and revised by multiple threads.
These include the update_peers_and_metrics thread, threads in CDCService as part of handling RPC requests.
As part of creating a consistent snapshot stream, the ChangeMetadataOperation will also now be setting
retention barriers.

The slowest consumer is determined by examining the entries of the cdc_state table.
It is possible that the update_peers_and_metrics thread determines the slowest consumer of the tablet,
while a stream creation is in progress, from the cdc_state table that has not yet been updated with the
checkpoint of the stream whose creation is in progress. This could lead to a situation where the retention
requirements of the newly created stream are overridden.

This race condition will be addressed by preventing forward revisions of the retention barrier for a period of time
following the setting of the revision barrier by a new stream. A gflag "retention_barrier_no_revision_interval_secs"
is introduced to configure this duration. The idea is that the status of the creation of the stream will become known
within this duration. If successful, the (stream_id, tablet_id) entry will be inserted into the cdc_state table and
hence will be considered from this point on while determining the slowest consumer.

UPGRADE/ROLLBACK SAFETY:
Upgrade safe
Rollback safe

PB changes

1) yb::cdc::UpdateCdcReplicatedIndexRequestPB - addition of 1 optional field. This is used in an RPC request.

This feature is currently disabled. Changes are guarded by the TEST flag
yb_enable_cdc_consistent_snapshot_streams which is set to false by default.
This same flag will be redefined as an autoflag (LocalPersisted) once all
the support for this feature is in place.

Jira: DB-9083, DB-9341

Test Plan:
All tests

New / Modified Unit Tests -
    ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestRetentionBarrierSettingRace
    ./yb_build.sh --cxx-test integration-tests_cdcsdk_consistent_snapshot-test --gtest_filter CDCSDKConsistentSnapshotTest.TestReleaseResourcesWhenNoStreamsOnTablet
    ./yb_build.sh --cxx-test integration-tests_xcluster-test --gtest_filter XClusterTest.CdcCheckpointPeerMove

Reviewers: sergei, skumar, xCluster, hsunder, jhe

Reviewed By: skumar

Subscribers: ybase, bogdan, ycdcxcluster, stiwary

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira-originated kind/new-feature This is a request for a completely new feature priority/low Low priority
Projects
None yet
Development

No branches or pull requests

1 participant