Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Remove existing non user tables from existing CDC streams #22835

Open
yugabyte-ci opened this issue Jun 12, 2024 · 0 comments
Open

[CDCSDK] Remove existing non user tables from existing CDC streams #22835

yugabyte-ci opened this issue Jun 12, 2024 · 0 comments

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Jun 12, 2024

Jira Link: DB-11733

There is a bug where non-user tables, such as indexes created after the CDC stream creation, are incorrectly added to the CDC stream metadata. Their corresponding tablet entries are also added to the CDC state table. Consequently, if any tablets of these non-user table splits, their child tablets are also added to the CDC state table unless the non-user table is removed from the CDC stream metadata.

@yugabyte-ci yugabyte-ci added area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/highest Highest priority issue labels Jun 12, 2024
@yugabyte-ci yugabyte-ci changed the title [CDCSDK] Remove existing non user tables from CDC stream metadata and cdc state table [CDCSDK] Remove existing non user tables from existing CDC streams Jun 12, 2024
siddharth2411 added a commit that referenced this issue Jun 26, 2024
…rom existing CDCSDK stream

Summary:
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36031
siddharth2411 added a commit that referenced this issue Jun 26, 2024
… tables for CDC from existing CDCSDK stream

Summary:
**Backport Description:**
No merge conflicts.

**Original Description:**
Original commit: 4e9a81c / D36031
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar, xCluster, hsunder

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36172
jasonyb pushed a commit that referenced this issue Jun 27, 2024
Summary:
 411a32e [DB-11813] Rename ysql_conn_mgr_idle_or_pending_clients metric name
 c2e13ef [#15682] YSQL: Fix stack_is_too_deep function in ASAN
 ef31455 [PLAT-14188] Fixing upgrade disk availability check
 db6b1b7 [#23004] YCQL: Fix tserver crash due to NULL pointer dereference
 0ada80a [PLAT-14433] Use correct kubeconfig for edit provider validation
 eccbc10 [PLAT-14414] Enable Kubernetes provider validation by default
 199f679 [PLAT-14324]: Move all node agent based flags from BETA to INTERNAL in Provider Conf keys file
 86a865d [PLAT-14443] YBA Installer wait for ready time configurable.
 ac184a8 [#22882] YSQL: Fix deadlock in DDL atomicity
 a4218fb [Docs] Sort feature to tables (Where fulfills the criteria) (#22836)
 2f267ca [#22996] xCluster: Add SOURCE_UNREACHABLE and SYSTEM_ERROR enums

Skipped due to conflict:
dee7691 [#21534] docdb: Set owner correctly for cloned databases
34632ba [PLAT-14495] Set up the node_exporter for ynp
7c99ff9 [#22876][#22773] CDCSDK: Add new yb-admin command to remove user table from CDCSDK stream
4e9a81c [#22876][#22835][#22773] CDCSDK: Remove non-eligible tables for CDC from existing CDCSDK stream
f2e574e [#23013] xClusterDDLRepl: Allow table_ids for GetXClusterStreams

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: tfoucher, sanketh, jenkins-bot

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D36184
siddharth2411 added a commit that referenced this issue Jun 27, 2024
… tables for CDC from existing CDCSDK stream

Summary:
**Backport Description:**
Faced merge conflicts because of changed method names & missing/changing of flags related to replication commands and
replica identity.

**Original Description:**
Original commit: 4e9a81c / D36031
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36175
jasonyb pushed a commit that referenced this issue Jun 28, 2024
…rom existing CDCSDK stream

Summary:
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36031
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants