-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDCSDK] Remove existing non user tables from existing CDC streams #22835
Labels
2.20.4.1_blocker
2.20.5_blocker
2024.1.1_blocker
area/cdcsdk
CDC SDK
jira-originated
kind/bug
This issue is a bug
priority/highest
Highest priority issue
Comments
siddharth2411
added a commit
that referenced
this issue
Jun 26, 2024
…rom existing CDCSDK stream Summary: Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`. Therefore, to prevent resource consumption from such tables, we want to achieve the following: 1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table. 2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries. We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams. **Working:** - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//. The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal: - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//. - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry: 1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted. 2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_. Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`. Note: 1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`. 2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`. Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata. Jira: DB-11778, DB-11733, DB-11676 Test Plan: Jenkins: urgent ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream Reviewers: asrinivasan, stiwary, skumar Reviewed By: stiwary Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36031
siddharth2411
added a commit
that referenced
this issue
Jun 26, 2024
… tables for CDC from existing CDCSDK stream Summary: **Backport Description:** No merge conflicts. **Original Description:** Original commit: 4e9a81c / D36031 Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`. Therefore, to prevent resource consumption from such tables, we want to achieve the following: 1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table. 2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries. We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams. **Working:** - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//. The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal: - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//. - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry: 1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted. 2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_. Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`. Note: 1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`. 2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`. Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata. Jira: DB-11778, DB-11733, DB-11676 Test Plan: Jenkins: urgent ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream Reviewers: asrinivasan, stiwary, skumar, xCluster, hsunder Reviewed By: asrinivasan Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36172
jasonyb
pushed a commit
that referenced
this issue
Jun 27, 2024
Summary: 411a32e [DB-11813] Rename ysql_conn_mgr_idle_or_pending_clients metric name c2e13ef [#15682] YSQL: Fix stack_is_too_deep function in ASAN ef31455 [PLAT-14188] Fixing upgrade disk availability check db6b1b7 [#23004] YCQL: Fix tserver crash due to NULL pointer dereference 0ada80a [PLAT-14433] Use correct kubeconfig for edit provider validation eccbc10 [PLAT-14414] Enable Kubernetes provider validation by default 199f679 [PLAT-14324]: Move all node agent based flags from BETA to INTERNAL in Provider Conf keys file 86a865d [PLAT-14443] YBA Installer wait for ready time configurable. ac184a8 [#22882] YSQL: Fix deadlock in DDL atomicity a4218fb [Docs] Sort feature to tables (Where fulfills the criteria) (#22836) 2f267ca [#22996] xCluster: Add SOURCE_UNREACHABLE and SYSTEM_ERROR enums Skipped due to conflict: dee7691 [#21534] docdb: Set owner correctly for cloned databases 34632ba [PLAT-14495] Set up the node_exporter for ynp 7c99ff9 [#22876][#22773] CDCSDK: Add new yb-admin command to remove user table from CDCSDK stream 4e9a81c [#22876][#22835][#22773] CDCSDK: Remove non-eligible tables for CDC from existing CDCSDK stream f2e574e [#23013] xClusterDDLRepl: Allow table_ids for GetXClusterStreams Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: tfoucher, sanketh, jenkins-bot Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D36184
siddharth2411
added a commit
that referenced
this issue
Jun 27, 2024
… tables for CDC from existing CDCSDK stream Summary: **Backport Description:** Faced merge conflicts because of changed method names & missing/changing of flags related to replication commands and replica identity. **Original Description:** Original commit: 4e9a81c / D36031 Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`. Therefore, to prevent resource consumption from such tables, we want to achieve the following: 1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table. 2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries. We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams. **Working:** - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//. The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal: - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//. - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry: 1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted. 2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_. Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`. Note: 1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`. 2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`. Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata. Jira: DB-11778, DB-11733, DB-11676 Test Plan: Jenkins: urgent ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream Reviewers: asrinivasan, stiwary, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36175
jasonyb
pushed a commit
that referenced
this issue
Jun 28, 2024
…rom existing CDCSDK stream Summary: Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`. Therefore, to prevent resource consumption from such tables, we want to achieve the following: 1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table. 2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries. We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams. **Working:** - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//. The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal: - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//. - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry: 1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted. 2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_. Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`. Note: 1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`. 2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`. Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata. Jira: DB-11778, DB-11733, DB-11676 Test Plan: Jenkins: urgent ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream Reviewers: asrinivasan, stiwary, skumar Reviewed By: stiwary Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36031
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20.4.1_blocker
2.20.5_blocker
2024.1.1_blocker
area/cdcsdk
CDC SDK
jira-originated
kind/bug
This issue is a bug
priority/highest
Highest priority issue
Jira Link: DB-11733
There is a bug where non-user tables, such as indexes created after the CDC stream creation, are incorrectly added to the CDC stream metadata. Their corresponding tablet entries are also added to the CDC state table. Consequently, if any tablets of these non-user table splits, their child tablets are also added to the CDC state table unless the non-user table is removed from the CDC stream metadata.
The text was updated successfully, but these errors were encountered: