Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC fails with Tablet Splitting is ON and new Table is added with same Stream ID #14846

Closed
shamanthchandra-yb opened this issue Nov 4, 2022 · 0 comments
Assignees
Labels
2.16.0_blocker 2.16.0 Release blocker defects area/cdc Change Data Capture area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Nov 4, 2022

Jira Link: DB-4143

Description

Env Used:

Yugabyte Universe Build: 2.16.0.0-b20
quay.io/yugabyte/debezium-connector:1.9.5.y.1 
debezium/kafka:1.9 
debezium/zookeeper:1.9

Steps:

  1. Create a universe with below flags:
Master:
{"enable_automatic_tablet_splitting": "true",
"tablet_split_high_phase_shard_count_per_node": 10000,
"tablet_split_high_phase_size_threshold_bytes": 10485760,     # 10 MB
"tablet_split_low_phase_size_threshold_bytes": 1048576, # 1 MB
"tablet_split_low_phase_shard_count_per_node": 16,
"db_write_buffer_size":102400}

Tserver:
{"enable_automatic_tablet_splitting":"true",
"yb_num_shards_per_tserver":1,        # For YCQL
"ysql_num_shards_per_tserver":1,       # For YSQL
"db_write_buffer_size":102400}
  1. Ran sample app
  2. Setup a CDC pipeline (While setting up this there was already about 10 tablets for this table)
  3. Waited for an hour or so, there wasn't any new addition of tablets. Not sure why, I believe it was enough load to split.
  4. So, ran another sample apps on same database, which created new table 'cdc_test2'
  5. Was quick enough to deploy new connector for this. Note: Used earlier stream itself, as Vaibhav said in above thread.

Observation:

  1. There was initial few rows (65+) loaded, but then in postgres new rows were not seen for cdc_test2. Above error was seen in connect log.
  2. After quite a some time, few more rows were seen (About 9k+). Then it has stopped again.
  3. Still cdc_test table is going fine. But need to wait to see its behaviour after again tablet happens. Currently both tables tablet count has reached 10.

Below Issues seen in connect log:
1.

org.yb.client.CDCErrorException: Server[147f906737984fb2b9aca8bbf936bd63] INTERNAL_ERROR[code 21]: CDCSDK Trying to fetch already GCed intents for transaction 4db46e4b-ecf5-429c-a70f-2bfe55454bf7
        at org.yb.client.TabletClient.dispatchCDCErrorOrReturnException(TabletClient.java:506)
        at org.yb.client.TabletClient.decode(TabletClient.java:437)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
        at io.netty.handler.codec.ReplayingDecoder.callDecode(ReplayingDecoder.java:366)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2022-11-04 09:31:33,361 WARN   YugabyteDB|dbserver1|streaming  GetChangesCDCSDK got Errback    [org.yb.client.AsyncYBClient]
org.yb.client.CDCErrorException: Server[2949228ddd2e4048be5c7f2eb7f43d0f] NOT_FOUND[code 1]: LookupByIdRpc(tablet: 16b4fd4bdf1343c3a9ac3677977fa458, num_attempts: 2) failed: Tablet deleted: Not serving tablet deleted upon request at 2022-11-04 09:30:57 UTC
        at org.yb.client.TabletClient.dispatchCDCErrorOrReturnException(TabletClient.java:506)
        at org.yb.client.TabletClient.decode(TabletClient.java:437)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
        at io.netty.handler.codec.ReplayingDecoder.callDecode(ReplayingDecoder.java:366)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
org.yb.client.CDCErrorException: Server[147f906737984fb2b9aca8bbf936bd63] LEADER_NOT_READY_TO_SERVE[code 23]: Not ready to serve
        at org.yb.client.TabletClient.dispatchCDCErrorOrReturnException(TabletClient.java:506)
        at org.yb.client.TabletClient.decode(TabletClient.java:437)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
        at io.netty.handler.codec.ReplayingDecoder.callDecode(ReplayingDecoder.java:366)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Kafka connect log:

connector_log.zip

More detailed steps performed can be found here: 2.16 CDC Tablet Split testcase

@shamanthchandra-yb shamanthchandra-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 4, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Nov 4, 2022
@shamanthchandra-yb shamanthchandra-yb added area/cdc Change Data Capture priority/high High Priority 2.16.0_blocker 2.16.0 Release blocker defects and removed area/docdb YugabyteDB core features priority/medium Medium priority issue labels Nov 4, 2022
@shamanthchandra-yb shamanthchandra-yb changed the title CDC fails with Tablet Splitting ON CDC fails with Tablet Splitting is ON and new Table is added with same Stream ID Nov 4, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Nov 4, 2022
adithya-kb pushed a commit that referenced this issue Nov 7, 2022
Summary:
Adding support for streaming changes through a CDCSDK stream for tablet split belonging to tablets of newly added tables.

In the method: AddTabletEntriesToCDCSDKStreamsForNewTables , when we add new table details to the stream's metadata and cdc_state table, we now also add the table to 'cdcsdk_tables_to_stream_map_'.
This ensures tablets blenging to the new table will not be deleted directly after a successful tablet split, and will rather be hidden , as needed.

Test Plan:
Added ctests:
TestTabletSplitOnAddedTableForCDC
TestTabletSplitOnAddedTableForCDCWithMasterRestart

Reviewers: skumar, sdash

Reviewed By: sdash

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D20841
adithya-kb pushed a commit that referenced this issue Nov 7, 2022
…y added tables

Summary:
Original commit:  / D20841
Adding support for streaming changes through a CDCSDK stream for tablet split belonging to tablets of newly added tables.

In the method: AddTabletEntriesToCDCSDKStreamsForNewTables , when we add new table details to the stream's metadata and cdc_state table, we now also add the table to 'cdcsdk_tables_to_stream_map_'.
This ensures tablets blenging to the new table will not be deleted directly after a successful tablet split, and will rather be hidden , as needed.

Test Plan:
Added ctests:
TestTabletSplitOnAddedTableForCDC
TestTabletSplitOnAddedTableForCDCWithMasterRestart

Reviewers: skumar, sdash

Reviewed By: sdash

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D20847
adithya-kb pushed a commit that referenced this issue Nov 7, 2022
…y added tables

Summary:
Original commit: db2b77a / D20841
Adding support for streaming changes through a CDCSDK stream for tablet split belonging to tablets of newly added tables.

In the method: AddTabletEntriesToCDCSDKStreamsForNewTables , when we add new table details to the stream's metadata and cdc_state table, we now also add the table to 'cdcsdk_tables_to_stream_map_'.
This ensures tablets blenging to the new table will not be deleted directly after a successful tablet split, and will rather be hidden , as needed.

Test Plan:
Added ctests:
TestTabletSplitOnAddedTableForCDC
TestTabletSplitOnAddedTableForCDCWithMasterRestart

Reviewers: skumar, sdash

Reviewed By: sdash

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D20848
jayant07-yb pushed a commit to jayant07-yb/yugabyte-db that referenced this issue Dec 7, 2022
… tables

Summary:
Adding support for streaming changes through a CDCSDK stream for tablet split belonging to tablets of newly added tables.

In the method: AddTabletEntriesToCDCSDKStreamsForNewTables , when we add new table details to the stream's metadata and cdc_state table, we now also add the table to 'cdcsdk_tables_to_stream_map_'.
This ensures tablets blenging to the new table will not be deleted directly after a successful tablet split, and will rather be hidden , as needed.

Test Plan:
Added ctests:
TestTabletSplitOnAddedTableForCDC
TestTabletSplitOnAddedTableForCDCWithMasterRestart

Reviewers: skumar, sdash

Reviewed By: sdash

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D20841
@shamanthchandra-yb shamanthchandra-yb added the qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.16.0_blocker 2.16.0 Release blocker defects area/cdc Change Data Capture area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures
Projects
None yet
Development

No branches or pull requests

5 participants