[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

lufzhangzitao · 2023-02-24T09:23:13Z

Search before asking

I searched in the issues and found nothing similar.

Flink version

1.16.0

Flink CDC version

2.3.0

Database and its version

mysql5.7

Minimal reproduce step

if i restart job from checkpoint can not work

2023-02-21 14:08:29,713 INFO com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext [] - MySQL current GTID set 106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204182899,7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662 does contain the GTID set 106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173 required by the connector.
2023-02-21 14:08:29,801 INFO com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext [] - Server has already purged 106a4bb6-ec0d-11ec-a2d4-00163e279211:1-203495053,7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662 GTIDs
2023-02-21 14:08:29,802 WARN com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext [] - Some of the GTIDs needed to replicate have been already purged
2023-02-21 14:08:29,803 ERROR org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Received uncaught exception.
java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:150) ~[flink-connector-files-1.16.1.jar:1.16.1]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:105) [flink-connector-files-1.16.1.jar:1.16.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_362]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_362]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_362]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_362]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_362]
Caused by: java.lang.IllegalStateException: The connector is trying to read binlog starting at Struct{version=1.6.4.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1676959709803,db=,server_id=0,file=mysql-bin.005893,pos=15052069,row=0}, but this is no longer available on the server. Reconfigure the connector to use a snapshot when needed.
at com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext.loadStartingOffsetState(StatefulTaskContext.java:194) ~[blob_p-384c1d850f997d73acba83fc72dc8cfb8bead162-07d02e9eebcf14d5793c784b1d89c6cc:?]
at com.ververica.cdc.connectors.mysql.debezium.task.context.StatefulTaskContext.configure(StatefulTaskContext.java:117) ~[blob_p-384c1d850f997d73acba83fc72dc8cfb8bead162-07d02e9eebcf14d5793c784b1d89c6cc:?]
at com.ververica.cdc.connectors.mysql.debezium.reader.BinlogSplitReader.submitSplit(BinlogSplitReader.java:103) ~[blob_p-384c1d850f997d73acba83fc72dc8cfb8bead162-07d02e9eebcf14d5793c784b1d89c6cc:?]
at com.ververica.cdc.connectors.mysql.debezium.reader.BinlogSplitReader.submitSplit(BinlogSplitReader.java:71) ~[blob_p-384c1d850f997d73acba83fc72dc8cfb8bead162-07d02e9eebcf14d5793c784b1d89c6cc:?]
at com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader.checkSplitOrStartNext(MySqlSplitReader.java:159) ~[blob_p-384c1d850f997d73acba83fc72dc8cfb8bead162-07d02e9eebcf14d5793c784b1d89c6cc:?]
at com.ververica.cdc.connectors.mysql.source.reader.MySqlSplitReader.fetch(MySqlSplitReader.java:71) ~[blob_p-384c1d850f997d73acba83fc72dc8cfb8bead162-07d02e9eebcf14d5793c784b1d89c6cc:?]
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.16.1.jar:1.16.1]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:142) ~[flink-connector-files-1.16.1.jar:1.16.1]
... 6 more

gtids in checkpoint:"file":"mysql-bin.005893","pos":"15052069","kind":"SPECIFIC","gtids":"106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173"
show master status gtids :106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204479617,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662
show binary logs：

What did you expect to see?

if first set specificOffset then can work from ck

What did you see instead?

if first set specificOffset then can not work from ck

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

wekin · 2023-03-03T09:58:07Z

除了引入mysql-cdc包还需要引入哪些包, 我的环境跟你一样, 但是我的却说少kafka的类, org.apache.kafka.connect.data.Struct, 你的lib包还有什么jar包?

lufzhangzitao · 2023-03-06T01:53:47Z

@wekin flink-connector-kafka

wallkop · 2023-03-13T16:36:12Z

The bug is caused by the unreasonable GTIDs saved in the new checkpoint after starting from the specificOffset. In this issue, the starting point of the checkpoint is at 106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173, This is clearly caused by the user setting a specificOffset. Although the Gtids 106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173 set by the user in specificOffset are not reasonable, I believe the program should be able to correct it automatically, so that the GITDs set in setting checkpoint become 106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204182173, 7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662.

About the cause of this problem, We can try to analyze the handling process of GTIDs for CDC:

1.Obtain the available GTIDs, i.e., show master status.

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204479617,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

2.Obtain the checkpoint GTIDs.

106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173

3.Obtain the purged GTIDs, i.e., @@global.gtid_purged.

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-203495053,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

4.Obtain the GTIDs to replicate, i.e., the difference between available GTIDs and checkpoint GTIDs.

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-203495053
106a4bb6-ec0d-11ec-a2d4-00163e279211:204182174-204479617,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

5.Obtain the non-purged GTIDs to replicate, i.e., the difference between GTIDs to replicate and purged GTIDs.

106a4bb6-ec0d-11ec-a2d4-00163e279211:204182174-204479617

6.Finally, compare the GTIDs to replicate with the non-purged GTIDs to replicate. If they are not the same, it is considered that some of the GTIDs that need to be synchronized have been cleaned up. In this issue, they are obviously not equal because the GTIDs recorded in the checkpoint are problematic. Only 106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173 is considered to have been consumed, so the following ranges of GTIDs need to be synchronized:

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-203495053
106a4bb6-ec0d-11ec-a2d4-00163e279211:204182174-204479617,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

This is not what we expected. In this scenario, the range of Gtids that the user expects to be synchronized should only include:

106a4bb6-ec0d-11ec-a2d4-00163e279211:204182174-204479617

Hi @PatrickRen, could you pay attention to this issue? If you don't have time, I am willing to submit a PR to fix it.

ruanhang1993 · 2023-03-21T03:34:44Z

@wallkop Thanks a lot for the analysis. I find the earliest offset exists this problem too.

ruanhang1993 · 2023-03-21T03:35:59Z

@wallkop Please be free to fix this issue. Assign to you. Thanks ~

wallkop · 2023-03-24T03:21:21Z

@wallkop Please be free to fix this issue. Assign to you. Thanks ~

Got it, thanks.

wallkop · 2023-04-02T08:51:26Z

@wallkop Please be free to fix this issue. Assign to you. Thanks ~

@ruanhang1993 hi, I have submitted a PR and would appreciate it if you could review it when you have time.

PR: #2220

lufzhangzitao added the bug Something isn't working label Feb 24, 2023

ruanhang1993 assigned wallkop Mar 21, 2023

leonardBang added this to the V2.4.0 milestone Mar 21, 2023

wallkop mentioned this issue Apr 1, 2023

[mysql] Fix issue #1944: Initialize complete GTIDs to ensure subsequent recovery from checkpoint #2063

Closed

wallkop mentioned this issue Jun 17, 2023

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

Merged

PatrickRen closed this as completed in #2220 Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

lufzhangzitao commented Feb 24, 2023

wekin commented Mar 3, 2023

lufzhangzitao commented Mar 6, 2023

wallkop commented Mar 13, 2023 •

edited

ruanhang1993 commented Mar 21, 2023

ruanhang1993 commented Mar 21, 2023

wallkop commented Mar 24, 2023

wallkop commented Apr 2, 2023 •

edited

[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

Comments

lufzhangzitao commented Feb 24, 2023

Search before asking

Flink version

Flink CDC version

Database and its version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

wekin commented Mar 3, 2023

lufzhangzitao commented Mar 6, 2023

wallkop commented Mar 13, 2023 • edited

ruanhang1993 commented Mar 21, 2023

ruanhang1993 commented Mar 21, 2023

wallkop commented Mar 24, 2023

wallkop commented Apr 2, 2023 • edited

wallkop commented Mar 13, 2023 •

edited

wallkop commented Apr 2, 2023 •

edited