Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] [PG Parity] Consistency broke when run with nemesis: Received restart LSN 1 is less than the last received restart LSN 185267 #21950

Closed
1 task done
shamanthchandra-yb opened this issue Apr 12, 2024 · 0 comments
Assignees
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Apr 12, 2024

Jira Link: DB-10866

Description

Please refer JIRA for stress run link.

Profile (8)

AssertionError: Some error, check test log. Reasons: ["Total sum is not as expected, consistency broke! Expected=1000000, Actual=998011. Thread: 0"]

Connector log ERROR:

2024-04-18 12:00:14,997 ERROR  Postgres|db_cdc|streaming  Producer failure   [io.debezium.pipeline.ErrorHandler]
com.yugabyte.util.PSQLException: ERROR: UpdateAndPersistLSN failed for stream_id: a191317acc156086df4e3c78531e1487: Received restart LSN 1 is less than the last received restart LSN 185267
	at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
	at com.yugabyte.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1263)
	at com.yugabyte.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1163)
	at com.yugabyte.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:44)
	at com.yugabyte.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:160)
	at com.yugabyte.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:125)
	at com.yugabyte.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:82)
	at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$1.readPending(PostgresReplicationConnection.java:622)
	at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.processMessages(PostgresStreamingChangeEventSource.java:221)
	at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:183)
	at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:37)
	at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:271)
	at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:194)
	at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Source connector version

fourpointfour/ybdb-debezium:0.3

Connector configuration

add connector connector_name='yugabyte_connector' stream_id='8ca0f8fd4bd4a48b82496f61ad154960' db_name='cdc_648b7a' connector_host='172.151.26.0' table_list=['test_cdc_f61ed2_0', 'test_cdc_f61ed2_1', 'test_cdc_f61ed2_2', 'test_cdc_f61ed2_3', 'test_cdc_f61ed2_4'] {'name': 'yugabyte_connector', 'config': {'database.master.addresses': '172.151.29.120:7100,172.151.23.105:7100,172.151.31.178:7100', 'database.hostname': '172.151.29.120:5433,172.151.23.105:5433,172.151.31.178:5433', 'database.port': 5433, 'database.masterhost': '172.151.23.105', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_648b7a', 'snapshot.mode': 'never', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 1, 'topic.creation.default.replication.factor': '1', 'tasks.max': '1', 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', 'topic.prefix': 'db_cdc', 'plugin.name': 'pgoutput', 'slot.name': '8ca0f8fd4bd4a48b82496f61ad154960_from_con', 'publication.autocreate.mode': 'filtered', 'publication.name': 'pn_yugabyte_connector', 'table.include.list': 'public.test_cdc_f61ed2_0,public.test_cdc_f61ed2_1,public.test_cdc_f61ed2_2,public.test_cdc_f61ed2_3,public.test_cdc_f61ed2_4', 'transforms': 'Reroute', 'transforms.Reroute.topic.regex': '(.*)', 'transforms.Reroute.topic.replacement': 'db_cdc_all_events', 'transforms.Reroute.type': 'io.debezium.transforms.ByLogicalTableRouter', 'transforms.Reroute.key.field.regex': 'db_cdc(.*)', 'transforms.Reroute.key.field.replacement': '$1', 'max.poll.interval.ms': '5000', 'transaction.ordering': 'true', 'provide.transaction.metadata': 'true'}}

YugabyteDB version

2.23.0.0-b131

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue priority/high High Priority and removed priority/medium Medium priority issue labels Apr 12, 2024
@shamanthchandra-yb shamanthchandra-yb changed the title [CDCSDK] [PG Parity] Consistency broke when run with nemesis [CDCSDK] [PG Parity] Consistency broke when run with nemesis: Received restart LSN 1 is less than the last received restart LSN 185267 Apr 18, 2024
@yugabyte-ci yugabyte-ci assigned dr0pdb and unassigned asrinivasanyb Apr 18, 2024
dr0pdb added a commit that referenced this issue Apr 23, 2024
…eipt of commit record

Summary:
We maintain a list of unacked transactions for the calculation of the restart_lsn from the confirmed_flush lsn. Prior to this revision, this list also included a transaction at
the end for which we haven't yet received the COMMIT record from the CDC service. Such a transaction is stored with commit_lsn as InvalidXLogRecPtr (0) and was leading to issues in the calculation in the restart_lsn.

This revision updates the logic to put a transaction into the list only on the receipt of the commit record.
Jira: DB-10866

Test Plan: Jenkins: test regex: .*ReplicationSlot.*

Reviewers: asrinivasan

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34327
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 23, 2024
dr0pdb added a commit that referenced this issue Apr 29, 2024
…bug logs and unacked txn fix

Summary:
##### Backport Description
All merges were clean. No conflicts.

##### Original Description
Original commits:
1b902ba / D34327
48a0279 / D34483
5e24eff / D34530

###### YSQL: Insert transaction into unacked txn list only upon receipt of commit record
We maintain a list of unacked transactions for the calculation of the restart_lsn from the confirmed_flush lsn. Prior to this revision, this list also included a transaction at
the end for which we haven't yet received the COMMIT record from the CDC service. Such a transaction is stored with commit_lsn as InvalidXLogRecPtr (0) and was leading to issues in the calculation in the restart_lsn.

This revision updates the logic to put a transaction into the list only on the receipt of the commit record.
Jira: DB-10866

###### YSQL: Log the time taken in converting from QLValuePB to PG datum in walsender
The walsender spends a considerable amount of time in converting QLValuePB to PG datum in ybc_pggate. Add a VLOG(1) which logs the time taken in this
operation.
Jira: DB-11059

###### YSQL: Log the time taken in yb_decode and reorder buffer
This revision adds computation and logging of the time taken by the Walsender in yb_decode and reorderbuffer while processing a single batch from the CDC
service.
Jira: DB-11071

Test Plan: Jenkins: test regex: .*ReplicationSlot.*

Reviewers: asrinivasan, skumar

Reviewed By: skumar

Subscribers: yql, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34535
svarnau pushed a commit that referenced this issue May 25, 2024
…eipt of commit record

Summary:
We maintain a list of unacked transactions for the calculation of the restart_lsn from the confirmed_flush lsn. Prior to this revision, this list also included a transaction at
the end for which we haven't yet received the COMMIT record from the CDC service. Such a transaction is stored with commit_lsn as InvalidXLogRecPtr (0) and was leading to issues in the calculation in the restart_lsn.

This revision updates the logic to put a transaction into the list only on the receipt of the commit record.
Jira: DB-10866

Test Plan: Jenkins: test regex: .*ReplicationSlot.*

Reviewers: asrinivasan

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

4 participants