Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] CDC nemesis case fails with org.apache.kafka.connect.errors.ConnectException: Unable to obtain valid replication slot. #21780

Closed
1 task done
shamanthchandra-yb opened this issue Apr 2, 2024 · 0 comments
Assignees
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Apr 2, 2024

Jira Link: DB-10655

Description

Please find stress link in JIRA description.

There was nemesis happened, and we observed below error during that time:

ERROR: Shutdown connection

However, after that, we saw sequence of logs stating:

2024-04-01 18:33:12,311 INFO   Postgres|ybconnector_cdc_1822a3_test_cdc_abc95e|postgres-connector-task  Unable to find confirmed_flushed_lsn, falling back to restart_lsn   [io.debezium.connector.postgresql.connection.PostgresConnection]
2024-04-01 18:33:12,311 WARN   Postgres|ybconnector_cdc_1822a3_test_cdc_abc95e|postgres-connector-task  Cannot obtain valid replication slot 'rs_cdc_1822a3_7be8_from_con' for plugin 'pgoutput' and database 'cdc_1822a3' [during attempt 90 out of 90, concurrent tx probably blocks taking snapshot.   [io.debezium.connector.postgresql.connection.PostgresConnection]
2024-04-01 18:33:12,367 INFO   Postgres|ybconnector_cdc_1822a3_test_cdc_f56770|postgres-connector-task  Failed to obtain valid replication slot, confirmed flush lsn is null   [io.debezium.connector.postgresql.connection.PostgresConnection]

and finally failing with

2024-04-01 18:33:14,068 ERROR  ||  WorkerSourceTask{id=ybconnector_cdc_1822a3_test_cdc_9b79d0-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted   [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: Unable to obtain valid replication slot. Make sure there are no long-running transactions running in parallel as they may hinder the allocation of the replication slot when starting this connector
        at io.debezium.connector.postgresql.connection.PostgresConnection.readReplicationSlotInfo(PostgresConnection.java:358)
        at io.debezium.connector.postgresql.connection.PostgresConnection.getReplicationSlotState(PostgresConnection.java:281)
        at io.debezium.connector.postgresql.PostgresConnectorTask.start(PostgresConnectorTask.java:136)
        at io.debezium.connector.common.BaseSourceTask.startIfNeededAndPossible(BaseSourceTask.java:268)
        at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:178)
        at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.poll(AbstractWorkerSourceTask.java:469)
        at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.execute(AbstractWorkerSourceTask.java:357)
        at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
        at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
        at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.run(AbstractWorkerSourceTask.java:77)
        at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:236)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Profile (5)

Source connector version

fourpointfour/ybdb-debezium:0.2

Connector configuration

add connector connector_name='ybconnector_cdc_1822a3_test_cdc_9b79d0' stream_id='rs_cdc_1822a3_e22a' db_name='cdc_1822a3' connector_host='172.151.20.132' table_list=['test_cdc_9b79d0'] {'name': 'ybconnector_cdc_1822a3_test_cdc_9b79d0', 'config': {'database.master.addresses': '172.151.29.64:7100,172.151.25.56:7100,172.151.24.69:7100', 'database.hostname': '172.151.29.64:5433,172.151.25.56:5433,172.151.24.69:5433', 'database.port': 5433, 'database.masterhost': '172.151.25.56', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_1822a3', 'snapshot.mode': 'initial', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '5', 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', 'topic.prefix': 'ybconnector_cdc_1822a3_test_cdc_9b79d0', 'plugin.name': 'pgoutput', 'slot.name': 'rs_cdc_1822a3_e22a_from_con', 'publication.autocreate.mode': 'filtered', 'publication.name': 'pn_ybconnector_cdc_1822a3_test_cdc_9b79d0', 'table.include.list': 'public.test_cdc_9b79d0'}}

YugabyteDB version

2.23.0.0-b86

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added priority/high High Priority area/cdcsdk CDC SDK status/awaiting-triage Issue awaiting triage labels Apr 2, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Apr 2, 2024
@yugabyte-ci yugabyte-ci assigned dr0pdb and unassigned asrinivasanyb Apr 8, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 8, 2024
dr0pdb added a commit that referenced this issue Apr 11, 2024
Summary:
This revision adds more logs in various phases for ease of debugging of stress runs for CDC. Most of them are VLOGs. The only INFO log is the Walsender
startup log which is a one-time log and shouldn't have any performance implications.
Jira: DB-10655

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*

Ran tests and checked log manually.

Reviewers: asrinivasan

Reviewed By: asrinivasan

Subscribers: ybase, ycdcxcluster, yql, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D33989
dr0pdb added a commit that referenced this issue Apr 23, 2024
…er fixes

Summary:
##### Backport Description
Clean merges. No conflicts.

##### Original Description
Original commits:
6bd88e6 / D33989
868d626 / D34162
ab43084 / D34320

###### [#21780] YSQL: Introduce more debug logs for better debuggability
This revision adds more logs in various phases for ease of debugging of stress runs for CDC. Most of them are VLOGs. The only INFO log is the Walsender
startup log which is a one-time log and shouldn't have any performance implications.
Jira: DB-10655

###### [#21519] YSQL: Skip RollbackToSubTransaction RPC to local tserver proxy if not using a distributed transaction
Before this revision, every RollbackToSubTransaction operation in PG would lead to a corresponding RPC call to the local tserver. The local tserver used to
return early in case there was no distributed transaction.

This revision adds the logic in the PG layer (pg_session) to skip sending the RPC if the transaction is read-only or a fast-path transaction i.e., has NON_TRANSACTIONAL isolation level. Note that we were already doing that for transaction commit/aborts but weren't skipping the RPC for rollback of sub-transaction.

This change was proposed as part of the implementation of the PG compatible logical replication support. While streaming the changes to the Walsender, it starts and aborts transactions for every transaction that gets streamed. This is required for reading PG catalog tables. As a result, we were seeing a lot of unnecessary RPC calls to the local tserver.
Jira: DB-10402

###### [#21652] YSQL: Add more debug logs in the ListReplicationSlots function for debugging
This revision introduces more VLOG statements in the ListReplicationSlots function in pg_client_service. This will aid us in debugging the issues observed
while reading the CDC state table from the tablet server.
Jira: DB-10546

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*

Reviewers: asrinivasan

Reviewed By: asrinivasan

Subscribers: bogdan, yql, ycdcxcluster, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34410
@yugabyte-ci yugabyte-ci assigned siddharth2411 and unassigned dr0pdb May 10, 2024
ZhenYongFan pushed a commit to ZhenYongFan/yugabyte-db that referenced this issue Jun 15, 2024
…L: Backport misc walsender fixes

Summary:
##### Backport Description
Clean merges. No conflicts.

##### Original Description
Original commits:
6bd88e6 / D33989
868d626 / D34162
ab43084 / D34320

###### [yugabyte#21780] YSQL: Introduce more debug logs for better debuggability
This revision adds more logs in various phases for ease of debugging of stress runs for CDC. Most of them are VLOGs. The only INFO log is the Walsender
startup log which is a one-time log and shouldn't have any performance implications.
Jira: DB-10655

###### [yugabyte#21519] YSQL: Skip RollbackToSubTransaction RPC to local tserver proxy if not using a distributed transaction
Before this revision, every RollbackToSubTransaction operation in PG would lead to a corresponding RPC call to the local tserver. The local tserver used to
return early in case there was no distributed transaction.

This revision adds the logic in the PG layer (pg_session) to skip sending the RPC if the transaction is read-only or a fast-path transaction i.e., has NON_TRANSACTIONAL isolation level. Note that we were already doing that for transaction commit/aborts but weren't skipping the RPC for rollback of sub-transaction.

This change was proposed as part of the implementation of the PG compatible logical replication support. While streaming the changes to the Walsender, it starts and aborts transactions for every transaction that gets streamed. This is required for reading PG catalog tables. As a result, we were seeing a lot of unnecessary RPC calls to the local tserver.
Jira: DB-10402

###### [yugabyte#21652] YSQL: Add more debug logs in the ListReplicationSlots function for debugging
This revision introduces more VLOG statements in the ListReplicationSlots function in pg_client_service. This will aid us in debugging the issues observed
while reading the CDC state table from the tablet server.
Jira: DB-10546

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*

Reviewers: asrinivasan

Reviewed By: asrinivasan

Subscribers: bogdan, yql, ycdcxcluster, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34410
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants