-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDCSDK] [PG Parity] Memory Leakage in Walsender Process and a node where connector is deployed goes OOM (Unreachable) #21530
Labels
Comments
shamanthchandra-yb
added
priority/high
High Priority
area/cdcsdk
CDC SDK
status/awaiting-triage
Issue awaiting triage
labels
Mar 15, 2024
Looked at the test logs, there is a definite increase in the walsender memory usage.
|
dr0pdb
added a commit
that referenced
this issue
Mar 18, 2024
Summary: The walsender while streaming the records to the client fetches the records in a batch from the CDC service. Once a batch is fully streamed, it is freed and a new batch is fetched again. The logic to free was a simple `pfree` call which wasn't completely freeing the previous batch. This revision fixes the freeing logic by doing a deep free of all the palloc'd members. Jira: DB-10412 Test Plan: Ran a manual load test with a Java application and noted the memory usage. Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D33246
asrinivasanyb
pushed a commit
to asrinivasanyb/yugabyte-db
that referenced
this issue
Mar 18, 2024
Summary: The walsender while streaming the records to the client fetches the records in a batch from the CDC service. Once a batch is fully streamed, it is freed and a new batch is fetched again. The logic to free was a simple `pfree` call which wasn't completely freeing the previous batch. This revision fixes the freeing logic by doing a deep free of all the palloc'd members. Jira: DB-10412 Test Plan: Ran a manual load test with a Java application and noted the memory usage. Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D33246
dr0pdb
added a commit
that referenced
this issue
Mar 18, 2024
…ontext in walsender Summary: The walsender stores the begin/commit lsn of each unacked transaction in the unacked_transactions list in yb_virtual_wal_client. This was introduced in https://phorge.dev.yugabyte.com/D32941. This revision moves the allocations of the list to a separate memory context for ease of tracking. Also fixed a small memory leak in `YBCGetTables`. Jira: DB-10412, DB-10291 Test Plan: No behavior changes. Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D33245
dr0pdb
added a commit
that referenced
this issue
Mar 21, 2024
Summary: The walsender while streaming records to the client, fetches a batch of records from the CDC services and streams them one by one. Once a batch is streamed completely, it is freed and the next batch is fetched. A record batch contains the pg datum values representing the DML column values. They are not that easy to free. An easier way is to store them in a separate memory context and free it whenever needed. This revision does that. We now store the record batch in a separate memory context and reset it before making the next call to the CDC service. Jira: DB-10412 Test Plan: Jenkins: test regex: .*ReplicationSlot.* Local load testing. No behavioural changes Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D33432
dr0pdb
added a commit
that referenced
this issue
Apr 15, 2024
…walsender Summary: Original commit: 9aea6ae / D33246 The walsender while streaming the records to the client fetches the records in a batch from the CDC service. Once a batch is fully streamed, it is freed and a new batch is fetched again. The logic to free was a simple `pfree` call which wasn't completely freeing the previous batch. This revision fixes the freeing logic by doing a deep free of all the palloc'd members. Jira: DB-10412 Test Plan: Ran a manual load test with a Java application and noted the memory usage. Reviewers: asrinivasan, skumar Reviewed By: skumar Subscribers: yql, ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34106
dr0pdb
added a commit
that referenced
this issue
Apr 16, 2024
…separate memory context in walsender Summary: #### Backport Description No merge conflicts. #### Original Description Original commit: cc088c0 / D33245 The walsender stores the begin/commit lsn of each unacked transaction in the unacked_transactions list in yb_virtual_wal_client. This was introduced in https://phorge.dev.yugabyte.com/D32941. This revision moves the allocations of the list to a separate memory context for ease of tracking. Also fixed a small memory leak in `YBCGetTables`. Jira: DB-10412, DB-10291 Test Plan: No behavior changes. Reviewers: asrinivasan, skumar Reviewed By: asrinivasan Subscribers: yql, ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34126
dr0pdb
added a commit
that referenced
this issue
Apr 16, 2024
…xes for PG compatible logical replication support Summary: ##### Backport Description No merge conflicts. ##### Original Description Original commits: 1604f1b / D33432 5c023f3 / D33156 2327b8d / D33538 ###### [#21530] YSQL: Store cached record batch in a separate context The walsender while streaming records to the client, fetches a batch of records from the CDC services and streams them one by one. Once a batch is streamed completely, it is freed and the next batch is fetched. A record batch contains the pg datum values representing the DML column values. They are not that easy to free. An easier way is to store them in a separate memory context and free it whenever needed. This revision does that. We now store the record batch in a separate memory context and reset it before making the next call to the CDC service. Jira: DB-10412 ###### [#21400] YSQL: Handle START_REPLICATION with start_lsn > confirmed_flush_lsn In the `START_REPLICATION` command, the client can specify the LSN from which they want the records to be streamed from. This revision introduces support for cases when start_lsn > confirmed_flush_lsn. Before this, the value of start_lsn was being ignored and all records from confirmed_flush_lsn onwards were streamed in case of restarts. Now, the records which belong to a transaction whose commit_lsn < start_lsn are skipped in the decoding phase (yb_decode.c) similar to how it is done in vanilla PG (decode.c). Jira: DB-10292 ###### [#21651] YSQL: Handle RPC errors as warning while destroying virtual wal Once the LogicalReplication ends either when a client ends it or due to an error, the walsender tries to cleanup the virtual wal instance associated with it. If there is any error in this cleanup logic, it is better to log it as a warning. This is because: 1. We already have a backup to clean the virtual wal instance 2. The error can mask an actual error due to which the cleanup was required. #21651 is one such instance. This revision updates Walsender, to treat any errors received during the DestroyVirtualWALForCDC as a warning. Jira: DB-10545 Test Plan: Jenkins: test regex: .*ReplicationSlot.* Local load testing. No behavioural changes ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testStartLsnValues' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testWalsenderGracefulShutdownWithCDCServiceError' Reviewers: asrinivasan, skumar, siddharth.shah Reviewed By: siddharth.shah Subscribers: yql, ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34163
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Jira Link: DB-10412
Description
Running one of the master case
test_cdc_comply_advantage_without_tablet_splitting
using PG connectorA node which is used to deploy connector goes down after few iterations.
Please find runs in JIRA.
Every 2 mins, we are doing pg_log_backend_memory_contexts and sorted list of ps, which can be found in test log.
Test log with walsender info
Source connector version
PG connector
Connector configuration
YugabyteDB version
2.21.1.0-b267
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: