Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] [PG Parity] Memory Leakage in Walsender Process and a node where connector is deployed goes OOM (Unreachable) #21530

Closed
1 task done
shamanthchandra-yb opened this issue Mar 15, 2024 · 1 comment
Assignees
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Mar 15, 2024

Jira Link: DB-10412

Description

Running one of the master case test_cdc_comply_advantage_without_tablet_splitting using PG connector

A node which is used to deploy connector goes down after few iterations.

  • We create replication slots.
  • Use them to deploy connector
  • We have 2 tables, setup for CDC streaming.

Please find runs in JIRA.

Every 2 mins, we are doing pg_log_backend_memory_contexts and sorted list of ps, which can be found in test log.

| USER     | PID   | %CPU | %MEM | VSZ     | RSS     | TTY | STAT | START | TIME  | COMMAND                                                                |
|----------|-------|------|------|---------|---------|-----|------|-------|-------|------------------------------------------------------------------------|
| yugabyte | 36892 | 8.1  | 36.3 | 7700392 | 5777364 | ?   | Ssl  | 17:36 | 0:54  | postgres: walsender yugabyte 172.151.29.149(55540) idle               |
| yugabyte | 31930 | 86.8 | 32.3 | 11212616| 5149680 | ?   | Sl   | 17:30 | 15:42 | /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf |
| yugabyte | 36406 | 4.5  | 21.0 | 5603240 | 3355524 | ?   | Ssl  | 17:36 | 0:31  | postgres: walsender yugabyte 172.151.18.162(34048) idle               |
| yugabyte | 52453 | 0.0  | 0.0  | 12144   | 1200    | ?   | S    | 17:48 | 0:00  | grep -e walsender -e yb-tserver                                      |
| yugabyte | 52430 | 0.0  | 0.0  | 12732   | 3248    | ?   | Ss   | 17:48 | 0:00  | bash -c ps aux | head -1; ps aux | grep -e walsender -e yb-tserver | sort -rnk 4 2>&1 |

Screenshot 2024-03-15 at 11 26 23 PM

Test log with walsender info

Source connector version

PG connector

Connector configuration

add connector connector_name='ybconnector_cdc_a5c062_test_cdc_112244' stream_id='rs_cdc_a5c062_a7be' db_name='cdc_a5c062' connector_host='172.151.29.149' table_list=['test_cdc_112244'] {'name': 'ybconnector_cdc_a5c062_test_cdc_112244', 'config': {'database.master.addresses': '172.151.29.44:7100,172.151.16.49:7100,172.151.17.253:7100', 'database.port': 5433, 'database.masterhost': '172.151.17.253', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_a5c062', 'snapshot.mode': 'initial', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '5', 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', 'topic.prefix': 'db_cdc', 'database.hostname': '172.151.17.253', 'plugin.name': 'pgoutput', 'slot.name': 'rs_cdc_a5c062_a7be', 'publication.autocreate.mode': 'filtered', 'publication.name': 'pn_ybconnector_cdc_a5c062_test_cdc_112244', 'table.include.list': 'public.test_cdc_112244'}}
add connector connector_name='ybconnector_cdc_a5c062_test_cdc_749ec3' stream_id='rs_cdc_a5c062_7c2d' db_name='cdc_a5c062' connector_host='172.151.18.162' table_list=['test_cdc_749ec3'] {'name': 'ybconnector_cdc_a5c062_test_cdc_749ec3', 'config': {'database.master.addresses': '172.151.29.44:7100,172.151.16.49:7100,172.151.17.253:7100', 'database.port': 5433, 'database.masterhost': '172.151.17.253', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_a5c062', 'snapshot.mode': 'initial', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '5', 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', 'topic.prefix': 'db_cdc', 'database.hostname': '172.151.17.253', 'plugin.name': 'pgoutput', 'slot.name': 'rs_cdc_a5c062_7c2d', 'publication.autocreate.mode': 'filtered', 'publication.name': 'pn_ybconnector_cdc_a5c062_test_cdc_749ec3', 'table.include.list': 'public.test_cdc_749ec3'}}

YugabyteDB version

2.21.1.0-b267

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added priority/high High Priority area/cdcsdk CDC SDK status/awaiting-triage Issue awaiting triage labels Mar 15, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Mar 15, 2024
@dr0pdb
Copy link
Contributor

dr0pdb commented Mar 17, 2024

Looked at the test logs, there is a definite increase in the walsender memory usage.

2024-03-15 17:35:53,278:DEBUG: < 172.151.17.253 < USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yugabyte   31930  2.7  1.1 5264816 182836 ?      Sl   17:30   0:09 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf
yugabyte   35273  0.0  0.3 2457512 49560 ?       Ssl  17:34   0:00 postgres: walsender yugabyte 172.151.18.162(43586) idle
yugabyte   35211  0.0  0.3 2457512 49340 ?       Ssl  17:34   0:00 postgres: walsender yugabyte 172.151.29.149(52990) idle


2024-03-15 17:37:55,746:DEBUG: < 172.151.17.253 < USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yugabyte   31930 38.4 19.3 7714396 3085756 ?     Sl   17:30   3:02 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf
yugabyte   36406  7.6  4.4 2457512 710600 ?      Ssl  17:36   0:05 postgres: walsender yugabyte 172.151.18.162(34048) idle
yugabyte   36892  9.0  3.8 2457512 611216 ?      Ssl  17:36   0:05 postgres: walsender yugabyte 172.151.29.149(55540) idle


2024-03-15 17:39:58,022:DEBUG: < 172.151.17.253 < USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yugabyte   31930 55.5 25.1 8910500 4009780 ?     Sl   17:30   5:31 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf
yugabyte   36406  8.8 12.7 3506088 2033460 ?     Ssl  17:36   0:17 postgres: walsender yugabyte 172.151.18.162(34048) idle
yugabyte   36892  8.4 10.9 3506088 1743812 ?     Ssl  17:36   0:15 postgres: walsender yugabyte 172.151.29.149(55540) idle


2024-03-15 17:44:02,512:DEBUG: < 172.151.17.253 < USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yugabyte   31930 78.9 32.4 9926356 5156536 ?     Sl   17:30  11:04 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf
yugabyte   36892  8.4 24.2 5603240 3856764 ?     Ssl  17:36   0:35 postgres: walsender yugabyte 172.151.29.149(55540) idle
yugabyte   36406  7.0 21.2 5603240 3383232 ?     Ssl  17:36   0:31 postgres: walsender yugabyte 172.151.18.162(34048) idle


2024-03-15 17:46:05,112:DEBUG: < 172.151.17.253 < USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yugabyte   31930 82.2 32.8 11237204 5234964 ?    Sl   17:30  13:12 /home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/server.conf
yugabyte   36892  8.0 29.7 6651816 4741156 ?     Ssl  17:36   0:43 postgres: walsender yugabyte 172.151.29.149(55540) idle
yugabyte   36406  5.5 21.2 5603240 3382240 ?     Ssl  17:36   0:31 postgres: walsender yugabyte 172.151.18.162(34048) idle

dr0pdb added a commit that referenced this issue Mar 18, 2024
Summary:
The walsender while streaming the records to the client fetches the records in a batch from the CDC service. Once a batch is fully streamed, it is freed and a
new batch is fetched again.

The logic to free was a simple `pfree` call which wasn't completely freeing the previous batch.

This revision fixes the freeing logic by doing a deep free of all the palloc'd members.
Jira: DB-10412

Test Plan: Ran a manual load test with a Java application and noted the memory usage.

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D33246
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 18, 2024
asrinivasanyb pushed a commit to asrinivasanyb/yugabyte-db that referenced this issue Mar 18, 2024
Summary:
The walsender while streaming the records to the client fetches the records in a batch from the CDC service. Once a batch is fully streamed, it is freed and a
new batch is fetched again.

The logic to free was a simple `pfree` call which wasn't completely freeing the previous batch.

This revision fixes the freeing logic by doing a deep free of all the palloc'd members.
Jira: DB-10412

Test Plan: Ran a manual load test with a Java application and noted the memory usage.

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D33246
dr0pdb added a commit that referenced this issue Mar 18, 2024
…ontext in walsender

Summary:
The walsender stores the begin/commit lsn of each unacked transaction in the unacked_transactions list in yb_virtual_wal_client. This was introduced in
https://phorge.dev.yugabyte.com/D32941.

This revision moves the allocations of the list to a separate memory context for ease of tracking.

Also fixed a small memory leak in `YBCGetTables`.
Jira: DB-10412, DB-10291

Test Plan: No behavior changes.

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D33245
dr0pdb added a commit that referenced this issue Mar 21, 2024
Summary:
The walsender while streaming records to the client, fetches a batch of records from the CDC services and streams them one by one. Once a
batch is streamed completely, it is freed and the next batch is fetched. A record batch contains the pg datum values representing the DML column
values. They are not that easy to free.

An easier way is to store them in a separate memory context and free it whenever needed.

This revision does that. We now store the record batch in a separate memory context and reset it before making the next call to the CDC service.
Jira: DB-10412

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*

Local load testing. No behavioural changes

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D33432
dr0pdb added a commit that referenced this issue Apr 15, 2024
…walsender

Summary:
Original commit: 9aea6ae / D33246
The walsender while streaming the records to the client fetches the records in a batch from the CDC service. Once a batch is fully streamed, it is freed and a
new batch is fetched again.

The logic to free was a simple `pfree` call which wasn't completely freeing the previous batch.

This revision fixes the freeing logic by doing a deep free of all the palloc'd members.
Jira: DB-10412

Test Plan: Ran a manual load test with a Java application and noted the memory usage.

Reviewers: asrinivasan, skumar

Reviewed By: skumar

Subscribers: yql, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34106
dr0pdb added a commit that referenced this issue Apr 16, 2024
…separate memory context in walsender

Summary:
#### Backport Description
No merge conflicts.

#### Original Description
Original commit: cc088c0 / D33245
The walsender stores the begin/commit lsn of each unacked transaction in the unacked_transactions list in yb_virtual_wal_client. This was introduced in
https://phorge.dev.yugabyte.com/D32941.

This revision moves the allocations of the list to a separate memory context for ease of tracking.

Also fixed a small memory leak in `YBCGetTables`.
Jira: DB-10412, DB-10291

Test Plan: No behavior changes.

Reviewers: asrinivasan, skumar

Reviewed By: asrinivasan

Subscribers: yql, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34126
dr0pdb added a commit that referenced this issue Apr 16, 2024
…xes for PG compatible logical replication support

Summary:
##### Backport Description
No merge conflicts.

##### Original Description
Original commits:
1604f1b / D33432
5c023f3 / D33156
2327b8d / D33538

###### [#21530] YSQL: Store cached record batch in a separate context

The walsender while streaming records to the client, fetches a batch of records from the CDC services and streams them one by one. Once a
batch is streamed completely, it is freed and the next batch is fetched. A record batch contains the pg datum values representing the DML column
values. They are not that easy to free.

An easier way is to store them in a separate memory context and free it whenever needed.

This revision does that. We now store the record batch in a separate memory context and reset it before making the next call to the CDC service.
Jira: DB-10412

###### [#21400] YSQL: Handle START_REPLICATION with start_lsn > confirmed_flush_lsn

In the `START_REPLICATION` command, the client can specify the LSN from which they want the records to be streamed from.

This revision introduces support for cases when start_lsn > confirmed_flush_lsn. Before this, the value of start_lsn was being ignored and all records from confirmed_flush_lsn onwards were streamed in case of restarts.

Now, the records which belong to a transaction whose commit_lsn < start_lsn are skipped in the decoding phase (yb_decode.c) similar to how it is done in vanilla PG
(decode.c).
Jira: DB-10292

###### [#21651] YSQL: Handle RPC errors as warning while destroying virtual wal

Once the LogicalReplication ends either when a client ends it or due to an error, the walsender tries to cleanup the virtual wal instance associated with it.

If there is any error in this cleanup logic, it is better to log it as a warning. This is because:
1. We already have a backup to clean the virtual wal instance
2. The error can mask an actual error due to which the cleanup was required. #21651 is one such instance.

This revision updates Walsender, to treat any errors received during the DestroyVirtualWALForCDC as a warning.
Jira: DB-10545

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*

Local load testing. No behavioural changes

./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testStartLsnValues'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testWalsenderGracefulShutdownWithCDCServiceError'

Reviewers: asrinivasan, skumar, siddharth.shah

Reviewed By: siddharth.shah

Subscribers: yql, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

4 participants