Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xCluster] Slow BootstrapProducer requests when using bootstrap_cdc_producer on many tables #10065

Closed
bmatican opened this issue Sep 21, 2021 · 2 comments
Assignees
Labels
area/cdc Change Data Capture area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority

Comments

@bmatican
Copy link
Contributor

bmatican commented Sep 21, 2021

Jira Link: DB-8241
Test from @Arjun-yb based on #9946

Error running bootstrap_cdc_producer: Timed out (yb/rpc/outbound_call.cc:512): Unable to bootstrap CDC producer: BootstrapProducer RPC (request call id 5) to 172.161.17.6:9100 timed out after 120.000s
Steps:
Created 50 tables and 50 indexes at producer side
Started workload at producer side and waited for sometime to get most of the tables to have some rows in it.
Tried to run cmd
bin/yb-admin -master_addresses 172.151.16.134,172.161.22.108,172.151.37.96 bootstrap_cdc_producer <comma separated tables/indexes ids> -certs_dir_name <certs_dir_path>

Notes from Slack:

I0921 12:07:44.721004 18780 cdc_service.cc:1026] Received BootstrapProducer request [...]
...
W0921 12:17:07.176775 18780 yb_rpc.cc:426] Call yb.cdc.CDCService.BootstrapProducer 172.151.16.134:51266 => 172.161.17.6:9100 (request call id 5) took 562455ms (client timeout 120000ms).

@nspiegelberg if you’re more familiar with this, might need a 2nd pair of eyes, as there’s a lot of TS side spew of various forms, from ChangeMetadataOperations

I0921 12:07:50.788729 18814 change_metadata_operation.cc:189] T 85b0bed9888f40c0bfe81385f865937c 0x000000001730c4a0 -> ChangeMetadataOperation

to some cdc_service metadata changes?

I0921 12:07:51.318403 18780 cdc_service.cc:461] Modifying remote peer { uuid: 04d9ffd224844b47a63a0d01b251da63 private: [host: "172.151.33.209" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2b"
I0921 12:07:51.369906 18780 cdc_service.cc:461] Modifying remote peer { uuid: a67adfbdadca4a84b56478596d002b25 private: [host: "172.151.20.147" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a"
I0921 12:07:51.420486 18780 cdc_service.cc:461] Modifying remote peer { uuid: 6802604e6e614c138021280450ca2024 private: [host: "172.161.17.6" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-east-2" placement_zone: "us-east-2a"
I0921 12:07:51.420614 18783 tablet_peer.cc:932] T 4c184fb032d8452fbc9ade26dc8a7fab P 6802604e6e614c138021280450ca2024 [state=RUNNING]: Setting cdc min replicated index to 189

cc @rahuldesirazu @hulien22

@bmatican bmatican added kind/bug This issue is a bug area/docdb YugabyteDB core features priority/high High Priority area/cdc Change Data Capture labels Sep 21, 2021
@bmatican bmatican added this to To do in xCluster replication via automation Sep 21, 2021
@bmatican bmatican added this to Backlog in YBase features via automation Sep 21, 2021
@nspiegelberg
Copy link
Contributor

Notes:

bootstrap_cdc_producer

  • ClusterAdminClient::BootstrapProducer
  • CDCServiceImpl::BootstrapProducer
  • Serial : OpenTable + CreateCDCStream
    —— CreateCDCStream : AlterTableRequest
    Task Breakdown
  1. Unit test to benchmark (CDCServiceTestDurableMinReplicatedIndex, TestBootstrapProducer). Go through flow and collect timing.
    — Start with 10, work up to 100.
    — Be sure to log timing for both CreateTable time for 100 tables + bootstrap_cdc_producer, since you are only optimizing the later.
    ———
  2. Turn CreateCDCStream into batch request (1 IOP)
    — Parallelize AlterTableRequest. (SetupUniverseReplication does pipelining)
  3. Parallelize TabletLeaderLatestEntryOpId
  4. Batch cdc_state table update, so only one ApplyAndFlush.
  5. Move bootstrap_cdc_producer to CatalogManager. (Local rpms). ???? [Maybe skip this].
    Handy Notes
    Parallelizing
  • async API with callback
  • ThreadPool
    Big-O ~~~ O(API_calls/thread_parallelism) (edited)

@tverona1 tverona1 added the status/awaiting-triage Issue awaiting triage label Apr 6, 2022
nspiegelberg added a commit that referenced this issue Apr 12, 2022
Summary: The Bootstrap Producer step of XCluster is timing out when run with a larger number of tables (50+).  Added thread parallelism and batching to improve the throughput for large volume configuration.

Test Plan:
TwoDCTest.BootstrapAndSetupLargeTableCount -n 4

5 Servers, 10 Tables, 10 tablets/T
- Batched: 10 sec
- Normal: 13 sec

Reviewers: yli, bogdan, jhe, rahuldesirazu

Reviewed By: jhe, rahuldesirazu

Subscribers: hector, zyu, kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D14160
nspiegelberg added a commit that referenced this issue Apr 29, 2022
…Performance

Summary:
The Bootstrap Producer step of XCluster is timing out when run with a larger number of tables (50+).  Added thread parallelism and batching to improve the throughput for large volume configuration.

Original commits:
  6cc0ddd / D14160
  cfe5beb / D16521

Test Plan:
TwoDCTest.BootstrapAndSetupLargeTableCount -n 4

5 Servers, 10 Tables, 10 tablets/T
- Batched: 10 sec
- Normal: 13 sec

Reviewers: rahuldesirazu, slingam, jhe

Reviewed By: jhe

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16735
YBase features automation moved this from Backlog to Done Apr 29, 2022
xCluster replication automation moved this from To do to Done Apr 29, 2022
@nspiegelberg
Copy link
Contributor

@Arjun-yb :
I validated [@nicolas](https://yugabyte.slack.com/team/ULQK50RML) perf fix in 12-node simulation cluster with master branch(2.13.2.0-b68) and it gave me nice result.
With 20 tables bootstrap cmd took
20 tables - 0m2.835s		GFLAG_parallelize_bootstrap_producer = true
20 tables - 0m19.028s		GFLAG_parallelize_bootstrap_producer = false
With 80 tables bootstrap cmd took
80 tables - 0m3.547s		GFLAG_parallelize_bootstrap_producer = true
80 tables- 1m6.086s		GFLAG_parallelize_bootstrap_producer = false
Setup replication took
20 tables- 0m2.019s
80 tables- 0m5.987s

@rthallamko3 rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Oct 9, 2023
manav-yb pushed a commit that referenced this issue Apr 22, 2024
Summary:
This diff fixes memory leaks in Ysql Connection Manager which were caused due to following reasons:

  # In upstream odyssey io object of a client is never deallocated when client is free, so it is explicitly made to free.
  # In connection manager while reading the packet from socket during authentication (in yb_auth_passthrough.c), a machine msg object is created which is not set free after it's purpose, so explicitly making it in this diff.
  # Memory allocated to store the name of guc variable (in var.h) is never released, it is fixed by doing automatic allocation on stack so it is released automatically as it goes out of scope.

Jira: DB-8241

Test Plan:
  # All Ysql Connection Manager Tests are passing.

  # Ensured any long running app in yb-stress-test doesn't causes ysql conn mgr's memory to grow continuously while app is running, rather after some point of time it gets stabilized.

Reviewers: janand, nkumar, rbarigidad

Reviewed By: janand, rbarigidad

Subscribers: rbarigidad, nkumar, mihnea, yql

Differential Revision: https://phorge.dev.yugabyte.com/D33190
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdc Change Data Capture area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority
Development

No branches or pull requests

4 participants