[xCluster] Slow BootstrapProducer requests when using bootstrap_cdc_producer on many tables #10065

bmatican · 2021-09-21T20:10:04Z

Jira Link: DB-8241
Test from @Arjun-yb based on #9946

Error running bootstrap_cdc_producer: Timed out (yb/rpc/outbound_call.cc:512): Unable to bootstrap CDC producer: BootstrapProducer RPC (request call id 5) to 172.161.17.6:9100 timed out after 120.000s
Steps:
Created 50 tables and 50 indexes at producer side
Started workload at producer side and waited for sometime to get most of the tables to have some rows in it.
Tried to run cmd
bin/yb-admin -master_addresses 172.151.16.134,172.161.22.108,172.151.37.96 bootstrap_cdc_producer <comma separated tables/indexes ids> -certs_dir_name <certs_dir_path>

Notes from Slack:

I0921 12:07:44.721004 18780 cdc_service.cc:1026] Received BootstrapProducer request [...]
...
W0921 12:17:07.176775 18780 yb_rpc.cc:426] Call yb.cdc.CDCService.BootstrapProducer 172.151.16.134:51266 => 172.161.17.6:9100 (request call id 5) took 562455ms (client timeout 120000ms).

@nspiegelberg if you’re more familiar with this, might need a 2nd pair of eyes, as there’s a lot of TS side spew of various forms, from ChangeMetadataOperations

I0921 12:07:50.788729 18814 change_metadata_operation.cc:189] T 85b0bed9888f40c0bfe81385f865937c 0x000000001730c4a0 -> ChangeMetadataOperation

to some cdc_service metadata changes?

I0921 12:07:51.318403 18780 cdc_service.cc:461] Modifying remote peer { uuid: 04d9ffd224844b47a63a0d01b251da63 private: [host: "172.151.33.209" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2b"
I0921 12:07:51.369906 18780 cdc_service.cc:461] Modifying remote peer { uuid: a67adfbdadca4a84b56478596d002b25 private: [host: "172.151.20.147" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a"
I0921 12:07:51.420486 18780 cdc_service.cc:461] Modifying remote peer { uuid: 6802604e6e614c138021280450ca2024 private: [host: "172.161.17.6" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-east-2" placement_zone: "us-east-2a"
I0921 12:07:51.420614 18783 tablet_peer.cc:932] T 4c184fb032d8452fbc9ade26dc8a7fab P 6802604e6e614c138021280450ca2024 [state=RUNNING]: Setting cdc min replicated index to 189

cc @rahuldesirazu @hulien22

The text was updated successfully, but these errors were encountered:

nspiegelberg · 2021-11-29T18:17:11Z

Notes:

bootstrap_cdc_producer

ClusterAdminClient::BootstrapProducer
CDCServiceImpl::BootstrapProducer
Serial : OpenTable + CreateCDCStream
—— CreateCDCStream : AlterTableRequest
Task Breakdown

Unit test to benchmark (CDCServiceTestDurableMinReplicatedIndex, TestBootstrapProducer). Go through flow and collect timing.
— Start with 10, work up to 100.
— Be sure to log timing for both CreateTable time for 100 tables + bootstrap_cdc_producer, since you are only optimizing the later.
———
Turn CreateCDCStream into batch request (1 IOP)
— Parallelize AlterTableRequest. (SetupUniverseReplication does pipelining)
Parallelize TabletLeaderLatestEntryOpId
Batch cdc_state table update, so only one ApplyAndFlush.
Move bootstrap_cdc_producer to CatalogManager. (Local rpms). ???? [Maybe skip this].
Handy Notes
Parallelizing

async API with callback
ThreadPool
Big-O ~~~ O(API_calls/thread_parallelism) (edited)

Summary: The Bootstrap Producer step of XCluster is timing out when run with a larger number of tables (50+). Added thread parallelism and batching to improve the throughput for large volume configuration. Test Plan: TwoDCTest.BootstrapAndSetupLargeTableCount -n 4 5 Servers, 10 Tables, 10 tablets/T - Batched: 10 sec - Normal: 13 sec Reviewers: yli, bogdan, jhe, rahuldesirazu Reviewed By: jhe, rahuldesirazu Subscribers: hector, zyu, kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D14160

…Performance Summary: The Bootstrap Producer step of XCluster is timing out when run with a larger number of tables (50+). Added thread parallelism and batching to improve the throughput for large volume configuration. Original commits: 6cc0ddd / D14160 cfe5beb / D16521 Test Plan: TwoDCTest.BootstrapAndSetupLargeTableCount -n 4 5 Servers, 10 Tables, 10 tablets/T - Batched: 10 sec - Normal: 13 sec Reviewers: rahuldesirazu, slingam, jhe Reviewed By: jhe Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16735

nspiegelberg · 2022-04-30T00:01:39Z

@Arjun-yb :
I validated [@nicolas](https://yugabyte.slack.com/team/ULQK50RML) perf fix in 12-node simulation cluster with master branch(2.13.2.0-b68) and it gave me nice result.
With 20 tables bootstrap cmd took
20 tables - 0m2.835s		GFLAG_parallelize_bootstrap_producer = true
20 tables - 0m19.028s		GFLAG_parallelize_bootstrap_producer = false
With 80 tables bootstrap cmd took
80 tables - 0m3.547s		GFLAG_parallelize_bootstrap_producer = true
80 tables- 1m6.086s		GFLAG_parallelize_bootstrap_producer = false
Setup replication took
20 tables- 0m2.019s
80 tables- 0m5.987s

Summary: This diff fixes memory leaks in Ysql Connection Manager which were caused due to following reasons: # In upstream odyssey io object of a client is never deallocated when client is free, so it is explicitly made to free. # In connection manager while reading the packet from socket during authentication (in yb_auth_passthrough.c), a machine msg object is created which is not set free after it's purpose, so explicitly making it in this diff. # Memory allocated to store the name of guc variable (in var.h) is never released, it is fixed by doing automatic allocation on stack so it is released automatically as it goes out of scope. Jira: DB-8241 Test Plan: # All Ysql Connection Manager Tests are passing. # Ensured any long running app in yb-stress-test doesn't causes ysql conn mgr's memory to grow continuously while app is running, rather after some point of time it gets stabilized. Reviewers: janand, nkumar, rbarigidad Reviewed By: janand, rbarigidad Subscribers: rbarigidad, nkumar, mihnea, yql Differential Revision: https://phorge.dev.yugabyte.com/D33190

bmatican added kind/bug This issue is a bug area/docdb YugabyteDB core features priority/high High Priority area/cdc Change Data Capture labels Sep 21, 2021

bmatican assigned nspiegelberg Sep 21, 2021

bmatican added this to To do in xCluster replication via automation Sep 21, 2021

bmatican added this to Backlog in YBase features via automation Sep 21, 2021

nspiegelberg mentioned this issue Oct 22, 2021

[docdb] xCluster Roadmap Firehose #10319

Closed

nspiegelberg mentioned this issue Nov 16, 2021

[xCluster] Slow SetupReplication requests when using too many tables #10611

Closed

rahuldesirazu mentioned this issue Jan 6, 2022

[xCluster] Roadmap for 2.13 #11015

Closed

tverona1 added the status/awaiting-triage Issue awaiting triage label Apr 6, 2022

nspiegelberg closed this as completed Apr 29, 2022

YBase features automation moved this from Backlog to Done Apr 29, 2022

xCluster replication automation moved this from To do to Done Apr 29, 2022

lizayugabyte mentioned this issue Oct 27, 2022

[Docs] Error in xcluster replication doc #14686

Closed

rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xCluster] Slow BootstrapProducer requests when using bootstrap_cdc_producer on many tables #10065

[xCluster] Slow BootstrapProducer requests when using bootstrap_cdc_producer on many tables #10065

bmatican commented Sep 21, 2021 •

edited by jira bot

nspiegelberg commented Nov 29, 2021

nspiegelberg commented Apr 30, 2022

[xCluster] Slow BootstrapProducer requests when using bootstrap_cdc_producer on many tables #10065

[xCluster] Slow BootstrapProducer requests when using bootstrap_cdc_producer on many tables #10065

Comments

bmatican commented Sep 21, 2021 • edited by jira bot

nspiegelberg commented Nov 29, 2021

nspiegelberg commented Apr 30, 2022

bmatican commented Sep 21, 2021 •

edited by jira bot