[DocDB] Backups: Reuse SSH connections in yb_backup #11465

bmatican · 2022-02-12T17:45:23Z

Description

Currently, the yb_backup script does a significant number of SSH connections to all the servers in the cluster, on the order of 6 x num_tablets. Many of these will be to the same tablet servers, so they could benefit from keeping a connection alive and reusing it in future work.

cc @tylarb

… large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: skedia, kkg, asrivastava, jenkins-bot, bogdan, oleg Differential Revision: https://phabricator.dev.yugabyte.com/D15306

…mprovements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Original diff: https://phabricator.dev.yugabyte.com/D15306 Original commit: 29d2c2c Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg, bogdan Reviewed By: bogdan Subscribers: oleg, bogdan, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D15486

…improvements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Original diff: https://phabricator.dev.yugabyte.com/D15306 Original commit: 29d2c2c Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: oleg, bogdan, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D15494

…mprovements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Original diff: https://phabricator.dev.yugabyte.com/D15306 Original commit: 29d2c2c Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: oleg, bogdan, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D15493

…ormance improvements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: skedia, kkg, asrivastava, jenkins-bot, bogdan, oleg Differential Revision: https://phabricator.dev.yugabyte.com/D15306

bmatican added the area/docdb YugabyteDB core features label Feb 12, 2022

bmatican assigned hulien22 Feb 12, 2022

bmatican added this to Backlog in YBase features via automation Feb 12, 2022

bmatican added this to To do in Backups via automation Feb 12, 2022

bmatican mentioned this issue Feb 12, 2022

[docdb] Backup hardening: Tracking issue #8895

Open

bmatican closed this as completed Feb 22, 2022

YBase features automation moved this from Backlog to Done Feb 22, 2022

Backups automation moved this from To do to Done Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Backups: Reuse SSH connections in yb_backup #11465

[DocDB] Backups: Reuse SSH connections in yb_backup #11465

bmatican commented Feb 12, 2022

[DocDB] Backups: Reuse SSH connections in yb_backup #11465

[DocDB] Backups: Reuse SSH connections in yb_backup #11465

Comments

bmatican commented Feb 12, 2022

Description