-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Backups: Reuse SSH connections in yb_backup #11465
Comments
hulien22
added a commit
that referenced
this issue
Feb 15, 2022
… large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: skedia, kkg, asrivastava, jenkins-bot, bogdan, oleg Differential Revision: https://phabricator.dev.yugabyte.com/D15306
hulien22
added a commit
that referenced
this issue
Feb 16, 2022
…mprovements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Original diff: https://phabricator.dev.yugabyte.com/D15306 Original commit: 29d2c2c Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg, bogdan Reviewed By: bogdan Subscribers: oleg, bogdan, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D15486
hulien22
added a commit
that referenced
this issue
Feb 22, 2022
…improvements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Original diff: https://phabricator.dev.yugabyte.com/D15306 Original commit: 29d2c2c Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: oleg, bogdan, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D15494
hulien22
added a commit
that referenced
this issue
Feb 22, 2022
…mprovements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Original diff: https://phabricator.dev.yugabyte.com/D15306 Original commit: 29d2c2c Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: oleg, bogdan, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D15493
jayant07-yb
pushed a commit
to jayant07-yb/yugabyte-db
that referenced
this issue
Mar 8, 2022
…ormance improvements for large number of tablets Summary: This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets. - Add ssh multiplexing to run_ssh_cmd - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command - Combine chain of ssh commands to single command - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain - Add parallelism to find_tablet_replicas - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master) This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain Test Plan: Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours. Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes. Generic backup tests: ybd --cxx-test tools_yb-backup-test_ent ybd --java-test org.yb.pgsql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.TestYbBackup --tp 1 ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1 Reviewers: oleg Reviewed By: oleg Subscribers: skedia, kkg, asrivastava, jenkins-bot, bogdan, oleg Differential Revision: https://phabricator.dev.yugabyte.com/D15306
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Currently, the yb_backup script does a significant number of SSH connections to all the servers in the cluster, on the order of 6 x num_tablets. Many of these will be to the same tablet servers, so they could benefit from keeping a connection alive and reusing it in future work.
cc @tylarb
The text was updated successfully, but these errors were encountered: