Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Backups: Reuse SSH connections in yb_backup #11465

Closed
bmatican opened this issue Feb 12, 2022 · 0 comments
Closed

[DocDB] Backups: Reuse SSH connections in yb_backup #11465

bmatican opened this issue Feb 12, 2022 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@bmatican
Copy link
Contributor

Description

Currently, the yb_backup script does a significant number of SSH connections to all the servers in the cluster, on the order of 6 x num_tablets. Many of these will be to the same tablet servers, so they could benefit from keeping a connection alive and reusing it in future work.

cc @tylarb

@bmatican bmatican added the area/docdb YugabyteDB core features label Feb 12, 2022
@bmatican bmatican added this to Backlog in YBase features via automation Feb 12, 2022
@bmatican bmatican added this to To do in Backups via automation Feb 12, 2022
hulien22 added a commit that referenced this issue Feb 15, 2022
… large number of tablets

Summary:
This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets.

- Add ssh multiplexing to run_ssh_cmd
  - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command
- Combine chain of ssh commands to single command
  - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain
- Add parallelism to find_tablet_replicas
  - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master)

This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain

Test Plan:
Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours.

Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes.

Generic backup tests:
ybd --cxx-test tools_yb-backup-test_ent
ybd --java-test org.yb.pgsql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1

Reviewers: oleg

Reviewed By: oleg

Subscribers: skedia, kkg, asrivastava, jenkins-bot, bogdan, oleg

Differential Revision: https://phabricator.dev.yugabyte.com/D15306
hulien22 added a commit that referenced this issue Feb 16, 2022
…mprovements for large number of tablets

Summary:
This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets.

- Add ssh multiplexing to run_ssh_cmd
  - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command
- Combine chain of ssh commands to single command
  - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain
- Add parallelism to find_tablet_replicas
  - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master)

This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain

Original diff: https://phabricator.dev.yugabyte.com/D15306
Original commit: 29d2c2c

Test Plan:
Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours.

Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes.

Generic backup tests:
ybd --cxx-test tools_yb-backup-test_ent
ybd --java-test org.yb.pgsql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1

Reviewers: oleg, bogdan

Reviewed By: bogdan

Subscribers: oleg, bogdan, jenkins-bot

Differential Revision: https://phabricator.dev.yugabyte.com/D15486
YBase features automation moved this from Backlog to Done Feb 22, 2022
Backups automation moved this from To do to Done Feb 22, 2022
hulien22 added a commit that referenced this issue Feb 22, 2022
…improvements for large number of tablets

Summary:
This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets.

- Add ssh multiplexing to run_ssh_cmd
  - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command
- Combine chain of ssh commands to single command
  - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain
- Add parallelism to find_tablet_replicas
  - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master)

This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain

Original diff: https://phabricator.dev.yugabyte.com/D15306
Original commit: 29d2c2c

Test Plan:
Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours.

Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes.

Generic backup tests:
ybd --cxx-test tools_yb-backup-test_ent
ybd --java-test org.yb.pgsql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1

Reviewers: oleg

Reviewed By: oleg

Subscribers: oleg, bogdan, jenkins-bot

Differential Revision: https://phabricator.dev.yugabyte.com/D15494
hulien22 added a commit that referenced this issue Feb 22, 2022
…mprovements for large number of tablets

Summary:
This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets.

- Add ssh multiplexing to run_ssh_cmd
  - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command
- Combine chain of ssh commands to single command
  - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain
- Add parallelism to find_tablet_replicas
  - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master)

This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain

Original diff: https://phabricator.dev.yugabyte.com/D15306
Original commit: 29d2c2c

Test Plan:
Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours.

Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes.

Generic backup tests:
ybd --cxx-test tools_yb-backup-test_ent
ybd --java-test org.yb.pgsql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1

Reviewers: oleg

Reviewed By: oleg

Subscribers: oleg, bogdan, jenkins-bot

Differential Revision: https://phabricator.dev.yugabyte.com/D15493
jayant07-yb pushed a commit to jayant07-yb/yugabyte-db that referenced this issue Mar 8, 2022
…ormance improvements for large number of tablets

Summary:
This diff adds a variety of speedups, that help in all cases, but especially in cases where there are a large number of tablets.

- Add ssh multiplexing to run_ssh_cmd
  - This allows us to reuse ssh connections, that way we don't incur the ssh startup cost on every command
- Combine chain of ssh commands to single command
  - This reduces the number of requests we're sending, and also allows us to do retries on the entire command chain
- Add parallelism to find_tablet_replicas
  - Previously we made `yb-admin list_tablet_servers` calls sequentially for each tablet, which could take a long time. Changing this use half of the `--parallelism` flag set (with a max of 16 to not overload the master)

This also fixes the issue of not retrying on checksum failures, as we now will retry the entire command chain

Test Plan:
Tested on a large setup with 10 nodes rf3, 100 tables with 100 tablets each. Previously doing a restore took 11.5 hours, with these improvements, it took under 2 hours.

Also saw similar number at a different scale with a single table with 100 tablets, which went from 11 minutes to 2 minutes.

Generic backup tests:
ybd --cxx-test tools_yb-backup-test_ent
ybd --java-test org.yb.pgsql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.TestYbBackup --tp 1
ybd --java-test org.yb.cql.ParameterizedTestYbBackup --tp 1

Reviewers: oleg

Reviewed By: oleg

Subscribers: skedia, kkg, asrivastava, jenkins-bot, bogdan, oleg

Differential Revision: https://phabricator.dev.yugabyte.com/D15306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
Backups
  
Done
Development

No branches or pull requests

2 participants