Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Slow remotebootstraps during multi-region setup #11868

Closed
amitanandaiyer opened this issue Mar 23, 2022 · 1 comment
Closed

[DocDB] Slow remotebootstraps during multi-region setup #11868

amitanandaiyer opened this issue Mar 23, 2022 · 1 comment
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented Mar 23, 2022

Description

On one cross region cluster: we were only getting about 2.8MBps of remote-bootstrap bandwidth; while

  • the remote bootstrap rate was set to 256MBps and
  • scp was able to get about 25MBps

after enabling --v = 1 temporarily for 10sec, we figured that we were fetching very small amounts of data on each trip.
With remote_bootstrap_max_chunk_size defaulting to 1_MB, network round trip to fetch that data was causing significant reduction in the overall bandwidth acheived.

@amitanandaiyer amitanandaiyer added the area/docdb YugabyteDB core features label Mar 23, 2022
@amitanandaiyer amitanandaiyer self-assigned this Mar 23, 2022
@amitanandaiyer
Copy link
Contributor Author

Seems like we already have an adaptive mechanism here which should have adjusted the size fetched automatically to something up to (max_time_slot = 100ms) * rate = ~ 25MB.

However remote_bootstrap_max_chunk_size which defaults to 1MB was making the effective fetch size back down to 1MB.

There is also some additional factors (to be investigated) causing each fetch to take about 300ms end to end.

Overall, we were able to achieve 100MBps after bumping up remote_bootstrap_max_chunk_size to about 50_MB

amitanandaiyer added a commit that referenced this issue Apr 5, 2022
Summary:
For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size`
artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to
take a long time.

In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms.
This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time)  } = ~3.3MBps. They were seeing practically about 2.8MBps.

After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput.

Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower
than what is set.

Test Plan:
The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput.

Eyeball

Reviewers: bogdan, timur

Reviewed By: timur

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16205
amitanandaiyer added a commit that referenced this issue Apr 15, 2022
Summary:
Original Revision: https://phabricator.dev.yugabyte.com/D16205
Original Commit: f036bc3

For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size`
artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to
take a long time.

In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms.
This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time)  } = ~3.3MBps. They were seeing practically about 2.8MBps.

After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput.

Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower
than what is set.

Test Plan:
The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput.

Eyeball

Reviewers: timur, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16536
amitanandaiyer added a commit that referenced this issue Apr 15, 2022
Summary:
Original Revision: https://phabricator.dev.yugabyte.com/D16205
Original Commit: f036bc3

For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size`
artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to
take a long time.

In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms.
This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time)  } = ~3.3MBps. They were seeing practically about 2.8MBps.

After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput.

Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower
than what is set.

Test Plan:
The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput.

Eyeball

Reviewers: timur, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16537
amitanandaiyer added a commit that referenced this issue Apr 15, 2022
Summary:
Original Revision: https://phabricator.dev.yugabyte.com/D16205
Original Commit: f036bc3

For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size`
artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to
take a long time.

In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms.
This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time)  } = ~3.3MBps. They were seeing practically about 2.8MBps.

After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput.

Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower
than what is set.

Test Plan:
The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput.

Eyeball

Reviewers: timur, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16538
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

1 participant