-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Slow remotebootstraps during multi-region setup #11868
Comments
Seems like we already have an adaptive mechanism here which should have adjusted the size fetched automatically to something up to (max_time_slot = 100ms) * rate = ~ 25MB. However There is also some additional factors (to be investigated) causing each fetch to take about 300ms end to end. Overall, we were able to achieve 100MBps after bumping up |
Summary: For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size` artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to take a long time. In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms. This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time) } = ~3.3MBps. They were seeing practically about 2.8MBps. After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput. Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower than what is set. Test Plan: The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput. Eyeball Reviewers: bogdan, timur Reviewed By: timur Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16205
Summary: Original Revision: https://phabricator.dev.yugabyte.com/D16205 Original Commit: f036bc3 For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size` artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to take a long time. In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms. This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time) } = ~3.3MBps. They were seeing practically about 2.8MBps. After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput. Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower than what is set. Test Plan: The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput. Eyeball Reviewers: timur, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16536
Summary: Original Revision: https://phabricator.dev.yugabyte.com/D16205 Original Commit: f036bc3 For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size` artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to take a long time. In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms. This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time) } = ~3.3MBps. They were seeing practically about 2.8MBps. After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput. Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower than what is set. Test Plan: The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput. Eyeball Reviewers: timur, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16537
Summary: Original Revision: https://phabricator.dev.yugabyte.com/D16205 Original Commit: f036bc3 For multi-region setups with a large network latency, having a small `remote_bootstrap_max_chunk_size` artificially reduces the overall bandwidth that can be achieved. This causes remote bootstraps to take a long time. In one customer case, the ping latency across nodes was ~150ms and the round trip to fetch a block from the bootstrapping node's pov was about 300-350ms. This meant that they were only able to achieve 1000 / 300ms * std::min{ 1_MB (remote_bootstrap_max_chunk_size), 100ms(max_time_slot) / 1000 * 256_MB (bootstrap_rate) / 1 (num_concurrent bootstraps at that time) } = ~3.3MBps. They were seeing practically about 2.8MBps. After bumping up remote_bootstrap_max_chunk_size, we were able to get a much higher throughput. Setting better defaults such that we aren't artificially limiting the bootstrap rate to be lower than what is set. Test Plan: The increase in `remote_bootstrap_max_chunk_size` is what unblocked a customer who was seeing that their cross region remoteboostrap was limited to 3MBps. After increasing the flag, we were able to get better throughput. Eyeball Reviewers: timur, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16538
Description
On one cross region cluster: we were only getting about 2.8MBps of remote-bootstrap bandwidth; while
after enabling --v = 1 temporarily for 10sec, we figured that we were fetching very small amounts of data on each trip.
With
remote_bootstrap_max_chunk_size
defaulting to 1_MB, network round trip to fetch that data was causing significant reduction in the overall bandwidth acheived.The text was updated successfully, but these errors were encountered: