Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vreplication: dynamic packet sizing #7933

Merged
merged 8 commits into from
Apr 26, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Apr 22, 2021

Description

☠️ Warning: Spicy PR ahead ☠️

As discussed on last week's VReplication performance improvements, I noticed that table copying performance was basically being bottlenecked by MySQL bulk insert performance. After some testing on staging environments, I found out that the simplest way to improve MySQL insert performance was tweaking the packet size (-vstream_packet_size) on the vttablet source: depending on the layout and size of the tables being copied, careful tuning of this configuration option can make table copying up to 30% faster.

This is good, because it means advanced Vitess users can configure their clusters to squeeze more performance out of VReplication, but it is also bad, because it means you need be an advanced user and be able to understand how to tweak the packet sizes based on your use case. So, in the spirit of PlanetScale, it occurred to me: can we make Vitess more magical so it always does the right thing?

And it turns out we can! This PR introduces a new configuration option -vstream_dynamic_packet_size, which turns on an extension to the VReplication controller that measures the throughput of the replication stream and uses statistical analysis to adjust its packet size for optimal performance. The idea is good, but the devil is in the details, of course: to ensure this real-time optimization is as effective as it can be, the dynamic packet sizer performs repeated experiments to sample the effects of packet size changes. It keeps track of the response times from previous stream calls and aggregates them into statistical samples by removing any outliers and performing a two-sample Welch T-Test to ensure changes in packet size are actually faster. Tthis is the same statistical analysis that I usually perform when analysing benchmark results, to ensure optimizations are meaningfully faster.

To accomplish all this, I've had to vendor some of the maths logic from golang.org/x/perf/benchstat into Vitess; the meaningful math code is in an internal package so it cannot be used as a dependency directly. I believe this is fine because these kinds of statistic analysis functions are probably not being changed or improved anytime soon. Relatedly: I decided to use a Welch T-Test instead of the default U-test that the benchstat package uses because the later runs in O(n^2) memory, while the T-Test uses constant memory. I don't think the accuracy difference between the tests is meaningful, or more accurately, Wikipedia thinks this is fine. Both tests should be statistically meaningful and the T-test doesn't allocate, sooo...

Testing

I've started by verifying in the end-to-end performance harness that this change meaningfully improves table copying performance. It does for all testing I've been able to do locally -- it seems to consistently improve on the default 25000 packet size setting from Vitess.

Besides that, as you may guess, this is a complex feature to unit test. In order to give some sanity to the implementation, I've implemented a simulation as a unit test: basically, I've graphed the performance of several MySQL bulk insert behaviors and curve-fitted the response time graphs into several polynomials. The unit tests are using this curve fitting to simulate the response times of a MySQL server and to ensure that the packet size adjustments give a meaningful performance impact.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

vmg added 4 commits April 21, 2021 16:01
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@vmg
Copy link
Collaborator Author

vmg commented Apr 22, 2021

To note: right now the performance measures are fully client side. It measures the roundtrip time for each insert batch as seen for the client. This could easily be improved to make the server collaborate and report the actual insert timing, but I think there's value in this client-side performance metric, because it takes into account network throughput and latency, something that I believe is also meaningful when tuning the packet sizes.

Signed-off-by: Vicent Marti <vmg@strn.cat>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magical PR as usual :)

I don't understand the statistical stuff but the changes to the vreplication code look good.

I suggest adding some per-stream vstreamer metrics to externalize the packet size (current, max, average). This will help provide more insights into how the algorithm is working in production setups both for users and (maybe) for us to tune it in specific circumstances. Examples here: go/vt/vttablet/tabletserver/vstreamer/engine.go:108

@vitessio vitessio deleted a comment Apr 23, 2021
@vitessio vitessio deleted a comment Apr 23, 2021
Signed-off-by: Vicent Marti <vmg@strn.cat>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vitessio vitessio deleted a comment Apr 26, 2021
@vitessio vitessio deleted a comment Apr 26, 2021
vmg added 2 commits April 26, 2021 10:42
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@rohit-nayak-ps rohit-nayak-ps merged commit c5cd306 into vitessio:master Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants