Skip to content

Initial thoughts about VReplication optimization #7997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shlomi-noach opened this issue Apr 29, 2021 · 2 comments
Closed

Initial thoughts about VReplication optimization #7997

shlomi-noach opened this issue Apr 29, 2021 · 2 comments
Assignees
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Performance

Comments

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Apr 29, 2021

At this point in time I'm starting to look more in depth into Online DDL via VReplication, to make sure it's on par, and above, gh-ost's capabilities. One of the first things I wanted to do is compare migration speed. I got my hands on an external server, and here goes.

My initial tests are on a example/local (unsharded, one primary, two replicas) cluster, and a large table (20M rows, ~1.7GB), with no traffic. So my intention was to first check the speed of vcopier. This follows an internal discussion with @rohit-nayak-ps and @vmg about how we should address packet size. I won't produce the wonderful graphs @vmg does, first because I'm still not so good at benchmarking, and secondly because things took a turn.

The good news

With the default vstream_packet_size (250000), and with even smaller packet sizes (65536) and marginally as low as (16384), Vreplication migration is faster than gh-ost:

  • gh-ost migration: consistently 310s - 320s
  • online (vreplication): migration: consistently ~200s

With lower vstream_packet_size (e.g 4096) VReplication runtime for same table jumps as high as ~20min == 1200sec

Problem statement

The above is really great, but only in isolation. I see two immediate problems with VReplication's implementation that creates a problem in a non-isolated environment:

1. Packet size <=> copy size

This was discussed internally, as well: VReplication fetches data up to n bytes (vstream_packet_size) from the streamer, then applies that data via insert onto the target table. This means a 500KB write per transaction. For 50byte rows that means 10,000 rows; for 1KB rows it means 500 rows.

The numbers are not unreasonable. However:

  • Those are still big writes. We need to make sure binlog_cache_size is at least as high as the packet size, or else every write is guaranteed to exceed the binlog cache size, which means the write goes down to disk o na temporary file associated with the session, before being read back and actually written (again, to disk) into the binlog.
  • For comparison, at my previous place, and where we developed gh-ost, we set on smaller sizes. We counted by rows, not by bytes, and limited ourselves to 50 rows at a time, no more than 100. Experimenting with higher values rarely produced any speedup, but did immediately impact performance.

We have discussed internally to decouple the packet size (and the amount of data transferring through the wire) from the number of rows written, so I don't have much to add here, we are likely to experiment with that and see what comes out.

2. Open query

It was actually not clear to me until now, completely my ignorance: the entire copy phase uses a single query on the source server. So if I ALTER a 1.7GB table, there's a single SELECT col1, col2, col3, ... <all columns> FROM the_table ORDER BY <list of PK columns>.

This single query runs for the duration of the migration/stream. Here's why this is a problem in a prod environment:

  1. By holding any open transaction for a long time, you interfere with InnoDB's MVCC mechanism. Because a transaction is open, InnoDB must keep a "version" of the data somewhere. Any write to a row you're selecting must keep a version of your data still available. This in turn increases the history list length. Baron Schwartz wrote a great explanation of what history list length is.
  2. By holding an open transaction on all table rows, you effectively ensure that either all or almost all (I say either because there's an implementation detail I'm uncertain of) any future write has to keep a version aside. On a busy server this means more IO for the duration of the open query, and a lot more IO when it completes.
  3. My experience was such that we would put alerts whenever a server's history list length exceeded 1,000,000 (or was it 5,000,000?). In my past experience such a large history list length was a cause for outage.
  4. It's interesting to consider DELETEs. When you DELETE a row, it can't actually be deleted from the table, because the massive SELECT query still holds on to that data.
  • To dwell a bit more about DELETEs: once we've copied some rows, we don't mind if they're subsequently deleted. But the query still forces them to exist.
  • If we haven't reads some rows yet (say we're only halfway through the table), and the rows are deleted by the app, we are really not interested in reading them anymore. The data is already stale. But the query forces us to read those rows nonetheless possibly hours after they've been deleted.

So running this query causes a high load on a primary production server unless for the smallest of tables. I think this is something we must work on.

So I would like to look into breaking this apart. For comparison, gh-ost "walks" the source table by reading n rows at a time in PK asc order, while keeping track of which PKs have been read. Each step continues exactly where the previous step ended, and it's OK for the app to meanwhile insert or delete rows into the next chunk, or the previous chuck, or in between the chunks. In this method, if the app deletes future rows, then gh-ost never gets to process them (by the time it reaches that part of the table, there's no rows to process). On the other hand, the app is also free to delete rows already processed by gh-ost, and there's no transaction to force those rows to keep in place.

The current architecture of vreplication does not support breaking up the select query on the source side. I'll look into what it takes to do so.

cc @vitessio/ps-vitess

@askdba askdba added Component: VReplication Severity 4 Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Apr 30, 2021
@rohit-nayak-ps rohit-nayak-ps removed their assignment May 2, 2021
@shlomi-noach
Copy link
Contributor Author

Discussion follows on #8056

@shlomi-noach
Copy link
Contributor Author

This can be closed as outdated. There have been numerous optimizations to vreplication by now (and with more experimentation ongoing), and nothing for keeping this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Performance
Projects
None yet
Development

No branches or pull requests

5 participants