Initial thoughts about VReplication optimization #7997
Labels
Component: VReplication
Type: Enhancement
Logical improvement (somewhere between a bug and feature)
Type: Performance
At this point in time I'm starting to look more in depth into Online DDL via VReplication, to make sure it's on par, and above,
gh-ost
's capabilities. One of the first things I wanted to do is compare migration speed. I got my hands on an external server, and here goes.My initial tests are on a
example/local
(unsharded, one primary, two replicas) cluster, and a large table (20M rows, ~1.7GB), with no traffic. So my intention was to first check the speed ofvcopier
. This follows an internal discussion with @rohit-nayak-ps and @vmg about how we should address packet size. I won't produce the wonderful graphs @vmg does, first because I'm still not so good at benchmarking, and secondly because things took a turn.The good news
With the default
vstream_packet_size
(250000
), and with even smaller packet sizes (65536
) and marginally as low as (16384
), Vreplication migration is faster thangh-ost
:gh-ost
migration: consistently310s
-320s
online
(vreplication): migration: consistently~200s
With lower
vstream_packet_size
(e.g4096
) VReplication runtime for same table jumps as high as~20min
==1200sec
Problem statement
The above is really great, but only in isolation. I see two immediate problems with VReplication's implementation that creates a problem in a non-isolated environment:
1. Packet size <=> copy size
This was discussed internally, as well: VReplication fetches data up to
n
bytes (vstream_packet_size
) from the streamer, then applies that data viainsert
onto the target table. This means a500KB
write per transaction. For50byte
rows that means10,000
rows; for1KB
rows it means500
rows.The numbers are not unreasonable. However:
binlog_cache_size
is at least as high as the packet size, or else every write is guaranteed to exceed the binlog cache size, which means the write goes down to disk o na temporary file associated with the session, before being read back and actually written (again, to disk) into the binlog.gh-ost
, we set on smaller sizes. We counted by rows, not by bytes, and limited ourselves to50
rows at a time, no more than100
. Experimenting with higher values rarely produced any speedup, but did immediately impact performance.We have discussed internally to decouple the packet size (and the amount of data transferring through the wire) from the number of rows written, so I don't have much to add here, we are likely to experiment with that and see what comes out.
2. Open query
It was actually not clear to me until now, completely my ignorance: the entire copy phase uses a single query on the source server. So if I ALTER a
1.7GB
table, there's a singleSELECT col1, col2, col3, ... <all columns> FROM the_table ORDER BY <list of PK columns>
.This single query runs for the duration of the migration/stream. Here's why this is a problem in a prod environment:
1,000,000
(or was it5,000,000
?). In my past experience such a large history list length was a cause for outage.DELETE
a row, it can't actually be deleted from the table, because the massiveSELECT
query still holds on to that data.So running this query causes a high load on a
primary
production server unless for the smallest of tables. I think this is something we must work on.So I would like to look into breaking this apart. For comparison,
gh-ost
"walks" the source table by readingn
rows at a time in PK asc order, while keeping track of which PKs have been read. Each step continues exactly where the previous step ended, and it's OK for the app to meanwhileinsert
ordelete
rows into the next chunk, or the previous chuck, or in between the chunks. In this method, if the app deletes future rows, thengh-ost
never gets to process them (by the time it reaches that part of the table, there's no rows to process). On the other hand, the app is also free to delete rows already processed bygh-ost
, and there's no transaction to force those rows to keep in place.The current architecture of vreplication does not support breaking up the
select
query on the source side. I'll look into what it takes to do so.cc @vitessio/ps-vitess
The text was updated successfully, but these errors were encountered: