I was looking into VReplication behavior as part of ongoing improvement to VReplication for Online DDL. It led me into a better understanding of the VReplication flow, and what I believe to be bottlenecks we can solve. I did some benchmarking in a no-traffic scenario (traffic scenario soon to come) to better understand the relationship between the different components. I'd like to share my findings. This will be long, so TL;DR:
- The major bottleneck is gRPC
- Current design is to use large thresholds, large timeouts, large queries, in order to reduce gRPC traffic
- There are advantages to using smaller thresholds/queries, and it is possible to decouple those sizes from the gRPC traffic.
- Consistent snapshots and accurate GTID tracking can be replaced with a more relaxed algorithm
- I offer alternate design(s)
- Different use cases may require different designs
Sample PR to demonstrate potential code changes: #8044
- Yes, most VReplication tests fail on this PR. Ongoing work.
Disclaimer: some opinions expressed here based on my personal past experience in production. I realize some may not hold true everywhere. Also, tests are failing, so obviously I haven't solved this yet.
But let's begin with a general explanation of the VReplication flow (also see Life of a Stream). I'll use an Online DDL flow, which mostly simple: copy shared columns from source to target. Much like CopyTables.
The general VReplication flow
The current flow uses perfect accuracy by tracking down GTIDs, locking tables and getting consistent read snapshots. For simplicity, we illustrate the flow for a single table operation. It goes like this:
- There are two sides to the flow: the source (
vstreamer) and target (vreplicator)
- There are two data dimensions to the flow: exxisting table data, and incoming binlog changes on that table
vstreamer (rowstreamer) is responsible of reading & streaming table data from source to target
vreplicator has two main components:
vcopier: reads rows streamed from vstreamer and INSERTs them to target table
vplayer: tails the binary logs and applies relevant events to the target table
These building blocks are essential. What's interesting is how they interact. We identify three phases: Copy, Catch-up, Fast-forward
Copy
-
vplayer will only apply changes to rows already copied, or anything before those queries.
-
so we begin with vplayer doing nothing
-
rowstreamer begins work
-
rowstreamer runs lock table my_table READ (writes blocked on the table) and gets opens a transaction with consistent snapshot, and reads current GTID.
Which means, a locking & blocking operation, which gets us a GTID and a transaction that is guaranteed to read data associated with that exact GTID
-
in that transaction, rowstreamer runs a select <all-relevant-columns> from <my_table> order by <pk-columns>.
this is an attempt to read basically all columns, all rows from a table in a single transaction, and stream those results back.
-
rowstreamer sends down the GTID value (to be intercepted by vcopier)
-
rowstreamer reads row by row; when the data it has accumulated from the query exceeds -vstream_packet_size, it sends down accumulated rows, runs some housekeeping, and continues to accumulate further rows.
-
Meanwhile, vcopier got the GTID, takes note.
-
vcopier now gets the first batch of rows. It writes them all at once, in a single transaction, to MySQL.
In that same transaction, it updates _vt.copy_state table with identities of written rows (identified by last PK)
-
The flow then switches to Catchup
Catchup
Vplayer processes events from the binary log. By now the binary logs have accumulated quite a few events.
- It discards anything that is not related to our table(s)
- It discards any event that has a larger PK than the ones we already copied
- It applies any remaining events:
insert/update/delete on our table rows
In same transaction where we apply the event, we update _vt.vreplication table with associated GTID
- It goes on without specific limit; it stops when replication lag is small enough, which means we've worked through almost all of the binlog (read: backlog)
Side notes
-
If the table is small enough, we might actually suffice with the flow thus far. Remember, vstreamer was actually selecting all rows from the table. So if all went well, we copied all rows, we caught up with binlog events, and we're good to cut-over or complete the workflow.
-
However, if the table is large enough, there are situations where we are not done.
- Any network issue may interrupt
vstreamer from sending the rows.
vreplicator has a copyTimeout of 1 hour. When that timeout expires, vreplicator sends an io.EOF to vstreamer, which aborts the operation.
In either of these scenarios, vreplicator has the last processed GTID and last read PK, and can resume work by:
- fast forward (follows)
- copy again
- catchup again
- (repeat)
Fast forward
This is actually a merger between both Copy & Catchup:
- We call on
vstreamer to prepare another snapshot
- This time
vreplicator tells vstreamer: "start with this PK"
vstreamer creates a new transaction with consistent snapshot. It prepares a new query: select <all-relevant-columns> from <my_table> where <pk-columns> > :lastPK order by <pk-columns>
So it attempts to read all table rows, starting from the given position (rows before that position are already handled by vcopier)
vstreamer sends down the GTID for the transaction
cut scene
vcopier does not proceed to read rows from vstreamer. In the time it took to create the new consistent snapshot, a few more events have appeared in the binary log. Because the flow keeps strict ordering of transactions, the flow wants to first apply all transactions in the binlog up to the GTID just received from vstreamer.
vplayer applies those rows
now that we're caught up, we return to vstreamer
- From this point on we loop back into the start of the flow.
vstreamer reads and streams more rows, either manages to read till end of table or it does not. vcopier writes rows, vplayer catches up with events, and so forth.
Benchmark & analysis of impact of factors
Let's identify some factors affecting the flow:
-
vstream_packet_size: the smaller it is, the more back-and-forth gRPC traffic we will see between vstreamer and vcopier: those will be more batches (smaller batches) of data sent from source to target.
-
copyTimeout: the smaller it is, the more we will interrupt the mega-query rowstreamer attempts to run. Theoretically, we could set that value to near infinite, in the hope of delivering the entire table in one single sweep. But as mentioned before, network failures can happen anyway.
So with a small value, we will:
- Have more gRPCs (from
target to source: "get me more rows please")
- Cancel more transactions on source
- Have more locks on
LOCK TABLES my_table READ
- Create more transactions on source
-
While not tunable in master branch, I've made it possible for the source query to have LIMIT <number-of-rows>. What if we voluntarily close the copy cycle on the vstreamer side?
See these changes to support the new behavior:
https://github.com/vitessio/vitess/pull/8044/files#diff-a1cffc790e352be31a3f600180d968b6d96f3bd90acf4bfa0a49e3a66611558cR227-R332
https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R206
Spoiler: smaller reads lead to more gRPC communication. This time vstreamer terminates the communication. vcopier says "well I want more rows" and initiates the next copy. Again, this means:
- Have more gRPCs (from
target to source: "get me more rows please")
- Cancel more transactions on source
- Have more locks one
LOCK TABLES my_table READ
- Create more transactions on source
-
To be discussed later, I've also experimented with the overhead of the consistent snapshot and of the table lock.
The below benchmarks some of these params. It is notable that I did not include a vstream_packet_size tuning in the below.
- On my benchmarks, any
vstream_packet_size >= 64K seems to be good enough and similar behavior
- Performance degraded very quickly on lower values
- It is apparent that keeping
vstream_packet_size high is desirable. But, what's interesting is how the flow is coupled with that value, to be discussed later.
- In the below benchmark I used a value of
640000 (640k)
The benchmark is to run an Online DDL on a large table. In Online DDL both source and target are same vttablet, so obviously same host, and there's no cross hosts network latency. It's noteworthy that there's still network involved: the source and target communicate via gRPC even though they're the same process.
The table:
CREATE TABLE `stress_test_pk` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`sig` varchar(40) NOT NULL,
`c` char(8) NOT NULL DEFAULT '',
PRIMARY KEY (`id`,`c`)
) ENGINE=InnoDB;
mysql> select * from stress_test_pk limit 10;
+----+------------------------------------------+----------+
| id | sig | c |
+----+------------------------------------------+----------+
| 1 | af15f1c7691cf684244f610e3b668e9de1dd83d6 | af15f1c7 |
| 2 | 1ccffab9bfa41d8e13fe4a8efa4764a532e6a605 | 1ccffab9 |
| 3 | a45617e5d2158c1a82fb2f98fda0106e15fe4dd2 | a45617e5 |
| 4 | fcffd065bf6e44950206617ce182b59a523986cc | fcffd065 |
| 6 | 10be967cf937218cbba69fd333020e2f1fdddd4f | 10be967c |
| 7 | 13804bb5f9e5cf8e08cc9d363790f339b761907f | 13804bb5 |
| 8 | cdb426bdddfdf660344a9bf875db911f84b30ff2 | cdb426bd |
| 9 | 7ad21b2c9261a85b558b2cdaf684ff616cad002e | 7ad21b2c |
| 13 | 94a02951925b5823fc0566a9dacbda0cad2b16cf | 94a02951 |
| 14 | d6fba9f908b703604a1b428f89d9ea6648ffe0bc | d6fba9f9 |
+----+------------------------------------------+----------+
mysql> select count(*) from stress_test_pk;
+----------+
| count(*) |
+----------+
| 16777216 |
+----------+
*************************** 1. row ***************************
Name: stress_test_pk
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 15974442
Avg_row_length: 82
Data_length: 1325400064
Max_data_length: 0
Index_length: 0
Data_free: 7340032
Auto_increment: 20578168
Create_time: 2021-05-05 08:22:03
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
The table takes 1.4GB on disk.
So this table is quite simple, and with some text content to make it somewhat fat.
| With snapshot? |
Traffic? |
SELECT ... LIMIT |
copyTimeout (seconds) |
runtime (seconds) |
comments |
| FALSE |
FALSE |
unlimited |
3600 |
140 |
|
| TRUE |
FALSE |
unlimited |
3600 |
140 |
|
| FALSE |
FALSE |
unlimited |
5 |
181 |
|
| TRUE |
FALSE |
unlimited |
5 |
185 |
|
| FALSE |
FALSE |
1000000 |
3600 |
166 |
~10sec per round |
| TRUE |
FALSE |
1000000 |
3600 |
167 |
~10sec per round |
| FALSE |
FALSE |
1000000 |
60 |
165 |
|
| TRUE |
FALSE |
1000000 |
60 |
165 |
|
| FALSE |
FALSE |
1000000 |
5 |
180 |
|
| TRUE |
FALSE |
1000000 |
5 |
182 |
|
| FALSE |
FALSE |
100000 |
3600 |
350 |
~1.5sec per round |
| TRUE |
FALSE |
100000 |
3600 |
352 |
~1.5sec per round |
| FALSE |
FALSE |
20000 |
60 |
infitinty |
like, ETA=hours |
| TRUE |
FALSE |
20000 |
60 |
infitinty |
like, ETA=hours |
For now, only consider rows With Snapshot=TRUE.
This is a subset of some more experiments I ran. The results are consistent up to 2-3 seconds across executions. What can we learn?
- The obvious winner (faster) is the existing flow in
master branch. No LIMIT to the SELCET query. 1 hour timeout (longer than it takes to copy the table)
- small
LIMITs are catastrophic
- small timeouts are likewise catastrophic
Now, I began by blaming this on gRPC. However, the above is not enough information to necessarily blame gRPC. Smaller timeout == more calls to vstreamer. Smaller LIMIT == more calls to vstreamer. It makes sens to first suspect the performance of a vstreamer cycle. Is it the LOCK TABLES, perhaps? Is it the transaction WITH CONSISTENT SNAPSHOT? Anything else about vstreamer? The following indicator gives us the answer:
Further development, which I'll discuss later, actually creates smaller LIMIT queries, but continuously streaming. I'll present it shortly, but the conclusion is clear: if you LIMIT without creating new gRPC calls, performance remains stable. BTW, @vmg is similarly looking into decoupling the value of vstream_packet_size from INSERTs on vcopier side; we can batch smaller chunks of INSERTs from one large vstream_packet_size.
But, but, ... What's wrong with the current flow?
It looks like a winner. BTW it is also 2 minutes faster than a gh-ost migration on same table! So is there anything to fix?
A few things:
- The current flow only runs one table at a time. That is, it can operate on multiple tables, but will only copy+catchup+ff a single table at a time. We are looking to parallelize that. Flows like Resharding or MoveTables could see massive gains.
- I claimed earlier that a single
SELECT <all-relevant-columns> FROM my_table ORDER BY <pk-columns>, in a CONSISTENT SNAPSHOT transaction is unsustainable. Evidently (and discussed internally) quite a few users have been using this and without reporting an issue. Either that executed on a replica, or their workload is somehow different, or my claim is invalid. Or in some gray area in between. I just can't shake off my experience where a high history list length predicts an imminent outage in production.
- Anyway, if we are to read and stream multiple tables at once, we must abandon the Grand Select approach. If only because we can't
SELECT ... from multiple tables at once in a single transaction... A transaction requires us to serialize our queries.
So what I'm looking into now, is how to break down the Grand Select into multiple smaller selects, while:
- Not breaking the correctness of the flow, and:
- Not degrading performance.
(1) is easy, right? Just use LIMIT 10000. But we see how that leads to terrible run times.
Other disadvantages to the current flow:
- It's "unbalanced". You can spend 1hour copying rows and then 30min applying binlogs. The direct incentive I had in looking into this was to support a Online DDL progress % & ETA. The current flow makes it difficult to predict an ETA. We can maybe predict how long it's going to take for rowcopy to complete (based on number of table rows and number of copied rows), but we can't predict how long it will take to ctachup with the binary logs. In
gh-ost, there is frequent switch between rowcopy and binlog catchup, and so the progress becomes more predictable. I'll discuss the gh-ost logic shortly.
- As per performance traits above, it is essential in our flow that we perform large bulks of row copy and large bulks of binlog catchup, and there is no easy way to interleave both.
- It is wasteful for deleted rows. The Grand Select will read rows even if they're deleted seconds after snapshot began.
vcopier will copy, write down those rows, and vplayer will later delete them.
- The logic is complex.
Of course, let's not forget about the advantages of the current flow:
- It works! It answers multiple use cases with one single logic. It's amazing.
- It is accurate. The way we wait for a GTID, the way we create a consistent snapshot, the way we catchup and fast-forward, is precise, and the results are predictable at every single transaction.
- I like how the catchup does not bother to apply events
> :lastPK. That is, the flow requires this behavior, but a side effect is that we only process relevant rows.
Anyway. If only for the sake of parallelism, I'm looking into eliminating the consistent snapshot and breaking down the Grand Select query into smaller, LIMITed queries.
The gh-ost flow
I'd like to explain how gh-ost runs migrations, without locking and without consistent snapshots or GTID tracking. gh-ost was written when we were not even using GTIDs, but to simplify the discussion and to make it more comparable to the current VReplication flow, I'll describe gh-ost's flow as if it were using GTIDs. It's just syntactic sugar.
I think we can apply the gh-ost flow to Vitess/VReplication, either fully or partially, to achieve parallelism.
The flow has two componenst: rowcopy and binlog catchup. There is no fast forward. Binlog catchup are prioritized over rowcopy. Do notice that gh-ost rowcopy operates on a single machine, so it runs something like INSERT INTO... SELECT rather than the VReplication two step of SELECT into memory & INSERT read values.
The steps:
- Begin. Create target table etc.
- Mark current GTID.
- Begin tailing the binary logs starting said GTID
- We actually begin applying binlog evens at this time! But more on this shortly.
- On source table, evaluate min PK and max PK.
- We will only copy rows in that range.
- Binlog catchup is prioritized. Are there binlog events to consume? If yes:
- Apply binary logs!
- Wait. We haven't even copied a single row. What does it mean to apply a binlog event if it's a
DELETE? If it's an UPDATE?
- If it's a
DELETE, we delete the row, whether it actually exists or not.
- If it's an
INSERT, we convert it into a REPLACE INTO and execute
- If it's an
UPDATE, we convert it into a REPLACE INTO and execute.
This means an UPDATE will actually create a new row.
- Continue applying binlog events, one at a time, until we've consumed the binary logs.
- Now that we've consumed the binary log (for now), we can switch to rowcopy.
- This is our first iteration. Evaluare the PK range of the next
1000 rows to copy
1000 as an example, we only copy between 10 and 10000 rows at a time. these limites are hard coded into gh-ost.
- Issue a
INSERT INGORE INTO _my_table_gho SELECT <relevant-columns> FROM my_table WHERE <pk in computed range>
Notice that we INSERT IGNORE.
Methodic break. We prioritize binlog events over rowcopy. This is done by making the binlog/catchup writes a REPLACE INTO (always succeeds and overwrites), and making rowcopy INSERT IGNORE (always yields in case of conflict).
consider that UPDATE we encountered in the catchup phase, and where we didn't even copy a single line. That UPDATE was converted to a REPLACE INTO, creating a new row. Later, in the future, rowcopy will reach that row and attempt to copy it. But it will use INSERT IGNORE so it will yield to the data applied by the binlog event. If no transaction ever changed that row again, then the two (the REPLACE INTO and the INSERT IGNORE for the specific row) will have had the same data anyway.
It is difficult to mathematically formalize the correctness of the algorithm, and it's something I wanted to do a while back. Maybe some day. At any case, it's the same algorithm created in oak-online-schema-change, copied by pt-online-schema-change, and continued by fb-osc and of course gh-ost. It is well tested and has been in production for a decade.
Back to the flow:
- Once done copying the range, we switch back to binlog/catchup. Are there more events? If so, apply them.
- No more events? Copy next range.
- Repeat, repeat, repeat...
- ...Until we've copied the last rows (we've reached maxPK which we evaluated at the beginning of the migration)
- We catchup with the remaining binlog events
- Keep on doing so until replication lag is low or otherwise things look healthy
- Attempt cut-over
- If successful, great
- If not, we used some short timeouts, continue applying binlog events, try cut-over again in next opportunity.
Some characteristics of the gh-ost flow:
- It's simple(r). It does not require coordination between rowcopy and binlog/catchup
- rowcopy does not require
LOCK TABLES (only) final cut-over phase does)
- rowcopy does not require consistent snapshot transactions
- rowcopy can work in small chunks (in fact, our experiments in production showed little change in overall performance between
50 row chunk size and 1000 row chunk size; for some workloads it could make a difference)
- binlog/catchup does not filter events; it just applies whatever it sees, whether the row is already copied or not.
- It is accurate
- Except for this particular scenario: adding a new
UNIQUE KEY.
- In this scenario, and because of the
INSERT IGNORE and REPLACE INTO queries, when you add a new UNIQUE KEY, gh-ost will silently drop duplicate rows while copying the data or while applying binlog events to the new table.
- To clarify, the operation will be successful, and the resulting table will be consistent.
- But people expect a different thing: they expect th emigration to fail if the table (or incoming traffic) do not comply with intended
UNIQUE constraint.
- The VReplication flow, in comparison, will fail the opration.
- The operation is more balanced
- There is a frequent swithch between rowcopy and binlog/catchup
We prioritise binlog/catchup just because we want to avoid a log backlog and the risk of binlog events being purged, but logically we could switch any tim ewe like.
- Which means the rowcopy progress is a good predictive of overall progress
- again, consider that by definition we keep replication lag low, and binlog backlog low. So at all times we've goot a good estimate about remaining tasks.
- We get a reliable progress/ETA for the process.
- The operation is more wasteful:
gh-ost applies binlog events for rows we haven't copied yet. We could skip those events like VREplication does, at the cost of tracking PK for the table.
It is in a sense the opposite of the VReplication wasteful scenario. VReplication is wasteful for deletes, gh-ost is wasteful for INSERTs and UPDATEs.
Where/how we can merge gh-ost flow into VReplication
I'll begin by again stressing:
- The
gh-ost flow cannot solve the issue of adding a UNIQue KEY. The resulting table is consistent, but the user might expect an error if duplicates are found.
- I'm not sure yet about Materialization with aggregate functions.
Otherwise, the process is sound. The PR actually incorporates gh-ost logic now:
- We keep:
- only apply binlog events to rows already copied
UPDATEs remain UPDATEs for now
- We change:
- Broken logic:
- We kinda race to the end of the table. Because there's no traffic in this experiment, the table doesn't grow while we copy it. But with iterative SELECTs, we can keep chasing the end of the table as it keeps growing. We need to apply a "last PK" approach like
gh-ost does.
benchmark results
This works for Online DDL; as mentioned above other tests are failing, there' still some termination condition to handle. We'll get this, but the proof of concept:
| With snapshot? |
Traffic? |
SELECT ... LIMIT |
copyTimeout (seconds) |
runtime (seconds) |
comments |
| FALSE |
FALSE |
10000 |
3600 |
140 |
|
The above shows we've managed to break the source query into multiple smaller queries, and remain at exact same performance.
It is also the final proof that our bottleneck was gRPC to begin with; not the queries, not database performance.
How can the new flow help us?
- It can help us because we can reads from multiple tables, concurrently.
- We will stream the data sequentially
- But on
vcopier side, we could split it again, and write 1000 rows of this table, and 1000 of that table, concurrently
- Applying binlog events is immaterial to which table has been copied or has not been copied
- If we go the full
gh-ost flow, we don't even need to track PK per table, this simplifies the logic.
Different approaches?
At the cost of maintaining two different code paths, we could choose to:
- Use the new flow for MoveTables, OnlineDDLs, Resharding, import/export
- Keep the old flow for Materialize
I'm still unsure about aggregation logic/impact, and @sougou suggested Materialized Views are in particular sensitive to chunk size and prefer mass queries.
- Keep the old flow for Online DDL that adds a new
UNIQUE KEY constraint
To be continued
I'll need to fix outstanding issues with the changed flow, and then run a benchmark under load. I want to say I can predict how the new flow will be so much better - but I know reality will have to prove me wrong.
Thoughts are welcome.
cc @vitessio/ps-vitess @rohit-nayak-ps @sougou @deepthi @vmg
I was looking into VReplication behavior as part of ongoing improvement to VReplication for Online DDL. It led me into a better understanding of the VReplication flow, and what I believe to be bottlenecks we can solve. I did some benchmarking in a no-traffic scenario (traffic scenario soon to come) to better understand the relationship between the different components. I'd like to share my findings. This will be long, so TL;DR:
Sample PR to demonstrate potential code changes: #8044
Disclaimer: some opinions expressed here based on my personal past experience in production. I realize some may not hold true everywhere. Also, tests are failing, so obviously I haven't solved this yet.
But let's begin with a general explanation of the VReplication flow (also see Life of a Stream). I'll use an Online DDL flow, which mostly simple: copy shared columns from source to target. Much like CopyTables.
The general VReplication flow
The current flow uses perfect accuracy by tracking down GTIDs, locking tables and getting consistent read snapshots. For simplicity, we illustrate the flow for a single table operation. It goes like this:
vstreamer) and target (vreplicator)vstreamer(rowstreamer) is responsible of reading & streaming table data from source to targetvreplicatorhas two main components:vcopier: reads rows streamed fromvstreamerandINSERTs them to target tablevplayer: tails the binary logs and applies relevant events to the target tableThese building blocks are essential. What's interesting is how they interact. We identify three phases: Copy, Catch-up, Fast-forward
Copy
vplayerwill only apply changes to rows already copied, or anything before those queries.so we begin with
vplayerdoing nothingrowstreamerbegins workrowstreamerrunslock table my_table READ(writes blocked on the table) and gets opens a transactionwith consistent snapshot, and reads current GTID.Which means, a locking & blocking operation, which gets us a GTID and a transaction that is guaranteed to read data associated with that exact GTID
in that transaction,
rowstreamerruns aselect <all-relevant-columns> from <my_table> order by <pk-columns>.this is an attempt to read basically all columns, all rows from a table in a single transaction, and stream those results back.
rowstreamersends down the GTID value (to be intercepted byvcopier)rowstreamerreads row by row; when the data it has accumulated from the query exceeds-vstream_packet_size, it sends down accumulated rows, runs some housekeeping, and continues to accumulate further rows.Meanwhile,
vcopiergot the GTID, takes note.vcopiernow gets the first batch of rows. It writes them all at once, in a single transaction, to MySQL.In that same transaction, it updates
_vt.copy_statetable with identities of written rows (identified by last PK)The flow then switches to Catchup
Catchup
Vplayerprocesses events from the binary log. By now the binary logs have accumulated quite a few events.insert/update/deleteon our table rowsIn same transaction where we apply the event, we update
_vt.vreplicationtable with associated GTIDSide notes
If the table is small enough, we might actually suffice with the flow thus far. Remember,
vstreamerwas actually selecting all rows from the table. So if all went well, we copied all rows, we caught up with binlog events, and we're good to cut-over or complete the workflow.However, if the table is large enough, there are situations where we are not done.
vstreamerfrom sending the rows.vreplicatorhas acopyTimeoutof 1 hour. When that timeout expires,vreplicatorsends anio.EOFtovstreamer, which aborts the operation.In either of these scenarios,
vreplicatorhas the last processed GTID and last read PK, and can resume work by:Fast forward
This is actually a merger between both Copy & Catchup:
vstreamerto prepare another snapshotvreplicatortellsvstreamer: "start with this PK"vstreamercreates a new transaction with consistent snapshot. It prepares a new query:select <all-relevant-columns> from <my_table> where <pk-columns> > :lastPK order by <pk-columns>So it attempts to read all table rows, starting from the given position (rows before that position are already handled by
vcopier)vstreamersends down the GTID for the transactionvcopierdoes not proceed to read rows fromvstreamer.In the time it took to create the new consistent snapshot, a few more events have appeared in the binary log. Because the flow keeps strict ordering of transactions, the flow wants to first apply all transactions in the binlog up to the GTID just received fromvstreamer.vplayerapplies those rowsvstreamerreads and streams more rows, either manages to read till end of table or it does not.vcopierwrites rows,vplayercatches up with events, and so forth.Benchmark & analysis of impact of factors
Let's identify some factors affecting the flow:
vstream_packet_size: the smaller it is, the more back-and-forth gRPC traffic we will see betweenvstreamerandvcopier: those will be more batches (smaller batches) of data sent from source to target.copyTimeout: the smaller it is, the more we will interrupt the mega-queryrowstreamerattempts to run. Theoretically, we could set that value to near infinite, in the hope of delivering the entire table in one single sweep. But as mentioned before, network failures can happen anyway.So with a small value, we will:
targettosource: "get me more rows please")LOCK TABLES my_table READWhile not tunable in
masterbranch, I've made it possible for the source query to haveLIMIT <number-of-rows>. What if we voluntarily close the copy cycle on thevstreamerside?See these changes to support the new behavior:
https://github.com/vitessio/vitess/pull/8044/files#diff-a1cffc790e352be31a3f600180d968b6d96f3bd90acf4bfa0a49e3a66611558cR227-R332
https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R206
Spoiler: smaller reads lead to more gRPC communication. This time
vstreamerterminates the communication.vcopiersays "well I want more rows" and initiates the next copy. Again, this means:targettosource: "get me more rows please")LOCK TABLES my_table READTo be discussed later, I've also experimented with the overhead of the consistent snapshot and of the table lock.
The below benchmarks some of these params. It is notable that I did not include a
vstream_packet_sizetuning in the below.vstream_packet_size>=64Kseems to be good enough and similar behaviorvstream_packet_sizehigh is desirable. But, what's interesting is how the flow is coupled with that value, to be discussed later.640000(640k)The benchmark is to run an Online DDL on a large table. In Online DDL both source and target are same vttablet, so obviously same host, and there's no cross hosts network latency. It's noteworthy that there's still network involved: the source and target communicate via gRPC even though they're the same process.
The table:
The table takes
1.4GBon disk.So this table is quite simple, and with some text content to make it somewhat fat.
For now, only consider rows With Snapshot=TRUE.
This is a subset of some more experiments I ran. The results are consistent up to 2-3 seconds across executions. What can we learn?
masterbranch. NoLIMITto theSELCETquery. 1 hour timeout (longer than it takes to copy the table)LIMITs are catastrophicNow, I began by blaming this on gRPC. However, the above is not enough information to necessarily blame gRPC. Smaller timeout == more calls to vstreamer. Smaller LIMIT == more calls to vstreamer. It makes sens to first suspect the performance of a
vstreamercycle. Is it theLOCK TABLES, perhaps? Is it the transactionWITH CONSISTENT SNAPSHOT? Anything else aboutvstreamer? The following indicator gives us the answer:TRUEorFALSE.In the first iteration of the PR, I just created a new
streamWithoutSnapshotfunction:https://github.com/vitessio/vitess/pull/8044/files#diff-efb6bf2b0113d05b4466e1c54f4abf7ccad01637cc8ad1f01ceb2bb916dd67fdR47-R64
At this time I'm not even trying to justify that it is correct. It is generally not. But, we're in a no-traffic scenario, which means there's no need to lock the table, no need for consistent snapshot. So basically, this function reads the rows the same way, without the lock & snapshot overhead.
Further development, which I'll discuss later, actually creates smaller
LIMITqueries, but continuously streaming. I'll present it shortly, but the conclusion is clear: if youLIMITwithout creating new gRPC calls, performance remains stable. BTW, @vmg is similarly looking into decoupling the value ofvstream_packet_sizefromINSERTs on vcopier side; we can batch smaller chunks ofINSERTs from one largevstream_packet_size.But, but, ... What's wrong with the current flow?
It looks like a winner. BTW it is also 2 minutes faster than a
gh-ostmigration on same table! So is there anything to fix?A few things:
SELECT <all-relevant-columns> FROM my_table ORDER BY <pk-columns>, in aCONSISTENT SNAPSHOTtransaction is unsustainable. Evidently (and discussed internally) quite a few users have been using this and without reporting an issue. Either that executed on a replica, or their workload is somehow different, or my claim is invalid. Or in some gray area in between. I just can't shake off my experience where a highhistory list lengthpredicts an imminent outage in production.SELECT ...from multiple tables at once in a single transaction... A transaction requires us to serialize our queries.So what I'm looking into now, is how to break down the Grand Select into multiple smaller selects, while:
(1) is easy, right? Just use
LIMIT 10000. But we see how that leads to terrible run times.Other disadvantages to the current flow:
gh-ost, there is frequent switch between rowcopy and binlog catchup, and so the progress becomes more predictable. I'll discuss thegh-ostlogic shortly.vcopierwill copy, write down those rows, andvplayerwill later delete them.Of course, let's not forget about the advantages of the current flow:
> :lastPK. That is, the flow requires this behavior, but a side effect is that we only process relevant rows.Anyway. If only for the sake of parallelism, I'm looking into eliminating the consistent snapshot and breaking down the Grand Select query into smaller,
LIMITed queries.The gh-ost flow
I'd like to explain how
gh-ostruns migrations, without locking and without consistent snapshots or GTID tracking.gh-ostwas written when we were not even using GTIDs, but to simplify the discussion and to make it more comparable to the current VReplication flow, I'll describegh-ost's flow as if it were using GTIDs. It's just syntactic sugar.I think we can apply the
gh-ostflow to Vitess/VReplication, either fully or partially, to achieve parallelism.The flow has two componenst: rowcopy and binlog catchup. There is no fast forward. Binlog catchup are prioritized over rowcopy. Do notice that
gh-ostrowcopy operates on a single machine, so it runs something likeINSERT INTO... SELECTrather than the VReplication two step ofSELECTinto memory &INSERTread values.The steps:
DELETE? If it's anUPDATE?DELETE, we delete the row, whether it actually exists or not.INSERT, we convert it into aREPLACE INTOand executeUPDATE, we convert it into aREPLACE INTOand execute.This means an
UPDATEwill actually create a new row.1000rows to copy1000as an example, we only copy between10and10000rows at a time. these limites are hard coded intogh-ost.INSERT INGORE INTO _my_table_gho SELECT <relevant-columns> FROM my_table WHERE <pk in computed range>Notice that we
INSERT IGNORE.Methodic break. We prioritize binlog events over rowcopy. This is done by making the binlog/catchup writes a
REPLACE INTO(always succeeds and overwrites), and making rowcopyINSERT IGNORE(always yields in case of conflict).consider that
UPDATEwe encountered in the catchup phase, and where we didn't even copy a single line. ThatUPDATEwas converted to aREPLACE INTO, creating a new row. Later, in the future, rowcopy will reach that row and attempt to copy it. But it will useINSERT IGNOREso it will yield to the data applied by the binlog event. If no transaction ever changed that row again, then the two (theREPLACE INTOand theINSERT IGNOREfor the specific row) will have had the same data anyway.It is difficult to mathematically formalize the correctness of the algorithm, and it's something I wanted to do a while back. Maybe some day. At any case, it's the same algorithm created in
oak-online-schema-change, copied bypt-online-schema-change, and continued byfb-oscand of coursegh-ost. It is well tested and has been in production for a decade.Back to the flow:
Some characteristics of the
gh-ostflow:LOCK TABLES(only) final cut-over phase does)50row chunk size and1000row chunk size; for some workloads it could make a difference)UNIQUE KEY.INSERT IGNOREandREPLACE INTOqueries, when you add a newUNIQUE KEY,gh-ostwill silently drop duplicate rows while copying the data or while applying binlog events to the new table.UNIQUEconstraint.We prioritise binlog/catchup just because we want to avoid a log backlog and the risk of binlog events being purged, but logically we could switch any tim ewe like.
gh-ostapplies binlog events for rows we haven't copied yet. We could skip those events like VREplication does, at the cost of tracking PK for the table.It is in a sense the opposite of the VReplication wasteful scenario. VReplication is wasteful for deletes,
gh-ostis wasteful forINSERTs andUPDATEs.Where/how we can merge gh-ost flow into VReplication
I'll begin by again stressing:
gh-ostflow cannot solve the issue of adding aUNIQue KEY. The resulting table is consistent, but the user might expect an error if duplicates are found.Otherwise, the process is sound. The PR actually incorporates
gh-ostlogic now:UPDATEs remainUPDATEs for nowINSERTintoREPLACE INTO:https://github.com/vitessio/vitess/pull/8044/files#diff-944642971422bb00ed68466f784f8b586f5e800f57e0c46002d6194c517cb1a0R531
vstreamerruns multiple smallSELECTs (withLIMIT), but without terminating straming; the results are all concatenated and streamed back:https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R268-R286
https://github.com/vitessio/vitess/pull/8044/files#diff-862152928bc2f7cafae8ed7dd0fa49608a7f2f60615dc6131059876cd42c0087R166-R197
https://github.com/vitessio/vitess/pull/8044/files#diff-efb6bf2b0113d05b4466e1c54f4abf7ccad01637cc8ad1f01ceb2bb916dd67fdR47-R64
gh-ostdoes.benchmark results
This works for Online DDL; as mentioned above other tests are failing, there' still some termination condition to handle. We'll get this, but the proof of concept:
The above shows we've managed to break the source query into multiple smaller queries, and remain at exact same performance.
It is also the final proof that our bottleneck was gRPC to begin with; not the queries, not database performance.
How can the new flow help us?
vcopierside, we could split it again, and write1000rows of this table, and1000of that table, concurrentlygh-ostflow, we don't even need to track PK per table, this simplifies the logic.Different approaches?
At the cost of maintaining two different code paths, we could choose to:
I'm still unsure about aggregation logic/impact, and @sougou suggested Materialized Views are in particular sensitive to chunk size and prefer mass queries.
UNIQUE KEYconstraintTo be continued
I'll need to fix outstanding issues with the changed flow, and then run a benchmark under load. I want to say I can predict how the new flow will be so much better - but I know reality will have to prove me wrong.
Thoughts are welcome.
cc @vitessio/ps-vitess @rohit-nayak-ps @sougou @deepthi @vmg