vreplication: performance & benchmarks (table copying) #7881

vmg · 2021-04-16T16:06:58Z

Description

These upcoming weeks I'm going to be working on vreplication performance. It's a pretty complex part of Vitess' codebase and it hasn't received a lot of performance tuning in the past so I'm hoping to get up to speed with the implementation and give it some perf attention.

Let's take a look at the changes in this PR, commit by commit:

endtoend: improve vreplication tests: Critically, there are currently no benchmarks for vreplication, and it turns out, the system is complex enough (with interdependencies between vtctl, vtgate, vttablet and mysqld) that writing microbenchmarks for it is not feasible. To get a starting point for performance tuning instead, I've written some end-to-end performance tests, which give us benchmarking estimates and let us look at the performance profiles for all the moving parts in the system. The first test included here tackles the simplest (possibly?) and most performance sensitive part of replication: the initial table copying.

For those following at home, I might as well explain a little bit some vreplication basics:

A vreplication stream links two vttablet instances by copying and keeping in sync one or more tables between the underlying mysqld instances of each tablet. To accomplish this, the stream is divided in three phases: first off, there is a table copying phase (what we're profiling and optimizing in this PR), which does bulk reads from the source tablet and sends them to the target for bulk inserting. Afterwards, there is a catchup phase, which catches up any row changes that have become outdated during the initial copy of the table. Once the stream has caught up and there's no significant replication lag between the two tablets, we switch to the final phase where we replicate rows in real time from the MySQL binary log.

The table copying phase is, for many vreplication workflows, the most performance intensive part of the process. For instance, if you're using vreplication to re-shard a cluster, you'll want to copy the tables as fast as possible, and once the target has caught up, you want to switch traffic to the new sharding. Hence, you'd spend the vast majority of time doing a copy, and very little time actually streaming. Other uses, just as materializing views, make much more use of streaming, but streaming can often keep up with the changes in the source cluster, because most clusters are not write-heavy, so optimizing that doesn't feel as urgent to me. We'll look at the streaming in the following weeks.
pprof: do not create profiles until we start profiling: this is a small cleanup to the profiling handlers that we wrote for vtgate and vttablet. These profiling handlers don't start profiling in their respective processes until they receive an USR2 signal, but they were previously creating an on-disk profile as soon as the process started. Since our integration suite doesn't profile all the available tablets in the integration cluster, this resulted in several zero-sized profiles on disk for the tablets that were not being profiled.
endtoend: allow bulk-loading databases: another important cleanup for our endtoend suite. We need very large datasets in MySQL in order to gather realistic profiling benchmarks, and building these for each test run can be very expensive (expensive enough to make the profile-optimize-build cycle very painful). To speed up this operation, this commit improves Vitess' mysqlctl helper so that spawned mysqld instances can have a configurable secure-file-priv path. This flag allows us to use MySQL's LOAD DATA INFILE to bulk-load millions of rows from CSV files on disk.
vreplication: improve copy performance: this is the first actual optimization from the CPU profiles of our copy test. It tackles table copy performance on the target side (i.e. the one receiving the rows to be inserted locally). The result is very good, it reduces CPU usage by 60% and memory allocations by 90%. Unfortunately, the effective performance improvement in wall time is not as large: between 15% and 20% faster runtime.

Some thoughts on this: one of the issues that we've seen in large production clusters that use vreplication is that, since the replication streams connect two vttablets, the replication process is competing for CPU cycles with the actual query serving code, which also runs in the vttablets. From this point of view, the massive CPU reduction in this optimization is a very good thing, because it means more spare CPU cycles to continue serving MySQL queries from the tablet. Why does it translate into only a 15% reduction in runtime? If we look at a syscall blocking profile of the tablet, the answer is clear: the optimization of the query inserts is so comprehensive that it brings down the runtime of the table copying process all the way down to MySQL insert performance. The copying process cannot go any faster because MySQL doesn't ingest our rows any faster.

Not all is lost, of course: there are still things we can do to make MySQL ingest the rows faster. I intend to look into that next week. Another side effect of the fact that the receiving vttablet is now bottlenecked in MySQL insert performance, is that are no gains to be obtained by optimizing the row generation in the source vttablet -- in theory. In practice, the same issue we've just discussed also applies to the source tablet: the replication flow also competes for CPU cycles with the actual query load, so any optimizations we perform there, while not reducing the actual wall runtime of the replication process, will result in more query throughput for production servers, so I intend to tackle that next week too.

For now, let's wrap this up by taking a look at the impact of this optimization on the flame graphs for the vttablet process:

Before

After

We can see two clear columns in both profiles: Go runtime (on the left), and our own code (on the right). The profile of our own code is further divided in two clear columns: GRPC streaming reads (left) and query preparation for MySQL bulk inserts (right). The result optimizing the query preparation step is significant, and it shows in how the relative CPU usage of GRPC, which used to be roughly 25% of the runtime, now becomes 75% of the total runtime. Obviously, the actual CPU usage of GRPC is constant between the two profiles, so the fact that it now dominates our CPU profile means that the query preparation is basically using residual amounts of CPU. The next possible optimization here is Protobuf (de)serialization for the incoming rows, which I intend to do next week.

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

deepthi

🚀

deepthi · 2021-04-16T16:48:08Z

go/bytes2/buffer.go

+func (buf *Buffer) StringUnsafe() string {
+	return *(*string)(unsafe.Pointer(&buf.bytes))
+}
+
+func (buf *Buffer) Reset() {
+	buf.bytes = buf.bytes[:0]
+}
+


I'm surprised the linter didn't catch this. Exported methods need comments.

deepthi · 2021-04-16T16:50:36Z

go/sqltypes/value.go

@@ -284,6 +284,36 @@ func (v Value) EncodeSQL(b BinWriter) {
 	}
 }

+// EncodeSQLStringBuilder is identical to EncodeSQL but it takes a strings.Builder
+// as its writer, so it can be inlined for performance.
+func (v Value) EncodeSQLStringBuilder(b *strings.Builder) {


We have two sets of duplicated code in this file that are crying out for generics.
I don't suppose it is worth generating them (since it is only two copies).

deepthi · 2021-04-16T16:54:14Z

go/test/endtoend/vreplication/cluster.go

@@ -211,7 +207,7 @@ func (vc *VitessCluster) AddKeyspace(t *testing.T, cells []*Cell, ksName string,
 	keyspace.VSchema = vschema
 	for _, cell := range cells {
 		if len(cell.Vtgates) == 0 {
-			fmt.Println("Starting vtgate")
+			t.Logf("Starting vtgate")


Do we still get logs to console with this change?
I usually run go test -alsologtostderr -v.

go/test/endtoend/vreplication/performance_test.go

go/vt/sqlparser/parsed_query.go

shlomi-noach · 2021-04-19T12:09:17Z

Noting that I'm going to review this, probably slowly 😛

Signed-off-by: Vicent Marti <vmg@strn.cat>

shlomi-noach · 2021-04-19T14:57:37Z

go/test/endtoend/cluster/vttablet_process.go

+	}
+	defer conn.Close()
+
+	query := fmt.Sprintf("LOAD DATA INFILE '%s' INTO TABLE `%s` FIELDS TERMINATED BY ',' ENCLOSED BY '\"'", tmpbulk.Name(), table)


This assumes you are importing a relation that matches exactly the target table. Just making sure: LOAD DATA INFILE will not work this way if:

the target table is a materialized view, and has more columns than original table, or:

this is an Online DDL operation, and target table may look altogether different than source table

Right, this is supposed to be used only when loading data in the unit tests, so the author of the test must make sure that the CSV file it's writing actually fits in the database. I'm not sure how we could enforce that programatically.

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg requested review from askdba, deepthi, GuptaManan100, harshit-gangal, rohit-nayak-ps, shlomi-noach and systay as code owners April 16, 2021 16:06

deepthi reviewed Apr 16, 2021

View reviewed changes

systay reviewed Apr 19, 2021

View reviewed changes

go/test/endtoend/vreplication/performance_test.go Show resolved Hide resolved

systay reviewed Apr 19, 2021

View reviewed changes

go/vt/sqlparser/parsed_query.go Outdated Show resolved Hide resolved

vmg force-pushed the vmg/vrperf branch 3 times, most recently from b59e211 to 5df0ff8 Compare April 20, 2021 10:01

vmg added 5 commits April 21, 2021 10:51

endtoend: improve vreplication tests

b4db473

Signed-off-by: Vicent Marti <vmg@strn.cat>

pprof: do not create profiles until we start profiling

3ff7fb5

Signed-off-by: Vicent Marti <vmg@strn.cat>

endtoend: allow bulk-loading databases

44caf15

Signed-off-by: Vicent Marti <vmg@strn.cat>

vreplication: improve copy performance

aeaf146

Signed-off-by: Vicent Marti <vmg@strn.cat>

vreplication: review feedback

7751b00

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg force-pushed the vmg/vrperf branch from 5df0ff8 to 7751b00 Compare April 21, 2021 08:51

shlomi-noach reviewed Apr 21, 2021

View reviewed changes

endtoend: make performance tests optional

29bf6ac

Signed-off-by: Vicent Marti <vmg@strn.cat>

shlomi-noach approved these changes Apr 21, 2021

View reviewed changes

shlomi-noach merged commit edceeda into vitessio:master Apr 21, 2021

systay added Component: VReplication Type: Performance labels Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vreplication: performance & benchmarks (table copying) #7881

vreplication: performance & benchmarks (table copying) #7881

vmg commented Apr 16, 2021

deepthi left a comment •

edited

deepthi Apr 16, 2021

deepthi Apr 16, 2021

deepthi Apr 16, 2021

shlomi-noach commented Apr 19, 2021 •

edited

shlomi-noach Apr 19, 2021

vmg Apr 21, 2021

vreplication: performance & benchmarks (table copying) #7881

vreplication: performance & benchmarks (table copying) #7881

Conversation

vmg commented Apr 16, 2021

Description

Before

After

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

deepthi left a comment • edited

Choose a reason for hiding this comment

deepthi Apr 16, 2021

Choose a reason for hiding this comment

deepthi Apr 16, 2021

Choose a reason for hiding this comment

deepthi Apr 16, 2021

Choose a reason for hiding this comment

shlomi-noach commented Apr 19, 2021 • edited

shlomi-noach Apr 19, 2021

Choose a reason for hiding this comment

vmg Apr 21, 2021

Choose a reason for hiding this comment

deepthi left a comment •

edited

shlomi-noach commented Apr 19, 2021 •

edited