perf: vreplication client CPU usage #7951

vmg · 2021-04-26T15:46:58Z

Description

Hi all! This week we're doing performance for the vreplication client. As discussed on our previous PR, the replication process is already bottlenecked on MySQL insert performance, so any optimizations we perform on the client at this point are not going to show up on wall time, but they're still very relevant because the replicator runs inside of the vttablets, so reducing CPU usage here allows us to serve more queries concurrently with the replication process.

For this round of optimizations, I've simply chipped away at all the warm points in the profile for a vttablet client performing replication. Let's break down the individual optimizations in this PR, starting with the whole profile:

I've annotated the warm spots in the profile in green. Let's go through them from left to right.

1. `mysql.(*Conn).readOnePacket`

There is an interesting "glitch" in this part of the flame graph that caught my attention. You can see that most of the calls into the runtime's allocation code are warranted (we are allocating a byte slice for each packet because the slice is later re-used as backing storage for the parsed rows -- this is efficient enough), but there's a significant amount of CPU time spent allocating memory in the middle column, under mysql.(*Conn).readHeaderFrom.

Looking at the code for this function, it's clear that it was designed to not allocate memory, with a tiny buffer on the stack, and yet it's a constant source of allocations:

vitess/go/mysql/conn.go

Lines 326 to 355 in c5cd306

    
           func (c *Conn) readHeaderFrom(r io.Reader) (int, error) { 
        
           	var header [packetHeaderSize]byte 
        
           	// Note io.ReadFull will return two different types of errors: 
        
           	// 1. if the socket is already closed, and the go runtime knows it, 
        
           	//   then ReadFull will return an error (different than EOF), 
        
           	//   something like 'read: connection reset by peer'. 
        
           	// 2. if the socket is not closed while we start the read, 
        
           	//   but gets closed after the read is started, we'll get io.EOF. 
        
           	if _, err := io.ReadFull(r, header[:]); err != nil { 
        
           		// The special casing of propagating io.EOF up 
        
           		// is used by the server side only, to suppress an error 
        
           		// message if a client just disconnects. 
        
           		if err == io.EOF { 
        
           			return 0, err 
        
           		} 
        
           		if strings.HasSuffix(err.Error(), "read: connection reset by peer") { 
        
           			return 0, io.EOF 
        
           		} 
        
           		return 0, vterrors.Wrapf(err, "io.ReadFull(header size) failed") 
        
           	} 
        
           	sequence := uint8(header[3]) 
        
           	if sequence != c.sequence { 
        
           		return 0, vterrors.Errorf(vtrpc.Code_INTERNAL, "invalid sequence, expected %v got %v", c.sequence, sequence) 
        
           	} 
        
           	c.sequence++ 
        
           	return int(uint32(header[0]) | uint32(header[1])<<8 | uint32(header[2])<<16), nil 
        
           }

The issue can be found trivially with escape analysis (running the compiler with go build -gcflags="-m"): the 4-byte array that is supposed to be in the stack is actually being allocated on the heap for each function call. This is because the Go compiler really struggles to perform escape analysis across interface calls (such as io.Reader). It can't tell that the subslice into the array is not being kept by the called code, so it needs to place it into the heap.

Unfortunately, we cannot monomorphise the io.Reader argument to help the compiler with escape analysis (because the Conn can be either a buffered or unbuffered conn), so we resort to the second best option: if we move the small array into Conn itself, it'll only be allocated once per Conn as opposed to once per packet. e0b90da

Before	After

2. `mysql.(*Conn).parseRow`

The parseRow function in Vitess' Conn was already optimized very significantly in a previous PR: we stopped copying the contents from the row's backing storage when converting them into individual Values. However, we can still see in the flame graph that there is a significant amount of CPU usage in runtime allocation code. What is being allocated here is the actual slice that contains the Values, i.e. result []Value, and this is a particularly wasteful allocation because the slice is always the exact same size for a query, so it could be reused over and over again while reading rows.

Commit d6674b6 does just that, but taking special care towards backwards compatibility in the API. The parseRow function now takes an optional target slice where the Values will be stored, but still handles gracefully the case where the target slice is nil and allocates it to the required size. This allows users of the API to continue using the allocating version by passing nil instead of an existing slice. It also allows ergonomic usage by keeping the argument and return value of the function the same:

https://github.com/vitessio/vitess/blob/d6674b66779f3e02c32b045218b055927d0868e9/go/vt/vttablet/tabletserver/vstreamer/rowstreamer.go#L261-L264

With this calling design, the first time FetchNext is called, its argument will be nil, so a new slice will be allocated to store the rows, but in subsequent calls, the slice will be automatically reused.

Before	After

3. `sqltypes.RowToProto3`

The conversion of Vitess row data ([]Value) into the ProtoBuf form that will be sent over the wire is the most expensive part of the whole row streaming process. This is bad, particularly because there is actually no conversion to be performed here: it's just copying the raw data of the rows into the equivalent ProtoBuf-specified struct. As seen on the flame graph, the massive cost comes from allocating the same Row objects over and over again.

There are many ways to work around this performance issue. The most obvious one would be using an object pool to reuse Row objects between calls. This would be simple to implement but it has a shortcoming: a shared pool of Row objects would contain very different sizes of pooled objects, because the average size of an individual row varies wildly depending on the specific table that is being currently copied. We can do better than a global pool if we scope down memory reuse to our local copying operation, where all the Row objects will be similarly sized, as they contain the exact same amount of columns.

Commit dafc1cb performs the local memory reuse without using a memory pool: instead, it uses a slice of Row objects, which is faster than a pool, also bounded (because there is a maximum size for the packets that the streamer will send) and it can be reused directly into the Response object. A new sqltypes.RowToProto3Inplace has been implemented to allow conversion without allocating a new Row object, and to minimize duplication, the old RowToProto3 has been implemented on top of the new API.

Before	After

4. `throttle.(*Client).ThrottleCheckOK`

Ooh, here's a spicy one. It's not often that one sees a significant amount of CPU time spent in wall-clock time acquisition (i.e. runtime.nanotime for Go programs). Getting the system's time has been an extremely fast operation in Linux systems for many years; back in the day, clock access was gated behind a syscall, with the consequent overhead of context switching into kernel space. This is not the case anymore: time access for userland process is now available via a vDSO, wherein the kernel maps the relevant clock data directly into our process' memory space so there is essentially no overhead when accessing it from our programs. Of course, the Go runtime takes advantage of this optimization (manually -- the Go runtime has to re-implement a lot of the vDSO logic that exists in libc because the Go runtime doesn't use libc 🙃), and yet we are spending a lot of CPU time in our Throttler implementation just measuring the current time.

Why is this? Well, the Throttler is calling time.Since to make sure that we don't call the external throttler service too often. The throttler is effectively throttled at 250ms per call. How ironic. But it turns out that we do a lot of throttler checks -- we attempt to throttle once per row, which results in millions of checks per second. We need a more efficient way to keep the throttler from only running once every 250ms.

In a more advanced systems language, I believe the correct approach would be to use a coarse clock source for this time calculation (i.e. CLOCK_MONOTONIC_COARSE, https://linux.die.net/man/2/clock_gettime), but such interface is not exposed by the Go runtime. Commit f5af133 works around this manually by creating a global ticker, shared between all the Throttlers in a process, that increments a counter once per 250ms. This way, we can delay all the throttle checks simply by ensuring that the current "tick" is newer than the "tick" when the last throttle was performed. ✨

Before	After
	`ThrottleCheckOK` is now missing from the profile because it uses essentially no CPU time

5. `vstreamer.(*Plan).filter`

Last, but not least, the filter functionality of the replication engine seems to be allocating a very significant amount of memory to perform the filtering. Why? Because it stores the resulting slice of Values into a new slice of Values, instead of filtering it in-place.

vitess/go/vt/vttablet/tabletserver/vstreamer/planbuilder.go

Lines 116 to 158 in c5cd306

    
           func (plan *Plan) filter(values []sqltypes.Value) (bool, []sqltypes.Value, error) { 
        
           	for _, filter := range plan.Filters { 
        
           		switch filter.Opcode { 
        
           		case Equal: 
        
           			result, err := evalengine.NullsafeCompare(values[filter.ColNum], filter.Value) 
        
           			if err != nil { 
        
           				return false, nil, err 
        
           			} 
        
           			if result != 0 { 
        
           				return false, nil, nil 
        
           			} 
        
           		case VindexMatch: 
        
           			ksid, err := getKeyspaceID(values, filter.Vindex, filter.VindexColumns) 
        
           			if err != nil { 
        
           				return false, nil, err 
        
           			} 
        
           			if !key.KeyRangeContains(filter.KeyRange, ksid) { 
        
           				return false, nil, nil 
        
           			} 
        
           		} 
        
           	} 
        
           	result := make([]sqltypes.Value, len(plan.ColExprs)) 
        
           	for i, colExpr := range plan.ColExprs { 
        
           		if colExpr.ColNum == -1 { 
        
           			result[i] = colExpr.FixedValue 
        
           			continue 
        
           		} 
        
           		if colExpr.ColNum >= len(values) { 
        
           			return false, nil, fmt.Errorf("index out of range, colExpr.ColNum: %d, len(values): %d", colExpr.ColNum, len(values)) 
        
           		} 
        
           		if colExpr.Vindex == nil { 
        
           			result[i] = values[colExpr.ColNum] 
        
           		} else { 
        
           			ksid, err := getKeyspaceID(values, colExpr.Vindex, colExpr.VindexColumns) 
        
           			if err != nil { 
        
           				return false, nil, err 
        
           			} 
        
           			result[i] = sqltypes.MakeTrusted(sqltypes.VarBinary, []byte(ksid)) 
        
           		} 
        
           	} 
        
           	return true, result, nil 
        
           }

If we look at the actual filtering code, we learn that the trivial optimization, filtering in-place, is not feasible, because the ordering of the resulting slice is not always linear with the input (see result[i] = values[colExpr.ColNum] in line 148). Our only choice is, again, to pass in a backing slice to store the results so that it can be reused between calls. Not the most elegant refactoring, but one with a significant impact in the profile:

Before	After

Conclusion

These 5 commits bring down the total CPU usage for our vreplication benchmark from 17.21s to 8.43s, a 49% decrease.

The profile after this round of optimizations is now dominated by MySQL client reads and GRPC serialization. There is very little we can do about MySQL because we need to read from MySQL and that's not free; and there's very little we can do to GRPC without forking it, so as far as I'm concerned this is a pretty flat profile.

Again, and sadly, these optimizations have no visible impact in wall times because of the bottleneck in MySQL insertion speed, but they have a massive impact in vttablet throughput as a whole because we have halved the amount of CPU cycles that the replication process uses. The key takeaway here, for those following at home, is that reducing memory allocations in a GC language doesn't just save memory, it mostly saves CPU time.

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

Signed-off-by: Vicent Marti <vmg@strn.cat>

The 4-byte hader slice which we pass to ReadAll is actually being allocated on the heap because the compiler can't tell through the io.Reader interface that the slice doesn't escape. If we move it to our Conn struct, it'll stop being allocated once for each read. This is a meaningful performance improvement! Signed-off-by: Vicent Marti <vmg@strn.cat>

Signed-off-by: Vicent Marti <vmg@strn.cat>

derekperkins · 2021-04-26T18:07:20Z

This is awesome. Without digging into the code, how many of these optimizations are shared across standard vttablet query serving? It seems like at least 1-3 are going to positively impact non-vreplication too.

systay · 2021-04-27T05:11:34Z

go/mysql/streaming_query.go

 	for c.fields != nil {
-		rows, err := c.FetchNext()
+		rows, err := c.FetchNext(row[:0])


help me understand - why do we need to do this slice subset of a just created slice?

We actually do not; I had this in place from an older iteration of the interface. Right now it doesn't make an actual difference, as the changes to the slice's contents inside of FetchNext are not reflected on the original slice, so this is redundant.

go/vt/vttablet/tabletserver/vstreamer/planbuilder.go

systay

Looks great to me. Should still be reviewed by someone that knows vrepl better, I think.

shlomi-noach

Love it. Thank you for the throttler fix.

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg · 2021-04-27T15:56:20Z

Without digging into the code, how many of these optimizations are shared across standard vttablet query serving?

@derekperkins: 1 and 2 are going to have an impact; number 3 could have an impact if I moved more places of the codebase to use the new Inplace API -- I just haven't found any of those places that are actually hot!

vmg added 5 commits April 26, 2021 12:30

vstreamer: do not allocate when converting rows

dafc1cb

Signed-off-by: Vicent Marti <vmg@strn.cat>

vstreamer: do not allocate when filtering rows

d31fb23

Signed-off-by: Vicent Marti <vmg@strn.cat>

throttle: do not check for time constantly

a2850bb

Signed-off-by: Vicent Marti <vmg@strn.cat>

mysql: allow reusing row storage when reading from a stream

e7bf329

Signed-off-by: Vicent Marti <vmg@strn.cat>

vmg requested review from harshit-gangal, shlomi-noach and systay as code owners April 26, 2021 15:46

systay added Component: Query Serving Type: Performance Component: VReplication and removed Component: Query Serving labels Apr 26, 2021

systay reviewed Apr 27, 2021

View reviewed changes

go/vt/vttablet/tabletserver/vstreamer/planbuilder.go Show resolved Hide resolved

systay approved these changes Apr 27, 2021

View reviewed changes

harshit-gangal approved these changes Apr 27, 2021

View reviewed changes

shlomi-noach approved these changes Apr 27, 2021

View reviewed changes

vstreamer: fix docs

172fac7

Signed-off-by: Vicent Marti <vmg@strn.cat>

harshit-gangal merged commit 37c09d3 into vitessio:master Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: vreplication client CPU usage #7951

perf: vreplication client CPU usage #7951

vmg commented Apr 26, 2021 •

edited

derekperkins commented Apr 26, 2021

systay Apr 27, 2021

vmg Apr 27, 2021

systay left a comment

shlomi-noach left a comment

vmg commented Apr 27, 2021

	func (c *Conn) readHeaderFrom(r io.Reader) (int, error) {
	var header [packetHeaderSize]byte
	// Note io.ReadFull will return two different types of errors:
	// 1. if the socket is already closed, and the go runtime knows it,
	// then ReadFull will return an error (different than EOF),
	// something like 'read: connection reset by peer'.
	// 2. if the socket is not closed while we start the read,
	// but gets closed after the read is started, we'll get io.EOF.
	if _, err := io.ReadFull(r, header[:]); err != nil {
	// The special casing of propagating io.EOF up
	// is used by the server side only, to suppress an error
	// message if a client just disconnects.
	if err == io.EOF {
	return 0, err
	}
	if strings.HasSuffix(err.Error(), "read: connection reset by peer") {
	return 0, io.EOF
	}
	return 0, vterrors.Wrapf(err, "io.ReadFull(header size) failed")
	}

	sequence := uint8(header[3])
	if sequence != c.sequence {
	return 0, vterrors.Errorf(vtrpc.Code_INTERNAL, "invalid sequence, expected %v got %v", c.sequence, sequence)
	}

	c.sequence++

	return int(uint32(header[0]) \| uint32(header[1])<<8 \| uint32(header[2])<<16), nil
	}

	func (plan *Plan) filter(values []sqltypes.Value) (bool, []sqltypes.Value, error) {
	for _, filter := range plan.Filters {
	switch filter.Opcode {
	case Equal:
	result, err := evalengine.NullsafeCompare(values[filter.ColNum], filter.Value)
	if err != nil {
	return false, nil, err
	}
	if result != 0 {
	return false, nil, nil
	}
	case VindexMatch:
	ksid, err := getKeyspaceID(values, filter.Vindex, filter.VindexColumns)
	if err != nil {
	return false, nil, err
	}
	if !key.KeyRangeContains(filter.KeyRange, ksid) {
	return false, nil, nil
	}
	}
	}

	result := make([]sqltypes.Value, len(plan.ColExprs))
	for i, colExpr := range plan.ColExprs {
	if colExpr.ColNum == -1 {
	result[i] = colExpr.FixedValue
	continue
	}
	if colExpr.ColNum >= len(values) {
	return false, nil, fmt.Errorf("index out of range, colExpr.ColNum: %d, len(values): %d", colExpr.ColNum, len(values))
	}
	if colExpr.Vindex == nil {
	result[i] = values[colExpr.ColNum]
	} else {
	ksid, err := getKeyspaceID(values, colExpr.Vindex, colExpr.VindexColumns)
	if err != nil {
	return false, nil, err
	}
	result[i] = sqltypes.MakeTrusted(sqltypes.VarBinary, []byte(ksid))
	}
	}
	return true, result, nil
	}

perf: vreplication client CPU usage #7951

perf: vreplication client CPU usage #7951

Conversation

vmg commented Apr 26, 2021 • edited

Description

1. mysql.(*Conn).readOnePacket

2. mysql.(*Conn).parseRow

3. sqltypes.RowToProto3

4. throttle.(*Client).ThrottleCheckOK

5. vstreamer.(*Plan).filter

Conclusion

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

derekperkins commented Apr 26, 2021

systay Apr 27, 2021

Choose a reason for hiding this comment

vmg Apr 27, 2021

Choose a reason for hiding this comment

systay left a comment

Choose a reason for hiding this comment

shlomi-noach left a comment

Choose a reason for hiding this comment

vmg commented Apr 27, 2021

vmg commented Apr 26, 2021 •

edited

1. `mysql.(*Conn).readOnePacket`

2. `mysql.(*Conn).parseRow`

3. `sqltypes.RowToProto3`

4. `throttle.(*Client).ThrottleCheckOK`

5. `vstreamer.(*Plan).filter`