Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: vreplication client CPU usage #7951

Merged
merged 6 commits into from
Apr 27, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Apr 26, 2021

Description

Hi all! This week we're doing performance for the vreplication client. As discussed on our previous PR, the replication process is already bottlenecked on MySQL insert performance, so any optimizations we perform on the client at this point are not going to show up on wall time, but they're still very relevant because the replicator runs inside of the vttablets, so reducing CPU usage here allows us to serve more queries concurrently with the replication process.

For this round of optimizations, I've simply chipped away at all the warm points in the profile for a vttablet client performing replication. Let's break down the individual optimizations in this PR, starting with the whole profile:

szoter_annotated_image

I've annotated the warm spots in the profile in green. Let's go through them from left to right.

1. mysql.(*Conn).readOnePacket

There is an interesting "glitch" in this part of the flame graph that caught my attention. You can see that most of the calls into the runtime's allocation code are warranted (we are allocating a byte slice for each packet because the slice is later re-used as backing storage for the parsed rows -- this is efficient enough), but there's a significant amount of CPU time spent allocating memory in the middle column, under mysql.(*Conn).readHeaderFrom.

Looking at the code for this function, it's clear that it was designed to not allocate memory, with a tiny buffer on the stack, and yet it's a constant source of allocations:

vitess/go/mysql/conn.go

Lines 326 to 355 in c5cd306

func (c *Conn) readHeaderFrom(r io.Reader) (int, error) {
var header [packetHeaderSize]byte
// Note io.ReadFull will return two different types of errors:
// 1. if the socket is already closed, and the go runtime knows it,
// then ReadFull will return an error (different than EOF),
// something like 'read: connection reset by peer'.
// 2. if the socket is not closed while we start the read,
// but gets closed after the read is started, we'll get io.EOF.
if _, err := io.ReadFull(r, header[:]); err != nil {
// The special casing of propagating io.EOF up
// is used by the server side only, to suppress an error
// message if a client just disconnects.
if err == io.EOF {
return 0, err
}
if strings.HasSuffix(err.Error(), "read: connection reset by peer") {
return 0, io.EOF
}
return 0, vterrors.Wrapf(err, "io.ReadFull(header size) failed")
}
sequence := uint8(header[3])
if sequence != c.sequence {
return 0, vterrors.Errorf(vtrpc.Code_INTERNAL, "invalid sequence, expected %v got %v", c.sequence, sequence)
}
c.sequence++
return int(uint32(header[0]) | uint32(header[1])<<8 | uint32(header[2])<<16), nil
}

The issue can be found trivially with escape analysis (running the compiler with go build -gcflags="-m"): the 4-byte array that is supposed to be in the stack is actually being allocated on the heap for each function call. This is because the Go compiler really struggles to perform escape analysis across interface calls (such as io.Reader). It can't tell that the subslice into the array is not being kept by the called code, so it needs to place it into the heap.

Unfortunately, we cannot monomorphise the io.Reader argument to help the compiler with escape analysis (because the Conn can be either a buffered or unbuffered conn), so we resort to the second best option: if we move the small array into Conn itself, it'll only be allocated once per Conn as opposed to once per packet. e0b90da

Before After
image image

2. mysql.(*Conn).parseRow

The parseRow function in Vitess' Conn was already optimized very significantly in a previous PR: we stopped copying the contents from the row's backing storage when converting them into individual Values. However, we can still see in the flame graph that there is a significant amount of CPU usage in runtime allocation code. What is being allocated here is the actual slice that contains the Values, i.e. result []Value, and this is a particularly wasteful allocation because the slice is always the exact same size for a query, so it could be reused over and over again while reading rows.

Commit d6674b6 does just that, but taking special care towards backwards compatibility in the API. The parseRow function now takes an optional target slice where the Values will be stored, but still handles gracefully the case where the target slice is nil and allocates it to the required size. This allows users of the API to continue using the allocating version by passing nil instead of an existing slice. It also allows ergonomic usage by keeping the argument and return value of the function the same:

https://github.com/vitessio/vitess/blob/d6674b66779f3e02c32b045218b055927d0868e9/go/vt/vttablet/tabletserver/vstreamer/rowstreamer.go#L261-L264

With this calling design, the first time FetchNext is called, its argument will be nil, so a new slice will be allocated to store the rows, but in subsequent calls, the slice will be automatically reused.

Before After
image image

3. sqltypes.RowToProto3

The conversion of Vitess row data ([]Value) into the ProtoBuf form that will be sent over the wire is the most expensive part of the whole row streaming process. This is bad, particularly because there is actually no conversion to be performed here: it's just copying the raw data of the rows into the equivalent ProtoBuf-specified struct. As seen on the flame graph, the massive cost comes from allocating the same Row objects over and over again.

There are many ways to work around this performance issue. The most obvious one would be using an object pool to reuse Row objects between calls. This would be simple to implement but it has a shortcoming: a shared pool of Row objects would contain very different sizes of pooled objects, because the average size of an individual row varies wildly depending on the specific table that is being currently copied. We can do better than a global pool if we scope down memory reuse to our local copying operation, where all the Row objects will be similarly sized, as they contain the exact same amount of columns.

Commit dafc1cb performs the local memory reuse without using a memory pool: instead, it uses a slice of Row objects, which is faster than a pool, also bounded (because there is a maximum size for the packets that the streamer will send) and it can be reused directly into the Response object. A new sqltypes.RowToProto3Inplace has been implemented to allow conversion without allocating a new Row object, and to minimize duplication, the old RowToProto3 has been implemented on top of the new API.

Before After
image image

4. throttle.(*Client).ThrottleCheckOK

Ooh, here's a spicy one. It's not often that one sees a significant amount of CPU time spent in wall-clock time acquisition (i.e. runtime.nanotime for Go programs). Getting the system's time has been an extremely fast operation in Linux systems for many years; back in the day, clock access was gated behind a syscall, with the consequent overhead of context switching into kernel space. This is not the case anymore: time access for userland process is now available via a vDSO, wherein the kernel maps the relevant clock data directly into our process' memory space so there is essentially no overhead when accessing it from our programs. Of course, the Go runtime takes advantage of this optimization (manually -- the Go runtime has to re-implement a lot of the vDSO logic that exists in libc because the Go runtime doesn't use libc 🙃), and yet we are spending a lot of CPU time in our Throttler implementation just measuring the current time.

Why is this? Well, the Throttler is calling time.Since to make sure that we don't call the external throttler service too often. The throttler is effectively throttled at 250ms per call. How ironic. But it turns out that we do a lot of throttler checks -- we attempt to throttle once per row, which results in millions of checks per second. We need a more efficient way to keep the throttler from only running once every 250ms.

In a more advanced systems language, I believe the correct approach would be to use a coarse clock source for this time calculation (i.e. CLOCK_MONOTONIC_COARSE, https://linux.die.net/man/2/clock_gettime), but such interface is not exposed by the Go runtime. Commit f5af133 works around this manually by creating a global ticker, shared between all the Throttlers in a process, that increments a counter once per 250ms. This way, we can delay all the throttle checks simply by ensuring that the current "tick" is newer than the "tick" when the last throttle was performed. ✨

Before After
image ThrottleCheckOK is now missing from the profile because it uses essentially no CPU time

5. vstreamer.(*Plan).filter

Last, but not least, the filter functionality of the replication engine seems to be allocating a very significant amount of memory to perform the filtering. Why? Because it stores the resulting slice of Values into a new slice of Values, instead of filtering it in-place.

func (plan *Plan) filter(values []sqltypes.Value) (bool, []sqltypes.Value, error) {
for _, filter := range plan.Filters {
switch filter.Opcode {
case Equal:
result, err := evalengine.NullsafeCompare(values[filter.ColNum], filter.Value)
if err != nil {
return false, nil, err
}
if result != 0 {
return false, nil, nil
}
case VindexMatch:
ksid, err := getKeyspaceID(values, filter.Vindex, filter.VindexColumns)
if err != nil {
return false, nil, err
}
if !key.KeyRangeContains(filter.KeyRange, ksid) {
return false, nil, nil
}
}
}
result := make([]sqltypes.Value, len(plan.ColExprs))
for i, colExpr := range plan.ColExprs {
if colExpr.ColNum == -1 {
result[i] = colExpr.FixedValue
continue
}
if colExpr.ColNum >= len(values) {
return false, nil, fmt.Errorf("index out of range, colExpr.ColNum: %d, len(values): %d", colExpr.ColNum, len(values))
}
if colExpr.Vindex == nil {
result[i] = values[colExpr.ColNum]
} else {
ksid, err := getKeyspaceID(values, colExpr.Vindex, colExpr.VindexColumns)
if err != nil {
return false, nil, err
}
result[i] = sqltypes.MakeTrusted(sqltypes.VarBinary, []byte(ksid))
}
}
return true, result, nil
}

If we look at the actual filtering code, we learn that the trivial optimization, filtering in-place, is not feasible, because the ordering of the resulting slice is not always linear with the input (see result[i] = values[colExpr.ColNum] in line 148). Our only choice is, again, to pass in a backing slice to store the results so that it can be reused between calls. Not the most elegant refactoring, but one with a significant impact in the profile:

Before After
image image

Conclusion

These 5 commits bring down the total CPU usage for our vreplication benchmark from 17.21s to 8.43s, a 49% decrease.

image

The profile after this round of optimizations is now dominated by MySQL client reads and GRPC serialization. There is very little we can do about MySQL because we need to read from MySQL and that's not free; and there's very little we can do to GRPC without forking it, so as far as I'm concerned this is a pretty flat profile.

Again, and sadly, these optimizations have no visible impact in wall times because of the bottleneck in MySQL insertion speed, but they have a massive impact in vttablet throughput as a whole because we have halved the amount of CPU cycles that the replication process uses. The key takeaway here, for those following at home, is that reducing memory allocations in a GC language doesn't just save memory, it mostly saves CPU time.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

vmg added 5 commits April 26, 2021 12:30
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
The 4-byte hader slice which we pass to ReadAll is actually being
allocated on the heap because the compiler can't tell through the
io.Reader interface that the slice doesn't escape.

If we move it to our Conn struct, it'll stop being allocated once for
each read. This is a meaningful performance improvement!

Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
@derekperkins
Copy link
Member

This is awesome. Without digging into the code, how many of these optimizations are shared across standard vttablet query serving? It seems like at least 1-3 are going to positively impact non-vreplication too.

for c.fields != nil {
rows, err := c.FetchNext()
rows, err := c.FetchNext(row[:0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help me understand - why do we need to do this slice subset of a just created slice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do not; I had this in place from an older iteration of the interface. Right now it doesn't make an actual difference, as the changes to the slice's contents inside of FetchNext are not reflected on the original slice, so this is redundant.

Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Should still be reviewed by someone that knows vrepl better, I think.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Thank you for the throttler fix.

Signed-off-by: Vicent Marti <vmg@strn.cat>
@harshit-gangal harshit-gangal merged commit 37c09d3 into vitessio:master Apr 27, 2021
@vmg
Copy link
Collaborator Author

vmg commented Apr 27, 2021

Without digging into the code, how many of these optimizations are shared across standard vttablet query serving?

@derekperkins: 1 and 2 are going to have an impact; number 3 could have an impact if I moved more places of the codebase to use the new Inplace API -- I just haven't found any of those places that are actually hot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants