-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: vreplication client CPU usage #7951
Conversation
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
The 4-byte hader slice which we pass to ReadAll is actually being allocated on the heap because the compiler can't tell through the io.Reader interface that the slice doesn't escape. If we move it to our Conn struct, it'll stop being allocated once for each read. This is a meaningful performance improvement! Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Vicent Marti <vmg@strn.cat>
This is awesome. Without digging into the code, how many of these optimizations are shared across standard vttablet query serving? It seems like at least 1-3 are going to positively impact non-vreplication too. |
for c.fields != nil { | ||
rows, err := c.FetchNext() | ||
rows, err := c.FetchNext(row[:0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help me understand - why do we need to do this slice subset of a just created slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do not; I had this in place from an older iteration of the interface. Right now it doesn't make an actual difference, as the changes to the slice's contents inside of FetchNext
are not reflected on the original slice, so this is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Should still be reviewed by someone that knows vrepl better, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it. Thank you for the throttler
fix.
Signed-off-by: Vicent Marti <vmg@strn.cat>
@derekperkins: 1 and 2 are going to have an impact; number 3 could have an impact if I moved more places of the codebase to use the new |
Description
Hi all! This week we're doing performance for the
vreplication
client. As discussed on our previous PR, the replication process is already bottlenecked on MySQL insert performance, so any optimizations we perform on the client at this point are not going to show up on wall time, but they're still very relevant because the replicator runs inside of thevttablet
s, so reducing CPU usage here allows us to serve more queries concurrently with the replication process.For this round of optimizations, I've simply chipped away at all the warm points in the profile for a
vttablet
client performing replication. Let's break down the individual optimizations in this PR, starting with the whole profile:I've annotated the warm spots in the profile in green. Let's go through them from left to right.
1.
mysql.(*Conn).readOnePacket
There is an interesting "glitch" in this part of the flame graph that caught my attention. You can see that most of the calls into the runtime's allocation code are warranted (we are allocating a byte slice for each packet because the slice is later re-used as backing storage for the parsed rows -- this is efficient enough), but there's a significant amount of CPU time spent allocating memory in the middle column, under
mysql.(*Conn).readHeaderFrom
.Looking at the code for this function, it's clear that it was designed to not allocate memory, with a tiny buffer on the stack, and yet it's a constant source of allocations:
vitess/go/mysql/conn.go
Lines 326 to 355 in c5cd306
The issue can be found trivially with escape analysis (running the compiler with
go build -gcflags="-m"
): the 4-byte array that is supposed to be in the stack is actually being allocated on the heap for each function call. This is because the Go compiler really struggles to perform escape analysis across interface calls (such asio.Reader
). It can't tell that the subslice into the array is not being kept by the called code, so it needs to place it into the heap.Unfortunately, we cannot monomorphise the
io.Reader
argument to help the compiler with escape analysis (because theConn
can be either a buffered or unbuffered conn), so we resort to the second best option: if we move the small array intoConn
itself, it'll only be allocated once perConn
as opposed to once per packet. e0b90da2.
mysql.(*Conn).parseRow
The
parseRow
function in Vitess'Conn
was already optimized very significantly in a previous PR: we stopped copying the contents from the row's backing storage when converting them into individualValue
s. However, we can still see in the flame graph that there is a significant amount of CPU usage in runtime allocation code. What is being allocated here is the actual slice that contains theValue
s, i.e.result []Value
, and this is a particularly wasteful allocation because the slice is always the exact same size for a query, so it could be reused over and over again while reading rows.Commit d6674b6 does just that, but taking special care towards backwards compatibility in the API. The
parseRow
function now takes an optional target slice where theValue
s will be stored, but still handles gracefully the case where the target slice isnil
and allocates it to the required size. This allows users of the API to continue using the allocating version by passingnil
instead of an existing slice. It also allows ergonomic usage by keeping the argument and return value of the function the same:https://github.com/vitessio/vitess/blob/d6674b66779f3e02c32b045218b055927d0868e9/go/vt/vttablet/tabletserver/vstreamer/rowstreamer.go#L261-L264
With this calling design, the first time
FetchNext
is called, its argument will benil
, so a new slice will be allocated to store the rows, but in subsequent calls, the slice will be automatically reused.3.
sqltypes.RowToProto3
The conversion of Vitess row data (
[]Value
) into the ProtoBuf form that will be sent over the wire is the most expensive part of the whole row streaming process. This is bad, particularly because there is actually no conversion to be performed here: it's just copying the raw data of the rows into the equivalent ProtoBuf-specified struct. As seen on the flame graph, the massive cost comes from allocating the sameRow
objects over and over again.There are many ways to work around this performance issue. The most obvious one would be using an object pool to reuse
Row
objects between calls. This would be simple to implement but it has a shortcoming: a shared pool ofRow
objects would contain very different sizes of pooled objects, because the average size of an individual row varies wildly depending on the specific table that is being currently copied. We can do better than a global pool if we scope down memory reuse to our local copying operation, where all theRow
objects will be similarly sized, as they contain the exact same amount of columns.Commit dafc1cb performs the local memory reuse without using a memory pool: instead, it uses a slice of
Row
objects, which is faster than a pool, also bounded (because there is a maximum size for the packets that the streamer will send) and it can be reused directly into theResponse
object. A newsqltypes.RowToProto3Inplace
has been implemented to allow conversion without allocating a newRow
object, and to minimize duplication, the oldRowToProto3
has been implemented on top of the new API.4.
throttle.(*Client).ThrottleCheckOK
Ooh, here's a spicy one. It's not often that one sees a significant amount of CPU time spent in wall-clock time acquisition (i.e.
runtime.nanotime
for Go programs). Getting the system's time has been an extremely fast operation in Linux systems for many years; back in the day, clock access was gated behind a syscall, with the consequent overhead of context switching into kernel space. This is not the case anymore: time access for userland process is now available via a vDSO, wherein the kernel maps the relevant clock data directly into our process' memory space so there is essentially no overhead when accessing it from our programs. Of course, the Go runtime takes advantage of this optimization (manually -- the Go runtime has to re-implement a lot of the vDSO logic that exists inlibc
because the Go runtime doesn't uselibc
🙃), and yet we are spending a lot of CPU time in our Throttler implementation just measuring the current time.Why is this? Well, the Throttler is calling
time.Since
to make sure that we don't call the external throttler service too often. The throttler is effectively throttled at 250ms per call. How ironic. But it turns out that we do a lot of throttler checks -- we attempt to throttle once per row, which results in millions of checks per second. We need a more efficient way to keep the throttler from only running once every 250ms.In a more advanced systems language, I believe the correct approach would be to use a coarse clock source for this time calculation (i.e.
CLOCK_MONOTONIC_COARSE
, https://linux.die.net/man/2/clock_gettime), but such interface is not exposed by the Go runtime. Commit f5af133 works around this manually by creating a global ticker, shared between all the Throttlers in a process, that increments a counter once per 250ms. This way, we can delay all the throttle checks simply by ensuring that the current "tick" is newer than the "tick" when the last throttle was performed. ✨ThrottleCheckOK
is now missing from the profile because it uses essentially no CPU time5.
vstreamer.(*Plan).filter
Last, but not least, the
filter
functionality of the replication engine seems to be allocating a very significant amount of memory to perform the filtering. Why? Because it stores the resulting slice ofValue
s into a new slice ofValues
, instead of filtering it in-place.vitess/go/vt/vttablet/tabletserver/vstreamer/planbuilder.go
Lines 116 to 158 in c5cd306
If we look at the actual filtering code, we learn that the trivial optimization, filtering in-place, is not feasible, because the ordering of the resulting slice is not always linear with the input (see
result[i] = values[colExpr.ColNum]
in line 148). Our only choice is, again, to pass in a backing slice to store the results so that it can be reused between calls. Not the most elegant refactoring, but one with a significant impact in the profile:Conclusion
These 5 commits bring down the total CPU usage for our
vreplication
benchmark from17.21s
to8.43s
, a 49% decrease.The profile after this round of optimizations is now dominated by MySQL client reads and GRPC serialization. There is very little we can do about MySQL because we need to read from MySQL and that's not free; and there's very little we can do to GRPC without forking it, so as far as I'm concerned this is a pretty flat profile.
Again, and sadly, these optimizations have no visible impact in wall times because of the bottleneck in MySQL insertion speed, but they have a massive impact in
vttablet
throughput as a whole because we have halved the amount of CPU cycles that the replication process uses. The key takeaway here, for those following at home, is that reducing memory allocations in a GC language doesn't just save memory, it mostly saves CPU time.Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: