vttablet: stream consolidation #7752

vmg · 2021-03-26T18:01:23Z

Description

Happy Friday everyone! Here's the first iteration of a feature I believe is going to have a very significant impact in some of Vitess' largest deployments.

Here's the problem we're trying to solve: thundering herds. That's basically it. Sometimes databases suffer a usage pattern called a "thundering herd" where hundreds/thousands of queries happen at the same time, and all these queries are identical. Most of the time, this is caused either by a deployment, or by a cache that has been cleared or has expired.

Vitess already does a very good job at handling thundering herds in vttablet with our query consolidation engine, which prevents to identical point-queries from going to the upstream MySQL server: only one of the queries goes upstream, and all the others wait for the initial query to return. The issue is that our current consolidation engine does not support streaming queries -- understandable, because this is a hard problem to solve, but at the same time, this is particularly bad, because streaming queries are by definition more expensive than point queries, so consolidating them is particularly beneficial to performance. In practice, these queries often "block" one of our MySQL streaming connections for a long amount of time, and since we have a limited amount of connections from vttablet to MySQL, this results in a snowball effect where every single query in the herd takes longer than the previous one, because it needs to wait for all the previous queries to pick up a connection and release it back to the pool.

So, let's try to solve this problem: this PR implements a consolidation engine for streaming queries. I'm not sure whether this is a novel algorithm, but I haven't found any existing art on this so I had to cook it up myself. It was tricky to get right, and it's definitely complex, but the results seem worth it.

The consolidation engine for streaming queries is configurable with a given amount of memory, because it essentially acts as a cache. When a streaming query starts, the engine keeps track of it and of all the intermediate rows it's returning from MySQL. If another identical stream shows up, it automatically catches up to the original stream with the data that has been cached in memory, and any new rows streamed from MySQL are fanned out to it. Once the original stream has grown large enough in memory, we clear its intermediate cache and new queries can no longer catch up to it, but all the existing followers keep receiving new rows as they come in.

In practice this works very well: with the default settings of 128MB total to be used for consolidation, and 2MB max for any consolidation stream, we can capture massive thundering herds even if they come with a latency differential of up to 250ms. The consolidation engine is smart, so when an identical query comes in that cannot catch up to the existing consolidation, it automatically promotes itself as a consolidation leader for any follow-up queries. A herd of 10.000 identical queries spread over 4s can be handled with only 4 MySQL streams, as opposed to, huh, 10.000.

The performance impact of this optimization is hard to graph, because the previous behavior is very pathological and the new behavior is very straightforward. As a sample:

This is a thundering herd pattern that is performing an expensive query (1100 rows taking up to 2MB in total) from up to a thousand simultaneous connection. The latency pattern for the current version of vttablet (seen in orange) is all over the place: depending on the ordering that the scheduler gave to the incoming queries, the earliest queries take 30ms to finish, while the latest ones go all the way to 1.4 seconds -- this is because, again, every query needs to wait for all the previous ones to release the borrowed MySQL connections.

The consolidation engine (seen in blue), well, it consolidates the 1000 queries into a single stream, so they all finish simultaneously between 40 and 50ms.

This is it for now, I'm putting this up early for review/testing while I'm away next week for Easter. The code may have bugs, but the testing is comprehensive, with integration tests for all error behaviors (including clients that lag behind -- that was tricky to get right) and endtoend tests, with full -race coverage.

Outstanding Issues

Nothing big. My main concern is that we're allocating more memory because the existing MySQL streaming client was reusing the rows of every Result set after yielding it (it's no longer safe to do this, because the result set is kept in memory temporarily to allow other queries to catch up). I've worked around the issue conservatively; an ideal approach would switch to MySQL connection to use a memory pool for rows reuse.

The other outstanding issue is that the results of the stream callbacks are now immutable -- understandably, as they're now shared between clients. I had to lift up a namespace-modification mutation so it didn't happen more than once between the consolidated queries, and the result was clean, but if somebody introduces another of these mutations in the future higher in the stack, it'll break things. The Go type system does not let us model this pointer as immutable because it uses type theory from 1973, so we'll have to keep an eye out to the race detector.

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

vmg · 2021-03-26T18:14:18Z

Unfortunate hiccup: the CI server doesn't have enough concurrency to handle the 1000-client thundering herd in the stress tests, either with or without consolidation. I need to ponder this a bit; maybe the test needs to be simplified for CI altogether.

dweitzman · 2021-03-27T00:26:10Z

Seems like a generally good thing. The edge case I wonder about is what happens if two gates are caught up and streaming the same query, then one of them isn't able to receive results quickly (maybe the application is thrashing and paging through results very slowly).

I haven't looked through the code, but I'm assuming the a single slow or stuck consolidated gate would fill up some buffer and then potentially block other gates from receiving results too?

If that's happening in production, seems like someone would be able to mitigate by setting ConsolidatorStreamTotalSize to 0 and restarting the tablets (to keep OLTP query consolidation and disable OLAP consolidation).