New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing CPU usage for Vitess #4191

Open
sougou opened this Issue Sep 9, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@sougou
Member

sougou commented Sep 9, 2018

Problem statement

As Vitess starts to get used in a variety of platforms, one commonly voiced request has been to reduce its CPU utilization. This is particularly relevant for cloud users for whom CPU is relatively more expensive than other resources.

Additionally, the network in certain environments is not very fast or reliable. This results in elevated tail latencies.

Approach

The proposed design is inspired by the existing vtcombo tool, but is different from it. VTCombo is a testing tool that encapsulates all of vitess in a single process. Using a tool like this in place of vtgate will allow it to directly send traffic to mysqls without going through vttablet. This will provide many savings:

  • grpc cost of vtgate
  • grpc cost of vttablet
  • network hop from vtgate to vttablet

Note that the cost of the network hop is mainly relevant if vttablet is also connecting to mysql from an external machine. This is the case when vitess is deployed against already running mysqls, or managed instances like RDS. In cases where vttablet connects to mysql through a socket, only the CPU costs are saved.

However, vtcombo itself will not work in a distributed production setup because it cannot track changing lifecycles of tablets and keyspaces. Also, vtcombo cannot perform other house keeping tasks like backup/restore, resharding, etc.

Instead, we'll preserve most of the existing pieces of vitess as is. This means that vttablets will still remain part of the infrastructure. The only difference, as mentioned above, is that vtgates will send queries directly to mysql.

Implementation strategy

VTGate has an abstraction layer called Gateway. Underneath this, there is an implementation called DiscoveryGateway that keeps track of vttablets that are going up and down. This layer also contains connections to vttablets.

The connection to vttablet can be replaced by the QueryService of vttablet. It doesn’t even have to be wrapped because the connection API and QueryService API are identical. An example of how to instantiate just the query serving part of vttablet is shown in go/vt/vttablet/endtoend, which tests that part.

Trade-offs

This change doesn't come for free. Some features derived from going through a single vttablet will be lost. For example:

  • Sequences
  • Hot row protection
  • Transaction throttling
  • 2PC
  • Message Queue

Setup changes

The setup changes will also be significant. The connection pools will have to be redistributed across multiple vtgates. The overall connections will have to be increase to accommodate unbalanced traffic.

Some of the monitoring variables that were previously associated to a query service will have to be exported differently at the vtgate level.

VTTablet’s QueryService can remain enabled. This will allow tools like vtworker to send queries to it. It will continue to perform other house-keeping tasks like vreplication and health check reporting.

We can consider moving some of the functionality around in such a way that a vttablet will not be required for normal query serving. That way, we spin them up only when housekeeping work needs to be performed.

@xhh1989

This comment has been minimized.

Contributor

xhh1989 commented Sep 10, 2018

awesome, This is a very valuable feature, connections are best placed in the lru cache so that as few connections as possible can be used per vtgate or vttablet

@sougou

This comment has been minimized.

Member

sougou commented Sep 10, 2018

We used to have LRU in the past. The problem with that type of scheme is that a spike in traffic causes all connections to be opened at once. Sometimes mysql can't handle it, and sometimes, the container goes OOM, etc. So, we decided to stick to round-robin, which was more steady on resource demands.

But maybe we can reconsider.

@alainjobart

This comment has been minimized.

Contributor

alainjobart commented Sep 10, 2018

The connection pools are just one feature that vttablet has, but how do you disable the query service consistently on all vtgates? How do you implement Vitess queues safely? Or is this meant to have a smaller number of vtgates?

@sougou

This comment has been minimized.

Member

sougou commented Sep 10, 2018

Message queue is another feature we'll have to disable (I'll add it to the list).

As for query service disabling, the most critical one is for reparenting and master migrate served types right? In such cases, we'll have to rely on mysql's read-only mode. This is something we are anyway used to in the case of an externally managed mysql.

Can you think of any other workflows where we'd need to disable query service?

@mpawliszyn

This comment has been minimized.

Collaborator

mpawliszyn commented Sep 10, 2018

Where does this leave Id generation? I guess this might mean more gaps perhaps?

@alainjobart

This comment has been minimized.

Contributor

alainjobart commented Sep 10, 2018

On the migrate served types, we also use the query service disabling on the slaves, to prevent using a replica tablet on the wrong shard / keyspace. Right now we ensure no-one reads from the wrong slave tablet by disabling the query service, or adding table black lists. This is less critical, as it's more to keep the vtgates in sync with the tablets right now, and you'll have that covered.

Id generation: more gaps, and no real increasing IDs. We used to guarantee increasing IDs, with holes. Now we can just guarantee unicity.

@tirsen

This comment has been minimized.

Contributor

tirsen commented Sep 11, 2018

Surely IDs can still be generated in the vttablet? At least as an option.

I don't think we can give up on increasing IDs. Our apps are built on that assumption.

@sougou

This comment has been minimized.

Member

sougou commented Sep 16, 2018

You can track progress of this implementation here: https://github.com/sougou/vitess/tree/vtdirect.

The embedded TabletServer additionally encapsulates a connection to vttablet and proxies its health while also mimicking the corresponding state changes.

As for IDs, I'm thinking of (identifying and) forwarding those requests to the vttablet instead. This approach should work for messages also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment