Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient Kubernetes API Updates using Informers #75

Conversation

rjackson90
Copy link
Contributor

We noticed our AWS bill was higher than expected over a month ago, and when we looked into it we saw that the config reloader was making a lot of requests to our Kubernetes API server. These requests were causing a lot of cross-AZ traffic, both on their own and from our api servers to our etcd nodes in different AZes. Turning down the update frequency for the config reloader from 5s to 10m ameliorated the immediate monetary problem, at the cost of introducing significant lag in our config updates.

I took a look at the source code for the config reloader, and saw that there was an opportunity to significantly reduce the number of requests being made to the Kubernetes API server by making use of Informers, a facility shipped with the Kubernetes Go client SDK to efficiently track updates to API resources. I've been running this updated implementation in our production kubernetes cluster for several weeks and it's been working really well. There was an immediate reduction in API request volume from the config reloader as soon as I switched over to our fork. We were able to turn back up the update interval to 5 seconds with no impact on the number of requests being made of the kubernetes API server. We don't have very many compute nodes, 20-30 at any given time, but I imagine this would be a pretty serious scalability issue for larger clusters or other people like us who are running on AWS and have to worry about the costs of cross-AZ traffic between HA cluster components.

I've left more detailed explanations of the "why" and "how" in my commit messages and as comments in the source. I've been running this fork for a few weeks and it's been well behaved, however we only ever used the MultiMap datasource so I can't really vouch for compatibility with the behavior of the default kubernetes datasource. I don't think it would be difficult to make this new datasource compatible enough with both existing datasources that they could be removed entirely. Indeed, one of my commits does precisely that, but I'm open to leaving them in if there are incompatibilities discovered in the new datasource.

@vmwclabot
Copy link
Member

@rjackson90, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding Signed-off-by: John Doe <john.doe@email.org> to the last line of each Git commit message. The e-mail address used to sign must match the e-mail address of the Git author. Click here to view the Developer Certificate of Origin agreement.

@rjackson90 rjackson90 force-pushed the feature/efficient-kubernetes-datasource-updates branch from 137c264 to 05b8db0 Compare September 30, 2019 16:40
@vmwclabot
Copy link
Member

@rjackson90, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding Signed-off-by: John Doe <john.doe@email.org> to the last line of each Git commit message. The e-mail address used to sign must match the e-mail address of the Git author. Click here to view the Developer Certificate of Origin agreement.

Richard Jackson added 3 commits September 30, 2019 11:43
Within the Go API client, the Kubernetes project has created tools for
keeping local references to API resources up to date with the server
state. By creating an Informer, the kubernetes API client can perform an
intial sync of objects we are interested in and then watch for updates
to those objects instead of polling the server on a regular interval.

Changing the access pattern of the config reloader is important for
scaling kubernetes clusters. If the config reloader is set to poll for
updates on a short interval (such as 5 seconds) it can generate quite
a significant amount of traffic on the API server even in relatively
modest clusters composed of a couple dozen compute nodes. This increase
in request volume does not provide any value to the config reloader, and
is a potentially expensive side-effect on AWS if some nodes are in a
different availability zone from the kubernetes API server(s).

By creating this new Datasource, the config reloader can be configured
with an arbitrarily short update interval without having to worry about
drowning the kubernetes API server in a flood of pointless requests. The
Informer datasource maintains watches on important API objects in a
separate goroutine, and will make changes available to configreloader
updates very quickly, usually on the update immediately following the
change in the upstream API.

The new Datasource has been designed to be able to replace both existing
Kubernetes Datasources, but has only been tested against the behavior of
the Kubernetes Multimap Datasource.

Signed-off-by: Richard Jackson <richard.jackson@55places.com>
Update the controller to only use the new Informer-based datasource to
replace both existing kubernetes datasource implementations.

Note: This change has been tested in multiple production-quality
clusters for several weeks, but only in a configuration that formerly
used the Multimap datasource. Only cursory testing has been done to
verify that the new datasource is compatible with the default Kubernetes
datasource.

Signed-off-by: Richard Jackson <richard.jackson@55places.com>
The new Informer-based datasource for Kubernetes has clear benefits in
terms of scalability, and should be compatible with the behavior of both
existing datasources.

Signed-off-by: Richard Jackson <richard.jackson@55places.com>
@rjackson90 rjackson90 force-pushed the feature/efficient-kubernetes-datasource-updates branch from 05b8db0 to 7f9d9be Compare September 30, 2019 16:44
@rjackson90
Copy link
Contributor Author

Hi @jvassev do you have any questions about this proposed change that I can answer? Is there anything I can do to make review easier?

@jvassev
Copy link
Contributor

jvassev commented Nov 10, 2019

Hi @rjackson90, this looks like a great improvement! Since you've tested mainly against the multimap datasource, don't you think it is safer to just change how the multimap is handled and leave the default datasource path untouched?

@jvassev
Copy link
Contributor

jvassev commented Nov 15, 2019

On second thought, let's merge it as is because it is definitely a huge improvement. Thanks for making it happen!
If there is a regression the revert for the default datasource would be trivial.

@jvassev jvassev merged commit 8ade99e into vmware:master Nov 15, 2019
@rjackson90
Copy link
Contributor Author

I'm happy to help, if any problems come up an immediate revert is wise. I was torn on whether or not to submit the PR with the default datasource modified, but landed in roughly the same place you did where it seemed sub-optimal for users to experience significantly different API call efficiency based on whether they selected a relatively minor configuration change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants