Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing hcloud API calls for hcloudmachines that are up and running #1336

Open
janiskemper opened this issue Jun 11, 2024 · 2 comments
Open

Comments

@janiskemper
Copy link
Contributor

janiskemper commented Jun 11, 2024

/kind proposal

Sometimes we hit the rate-limit because the caph controller does too many calls to the hcloud API.

Checks we do for a running hcloudmachine

  • validate that labels are still correctly set and let the machine fail if not
  • server status is off and gets switched on
  • server doesn't exist anymore, so it gets created again (-> this is actually even a problem - we don't want to re-create a server if the first one was deleted while having the same machine object)
  • server is not attached to network anymore and gets attached again
  • server (control plane) is not a target of the load balancer anymore and gets added again

The following points are based on an action of the user. The user removes a label of the server in the HCloud UI, and we cannot validate it. The user deletes a server manually and we realize that. The user removes the server from the load balancer or the network and we add it again.

These are potentially valid use cases - the question is whether they are so relevant that we need to keep them.

Possible ways of handling these checks

Right now we do the following:
One extreme (current one): Do all API calls to check everything in every reconcile loop.

The other extreme: Stop doing any API calls once the server is up and running. If something is wrong with the server, the Machine Health Checks should discover that. We don't do anything if the user actively misconfigures something and for example removes a server from the load balancer.

Middle way 1: To specific checks and stop doing all others
We could stop checking that the server is part of the network and continue checking that it is added as target to the load balancer. For example. Any combination of things that are important to us is possible.

Middle way 2: Heavily cache API calls once a server is running
We could also use a cache to not call the API regularly. If something goes wrong, we would realize it later, but eventually we would.

Any thoughts?

I'm curious to hear opinions, also from people outside of Syself! The overall goal is to reduce the number of API calls that can be rather high. Hundreds of calls per hour for a stable (not scaling) cluster is normal.

A similar question could be asked also for the general load balancer, placement group and network configuration, which we reconcile in the hetznercluster-controller. I'm also looking forward to opinions there!

@janiskemper janiskemper changed the title Reducing API calls for hcloudmachines that are up and running? Reducing API calls for hcloudmachines that are up and running Jun 11, 2024
@guettli guettli changed the title Reducing API calls for hcloudmachines that are up and running Reducing hcloud API calls for hcloudmachines that are up and running Jun 12, 2024
@apricote
Copy link
Contributor

I do think all of the above requests should be fine. My question would be how often you are triggering reconciles of the HCloudMachines controller and if that can be optimized.

I described my previous solution to investigate this here: #926 (comment)

@janiskemper
Copy link
Contributor Author

I think that we can probably slightly improve the work that you have started already @apricote .

Why I have written down my thoughts here is more that a large use case is more likely to run into a rate limit than a smaller one.

This is in general just a question of optimizing rather than fixing a certain bug or issue.

We currently reconcile all objects every three minutes as default, this is a controller-runtime setting that can also be used as a parameter to reduce the amount of reconcilements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants