-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--max-reconcile-rate doesn’t seem to be working on v0.13.0 #234
Comments
@cachaldora I do not observe any related change in the provider codebase. If we look at https://github.com/upbound/provider-terraform/blame/main/cmd/provider/main.go the related changes of If we look at https://github.com/upbound/provider-terraform/tree/main?tab=readme-ov-file#known-limitations , there is statement
Is it possible that you are overloading your system with a high value of 40, and it produces this undesireable 'long reconciliation' side effect? Could you try to use the CPU limit as per recommendation? |
The description in #233 mentions having cpu.requests == 1000m which is essentially the same as setting the max-reconcile-rate to 1. If you want to use max-reconcile-rate=10 then you also have to set cpu.requests to 10 AND the node must have 10 vCPUs available to give to the pod. It's not very efficient but it's the only way to run multiple terraform CLI commands in parallel. |
@bobh66 is there anything which enforces the CPU count limiting the reconcile rate (i.e. by somehow reserving 1 CPU per terraform invocation), or is your comment based on the assumption that any terraform invocation will just use 1 CPU continuously? AIUI most long running terraform operations will be IO bound on calls to resource APIs (or in some cases sleeping between such calls as it polls for status) |
Each goroutine that gets spawned for a reconciliation calls the terraform CLI several times for workspace selection and the Observe step (terraform plan), and may call it again for either apply or delete, either of which may be long-running processes. Each invocation of the terraform CLI calls |
This may be related to the Lock behavior described in #239 |
It seems to be related. We also experienced the same as @toastwaffle testing with --max-reconcile-rate=40 and resource.request.cpu=4 and resource.limit.cpu=4. We were looking into cpu load that usually didn't go above 2 and to processes running that seemed to be running a workspace at a time most of the time. |
Is it still an issue with v0.16? |
@project-administrator (nice username) This should be fixed from v0.14.1 onwards - see #240 which more or less fixed #239. If you are using high concurrency, I strongly recommend setting up a PVC for your TF workspaces. |
With higher concurrency values like 10 we need to reserve an appropriate amount of RAM and CPU for the terraform pod to run multiple "terraform apply" instances. For us it's 1 CPU core and 1 Gb of RAM per terraform invocation. |
I wonder if we can use the DeploymentRuntimeConfig replicas setting to run several instances of the provider? Has anyone tested this configuration? |
@project-administrator You can run multiple replicas of the provider but it will not help with scaling. The provider is a kubernetes controller (or multiple controllers) and by design controllers cannot run more than one instance. There is (currently) no way to ensure that events for a specific resource instance will be processed in order by the same controller instance, so all controllers are run as single instances. If there are multiple replicas defined they will do leader election and the non-leader instances will wait until the leader is unavailable before they try to become to leader and process the workload. |
Seems this bug is still present in 0.17 and that's quite the issue |
Maybe this information will help someone. I faced long creation of resources in my dev installation, investigation showed that my config for provider was not applied. For example, Not long ago, |
What happened?
I’ve updated terraform provider from version 0.11.0 to 0.13.0 to be able to use pluginCache and parallelism without concurrency issues. With 0.13.0 version I’ve enabled plugin cache and everything seems to be working however reconciliation was taking too long.
As I have -d flag enabled, I was looking into pod logs and noticed that after the upgrade the provider seems to be slow and picking less workloads. The resources used by pod are also significantly less.
How can we reproduce it?
Configure terraform provider:
Create several workspaces and take a look into pod logs and consumed resources.
What environment did it happen in?
The text was updated successfully, but these errors were encountered: