--max-reconcile-rate doesn’t seem to be working on v0.13.0 #234

cachaldora · 2024-01-19T11:11:22Z

What happened?

I’ve updated terraform provider from version 0.11.0 to 0.13.0 to be able to use pluginCache and parallelism without concurrency issues. With 0.13.0 version I’ve enabled plugin cache and everything seems to be working however reconciliation was taking too long.

As I have -d flag enabled, I was looking into pod logs and noticed that after the upgrade the provider seems to be slow and picking less workloads. The resources used by pod are also significantly less.

How can we reproduce it?

Configure terraform provider:

apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: controller-terraform-config
  labels:
    app: provider-terraform
spec:
  args:
    - -d
    - --poll=5m
    - --max-reconcile-rate=40
---
apiVersion: tf.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: provider-terraform-config
  namespace: upbound-system
spec:
  pluginCache: true

Create several workspaces and take a look into pod logs and consumed resources.

What environment did it happen in?

Crossplane Version: 1.14.3-up.1
Provider Version: 0.11.0
Kubernetes Version: 1.26.10
Kubernetes Distribution: AKS

The text was updated successfully, but these errors were encountered:

ytsarev · 2024-01-19T11:57:12Z

@cachaldora I do not observe any related change in the provider codebase.

If we look at https://github.com/upbound/provider-terraform/blame/main/cmd/provider/main.go the related changes of maxReconcileRate were happening 2 years ago.

If we look at https://github.com/upbound/provider-terraform/tree/main?tab=readme-ov-file#known-limitations , there is statement

Setting --max-reconcile-rate to a value greater than 1 will potentially cause the provider to use up to the same number of CPUs. Add a resources section to the ControllerConfig to restrict CPU usage as needed.

Is it possible that you are overloading your system with a high value of 40, and it produces this undesireable 'long reconciliation' side effect? Could you try to use the CPU limit as per recommendation?

bobh66 · 2024-01-19T15:14:26Z

The description in #233 mentions having cpu.requests == 1000m which is essentially the same as setting the max-reconcile-rate to 1. If you want to use max-reconcile-rate=10 then you also have to set cpu.requests to 10 AND the node must have 10 vCPUs available to give to the pod. It's not very efficient but it's the only way to run multiple terraform CLI commands in parallel.

toastwaffle · 2024-02-01T15:37:27Z

@bobh66 is there anything which enforces the CPU count limiting the reconcile rate (i.e. by somehow reserving 1 CPU per terraform invocation), or is your comment based on the assumption that any terraform invocation will just use 1 CPU continuously? AIUI most long running terraform operations will be IO bound on calls to resource APIs (or in some cases sleeping between such calls as it polls for status)

bobh66 · 2024-02-01T15:53:07Z

Each goroutine that gets spawned for a reconciliation calls the terraform CLI several times for workspace selection and the Observe step (terraform plan), and may call it again for either apply or delete, either of which may be long-running processes. Each invocation of the terraform CLI calls exec.Command which is blocking for the duration of that execution. When there is only a single CPU allocated there can be only one terraform plan/apply/delete running at a time. Even if the CLI command is blocked waiting for the remote API to finish - which I agree is happening a lot - it still won't allow the other CLI commands to run until it is finished. We have talked about using async execution, but that would require bypassing the CLI which would require a lot of rework.

bobh66 · 2024-02-01T16:30:09Z

This may be related to the Lock behavior described in #239

cachaldora · 2024-02-02T16:27:14Z

It seems to be related. We also experienced the same as @toastwaffle testing with --max-reconcile-rate=40 and resource.request.cpu=4 and resource.limit.cpu=4. We were looking into cpu load that usually didn't go above 2 and to processes running that seemed to be running a workspace at a time most of the time.

project-administrator · 2024-05-20T13:02:06Z

Is it still an issue with v0.16?
We're still using the v0.12 where concurrency is working OK and can't upgrade to any newer version because we need the concurrency working properly. Newer versions have an issue with TF plugin cache locking #239

toastwaffle · 2024-05-20T13:08:02Z

@project-administrator (nice username) This should be fixed from v0.14.1 onwards - see #240 which more or less fixed #239. If you are using high concurrency, I strongly recommend setting up a PVC for your TF workspaces.

project-administrator · 2024-05-24T16:22:02Z

With higher concurrency values like 10 we need to reserve an appropriate amount of RAM and CPU for the terraform pod to run multiple "terraform apply" instances. For us it's 1 CPU core and 1 Gb of RAM per terraform invocation.
Provider-terraform pod stays 99% of time idle, but it really needs these resources when we apply some change globally for multiple TF workspaces, then it needs to reconcile all of them.
Given the above, it looks like we're reserving the resources for the provider-terraform and it's not using them most of the time..
Would be really nice if we could run the "terraform apply" as a kubernetes job with its own requests and limits instead of running everything in a one single pod..

project-administrator · 2024-05-28T14:19:31Z

I wonder if we can use the DeploymentRuntimeConfig replicas setting to run several instances of the provider? Has anyone tested this configuration?

bobh66 · 2024-05-28T14:22:59Z

@project-administrator You can run multiple replicas of the provider but it will not help with scaling. The provider is a kubernetes controller (or multiple controllers) and by design controllers cannot run more than one instance. There is (currently) no way to ensure that events for a specific resource instance will be processed in order by the same controller instance, so all controllers are run as single instances. If there are multiple replicas defined they will do leader election and the non-leader instances will wait until the leader is unavailable before they try to become to leader and process the workload.

EladDolev · 2024-07-30T18:43:46Z

Seems this bug is still present in 0.17 and that's quite the issue
There are some workspaces that may get stuck for 20 or 30 minutes because of some Terraform resource timeout and then everything is basically stuck

Maunty · 2024-10-15T12:46:36Z

Maybe this information will help someone.

I faced long creation of resources in my dev installation, investigation showed that my config for provider was not applied.

For example, ps inside upbound pod returned bald launch of binary without any arguments, however I am sure that was not the case some time ago.

Not long ago, crossplane deprecated ControllerConfig resource, so now is required to define DeploymentRuntimeConfig. Related information in easy to google. Moving to the new configuration model solved my issue with long task queue on upbound.

cachaldora added bug Something isn't working needs:triage labels Jan 19, 2024

cachaldora mentioned this issue Jan 19, 2024

The node was low on resource: ephemeral-storage #233

Open

toastwaffle mentioned this issue Feb 1, 2024

Reconciliations blocking unexpectedly #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--max-reconcile-rate doesn’t seem to be working on v0.13.0 #234

--max-reconcile-rate doesn’t seem to be working on v0.13.0 #234

cachaldora commented Jan 19, 2024

ytsarev commented Jan 19, 2024

bobh66 commented Jan 19, 2024

toastwaffle commented Feb 1, 2024

bobh66 commented Feb 1, 2024

bobh66 commented Feb 1, 2024

cachaldora commented Feb 2, 2024

project-administrator commented May 20, 2024 •

edited

Loading

toastwaffle commented May 20, 2024

project-administrator commented May 24, 2024

project-administrator commented May 28, 2024

bobh66 commented May 28, 2024

EladDolev commented Jul 30, 2024

Maunty commented Oct 15, 2024

--max-reconcile-rate doesn’t seem to be working on v0.13.0 #234

--max-reconcile-rate doesn’t seem to be working on v0.13.0 #234

Comments

cachaldora commented Jan 19, 2024

What happened?

How can we reproduce it?

What environment did it happen in?

ytsarev commented Jan 19, 2024

bobh66 commented Jan 19, 2024

toastwaffle commented Feb 1, 2024

bobh66 commented Feb 1, 2024

bobh66 commented Feb 1, 2024

cachaldora commented Feb 2, 2024

project-administrator commented May 20, 2024 • edited Loading

toastwaffle commented May 20, 2024

project-administrator commented May 24, 2024

project-administrator commented May 28, 2024

bobh66 commented May 28, 2024

EladDolev commented Jul 30, 2024

Maunty commented Oct 15, 2024

project-administrator commented May 20, 2024 •

edited

Loading