Skip to content

Drop sync-wave from CRBS#103

Merged
mbaldessari merged 1 commit into
validatedpatterns:mainfrom
mbaldessari:clustergroup_race
Mar 19, 2026
Merged

Drop sync-wave from CRBS#103
mbaldessari merged 1 commit into
validatedpatterns:mainfrom
mbaldessari:clustergroup_race

Conversation

@mbaldessari
Copy link
Copy Markdown
Contributor

In commit 688ddb6 (Force rolebindings
as early as possible) we made the RBACs at sync-wave: -100 to create
them as early as possible. Without that we'd und up with the following
error on the spokes:

Failed sync attempt to : one or more objects failed to apply, reason:
serviceaccounts is forbidden: User
"system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller"
cannot create resource "serviceaccounts" in API group "" in the namespace
"imperative" due to application controller sync timeout. Retrying attempt
#1 at 6:38PM. 20 minutes ago (Wed Mar 18 2026 19:39:04 GMT+0100)

The problem is that by using these sync-waves, we seem to trigger an
ArgoCD bug where selfHeal simply stops trying:
https://www.github.com/argoproj/argo-cd/issues/18442

It does not always happen but it is certainly frequent enough to be
noticed.

In order to avoid this bug (and potentially others) we fully drop the
sync-waves around the CRBs and to avoid the original problem of the
openshift-gitops-argocd-application-controller service account being
unable to create SAs, we also precreate that CRB via ACM. This way we
actually avoid the argoCD issue and still get everything working.

Tested on 6 separate MCG hub/spoke installations without any issues.
Previously I would hit the issue at least 4 times.

Closes: #63

In commit 688ddb6 (Force rolebindings
as early as possible) we made the RBACs at sync-wave: -100 to create
them as early as possible. Without that we'd und up with the following
error on the spokes:

    Failed sync attempt to : one or more objects failed to apply, reason:
    serviceaccounts is forbidden: User
    "system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller"
    cannot create resource "serviceaccounts" in API group "" in the namespace
    "imperative" due to application controller sync timeout. Retrying attempt
    validatedpatterns#1 at 6:38PM. 20 minutes ago (Wed Mar 18 2026 19:39:04 GMT+0100)

The problem is that by using these sync-waves, we seem to trigger an
ArgoCD bug where selfHeal simply stops trying:
https://www.github.com/argoproj/argo-cd/issues/18442

It does not always happen but it is certainly frequent enough to be
noticed.

In order to avoid this bug (and potentially others) we fully drop the
sync-waves around the CRBs and to avoid the original problem of the
openshift-gitops-argocd-application-controller service account being
unable to create SAs, we also precreate that CRB via ACM. This way we
actually avoid the argoCD issue and still get everything working.

Tested on 6 separate MCG hub/spoke installations without any issues.
Previously I would hit the issue at least 4 times.

Closes: validatedpatterns#63
mbaldessari added a commit to mbaldessari/acm-chart that referenced this pull request Mar 19, 2026
This is to fix an issue on spokes. See
validatedpatterns/clustergroup-chart#103
for the full reasoning.

TLDR: we need to drop sync-waves in clustergroup from CRBs to avoid
an argo bug, but then without those the SA will never have the right
permissions to create another service account, so we precreate the
CRB via the acm-chart

Closes: validatedpatterns/clustergroup-chart#63
@mbaldessari mbaldessari merged commit 6079d50 into validatedpatterns:main Mar 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gitops on the spokes sometimes is stuck

1 participant