Skip to content

Common update#99

Merged
mbaldessari merged 138 commits intovalidatedpatterns:mainfrom
day0hero:commonupdate
Jul 20, 2023
Merged

Common update#99
mbaldessari merged 138 commits intovalidatedpatterns:mainfrom
day0hero:commonupdate

Conversation

@day0hero
Copy link
Copy Markdown
Contributor

Updated common to the latest.

claudiol and others added 30 commits March 21, 2023 10:06
…annotations:

      labels:
        openshift.io/node-selector: ""
      annotations:
        openshift.io/cluster-monitoring: "true"
When running the `pattern.sh` script multiple times, a lot of
podman exited containers will be left on the machine, adding
`--rm` parameter to `podman run` makes podman automatically
delete the exited containers leaving the machine cleaner.
…ontainers

Avoid exited containers proliferation using pattern.sh script
Generating the ICSP and allowing insecure registries is best done prior
to helm upgrade, and requires VPN access to registry-proxy.engineering.redhat.com
WIP: Add labels and annotations support for namespaces (RFE)
This is especially useful when multiple people are working on a pattern
an have been using different names:

    $ make help |grep Pattern:
    Pattern: multicloud-gitops
    $ make NAME=foobar help |grep Pattern:
    Pattern: foobar
Allow overriding the pattern's name
After merging validatedpatterns/patterns-operator@235b303
it is now effectively possible to pick a different catalogSource for
the gitops operator. This is needed in order to allow CI to install
the gitops operator from an IIB
This variable can be set in order to pass additional helm arguments from the
the command line. I.e. we can set things without having to tweak values files
So it is now possible to run something like the following:

  ./pattern.sh make install \
  EXTRA_HELM_OPTS="--set main.gitops.operatorSource=iib-49232"
mbaldessari and others added 29 commits July 7, 2023 11:23
This reverts commit ab5532a.
ArgoCD will retry 5 times by default to sync an application in case of
errors and then will give up. So if an application contains a reference
to a CRD that has not been installed yet (say because it will be
installed by another application), it will error out and retry later.
This happens by default for a maximum of 5 times [1]. After those 5 times
the application will give up and will stay in Degraded moded and
eventually move to Failed. In this case a manual sync will usually fix
the application just fine (i.e. as long as the missing CRD has been
installed in the meantime).

Now to solve this issue we can add complex preSync Jobs that wait for
the needed resources, but this fundamentally breaks the simplicity of
things and introduces unneeded dependencies. In this change we just
increase the default retry limit to something larger (20) that should
cover most cases. The retry limit functionality is rather undocumented
currently in the docs but is defined at [2] and also shown at [3].

In our patterns' case the concrete issue happened as follows:
1. ESO ClusterSecrets were often not synced/degraded
2. We introduced a Job in a preSync hook for the ESO chart that would
   wait on vault to be ready before applying the rest of ESO
3. MCG started failing because the config-demo app had already tried to
   sync 5 times and failed everytime because the ESO CRDs were not
   installed yet (due to ESO waiting on vault)

So instead of adding yet another job, let's just try a lot more often.
We picked 20 as a sane default because that should have argo try for
about 60 minutes (3min is the default maximum backoff limit)

Tested with two MCG installations (with the ESO Job hook included) and
both worked out of the box. Whereas before I managed to get three
failures out of three installs.

[1] https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L1680
[2] https://github.com/argoproj/argo-cd/blob/master/manifests/crds/application-crd.yaml#L1476
[3] https://github.com/argoproj/argo-cd/blob/master/docs/operator-manual/application.yaml#L202C18-L202C100
Revert ESO/Vault Job and add a default higher number of retries
We can only split out bits of yaml that reference $.* variables. This is
because these sinppets in _helpers.tbl are passed a single context
either $ or . and cannot use both like the top-level domain.
I only remove the variables that are defined identically in
ApplicationSet and in the helper. Leaving the other ones as is
as their presence is not entirely clear to me and I do not want to
risk breaking things.
When HOME is set we replace it with '~' in this debug message
because when run from inside the container the HOME is /pattern-home
which is confusing for users. Printing out '~' when at the start of
the string is less confusing.

Before:
ok: [localhost] => {
    "msg": "/home/michele/.config/hybrid-cloud-patterns/values-secret-multicloud-gitops.yaml"
}

After:
ok: [localhost] => {
    "msg": "~/.config/hybrid-cloud-patterns/values-secret-multicloud-gitops.yaml"
}
Tweak the load secret debug message to be clearer
If it is somewhere under /tmp or out of the HOME folder, bail out
explaining why. This has caused a few silly situations where the user
would save the KUBECONFIG file under /tmp. Since bind-mounting /tmp
seems like a wrong thing to do in general, we at least bail out with a
clear error message. To do this we rely on a bash functionality so let's
just switch the script to use that.

Tested as follows:
export KUBECONFIG=/tmp/kubeconfig
./scripts/pattern-util.sh make help
/tmp/kubeconfig is pointing outside of the HOME folder, this will make it unavailable from the container.
Please move it somewhere inside your /home/michele folder, as that is what gets bind-mounted inside the container

export KUBECONFIG=~/kubeconfig
./scripts/pattern-util.sh make help
Pattern: common

Usage:
  make <target>
...
Sanely handle cluster pools with no clusters (yet)
We currently have a small inconsistency where we use common/clustergroup
in order to point Argo CD to this chart, but the name inside the chart
is 'pattern-clustergroup'.

This inconsistency is currently irrelevant, but in the future when
migrating to helm charts inside proper helm repos, this becomes
problematic. So let's fix the name to be the same as the folder.

Tested on MCG successfully.
Check if the KUBECONFIG file is inside /tmp
Currently with the following values snippet:

  managedClusterGroups:
    exampleRegion:
      name: group-one
      acmlabels:
      - name: clusterGroup
        value: group-one
      helmOverrides:
      - name: clusterGroup.isHubCluster
        value: false
      clusterPools:
        exampleAWSPool:
          size: 1
          name: aws-ap-bandini
          openshiftVersion: 4.12.24
          baseDomain: blueprints.rhecoeng.com
          controlPlane:
            count: 1
            platform:
              aws:
                type: m5.2xlarge
          workers:
            count: 0
          platform:
            aws:
              region: ap-southeast-2
          clusters:
          - One

You will get a clusterClaim that is pointing to the wrong Pool:
NAMESPACE                 NAME                       POOL
open-cluster-management   one-group-one              aws-ap-bandini

This is wrong because the clusterPool name will be generated using the
pool name + "-" group name:

  {{- $pool := . }}
  {{- $poolName := print .name "-" $group.name }}

But the clusterPoolName inside the clusterName is only using the
"$pool.name" which will make the clusterClaim ineffective as the pool
does not exist.

Switch to using the same poolName that is being used when creating the
clusterPool.
Fix the clusterPoolName in clusterClaims
git-subtree-dir: common
git-subtree-mainline: 3f09479
git-subtree-split: e0d7954
@mbaldessari mbaldessari merged commit 41059b0 into validatedpatterns:main Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants