You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Device and OS: Nutanix VM RHEL8
App version: 0.36.1
Kubernetes distro being used: RKE2
Steps to reproduce
Run a zarf init with its registry deployment replica set to many, or have HPA enabled.
Expected result
Zarf to successfully initialize in the cluster.
Actual Result
Image pull backoff on some of the permanent registry pods because of what is explained in the below additional context.
Visual Proof (screenshots, videos, text, etc)
Need to produce... standby
Severity/Priority
Additional Context
We don't have a reproducible test case outside a Nutanix deployment of ours right now. We believe that when using multiple replicas of the registry and persistent volume provisioning is slow, this can happen.
We have the below custom zarf init package for our Nutanix CSI driver that imports the seed registry and permanent registry from the zarf project.
What we believe happens is the wait on the permanent registry immediately succeeds and moves on because the seed registry is using the same name as the permanent registry. This can cause the permanent registry to not actually be ready with all the images pushed to it in its persistent volume.
When the permanent registry replicas start to come up, some of the initial pods succeed because they pulled images from the temporary seed registry. Once that seed registry is gone, because the wait didn't actually wait for the permanent registry, and if the permanent registries persistent volumes were not quite ready or fully filled with all the images the seed registry had, any remaining registry pods can't pull images and remain in an image pull backoff. We can scale down to 1 registry pod and that allows the zarf init to continue and finish successfully.
Custom zarf init:
kind: ZarfInitConfigmetadata:
name: initdescription: "Nutanix CSI Driver Custom Zarf Init Package"architecture: amd64version: "0.0.1"# This version is not used by zarf, but is used for tracking with the published versionsvariables:
- name: DYNAMIC_FILE_STORE_NAMEdescription: "Name of Nutanix File Server to use for Dynamic File storageclass. Should match the name value for the file server in Prism."
- name: PRISM_ENDPOINTdescription: "IP or hostname of Prism Element."
- name: PRISM_USERNAMEdescription: "Username of prism user to use for Nutanix CSI driver."
- name: PRISM_PASSWORDdescription: "Password for prism user to use for Nutanix CSI driver."
- name: STORAGE_CONTAINERdescription: "Name of Nutanix Storage Container for CSI driver to create volumes in."components:
# (Optional) Deploys a k3s cluster
- name: k3simport:
url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1# This package moves the injector & registries binaries
- name: zarf-injectorrequired: trueimport:
url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1# Creates the temporary seed-registry
- name: zarf-seed-registryrequired: trueimport:
url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1charts:
- name: docker-registryvaluesFiles:
- values/registry-values.yaml# On upgrades ensure we retain the existing PVactions:
onDeploy:
before:
- description: Set persistence for upgrade seed registrycmd: ./zarf tools kubectl get pvc zarf-docker-registry -n zarf >/dev/null 2>&1 && echo true || echo falsemute: truesetVariables:
- name: UPGRADE_PERSISTENCE
- description: Set env vars for upgrade seed registrymute: truecmd: | ./zarf tools kubectl get pvc zarf-docker-registry -n zarf >/dev/null 2>&1 && \ echo "" || \ echo "- name: REGISTRY_STORAGE_FILESYSTEM_ROOTDIRECTORY value: \"/var/lib/registry\""setVariables:
- name: UPGRADE_ENV_VARSautoIndent: true# Push nutanix csi images to seed-registry
- name: nutanix-csi-images-initialrequired: truedescription: Push nutanix images to the zarf registryimages:
- registry.k8s.io/sig-storage/snapshot-controller:v8.0.1
- registry.k8s.io/sig-storage/snapshot-validation-webhook:v8.0.1
- quay.io/karbon/ntnx-csi:v2.6.10
- registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
- registry.k8s.io/sig-storage/csi-provisioner:v5.0.1
- registry.k8s.io/sig-storage/csi-snapshotter:v8.0.1
- registry.k8s.io/sig-storage/csi-resizer:v1.11.2
- registry.k8s.io/sig-storage/livenessprobe:v2.13.1
- registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-csi:v0.7.1
- registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-aws:v1.10.0
- name: nutanix-csi-storagerequired: truecharts:
# renovate: datasource=helm
- name: nutanix-csi-storageurl: https://github.com/defenseunicorns/nutanix-helm.git # fork containing fix for imagepullsecrets needed for pods to pull images from zarf registryversion: v2.6.10-modifiedgitPath: charts/nutanix-csi-storagenamespace: ntnx-systemvaluesFiles:
- values/nutanix-storage-values.yamlactions:
onDeploy:
before:
- description: Delete Storage Classescmd: ./zarf tools kubectl delete sc nutanix-volume --ignore-not-found=true
- name: nutanix-dynamicfile-manifestsrequired: truemanifests:
- name: nutanix-dynamicfile-manifestsnamespace: ntnx-systemfiles:
- nutanix-dynamicfile.yamlactions:
onDeploy:
before:
- description: Delete Storage Classescmd: ./zarf tools kubectl delete sc nutanix-dynamicfile --ignore-not-found=true# Creates the permanent registry
- name: zarf-registryrequired: trueimport:
url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1# Push nutanix csi (and registry) images to permanent registry
- name: nutanix-csi-imagesrequired: truedescription: Push nutanix csi images to the zarf registryimages:
- registry.k8s.io/sig-storage/snapshot-controller:v8.0.1
- registry.k8s.io/sig-storage/snapshot-validation-webhook:v8.0.1
- quay.io/karbon/ntnx-csi:v2.6.10
- registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
- registry.k8s.io/sig-storage/csi-provisioner:v5.0.1
- registry.k8s.io/sig-storage/csi-snapshotter:v8.0.1
- registry.k8s.io/sig-storage/csi-resizer:v1.11.2
- registry.k8s.io/sig-storage/livenessprobe:v2.13.1
- registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-csi:v0.7.1
- registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-aws:v1.10.0
- "###ZARF_PKG_TMPL_REGISTRY_IMAGE_DOMAIN######ZARF_PKG_TMPL_REGISTRY_IMAGE###:###ZARF_PKG_TMPL_REGISTRY_IMAGE_TAG###"# Creates the pod+git mutating webhook
- name: zarf-agentrequired: trueimport:
url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1# (Optional) Adds a git server to the cluster
- name: git-serverimport:
url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1
The text was updated successfully, but these errors were encountered:
Thank you for the issue @anthonywendt. This can be easily solved by updating the component to use health checks after #2718 is introduced. You are correct that the root of the issue is that the permanent registry is immediately succeeded on the wait check. Health checks will use kstatus rather than kubectl wait under the hood. Kstatus waits for all of the pods to be in the updated state before evaluating to ready.
Environment
Device and OS: Nutanix VM RHEL8
App version: 0.36.1
Kubernetes distro being used: RKE2
Steps to reproduce
Expected result
Zarf to successfully initialize in the cluster.
Actual Result
Image pull backoff on some of the permanent registry pods because of what is explained in the below additional context.
Visual Proof (screenshots, videos, text, etc)
Need to produce... standby
Severity/Priority
Additional Context
We don't have a reproducible test case outside a Nutanix deployment of ours right now. We believe that when using multiple replicas of the registry and persistent volume provisioning is slow, this can happen.
We have the below custom zarf init package for our Nutanix CSI driver that imports the seed registry and permanent registry from the zarf project.
What we believe happens is the wait on the permanent registry immediately succeeds and moves on because the seed registry is using the same name as the permanent registry. This can cause the permanent registry to not actually be ready with all the images pushed to it in its persistent volume.
When the permanent registry replicas start to come up, some of the initial pods succeed because they pulled images from the temporary seed registry. Once that seed registry is gone, because the wait didn't actually wait for the permanent registry, and if the permanent registries persistent volumes were not quite ready or fully filled with all the images the seed registry had, any remaining registry pods can't pull images and remain in an image pull backoff. We can scale down to 1 registry pod and that allows the zarf init to continue and finish successfully.
Custom zarf init:
The text was updated successfully, but these errors were encountered: