Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BundleDeployoments Status not being properly updated in Fleet local with Rancher 2.8-devel #2128

Closed
1 task done
mmartin24 opened this issue Feb 7, 2024 · 11 comments
Closed
1 task done
Assignees
Labels
Milestone

Comments

@mmartin24
Copy link
Collaborator

mmartin24 commented Feb 7, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When deploying fleet-examples with path simple on local fleet cluster on Rancher 2.8-devel, it can take more than 20 mins to get clusters in Ready state. This does not seem only an UI issue as the backend displays the error on:

➜  ~ k get gitrepos -A -w
NAMESPACE     NAME               REPO                                        COMMIT                                     BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/0                       
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/0                       
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/0                       WaitApplied(1) [Bundle time-health-test-simple]
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       WaitApplied(1) [Bundle time-health-test-simple]
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       NotReady(1) [Bundle time-health-test-simple]
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       NotReady(1) [Bundle time-health-test-simple]; deployment.apps default/frontend [progressing] Deployment does not have minimum availability., Available: 0/3; deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1; deployment.apps default/redis-slave [progressing] Deployment does not have minimum availability., Available: 0/2
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       NotReady(1) [Bundle time-health-test-simple]; deployment.apps default/frontend [progressing] Deployment does not have minimum availability., Available: 0/3; deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1; deployment.apps default/redis-slave [progressing] Deployment does not have minimum availability., Available: 0/2
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       NotReady(1) [Bundle time-health-test-simple]; deployment.apps default/frontend [progressing] Deployment does not have minimum availability., Available: 0/3; deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1; deployment.apps default/redis-slave [progressing] Deployment does not have minimum availability., Available: 0/2
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       NotReady(1) [Bundle time-health-test-simple]; deployment.apps default/frontend [progressing] Deployment does not have minimum availability., Available: 0/3; deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1; deployment.apps default/redis-slave [progressing] Deployment does not have minimum availability., Available: 0/2
fleet-local   time-health-test   https://github.com/rancher/fleet-examples   44f4634747e3dd6c9d1ad6c9d402b430f7aae20b   0/1                       NotReady(1) [Bundle time-health-test-simple]; deployment.apps default/frontend [progressing] Deployment does not have minimum availability., Available: 0/3; deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1; deployment.apps default/redis-slave [progressing] Deployment does not have minimum availability., Available: 0/2

20_mins_not_ok
a

Expected Behavior

Clusters should be ready in few minutes (<5 mins)

Workaround. Using "Force update" Fixes the issue as can be seen here:

Screencast.from.06-02-24.14.18.05.mov

Steps To Reproduce

  1. Install single node cluster with k3s 1.26.10+k3s2
  2. Install Rancher v2.8-head, I used:
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo update

helm upgrade --install rancher rancher-latest/rancher \
  --namespace cattle-system --create-namespace \
  --set global.cattle.psp.enabled=false \
  --set rancherImageTag=v2.8-head \
  --set hostname=$SYSTEM_DOMAIN \
  --set bootstrapPassword=password \
  --set replicas=1 \
  --wait
  1. Go to Continuous Delivery > toggle to fleet local > Gitjob > Add Repository
  2. Add:
  3. Click Add, then Create

Environment

- Architecture: amd
- Fleet Version: 
- Cluster:
  - Provider:k3s
  - Options: single cluster 
  - Kubernetes Version: `1.26.10+k3s2`

Logs

No response

Anything else?

No response

@manno
Copy link
Member

manno commented Feb 7, 2024

This looks like it's related to the chart. Can you provide more debug information?
Maybe from the fleet-agent logs, the Gitrepo/Bundle status and maybe the helm release on the downstream cluster, too?

@mmartin24
Copy link
Collaborator Author

This looks like it's related to the chart. Can you provide more debug information? Maybe from the fleet-agent logs, the Gitrepo/Bundle status and maybe the helm release on the downstream cluster, too?

Thanks for looking at it @manno. Sure here you go:

➜  ~ kubectl logs -l app=fleet-agent -n cattle-fleet-local-system 
time="2024-02-07T10:57:54Z" level=info msg="getting history for release test-health-2-simple"
time="2024-02-07T10:57:54Z" level=info msg="getting history for release test-health-2-simple"
time="2024-02-07T10:57:54Z" level=error msg="bundle test-health-2-simple: deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1"
time="2024-02-07T10:57:54Z" level=info msg="getting history for release test-health-2-simple"
time="2024-02-07T10:57:56Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/test-health-2-simple"
time="2024-02-07T10:57:56Z" level=info msg="getting history for release test-health-2-simple"
time="2024-02-07T10:57:56Z" level=info msg="getting history for release test-health-2-simple"
time="2024-02-07T10:57:56Z" level=info msg="getting history for release test-health-2-simple"
time="2024-02-07T10:57:56Z" level=error msg="bundle test-health-2-simple: deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1"
time="2024-02-07T10:57:56Z" level=info msg="getting history for release test-health-2-simple"

➜  ~ kubectl get bundle -n fleet-local fleet-agent-local -o=jsonpath={.status} | jq
{
  "conditions": [
    {
      "lastUpdateTime": "2024-02-06T13:49:39Z",
      "status": "True",
      "type": "Ready"
    },
    {
      "lastUpdateTime": "2024-02-06T13:49:39Z",
      "status": "True",
      "type": "Processed"
    }
  ],
  "display": {
    "readyClusters": "1/1"
  },
  "maxNew": 50,
  "maxUnavailable": 1,
  "maxUnavailablePartitions": 0,
  "observedGeneration": 1,
  "partitions": [
    {
      "count": 1,
      "maxUnavailable": 1,
      "name": "All",
      "summary": {
        "desiredReady": 1,
        "ready": 1
      }
    }
  ],
  "resourceKey": [
    {
      "apiVersion": "apps/v1",
      "kind": "Deployment",
      "name": "fleet-agent",
      "namespace": "cattle-fleet-local-system"
    },
    {
      "apiVersion": "networking.k8s.io/v1",
      "kind": "NetworkPolicy",
      "name": "default-allow-all",
      "namespace": "cattle-fleet-local-system"
    },
    {
      "apiVersion": "rbac.authorization.k8s.io/v1",
      "kind": "ClusterRole",
      "name": "cattle-fleet-local-system-fleet-agent-role"
    },
    {
      "apiVersion": "rbac.authorization.k8s.io/v1",
      "kind": "ClusterRoleBinding",
      "name": "cattle-fleet-local-system-fleet-agent-role-binding"
    },
    {
      "apiVersion": "v1",
      "kind": "ServiceAccount",
      "name": "default",
      "namespace": "cattle-fleet-local-system"
    },
    {
      "apiVersion": "v1",
      "kind": "ServiceAccount",
      "name": "fleet-agent",
      "namespace": "cattle-fleet-local-system"
    }
  ],
  "summary": {
    "desiredReady": 1,
    "ready": 1
  },
  "unavailable": 0,
  "unavailablePartitions": 0
}

➜  ~ kubectl get gitrepo -n fleet-local test-health-2  -o=jsonpath={.status} | jq
{
  "commit": "44f4634747e3dd6c9d1ad6c9d402b430f7aae20b",
  "conditions": [
    {
      "lastUpdateTime": "2024-02-07T10:57:54Z",
      "message": "NotReady(1) [Bundle test-health-2-simple]; deployment.apps default/frontend [progressing] Deployment does not have minimum availability., Available: 0/3; deployment.apps default/redis-master [progressing] Deployment does not have minimum availability., Available: 0/1; deployment.apps default/redis-slave [progressing] Deployment does not have minimum availability., Available: 0/2",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastUpdateTime": "2024-02-07T11:10:44Z",
      "status": "True",
      "type": "Accepted"
    },
    {
      "lastUpdateTime": "2024-02-07T10:57:41Z",
      "status": "True",
      "type": "ImageSynced"
    },
    {
      "lastUpdateTime": "2024-02-07T10:57:51Z",
      "status": "False",
      "type": "Reconciling"
    },
    {
      "lastUpdateTime": "2024-02-07T10:57:51Z",
      "status": "False",
      "type": "Stalled"
    },
    {
      "lastUpdateTime": "2024-02-07T11:10:44Z",
      "status": "True",
      "type": "Synced"
    }
  ],
  "desiredReadyClusters": 1,
  "display": {
    "readyBundleDeployments": "0/1",
    "state": "NotReady"
  },
  "gitJobStatus": "Current",
  "lastSyncedImageScanTime": null,
  "observedGeneration": 1,
  "readyClusters": 0,
  "resourceCounts": {
    "desiredReady": 6,
    "missing": 0,
    "modified": 0,
    "notReady": 3,
    "orphaned": 0,
    "ready": 3,
    "unknown": 0,
    "waitApplied": 0
  },
  "resources": [
    {
      "apiVersion": "apps/v1",
      "id": "default/frontend",
      "kind": "Deployment",
      "message": "Deployment does not have minimum availability.; Available: 0/3",
      "name": "frontend",
      "namespace": "default",
      "perClusterState": [
        {
          "clusterId": "fleet-local/local",
          "message": "Deployment does not have minimum availability.; Available: 0/3",
          "state": "updating",
          "transitioning": true
        }
      ],
      "state": "updating",
      "transitioning": true,
      "type": "apps.deployment"
    },
    {
      "apiVersion": "apps/v1",
      "id": "default/redis-master",
      "kind": "Deployment",
      "message": "Deployment does not have minimum availability.; Available: 0/1",
      "name": "redis-master",
      "namespace": "default",
      "perClusterState": [
        {
          "clusterId": "fleet-local/local",
          "message": "Deployment does not have minimum availability.; Available: 0/1",
          "state": "updating",
          "transitioning": true
        }
      ],
      "state": "updating",
      "transitioning": true,
      "type": "apps.deployment"
    },
    {
      "apiVersion": "apps/v1",
      "id": "default/redis-slave",
      "kind": "Deployment",
      "message": "Deployment does not have minimum availability.; Available: 0/2",
      "name": "redis-slave",
      "namespace": "default",
      "perClusterState": [
        {
          "clusterId": "fleet-local/local",
          "message": "Deployment does not have minimum availability.; Available: 0/2",
          "state": "updating",
          "transitioning": true
        }
      ],
      "state": "updating",
      "transitioning": true,
      "type": "apps.deployment"
    },
    {
      "apiVersion": "v1",
      "id": "default/frontend",
      "kind": "Service",
      "name": "frontend",
      "namespace": "default",
      "state": "Ready",
      "type": "service"
    },
    {
      "apiVersion": "v1",
      "id": "default/redis-master",
      "kind": "Service",
      "name": "redis-master",
      "namespace": "default",
      "state": "Ready",
      "type": "service"
    },
    {
      "apiVersion": "v1",
      "id": "default/redis-slave",
      "kind": "Service",
      "name": "redis-slave",
      "namespace": "default",
      "state": "Ready",
      "type": "service"
    }
  ],
  "summary": {
    "desiredReady": 1,
    "nonReadyResources": [
      {
        "bundleState": "NotReady",
        "name": "test-health-2-simple",
        "nonReadyStatus": [
          {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "name": "redis-master",
            "namespace": "default",
            "summary": {
              "message": [
                "Deployment does not have minimum availability.",
                "Available: 0/1"
              ],
              "state": "updating",
              "transitioning": true
            },
            "uid": "47074a6e-1a55-4a11-b482-3b1bf38b9258"
          },
          {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "name": "redis-slave",
            "namespace": "default",
            "summary": {
              "message": [
                "Deployment does not have minimum availability.",
                "Available: 0/2"
              ],
              "state": "updating",
              "transitioning": true
            },
            "uid": "57fa1a5d-7245-4791-abea-7644629a150d"
          },
          {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "name": "frontend",
            "namespace": "default",
            "summary": {
              "message": [
                "Deployment does not have minimum availability.",
                "Available: 0/3"
              ],
              "state": "updating",
              "transitioning": true
            },
            "uid": "e23816d5-383b-4f64-9026-2d812f645e73"
          }
        ]
      }
    ],
    "notReady": 1,
    "ready": 0
  }
}

Component Version
Rancher v2.8-4716667b53c4b76e82fe34207899da46470c5e0d-head
Dashboard release-2.8 0aaba7166
Helm v2.16.8-rancher2
Machine v0.15.0-rancher109

@thehejik
Copy link

I consider this as a Critical Issue, it behaves the same on today's build of rancher v2.8-882d1e32f9e5093f61aba4822fb2361dd61d66e6-head with fleet[-crd]:103.1.1+up0.9.1-rc.5 Application (pre)installed.

When adding PRIVATE HTTPS repo with BASIC AUTH for the first time, regardless if on local or downstream cluster, it is gonna fail and the status of the GitRepo is Not Ready but the resources are actually provisioned on the target cluster(s).

The status can be hotfixed by selecting the GitRepo and pressing Force Update.

I could reproduce with simple nginx deployment stored on private Azure, Bitbucket and Gitlab repository. Surprisingly it works ok with private Github repo.

@thehejik
Copy link

Here you can see 15 Pods is up in default ns but fleet bundle reports Available: 0/15:
image

@manno manno added this to the v2.8.3 milestone Feb 27, 2024
@manno
Copy link
Member

manno commented Feb 27, 2024

This only happens with basic auth credentials for the git repo?
And I guess, it does not affect Github repos, since they don't allow basic auth anymore?

@thehejik
Copy link

This only happens with basic auth credentials for the git repo? And I guess, it does not affect Github repos, since they don't allow basic auth anymore?

Unfortunately I could reproduce it also with Private Azure DevOps SSH repo and keys on local rancher:2.8.3-rc1 with fleet images:

rancher/fleet:v0.9.1-rc.5
rancher/gitjob:v0.9.3
rancher/fleet-agent:v0.9.1-rc.5

image

$ kubectl get bundles -A
NAMESPACE     NAME                BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local   1/1                       
fleet-local   azure-private-ssh   0/1                       NotReady(1) [Cluster fleet-local/local]; deployment.apps default/fleet-nginx [progressing] Deployment does not have minimum availability., Available: 0/15

The pods are up and Force Update helped also this time.

@thehejik
Copy link

Private Github HTTPS repo is not affected.

@thehejik
Copy link

To me it looks like fleet-agent pod attempts to check fleet deployment from gitrepo and its pods availability 4 times and then it gives up just before the pods are scheduled.

cattle-fleet-local-system fleet-agent-75f9c4fd7d-flpd2 fleet-agent time="2024-02-28T13:02:59Z" level=error msg="bundle azure-private-https: deployment.apps default/fleet-nginx [progressing] Deployment does not have minimum availability., Available: 0/15"

Logs from all pods/jobs and namespaces here
Logs from all pods/jobs and namespaces after pressing Force update here
(produced by: stern -A . --since 1s)

@mmartin24 mmartin24 changed the title Clusters not ready for a long time when deploying Git Repos in Fleet local with Rancher 2.8-devel Authentication issues when deploying Repos (except Github Private) in Fleet local with Rancher 2.8-devel Feb 29, 2024
@kkaempf kkaempf modified the milestones: v2.8.3, v2.8-Next1 Mar 1, 2024
@p-se p-se self-assigned this Mar 4, 2024
@kkaempf kkaempf modified the milestones: v2.8-Next1, v2.8-Next2 Mar 4, 2024
@p-se p-se removed their assignment Mar 4, 2024
@0xavi0 0xavi0 self-assigned this Mar 22, 2024
@0xavi0 0xavi0 changed the title Authentication issues when deploying Repos (except Github Private) in Fleet local with Rancher 2.8-devel BundleDeployoments Status not being properly updated in Fleet local with Rancher 2.8-devel Mar 26, 2024
@0xavi0
Copy link
Contributor

0xavi0 commented Mar 26, 2024

I've updated the subject of the issue as I could recreate using almost any gitrepo, no need to use private repos.

Investigating I realised the issue was because Modify events for deployments that were notifying for being ready were filtered, which made the BundleDeployment (and Bundle) to be reported as Not Ready. gitrepos take their status from bundles and that's why showed up as Not Ready.
At some other point (taking 30 minutes or so) there is a refresh in the cluster and that's when we finally see the real Ready status.
Issue was introduced in this PR, which was reverted recently.

It will be available when Rancher 2.8.3 is released, so no stable release should be affected.

@0xavi0
Copy link
Contributor

0xavi0 commented Mar 27, 2024

QA Template

Solution

Solution was to rollback the PR that caused the issue

Testing

Deploy any gitrepo with Fleet v0.9.1-rc.2 to v0.9.1-rc.6.
For example:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: sample
  namespace: fleet-local
spec:
  repo: "https://github.com/0xavi0/fleet-examples"
  branch: master
  paths:
  - single-cluster/helm

The gitrepo, bundle and bundleDeployment don't reach the READY state although all deployments related (and pods) are ready.

Fix should be in future version v0.9.1-rc.7 which should be released with Rancher 2.8.3.
With the fixed version everything should be back to normal.

Additional info

@mmartin24
Copy link
Collaborator Author

Fix noted.

Checked in Rancher 2.8.3-rc6 with Fleet version 103.1.2+up0.9.2
Tested using: https://github.com/rancher/fleet-examples/tree/master/simple

kind: GitRepo
apiVersion: fleet.cattle.io/v1alpha1
metadata:
  name: simple
  namespace: fleet-local
spec:
  repo: https://github.com/rancher/fleet-examples
  paths:
  - simple

The example was deployed correctly showing 6 resources deployed in seconds....
2024-03-27_15-44_1

, as opposed as in Rancher v2.8.0 with Fleet version 103.1.0+up0.9.0 where fails:
2024-03-27_15-44_1

This test has been added to our UI e2e regression test Fleet-62 here.

@zube zube bot added the [zube]: Done label Mar 27, 2024
0xavi0 added a commit to 0xavi0/fleet that referenced this issue Apr 2, 2024
Related to: rancher#2128

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
0xavi0 added a commit to 0xavi0/fleet that referenced this issue Apr 2, 2024
Related to: rancher#2128

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
0xavi0 added a commit to 0xavi0/fleet that referenced this issue Apr 2, 2024
Related to: rancher#2128

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
0xavi0 added a commit to 0xavi0/fleet that referenced this issue Apr 3, 2024
Related to: rancher#2128

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
manno pushed a commit that referenced this issue Apr 10, 2024
Related to: #2128

Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

6 participants