[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

kkaempf · 2024-01-12T15:06:50Z

SURE-7413

Issue Description:

When updating a bundle in a repo to a helm version that does not exist, the fleet silently ignores it, and the fleet agent job's pod keeps restarting. There is no proper indication of the error, and the bundle shows active in the Rancher UI.

Business impact:

Developers are using Rancher to update the bundle and cannot see any error for wrong deployment. It makes it difficult to manage the repo

Troubleshooting steps:

Multiple developers are using the Rancher only for deployment purposes. They only have the read permission at the Rancher level to see after the commit at git repo. The issue we observed is that the Rancher UI is not showing any error even though there is an issue with the commit. The customer is looking for a solution where the user can see from Rancher UI if the Commit has failed.

Repro steps:

Rancher 2.7.9
Create a Git Repo in the continuous delivery session of Rancher. Make sure the Gitrepo is in an active state.
Create a Git commit with any of the helm charts ( Use LH chart for testing )
The chart is getting deployed without any issues.

Now, again, edit the go to the Git repo, edit the helm chart, and change the version that is not available.
Go to the Rancher UI and check the Repo status, and we could see it's still active and no error throwing.
Now go to the gitjob pod and see; we can see the version is not available error.

The issue is that there is no option for a Rancher user with limited access to the clusters; they won't be able to identify the status of the last commit if there are any issues.

Workaround:

Is a workaround available and implemented? NO

Actual behaviour:

The Rancher UI does not show the Error if the last Git commit failed when using the Rancher-provided Continuous Delivery.

Expected behaviour:

The Rancher UI should show the Error if the last Git commit failed when using the Rancher-provided Continuous Delivery.

manno · 2024-01-15T14:35:58Z

The job controller in gitjob should collect the job's output from a Failed job. If I remember correctly the error is propagated from the job to the gitjob status, to the gitrepo status. UI finally reads it from the gitrepo status.

Does the error from the "bundlereader" not result in a Failed job? Does the controller fail to pick up the state, does propagation fail?

Martin-Weiss · 2024-01-16T12:38:54Z

+1

One situation where we had to see this "problem" was when the helm credentials that were used to fetch an OCI helm chart have not been valid - the Rancher UI for continuous delivery showed the gitrepo with a green / ok status even though the job failed to fetch the helm chart.. (only checking the logs of the fleet container showed the problem).

khushalchandak17 · 2024-01-22T06:15:57Z

+1

I had seen similar behavior when we provide a invalid path in git-repo.

Well, I have created a few scenarios to elaborate this issue in detail;
Scenario 1: gitrepo (name: failbranch) with the wrong branch which shows the expected result failed on gitrepo.
Scenario 2: gitrepo (name: test) with the wrong path.
Scenario 3: gitrepo (name: logapp) with an invalid chart version.

In scenario 1, I do see that gitrepo ends up with a failed status with the error reported as “No commit for branch: fakebranch,” which is the expected result.

In scenario 2, I do see that the git repo remains active even though an invalid dir path has been provided to the git repo. However, we do see that for a fraction of a second, on the UI, we see the error reported as “no resource found at the following path to deploy:[<Path>]” with the gitrepo status as ‘Git Updating.’ From the terminal, we can see a similar error in gitjob status, but it stays the same for a few seconds, and then, I guess, it reconciles and puts the gitrepo back in the active state, flushing the error on the UI.

In scenario 3, again, I see that even if an invalid chart version is provided in the fleet.yaml, the git repo remains in the active state. But again, for a fraction of a second, we do see that the error is reported on the UI with “no chart version found for <chart-version>.” We can see a similar error in gitjob and gitrepo. The status of git-repo was git-updating, but after reconciling, the git-repo status changes to active.

The expected result in scenarios 2 & 3 was to update the git-repo status with failed and print the error rather than reconciling and becoming active.

I have attached screenshots for the error captured over the UI for a fraction of a second in the second and third scenarios.

Martin-Weiss · 2024-01-22T15:40:00Z

For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.

skanakal · 2024-01-23T08:20:53Z

It appears to be functioning as intended, but the process is exceptionally swift, making it challenging to capture the information effectively.

I think the job is continually being deleted and retried, it's likely due to the fatal error condition detected in the GitJob status. It seems the GitJob is designed to respond to such errors by deleting the job to initiate a retry...

https://github.com/rancher/gitjob/blob/release/fleet/v0.9/pkg/controller/gitjob/gitjobs.go#L125

{
  "commit": "4ff289ba5a9108502f83ee41fb17208d84bf2bb0",
  "conditions": [
    {
      "lastUpdateTime": "2024-01-23T07:21:19Z",
      "status": "False",
      "type": "Reconciling"
    },
    {
      "lastUpdateTime": "2024-01-23T07:21:47Z",
      "message": "time=\"2024-01-23T07:21:44Z\" level=fatal msg=\"no chart version found for rancher-logging-45.5.0\"\n",
      "reason": "Stalled",
      "status": "True",
      "type": "Stalled"
    },
    {
      "lastUpdateTime": "2024-01-23T07:21:27Z",
      "status": "True",
      "type": "Synced"
    }
  ],
  "jobStatus": "Failed",
  "lastSyncedTime": "2024-01-23T07:21:27Z",
  "observedGeneration": 5,
  "updateGeneration": 11
}

time="2024-01-23T07:21:19Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:21:16Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"

time="2024-01-23T07:22:20Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:22:17Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"

I was able to see them in gitjob pod logs...

manno · 2024-01-24T11:15:27Z

For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.

Yes, you can also try "stern", if you know how to match the pod, e.g. by label you can do stern -n cattle-fleet-system -l "app=fleet-job" and it will tail any output from jobs like that.

Martin-Weiss · 2024-01-25T14:55:44Z

Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?

manno · 2024-05-07T11:07:14Z

Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?

How would you define "constantly failing"? Like a retry counter, which we reset on a successful deployment?

0xavi0 · 2024-05-15T14:52:28Z

This is working as expected in fleet v0.10.0-rc.13 (Rancher 2.9-head)

I've tested it with Ranched 2.7.9 and, although I can see all the job pods trying to get an invalid version for a helm chart, it still shows up as ACTIVE in Rancher.

I see this:

NAME                    READY   STATUS   RESTARTS   AGE
supertest-512fe-p4gms   0/2     Error    0          31s
supertest-512fe-jcqnz   0/2     Error    0          23s
supertest-512fe-xxsmw   0/2     Error    0          5s

But Rancher is still showing this:

If we test the same scenario with Rancher 2.9-head we can see:

NAME                      READY   STATUS      RESTARTS   AGE
supertest29-0ea0d-htsmw   0/1     Completed   0          2m9s
supertest29-160cf-d62f4   0/1     Error       0          69s
supertest29-160cf-gmzdv   0/1     Error       0          63s
supertest29-160cf-k9zxd   0/1     Error       0          48s

And, after a few seconds we can see the error in Rancher: (and the error persists)

I'm closing this because its Milestone is 2.9.0 and it works fine for it.

kkaempf added the kind/bug label Jan 12, 2024

kkaempf added this to the v2.8-Next2 milestone Jan 12, 2024

github-actions bot added team/fleet labels Jan 12, 2024

kkaempf added JIRA Must shout area/ui labels Feb 2, 2024

mmartin24 mentioned this issue Mar 12, 2024

Context deadline exceeded error after cluster upgrade to 2.8.2 #2224

Closed

kkaempf modified the milestones: v2.8-Next1, v2.9.0 Apr 3, 2024

kkaempf removed team/fleet labels Apr 10, 2024

0xavi0 assigned 0xavi0 and unassigned 0xavi0 May 14, 2024

0xavi0 closed this as completed May 15, 2024

This was referenced May 24, 2024

DO_NOT_MERGE Resources for FLEET-63 Test private helm registries with helmRepoURLRegex rancher/fleet-e2e#126

Closed

Fleet Repo doesn't show any error when there is an issue (in fleet 0.9) #2462

Closed

manno mentioned this issue Jun 5, 2024

[SURE-8482] Misleading error message when trying to deploy cluster resources with targetNamespace #2484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

kkaempf commented Jan 12, 2024

manno commented Jan 15, 2024 •

edited

Loading

Martin-Weiss commented Jan 16, 2024

khushalchandak17 commented Jan 22, 2024

Martin-Weiss commented Jan 22, 2024

skanakal commented Jan 23, 2024

manno commented Jan 24, 2024

Martin-Weiss commented Jan 25, 2024

manno commented May 7, 2024 •

edited

Loading

0xavi0 commented May 15, 2024 •

edited

Loading

[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

Comments

kkaempf commented Jan 12, 2024

SURE-7413

Issue Description:

Business impact:

Troubleshooting steps:

Repro steps:

Workaround:

Actual behaviour:

Expected behaviour:

manno commented Jan 15, 2024 • edited Loading

Martin-Weiss commented Jan 16, 2024

khushalchandak17 commented Jan 22, 2024

Martin-Weiss commented Jan 22, 2024

skanakal commented Jan 23, 2024

manno commented Jan 24, 2024

Martin-Weiss commented Jan 25, 2024

manno commented May 7, 2024 • edited Loading

0xavi0 commented May 15, 2024 • edited Loading

manno commented Jan 15, 2024 •

edited

Loading

manno commented May 7, 2024 •

edited

Loading

0xavi0 commented May 15, 2024 •

edited

Loading