Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

Closed
kkaempf opened this issue Jan 12, 2024 · 9 comments
Closed

[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065

kkaempf opened this issue Jan 12, 2024 · 9 comments
Assignees
Milestone

Comments

@kkaempf
Copy link
Collaborator

kkaempf commented Jan 12, 2024

SURE-7413

Issue Description:

When updating a bundle in a repo to a helm version that does not exist, the fleet silently ignores it, and the fleet agent job's pod keeps restarting. There is no proper indication of the error, and the bundle shows active in the Rancher UI.

Business impact:

Developers are using Rancher to update the bundle and cannot see any error for wrong deployment. It makes it difficult to manage the repo

Troubleshooting steps:

Multiple developers are using the Rancher only for deployment purposes. They only have the read permission at the Rancher level to see after the commit at git repo. The issue we observed is that the Rancher UI is not showing any error even though there is an issue with the commit. The customer is looking for a solution where the user can see from Rancher UI if the Commit has failed.

Repro steps:

Rancher 2.7.9
Create a Git Repo in the continuous delivery session of Rancher. Make sure the Gitrepo is in an active state.
Create a Git commit with any of the helm charts ( Use LH chart for testing )
The chart is getting deployed without any issues.

Now, again, edit the go to the Git repo, edit the helm chart, and change the version that is not available.
Go to the Rancher UI and check the Repo status, and we could see it's still active and no error throwing.
Now go to the gitjob pod and see; we can see the version is not available error.

The issue is that there is no option for a Rancher user with limited access to the clusters; they won't be able to identify the status of the last commit if there are any issues.

Workaround:

Is a workaround available and implemented? NO

Actual behaviour:

The Rancher UI does not show the Error if the last Git commit failed when using the Rancher-provided Continuous Delivery.

Expected behaviour:

The Rancher UI should show the Error if the last Git commit failed when using the Rancher-provided Continuous Delivery.

@kkaempf kkaempf added this to the v2.8-Next2 milestone Jan 12, 2024
@manno
Copy link
Member

manno commented Jan 15, 2024

The job controller in gitjob should collect the job's output from a Failed job. If I remember correctly the error is propagated from the job to the gitjob status, to the gitrepo status. UI finally reads it from the gitrepo status.

Does the error from the "bundlereader" not result in a Failed job? Does the controller fail to pick up the state, does propagation fail?

@Martin-Weiss
Copy link

+1

One situation where we had to see this "problem" was when the helm credentials that were used to fetch an OCI helm chart have not been valid - the Rancher UI for continuous delivery showed the gitrepo with a green / ok status even though the job failed to fetch the helm chart.. (only checking the logs of the fleet container showed the problem).

@khushalchandak17
Copy link

+1

I had seen similar behavior when we provide a invalid path in git-repo.

Well, I have created a few scenarios to elaborate this issue in detail;
Scenario 1: gitrepo (name: failbranch) with the wrong branch which shows the expected result failed on gitrepo.
Scenario 2: gitrepo (name: test) with the wrong path.
Scenario 3: gitrepo (name: logapp) with an invalid chart version.

In scenario 1, I do see that gitrepo ends up with a failed status with the error reported as “No commit for branch: fakebranch,” which is the expected result.

In scenario 2, I do see that the git repo remains active even though an invalid dir path has been provided to the git repo. However, we do see that for a fraction of a second, on the UI, we see the error reported as “no resource found at the following path to deploy:[<Path>]” with the gitrepo status as ‘Git Updating.’ From the terminal, we can see a similar error in gitjob status, but it stays the same for a few seconds, and then, I guess, it reconciles and puts the gitrepo back in the active state, flushing the error on the UI.

In scenario 3, again, I see that even if an invalid chart version is provided in the fleet.yaml, the git repo remains in the active state. But again, for a fraction of a second, we do see that the error is reported on the UI with “no chart version found for <chart-version>.” We can see a similar error in gitjob and gitrepo. The status of git-repo was git-updating, but after reconciling, the git-repo status changes to active.

The expected result in scenarios 2 & 3 was to update the git-repo status with failed and print the error rather than reconciling and becoming active.

I have attached screenshots for the error captured over the UI for a fraction of a second in the second and third scenarios.
Gitjob status
Scenario 2   3

@Martin-Weiss
Copy link

For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.

@skanakal
Copy link

It appears to be functioning as intended, but the process is exceptionally swift, making it challenging to capture the information effectively.

I think the job is continually being deleted and retried, it's likely due to the fatal error condition detected in the GitJob status. It seems the GitJob is designed to respond to such errors by deleting the job to initiate a retry...

https://github.com/rancher/gitjob/blob/release/fleet/v0.9/pkg/controller/gitjob/gitjobs.go#L125

{
  "commit": "4ff289ba5a9108502f83ee41fb17208d84bf2bb0",
  "conditions": [
    {
      "lastUpdateTime": "2024-01-23T07:21:19Z",
      "status": "False",
      "type": "Reconciling"
    },
    {
      "lastUpdateTime": "2024-01-23T07:21:47Z",
      "message": "time=\"2024-01-23T07:21:44Z\" level=fatal msg=\"no chart version found for rancher-logging-45.5.0\"\n",
      "reason": "Stalled",
      "status": "True",
      "type": "Stalled"
    },
    {
      "lastUpdateTime": "2024-01-23T07:21:27Z",
      "status": "True",
      "type": "Synced"
    }
  ],
  "jobStatus": "Failed",
  "lastSyncedTime": "2024-01-23T07:21:27Z",
  "observedGeneration": 5,
  "updateGeneration": 11
}

time="2024-01-23T07:21:19Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:21:16Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"

time="2024-01-23T07:22:20Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:22:17Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n"

I was able to see them in gitjob pod logs...

@manno
Copy link
Member

manno commented Jan 24, 2024

For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched.

Yes, you can also try "stern", if you know how to match the pod, e.g. by label you can do stern -n cattle-fleet-system -l "app=fleet-job" and it will tail any output from jobs like that.

@Martin-Weiss
Copy link

Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?

@manno
Copy link
Member

manno commented May 7, 2024

Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing?

How would you define "constantly failing"? Like a retry counter, which we reset on a successful deployment?

@0xavi0 0xavi0 assigned 0xavi0 and unassigned 0xavi0 May 14, 2024
@0xavi0
Copy link
Contributor

0xavi0 commented May 15, 2024

This is working as expected in fleet v0.10.0-rc.13 (Rancher 2.9-head)

I've tested it with Ranched 2.7.9 and, although I can see all the job pods trying to get an invalid version for a helm chart, it still shows up as ACTIVE in Rancher.

I see this:

NAME                    READY   STATUS   RESTARTS   AGE
supertest-512fe-p4gms   0/2     Error    0          31s
supertest-512fe-jcqnz   0/2     Error    0          23s
supertest-512fe-xxsmw   0/2     Error    0          5s

But Rancher is still showing this:
Image

If we test the same scenario with Rancher 2.9-head we can see:

NAME                      READY   STATUS      RESTARTS   AGE
supertest29-0ea0d-htsmw   0/1     Completed   0          2m9s
supertest29-160cf-d62f4   0/1     Error       0          69s
supertest29-160cf-gmzdv   0/1     Error       0          63s
supertest29-160cf-k9zxd   0/1     Error       0          48s

And, after a few seconds we can see the error in Rancher: (and the error persists)

Image

I'm closing this because its Milestone is 2.9.0 and it works fine for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants