-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SURE-7413] Fleet Repo doesn't show any error when there is an issue #2065
Comments
The job controller in gitjob should collect the job's output from a Does the error from the "bundlereader" not result in a |
+1 One situation where we had to see this "problem" was when the helm credentials that were used to fetch an OCI helm chart have not been valid - the Rancher UI for continuous delivery showed the gitrepo with a green / ok status even though the job failed to fetch the helm chart.. (only checking the logs of the fleet container showed the problem). |
For debugging this is really annoying - especially because the failing pods (fleet container fails) are deleted really fast so that getting the logs is not easy.. basically as a workaround I use a bash for loop to get the logs of the fleet container as soon as the new pod is launched. |
It appears to be functioning as intended, but the process is exceptionally swift, making it challenging to capture the information effectively. I think the job is continually being deleted and retried, it's likely due to the fatal error condition detected in the GitJob status. It seems the GitJob is designed to respond to such errors by deleting the job to initiate a retry... https://github.com/rancher/gitjob/blob/release/fleet/v0.9/pkg/controller/gitjob/gitjobs.go#L125
time="2024-01-23T07:21:19Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:21:16Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n" time="2024-01-23T07:22:20Z" level=info msg="Deleting failed job to trigger retry fleet-local/loggin-final-1c010 due to: time="2024-01-23T07:22:17Z" level=fatal msg="no chart version found for rancher-logging-45.5.0"\n" I was able to see them in gitjob pod logs... |
Yes, you can also try "stern", if you know how to match the pod, e.g. by label you can do |
Could Fleet and the Rancher UI be extended so that in the UI one can see that a specific git repo is constantly failing? |
How would you define "constantly failing"? Like a retry counter, which we reset on a successful deployment? |
SURE-7413
Issue Description:
When updating a bundle in a repo to a helm version that does not exist, the fleet silently ignores it, and the fleet agent job's pod keeps restarting. There is no proper indication of the error, and the bundle shows active in the Rancher UI.
Business impact:
Developers are using Rancher to update the bundle and cannot see any error for wrong deployment. It makes it difficult to manage the repo
Troubleshooting steps:
Multiple developers are using the Rancher only for deployment purposes. They only have the read permission at the Rancher level to see after the commit at git repo. The issue we observed is that the Rancher UI is not showing any error even though there is an issue with the commit. The customer is looking for a solution where the user can see from Rancher UI if the Commit has failed.
Repro steps:
Rancher 2.7.9
Create a Git Repo in the continuous delivery session of Rancher. Make sure the Gitrepo is in an active state.
Create a Git commit with any of the helm charts ( Use LH chart for testing )
The chart is getting deployed without any issues.
Now, again, edit the go to the Git repo, edit the helm chart, and change the version that is not available.
Go to the Rancher UI and check the Repo status, and we could see it's still active and no error throwing.
Now go to the gitjob pod and see; we can see the version is not available error.
The issue is that there is no option for a Rancher user with limited access to the clusters; they won't be able to identify the status of the last commit if there are any issues.
Workaround:
Is a workaround available and implemented? NO
Actual behaviour:
The Rancher UI does not show the Error if the last Git commit failed when using the Rancher-provided Continuous Delivery.
Expected behaviour:
The Rancher UI should show the Error if the last Git commit failed when using the Rancher-provided Continuous Delivery.
The text was updated successfully, but these errors were encountered: