Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of terraform validation failure in digger #1286

Open
motatoes opened this issue Mar 19, 2024 · 8 comments
Open

Better handling of terraform validation failure in digger #1286

motatoes opened this issue Mar 19, 2024 · 8 comments

Comments

@motatoes
Copy link
Contributor

Hi,
it looks like digger does not post comment when terraform resources are not correct
example of a pagerduty resource which is missing a rulle block
Acquiring state lock. This may take a few moments...

Error: Insufficient rule blocks
Basically, if terraform validate fails
run is marked as failed
comment is not update and keeps and remains
👷 The following projects are impacted

🕚 pagerduty: pending...
check is marked as failing (expected 🙂 ), do you think we can add, as details to the check, the runs links (the failing step would be event better)

Reported by user A.T.

@evanstachowiak
Copy link
Contributor

I'm having the same issue. If for instance a terraform plan errors with, for instance, the role not having all of the correct IAM permissions for a plan, digger will never update its status or the comment on the PR.

@ben-of-codecraft
Copy link
Contributor

I would like to vote this up, I encountered this today as well.

@motatoes
Copy link
Contributor Author

I can take a look at this, is there any steps that can help me reproduce this scenario? Its probably a case of digger bailing early but not updaing the reporter

@evanstachowiak
Copy link
Contributor

@motatoes , one scenario was the recent issue where the locks were broken.

@evanstachowiak
Copy link
Contributor

Here is another one where the status does not get updated:

Error while fetching policy: unexpected response while fetching project policy: <html>

<head><title>504 Gateway Time-out</title></head>

<body>

<center><h1>504 Gateway Time-out</h1></center>

</body>

</html>

 code 504
Warning: nil passed to plan result, sending empty
Failed to run commands. error checking policy: unexpected response while fetching project policy: <html>

<head><title>504 Gateway Time-out</title></head>

<body>

<center><h1>504 Gateway Time-out</h1></center>

</body>

</html>

 code 504
Error: Job failed.

@motatoes
Copy link
Contributor Author

Hey @evanstachowiak thanks for chiming in! To be honest there is a solution that has a PR pending here for general plan and apply failures: #1579

Regarding things like failure to report to backend etc. and the job crashing or not starting I was thinking that we should have a catch-all timeout on the orchestrator side that if it doesn't hear from the job within X minutes it considers it timed out. This would require implementing a heartbeat method where the cli will report in the background heartbeat pings to the orchestrator every few seconds and if the orchestrator, from the time it first the job doesn't hear an initial heartbeat after 1 minute, and subsequently doesn't get a heartbeat every 10 seconds or so it can fail with a timeout and post comment etc.

that's not the only solution but I think its a catch-all for cases when the job crashes, for whatever reason.thoughts ?

@evanstachowiak
Copy link
Contributor

@motatoes, shouldn't the orchestrator have the id of the job running when it first marks it as pending? Then you don't necessarily need a heartbeat, but the orchestrator can check the job status every however long and if the job is finished with no status update it can mark it as failed.

@motatoes
Copy link
Contributor Author

@evanstachowiak in fact it does know the id of the job not when it first fires it (github api does not return the ID on workflow dispatch), we have to query the list of jobs running and cross-match on the ID that we sent as input. We could do that in the background after firing the job but now we actually only know the ID when the cli first reports back the first time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants