-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Renovate not exiting since 24.40.0 #8660
Comments
How do you find the image? Are you overriding the entry point? |
No, there's no entrypoint specified. Only an args value
|
How is ECR authentication achieved? Eg is it run on EC2 with an IAM role? |
yes, web-identity tokens using IAM Roles for Service Accounts. The pod has IAM permissions to pull from the ECR registry |
@caphrim007 Can you check which processes are running in container? When i do some local tests it will simply work for me. |
verified with |
@caphrim007 can you provide steps to pass the IAM stuff to renovate, as i don't know how to do / to test that |
@viceice some more information v24.39.3
v24.40.0
It feels like something it not being returned to dumb-init, but I don't know what would be not returning. Is it possible that the ecr-client is not closing its connection and just hanging open? There's no authentication error that I can see. Again, renovate runs just fine. Its the "what happens after its done" part that seems to be a problem. I observed this as well v24.39.3
where-as in v24.40.0 I see this
dumb-init is not reporting that it received anything from its child process. In terms of passing the IAM stuff to renovate, well, I don't know how trivial it is. The process we're using is called IAM Roles for Service Accounts. https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/ There's a role in IAM with permissions to read cloud resources if needed. That role arn is provided as an annotation on the Kubernetes Service Account. Then AWS magic will cause the Pod to have 2 environment variables set when it starts AWS_ROLE_ARN=arn:aws:iam::111111111111:role/MyServiceRoleForRenovate The content of the web identity token file is a JWT token. This token can be used to assume a role in IAM; the role in AWS_ROLE_ARN. In all situations, the way that this is best handled is to just leave it to the AWS SDK. If that env var is present, the SDK libraries will use it and get access to whatever cloud resources they need. But having said that, again, all of that "stuff" appears to be working just fine. It is able to auth to ECR, it finds new images, it sends PRs, that's all good. Its just when it goes to exit, it behaves like something is...like...still open, or waiting, or the exit code is not trickled up to dumb-init. I'm suspicious of the aws-sdk/ecr-client. I'm looking in to how I can revert just this patch here c90c506#diff-922b65de7866726a28ede70241c7f637072005518289aa3755070398bc5b1db7 and re-build a renovate image to see if that is the culprit. |
adding more notes here for reference. I added this library to the renovate image https://github.com/myndzi/wtfnode and added its usage to renovate.js within this block at the end
On the working image (39.3), this outputs the following
And then the process exits On the 40.0 image, the output is the following
Note the mention of the sockets. My hypothesis is that aws-sdk/client-ecr...or the usage of it...is keeping these open. Since they never close, this prevents the anonymous async function from exiting. I additionally tried changing the anonymous function to be the following to see if it would affect the cronjob run
The above does make the cronjob succeed. This is discussed in several forums and the nodejs notes as not recommended because """ This would seem to reasonably explain why it works though when I test its usage; ie, stopping an async operation which hasn't fully completed. https://nodejs.org/api/process.html#process_process_exit_code I am trying to understand further why the sockets remain open, if that is expected, and whether there is an appropriate way to gracefully shut them down. I've pasted the output of the following two functions which provide more insight into these sockets
https://gist.github.com/caphrim007/5ec84329d962eff26ad3f2c8d57c9f4a Will post more as I learn more |
We should check in case the library has a function to close sockets. Otherwise maybe it's a library bug/shortcoming. We can consider a hard process exit too |
@caphrim007 Are the ips the originals? Do you have a kind of proxy between ecr and your renovatebot? I remember we had some trouble with proxies. Maybe they can be the cause why the connection isn't closed. There are a lot issues in node about not closing http sockets. |
Those IPs look like the EC2 metadata endpoint: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/instancedata-data-retrieval.html Ie used to fetch credentials, not ECR itself |
ok, so it looks like the problem only occures aws internally. make it even harder to debug. I'll check the awds sdk issues for this problem later |
btw: there was an aws sdk update a few version after 24.40. @caphrim007 did you chaecked latest renovate version too? |
I updated the ECR lib in #8680 Don't see any mention of this problem being fixed but you never know. |
Renovate v24.40 comes with aws sdk v3.2. v24.42 brings aws v3.4.1 and your pr is in renovate v24.49.7 and brings aws v3.5. That's way i ask if he tested latest renovate version too. Maybe aws sdk v3.3 makes a difference. |
@viceice I've assembled a minimal (non-renovate) repro that exhibits the behavior I've been describing. Before I post it, I looked through the issues for client-ecr and found these
Though I can't say for sure, because again, renovate doesn't have any problem querying my private ECR registry from inside the cluster...afaik...I could be convinced that issue 1808 is related to this. When I run the following code in a cronjob pod within kubernetes, I get the same behavior I described in this thread.
It's a semi-faithful repro. I also made a repro that uses the before-change code
It produced this output
So there is clearly a difference between aws v2 sdk and v3 sdk. Even if the difference between await and callbacks is functionally identical, something is not the same buried in the v3 sdk. According to the first GH issue I linked above, AWS claims an EKS web-id token fix is aimed for release in v3.6.0 scheduled for release at/around 2/18. @rarkins and @viceice if you don't mind, I'm going to temporarily table this discussion because I don't think this is renovate's problem; you're just the affected party. I'm going to reach out to my AWS TAM and provide my aws repros and have them tell me if there's a bug in their SDK. When I learn more I'll post here. In the meantime I can leave myself pinned to the 24.39.3 version or (at worst) I can build a new image with Will update this thread when I know more |
My suspicion now is that it's related to IAM code and not ECR directly. Ie it's the code that fetches your credentials that's left hanging and not the code that fetches from ECR |
The AWS team has pushed back the release of the web-id token support to work on (in their opinion) a more important missing feature. As such, this issue will be pushed back accordingly. just fyi |
FWIW we are experiencing a similar issue. our setup is a self-hosted renovatebot installed via npm and run on GitLab CI in the regular |
Please use node:14, or even better |
yeah, we just upgraded to that, but it did not solve the issue unfortunately :-/ |
Shall we add a delay followed by process.exit() until AWS fixes it? It would be nice if there was some function we could call to find out what's stopping the exit normally and log it too. |
I'm experiencing this issue w/ both the |
@narwold are you also using an ECR registry? |
@rarkins Maybe use an env variable a workaround to do a force exit when renovate is done? |
the AWS team behind the SDK, as i mentioned in a previous post, seems to have de-prioritzed (what I think) is the solution (or towards a solution?). I haven't commented back on this thread yet because they haven't made any commits related to it. just fyi. |
We have a new experimental environment variable to force a hard exit, so please have a try until the issue is fixed upstream on aws sdk.
https://docs.renovatebot.com/self-hosted-experimental/#renovate_x_hard_exit |
Maybe it's time to raise an issue at aws sdk. |
FWIW I can confirm that using |
@rarkins aws/aws-sdk-js-v3#1808 has been implemented and released as 3.10.0. Would it be possible to bump |
Updated, should be released in a few minutes |
Ref: #10527 |
Release as |
I will chime back in with results on my end once I see the image come through. |
v25.49.1 works great for me. Renovate terminates correctly (without |
I can verify what @msw-kialo says. it's working for me also without the env var. |
Thanks everyone. We'll remove the env var now and close this issue in the process |
i dont know if this is specifically related to the last release, but somewhere in my upgrade path, after i put 25.49.1 into prod, renovate autoclosed all my existing open PRs and hasn't opened any new PRs since....not sure what happened |
ahh...seems it might be related.
seems i need to retract the earlier statement about it knowing about webid tokens. the process exits completely with the removal of the env var, but I dont see web-id tokens working on my end. @msw-kialo can you provide any more detail on your env working? |
for reference i reverted back to 25.47.6, with env var to force quit, and ECR registries were found again without issue. Will need to spend time tomorrow bisecting this unless anyone has ideas from merged PRs between 25.47.6 and latest |
Can you verify v25.49.0 is working, so the issue is the SDK update in v25.49.1? |
@caphrim007 The usual setup: the serviceAccount for the cronjob/pods has the securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000 # this was crucial for us In our case the EKS cluster and the ECR repositories are in different AWS accounts (shouldn't matter) but in the same AWS region: We haven't checked whether this case is handled correctly. We did have issues with this in other software (the injected |
@caphrim007 Would this work: securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 0 As our default image user has uid=1000 and gid=0. |
@viceice @rarkins i think I figured it out...after a long day of console.log'ing my way through the code. The first thing that I got stuck on was this https://github.com/renovatebot/renovate/blob/main/lib/datasource/docker/index.ts#L97 The first time through, that appears to work fine (it skips), however all future runs of renovate (i was running it many times in a single long-running pod) would return there because the return value of .get is I looked at some other datasources and they all use docker, however, does not (among like 3 in total). Perhaps this is a bug. It certainly made debugging in run-time challenging. After I got that sorted, I realized I needed to get to the bottom of whatever was in this err I spent some time reading about ecr client, the newer web-identity packages that aws released, and I tried this solution to see if it just plain out wasn't reading the token to get a credential I finally reached a point where the aws-sdk-js sent to stdout that I was "unable to sts:AssumeRoleWithWebIdentity". This was super weird to me because...I swear...this has been working without issue for ages. There was this ticket that prompted the To answer @viceice 's question, yes, it was working with v25.49.0. I specifically tried it because you had asked. I tried the suggested securityContext tweaks as well, without success; same error I posted earlier about When I looked at the trust relationship as well as the annotation specifying the role arn in the service account in k8s, I found what @msw-kialo so presciently alluded to; they were different. My ServiceAccount annotation was After that fix, there was one more error. The existing IAM policy that I've used for ages, seems to have gained a requirement that I missed somewhere in release notes. Specifically, it now additionally needs After I added that, the service came back and the lookup errors went away. I'm going to do some more clean-up on my end to make sure this is now correct, as well as try the newer releases of renovate to see if I continue to see success. Will update the ticket here shortly once I do that. Appreciate the help that everyone on this thread provided! |
In Renovate datasources we use |
I've not seen that issue before. I've copied the same cache checking logic as is used everywhere else. We will cache a renovate/lib/util/cache/package/decorator.ts Lines 106 to 118 in 226ec07
|
I've decided to leave the env var there in case it's of use to people in future, so will close this now. |
What Renovate type, platform and version are you using?
We run the self-hosted renovate full image (renovate/renovate) in a kubernetes cronjob.
Describe the bug
Starting in version 24.40.0, the renovate process no longer exits.
It emits an exiting log message,
but the process in the container never dies. This causes the pod to never die, and as a result, the cronjob never finishes. This blocks all future running of the job.
In version 24.39.3 this problem did not exist. However I looked at the source code diff between the two versions of renovate, and there's nothing that jumps out at me as to why the process is suddenly now failing to stop.
Relevant debug logs
Have you created a minimal reproduction repository?
Please read the minimal reproductions documentation to learn how to make a good minimal reproduction repository.
Additional context
The text was updated successfully, but these errors were encountered: