Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entrypoint cannot be found in private repository image #7698

Closed
wilstdu opened this issue Feb 22, 2024 · 12 comments · Fixed by #7921
Closed

Entrypoint cannot be found in private repository image #7698

wilstdu opened this issue Feb 22, 2024 · 12 comments · Fixed by #7921
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@wilstdu
Copy link
Contributor

wilstdu commented Feb 22, 2024

Expected Behavior

Task without command and script arguments can resolve entrypoint from image manifest laying in a private repository.

Actual Behavior

PodCreationFailed.

Tekton Pipelines controller receives this error:
Failed to create task run pod for taskrun \"tester-17\": failed to create task run pod \"tester-17\": translating TaskSpec to Pod: GET https://<redacted-account-id>.dkr.ecr.<redacted-region>.[amazonaws.com/v2/](http://amazonaws.com/v2/)<redacted-repo>/<redacted>/manifests/<redacted-sha>: unexpected status code 401 Unauthorized: Not Authorized\n. Maybe missing or invalid Task default/resolve-dependencies

Issue started occurring since v0.55.0. With exactly the same system setup and older Tekton Pipelines version is was still working.

Steps to Reproduce the Problem

  1. Tekton deployed on AWS EKS
  2. Tekton running at least 0.55.0
  3. Image is available in private ECR repository
  4. Nodes have required IAM policies attached to pull from private ECR repositories
  5. Apply a TaskRun that references Task having a step with image from private ECR repository

Additional Info

Discussion on Slack: https://tektoncd.slack.com/archives/CJ62C1555/p1708526794010999
Tekton is installed via Tekton Operator with this configuration, but operator doesn't seem to have impact for this error:

apiVersion: operator.tekton.dev/v1alpha1
kind: TektonConfig
metadata:
  name: config
spec:
  profile: lite
  targetNamespace: tekton-pipelines
  config:
    nodeSelector:
      kubernetes.io/os: linux
    priorityClassName: system-cluster-critical
  chain:
    disabled: true
  pipeline:
    enable-api-fields: alpha
    enable-tekton-oci-bundles: true
    disable-affinity-assistant: true
  • Kubernetes version:
EKS 1.28
@wilstdu wilstdu added the kind/bug Categorizes issue or PR as related to a bug. label Feb 22, 2024
@afrittoli
Copy link
Member

Thanks @wilstdu for reporting this.
I looked for changes in v0.55 that could have caused this change in behaviour, but nothing stood out for me.
@vdemeester @imjasonh do you have any idea about what might have broken this behaviour?

@vdemeester
Copy link
Member

Essentially, the pipeline controller pod doesn't have the rights to fetch the images — really the image configuration — not the entrypoint.

@wilstdu what previous version was it working ? 0.54, or even previously ? Also, does it still not work with 0.56 or 0.57 ?

@wilstdu
Copy link
Contributor Author

wilstdu commented Feb 22, 2024

@vdemeester, I was upgrading Tekton pipelines from v0.44.0 to v0.56.1.
I checked multiple Tekton Pipeline versions in between, and the last working one was 0.54, with 0.55 entry point retrieval no longer works. The only different thing in my setup was tekton-pipelines pod running different version - everything else stayed the same.

It doesn't work with 0.56, nor with 0.57.

@vdemeester
Copy link
Member

@wilstdu interesting 🤔

So, they way the pipeline controller work (in that part) is that it's taking we are taking the imagePullSecrets from the service account attached to the pipelinerun and the imagePullSecrets from the podTemplate (taskRun.Spec.ServiceAccountName, podTemplate.ImagePullSecrets in code) as well as some amazon (or other cloud) specifics (the "cloud-specific" part comes from go-containerregistry really – and I am not familiar at all of what it does). Nothing in that part of the code changed synced 2022, so either something changed in go-containerregistry, or something else weird is happening 🤔

@afrittoli
Copy link
Member

I also thought that maybe it was go-containerregistry, but there was no version change between v0.54 and v0.55:

➜ git diff v0.54.0..v0.55.0 -- go.mod | grep containerreg
 	github.com/google/go-containerregistry v0.16.1
 	github.com/google/go-containerregistry/pkg/authn/k8schain v0.0.0-20230625233257-b8504803389b
 	github.com/google/go-containerregistry/pkg/authn/kubernetes v0.0.0-20230516205744-dbecb1de8cfa

@vdemeester
Copy link
Member

vdemeester commented Feb 22, 2024

@afrittoli yeah, that's what make me wonder what the hell is happening here 🙃 There is changes in the "indirect" dependencies from aws/ecr.

λ git diff v0.54.0..v0.55.0 -- go.mod | grep aws
-	github.com/sigstore/sigstore/pkg/signature/kms/aws v1.7.5
+	github.com/sigstore/sigstore/pkg/signature/kms/aws v1.7.6
-	github.com/aws/aws-sdk-go-v2/service/kms v1.24.7 // indirect
-	github.com/aws/aws-sdk-go-v2/service/ssooidc v1.17.3 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.10.3 // indirect
+	github.com/aws/aws-sdk-go-v2/service/kms v1.27.2 // indirect
+	github.com/aws/aws-sdk-go-v2/service/ssooidc v1.21.2 // indirect
-	github.com/aws/aws-sdk-go-v2 v1.21.2 // indirect
-	github.com/aws/aws-sdk-go-v2/config v1.19.1 // indirect
-	github.com/aws/aws-sdk-go-v2/credentials v1.13.43 // indirect
-	github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.13.13 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/configsources v1.1.43 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.4.37 // indirect
-	github.com/aws/aws-sdk-go-v2/internal/ini v1.3.45 // indirect
+	github.com/aws/aws-sdk-go-v2 v1.23.5 // indirect
+	github.com/aws/aws-sdk-go-v2/config v1.25.11 // indirect
+	github.com/aws/aws-sdk-go-v2/credentials v1.16.9 // indirect
+	github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.9 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/configsources v1.2.8 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.5.8 // indirect
+	github.com/aws/aws-sdk-go-v2/internal/ini v1.7.1 // indirect
 	github.com/aws/aws-sdk-go-v2/service/ecr v1.18.11 // indirect
 	github.com/aws/aws-sdk-go-v2/service/ecrpublic v1.16.2 // indirect
-	github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.9.37 // indirect
-	github.com/aws/aws-sdk-go-v2/service/sso v1.15.2 // indirect
-	github.com/aws/aws-sdk-go-v2/service/sts v1.23.2 // indirect
-	github.com/aws/smithy-go v1.15.0 // indirect
+	github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.10.8 // indirect
+	github.com/aws/aws-sdk-go-v2/service/sso v1.18.2 // indirect
+	github.com/aws/aws-sdk-go-v2/service/sts v1.26.2 // indirect
+	github.com/aws/smithy-go v1.18.1 // indirect
 	github.com/awslabs/amazon-ecr-credential-helper/ecr-login v0.0.0-20230510185313-f5e39e5f34c7 // indirect

But not on github.com/awslabs/amazon-ecr-credential-helper/ecr-login, so I am not sure if it as any impact…

@afrittoli
Copy link
Member

@wilstdu if you're familiar with the process of building Tekton, you could try building a v0.55 with github.com/sigstore/sigstore/pkg/signature/kms/aws== v1.7.5 - I imagine that's what's pulling in all the new aws packages, and see if that works. If not I could try and build that for you on some public artifact repo.

@wilstdu
Copy link
Contributor Author

wilstdu commented Feb 22, 2024

@afrittoli if I did no mistakes when building the controller image - result is the same.

@seternate
Copy link
Contributor

I have the same bug discovered when trying to upgrade to the newest version v0.56.1 with the Operator. Same error message and behaviour.

One thing I managed to get working was, creating a imagePullSecret attaching to a ServiceAccount and using this ServiceAccount in the PipelineRun. This seems to work and the Task started at least. But this is no solution since AWS is resetting the credentials every 12 hours. More of a "POC" if there is a permission problem.
Once I figured that out I double-checked everything permission-wise in my cluster/AWS account, but can not find any problem their. Even trying to provide a ServiceAccount to the PipelineRun with annotation "eks.amazonaws.com/role-arn: " did not work.

Any progress for this topic or something I can help with to get that fixed?

@seternate
Copy link
Contributor

seternate commented Apr 26, 2024

@afrittoli I investigated the bug a little bit further in the past few days.

I found out updating github.com/sigstore/sigstore/pkg/signature/kms/aws from 1.7.5 to 1.7.6 on version v0.54.0 is introducing the bug. After hunting and digging I found the issue is introduced with the indirect dependency github.com/aws/aws-sdk-go-v2 where they introduced a breaking change with v1.23.0 leading to the bug. They already know about that and also suggest how to fix it. See aws/aws-sdk-go-v2#2370.

All in all every module/package used of the github.com/aws/aws-sdk-go-v2 dependency needs to be updated to a version after the release of v1.23.0. I tried to do that with the newest Tekton Release v0.59.0 build the images deployed to our AWS cluster and everything was working again. To fix it it was enough to pull this into go.mod:

replace (
	github.com/aws/aws-sdk-go-v2/service/ecr => github.com/aws/aws-sdk-go-v2/service/ecr v1.27.3
	github.com/aws/aws-sdk-go-v2/service/ecrpublic => github.com/aws/aws-sdk-go-v2/service/ecrpublic v1.23.3
)

I would open a PR with updated dependencies to get that out ASAP and would try to hunt down where the indirect dependency is coming from to patch that directly there (if possible).

@seternate
Copy link
Contributor

@afrittoli I tracked down the dependencies.

It seems that the dependency github.com/google/go-containerregistry/pkg/authn/k8schain uses github.com/awslabs/amazon-ecr-credential-helper/ecr-login, which uses github.com/aws/aws-sdk-go-v2 with version v1.18.0 and github.com/aws/aws-sdk-go-v2/service/ecr with version v1.18.10.

github.com/sigstore/sigstore/pkg/signature/kms/aws is also using github.com/aws/aws-sdk-go-v2. But with version v1.26.0.

This leads to the problem that the indirect dependencies of github.com/aws/aws-sdk-go-v2 with version v1.26.0 and github.com/aws/aws-sdk-go-v2/service/ecr with version v1.18.10 does not match as said by the issue aws/aws-sdk-go-v2#2370, because with version v1.23.0 of github.com/aws/aws-sdk-go-v2 they introduced a breaking change.

I would create a PR to fix this with a replace in the go.mod for now in Tekton.
But my question is should I create a PR at github.com/awslabs/amazon-ecr-credential-helper/ecr-login to bump up the version of the github.com/aws/aws-sdk-go-v2 and github.com/aws/aws-sdk-go-v2/service/ecr so we can get rid of the replace or is it fine to leave it with the replace? Because out of the view of the credential-helper everything is fine.

@vdemeester
Copy link
Member

But my question is should I create a PR at github.com/awslabs/amazon-ecr-credential-helper/ecr-login to bump up the version of the github.com/aws/aws-sdk-go-v2 and github.com/aws/aws-sdk-go-v2/service/ecr so we can get rid of the replace or is it fine to leave it with the replace? Because out of the view of the credential-helper everything is fine.

Ideally yes 👼🏼 🙏🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants