Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestHermeticTaskRun is flakey #4567

Closed
jerop opened this issue Feb 11, 2022 · 7 comments
Closed

TestHermeticTaskRun is flakey #4567

jerop opened this issue Feb 11, 2022 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@jerop
Copy link
Member

jerop commented Feb 11, 2022

Expected Behavior

TestHermeticTaskRun should only fail due to actual bugs

Actual Behavior

TestHermeticTaskRun flaked in:

Error waiting for TaskRun not-hermetic-run-as-root to finish: "not-hermetic-run-as-root" failed
Error executing command: fork/exec /tekton/scripts/script-0-wrvhk: permission denied
@jerop jerop added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 11, 2022
@pritidesai pritidesai added the kind/flake Categorizes issue or PR as related to a flakey test label Feb 11, 2022
@bobcatfish
Copy link
Collaborator

Some more context, It looks like for #4541 it failed 3 times in a row:

https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/pull/tektoncd_pipeline/4541/pull-tekton-pipeline-alpha-integration-tests/

image

I think Error executing command: fork/exec /tekton/scripts/script-0-wrvhk: permission denied might be a red herring - I think that might actually be from a previous run with hermetic mode on (makes me wonder if the expected failure in hermetic mode is happening for the right reason - maybe script mode doesnt work with hermetic mode? but that's another story!)

It seems like the failing taskrun is timing out:

          podName: not-hermetic-run-as-root-pod
          startTime: "2022-02-02T10:36:41Z"
          steps:
          - container: step-access-network
            imageID: docker-pullable://ubuntu@sha256:669e010b58baf5beb2836b253c1fd5768333f0d1dbcb834f7c07a4dc93f474be
            name: access-network
            terminated:
              exitCode: 1
              finishedAt: "2022-02-02T10:37:41Z"
              reason: TaskRunTimeout
              startedAt: "2022-02-02T10:36:46Z"

And then I think we're not getting any logs b/c iirc when a TaskRun times out we have to stop the pod from executing, and I think that might involved deleting the underlying pod?? I'm getting rusty though so I'm not sure XD but if so that might explain why we aren't seeing any logs for the taskrun that is timing out:

    build_logs.go:35: Could not get logs for pod not-hermetic-run-as-root-pod: pods "not-hermetic-run-as-root-pod" not found

Looking at the test that is failing, I'm wondering if it might be that the apt-get commands sometimes take more than a minute 🤔

apt-get update
apt-get install -y curl

@bobcatfish bobcatfish self-assigned this Feb 16, 2022
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Feb 16, 2022
Also use Errorf instead of Fatalf between the two tests (the hermetic
test and the non-hermetic tests) so that if one fails the other will
still run.

In tektoncd#4567 we see that the hermetic end to end test sometimes fails,
specifically it seems to be the `not-hermetic-run-as-root` version of
the test, and it seems like the failure is hitting the 1 minute timeout.

Looking at the test, it seems to be doing an `apt-get update` which
seems like an operation that would be in grave danger of sometimes
taking a while (especially depending on what version of the latest
ubuntu image is running) so although I'm not sure that's what is causing
the problem, I want to try doing something that is less likely to take
so long but still would require network access, as well as something
that would require priviledged access (which I assume is why the update
was included, to capture the combo of network access and doing something
priviledged)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Feb 16, 2022
Also use Errorf instead of Fatalf between the two tests (the hermetic
test and the non-hermetic tests) so that if one fails the other will
still run.

In tektoncd#4567 we see that the hermetic end to end test sometimes fails,
specifically it seems to be the `not-hermetic-run-as-root` version of
the test, and it seems like the failure is hitting the 1 minute timeout.

Looking at the test, it seems to be doing an `apt-get update` which
seems like an operation that would be in grave danger of sometimes
taking a while (especially depending on what version of the latest
ubuntu image is running) so although I'm not sure that's what is causing
the problem, I want to try doing something that is less likely to take
so long but still would require network access, as well as something
that would require priviledged access - which I assume is why the update
was included, to capture the combo of network access and doing something
priviledged. I'm still a bit confused about why both of those elements
are present - I assume both are not allowed in hermetic mode but it
would probably make more sense to test them separately to be sure they
each fail, otherwise only one is covered (i.e. either the network call
is going to fail and halt things, or the priviledged operation)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Feb 17, 2022
Also use Errorf instead of Fatalf between the two tests (the hermetic
test and the non-hermetic tests) so that if one fails the other will
still run.

In tektoncd#4567 we see that the hermetic end to end test sometimes fails,
specifically it seems to be the `not-hermetic-run-as-root` version of
the test, and it seems like the failure is hitting the 1 minute timeout.

Looking at the test, it seems to be doing an `apt-get update` which
seems like an operation that would be in grave danger of sometimes
taking a while (especially depending on what version of the latest
ubuntu image is running) so although I'm not sure that's what is causing
the problem, I want to try doing something that is less likely to take
so long but still would require network access.

I thought maybe that it was also trying to do somethign that required
priviledged execution (i.e. running as root) but it seems like that's
not something that hermetic mode drops anyway (looking at the TEP it
seems to just be scoped to networking) so it doesn't feel like there is
actually any need for that.
tekton-robot pushed a commit that referenced this issue Mar 21, 2022
Also use Errorf instead of Fatalf between the two tests (the hermetic
test and the non-hermetic tests) so that if one fails the other will
still run.

In #4567 we see that the hermetic end to end test sometimes fails,
specifically it seems to be the `not-hermetic-run-as-root` version of
the test, and it seems like the failure is hitting the 1 minute timeout.

Looking at the test, it seems to be doing an `apt-get update` which
seems like an operation that would be in grave danger of sometimes
taking a while (especially depending on what version of the latest
ubuntu image is running) so although I'm not sure that's what is causing
the problem, I want to try doing something that is less likely to take
so long but still would require network access.

I thought maybe that it was also trying to do somethign that required
priviledged execution (i.e. running as root) but it seems like that's
not something that hermetic mode drops anyway (looking at the TEP it
seems to just be scoped to networking) so it doesn't feel like there is
actually any need for that.
@bobcatfish
Copy link
Collaborator

Hopefully this is fixed by #4567 but plz re-open if it pops up again!

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2023
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Done
Development

No branches or pull requests

4 participants