increase timer for TestRealRunnerTimeout #6409

Yongxuanzhang · 2023-03-21T04:12:23Z

Changes

This commit increase the timer for TestRealRunnerTimeout and hope this could mitigate/fix the flake of #4643.

Some thoughts about why #4643 happened. The flaky test got "step didn't timeout", which means that the rr.Run doesn't return any errors, including the DeadlineExceeded error. It could be that the context timeout is accidentally larger than the sleep time and the Run finishes without context timeout (This could be the sleep time decreases, or the ctx timeout increases).
If we look at those failed reports:
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5666/pull-tekton-pipeline-unit-tests/1584556112152104960
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5666/pull-tekton-pipeline-unit-tests/1584676793351147520
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/5465/pull-tekton-pipeline-unit-tests/1570762197682884608
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/6120/pull-tekton-pipeline-unit-tests/1623358151954796544

TestRealRunnerTimeout took either 0.09s or 0.08s but the passed case should take less time than that (includes code run time and ctx timeout time, sleep time if the process started, if we look at passed reports most of them should be less than 0.02s). So I suspect it may be the case where ctx timeout took much longer time, and the test finished (code run time+process sleep time) less than ctx timeout. And I think we may increase the sleep time to 0.2s to leave more room and may avoid this flake.

/kind flake

Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs included if any changes are user facing
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

This commit increase the timer for TestRealRunnerTimeout and hope this could reduce the flake of tektoncd#4643. Some thoughts about why tektoncd#4643 happened.The flaky test got "step didn't timeout", which means that the rr.Run doesn't return any errors, including the DeadlineExceeded error. It could be that the context timeout is accidentaly larger than the sleep time and the Run finishes without context timeout. So I think we may increase the sleep time to avoid this flake. Even it is already a rare case. Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com

tekton-robot · 2023-03-21T04:12:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tekton-robot · 2023-03-21T04:12:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dibyom after the PR has been reviewed.
You can assign the PR to them by writing /assign @dibyom in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot · 2023-03-30T22:52:51Z

@Yongxuanzhang: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lbernick · 2023-03-31T20:14:26Z

Thanks @Yongxuanzhang for looking into this!

I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time".

I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here:

The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish:

pipeline/cmd/entrypoint/runner.go

Lines 150 to 159 in f83cd1f

    
           // Wait for stdout/err buffers to finish reading before returning. 
        
           wg.Wait() 
        
           // Wait for command to exit 
        
           if err := cmd.Wait(); err != nil { 
        
           	if ctx.Err() == context.DeadlineExceeded { 
        
           		return context.DeadlineExceeded 
        
           	} 
        
           	return err 
        
           }

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error, because the command will have already exited with code 0. Reading from stdout/stderr could take extra time for the same reasons as in #5254 (comment).

This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two Waits in different goroutines? The problem is that we don't have a reliable way of reproducing this bug.

I'm not sure if this is actually problematic in real scenarios or not.

Yongxuanzhang · 2023-04-03T18:20:43Z

Thanks @Yongxuanzhang for looking into this!

I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time".

I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here:

The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish:

pipeline/cmd/entrypoint/runner.go

Lines 150 to 159 in f83cd1f

// Wait for stdout/err buffers to finish reading before returning.

wg.Wait()

// Wait for command to exit

if err := cmd.Wait(); err != nil {

if ctx.Err() == context.DeadlineExceeded {

return context.DeadlineExceeded

}

return err

}

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error, because the command will have already exited with code 0. Reading from stdout/stderr could take extra time for the same reasons as in #5254 (comment).

This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two Waits in different goroutines? The problem is that we don't have a reliable way of reproducing this bug.

I'm not sure if this is actually problematic in real scenarios or not.

If I run the test locally it will return here:

pipeline/cmd/entrypoint/runner.go

Lines 119 to 125 in f83cd1f

    
           // Start defined command 
        
           if err := cmd.Start(); err != nil { 
        
           	if ctx.Err() == context.DeadlineExceeded { 
        
           		return context.DeadlineExceeded 
        
           	} 
        
           	return err 
        
           }

This is a case that it will not reach the code you mentioned and I suspect this may also happen in our ci. (I can open a test PR to test this behaviour) That's why I commented in the original issue that I'm not sure if the test is testing what it originally wants to test

So I think maybe we should also update the context timeout value if this needs to be fixed?

Yongxuanzhang · 2023-04-03T18:26:23Z

Thanks @Yongxuanzhang for looking into this!

I agree that "step didn't timeout" signals that rr.Run doesn't return any errors, but I'm not convinced by the explanation here that for whatever reason, "context timeout is accidentally larger than the sleep time".

I think this change will make it less likely that we observe test failure, but I'm concerned the problem is in the code rather than the tests. My theory of what's happening here:

The entrypoint waits for the stdout/stderr buffers here before waiting for the command to finish:

pipeline/cmd/entrypoint/runner.go

Lines 150 to 159 in f83cd1f

// Wait for stdout/err buffers to finish reading before returning.

wg.Wait()

// Wait for command to exit

if err := cmd.Wait(); err != nil {

if ctx.Err() == context.DeadlineExceeded {

return context.DeadlineExceeded

}

return err

}

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error, because the command will have already exited with code 0. Reading from stdout/stderr could take extra time for the same reasons as in #5254 (comment).

This would imply a bug in the code rather than in the tests. I think it could be fixed by running the two Waits in different goroutines? The problem is that we don't have a reliable way of reproducing this bug.

I'm not sure if this is actually problematic in real scenarios or not.

Regarding on the possible reason you mentioned, if that is true then we should be able to reproduce this flake? We could add sleep time after the wg.wait() and test if there are no errors returned.

And I'm also not sure about this:

If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error,

If it takes longer time, why cmd.Wait() won't return error?

lbernick · 2023-04-04T20:32:08Z

Regarding on the possible reason you mentioned, if that is true then we should be able to reproduce this flake? We could add sleep time after the wg.wait() and test if there are no errors returned.

Good idea! I tried this with sleep = 10s, timeout = 1s, and command = 5s at commit 538fee3. Interestingly what happens is rr.Run takes 10s (when it should time out after 1s), but then the tests pass.

However, #6162 removed the call to wg.Wait(), so there's nothing blocking in between the calls to cmd.Start and cmd.Wait.

My guess is that there may have been a bug, but if so it was fixed by #6162. However, it seems like the existing test we have doesn't reliably tell us whether the entrypoint times out correctly-- I wonder if we can find a way to address this.

And I'm also not sure about this:
If reading from stdout/stderr takes longer than the context timeout, I think the call to cmd.Wait will not return an error,
If it takes longer time, why cmd.Wait() won't return error?

I was wrong!

Yongxuanzhang · 2023-05-31T19:34:21Z

Haven't seen the flaky for a long time, close this PR

tekton-robot added kind/flake Categorizes issue or PR as related to a flakey test release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Mar 21, 2023

tekton-robot requested review from pritidesai and vdemeester March 21, 2023 04:12

tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 21, 2023

Yongxuanzhang force-pushed the debug-4643 branch from 0d05620 to 4555a80 Compare March 21, 2023 04:12

tekton-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 21, 2023

Yongxuanzhang marked this pull request as ready for review March 21, 2023 13:08

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2023

tekton-robot requested a review from afrittoli March 21, 2023 13:08

lbernick self-assigned this Mar 21, 2023

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2023

Yongxuanzhang closed this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase timer for TestRealRunnerTimeout #6409

increase timer for TestRealRunnerTimeout #6409

Yongxuanzhang commented Mar 21, 2023 •

edited

Loading

tekton-robot commented Mar 21, 2023

tekton-robot commented Mar 21, 2023

tekton-robot commented Mar 30, 2023

lbernick commented Mar 31, 2023

Yongxuanzhang commented Apr 3, 2023 •

edited

Loading

Yongxuanzhang commented Apr 3, 2023

lbernick commented Apr 4, 2023

Yongxuanzhang commented May 31, 2023

increase timer for TestRealRunnerTimeout #6409

increase timer for TestRealRunnerTimeout #6409

Conversation

Yongxuanzhang commented Mar 21, 2023 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Mar 21, 2023

tekton-robot commented Mar 21, 2023

tekton-robot commented Mar 30, 2023

lbernick commented Mar 31, 2023

Yongxuanzhang commented Apr 3, 2023 • edited Loading

Yongxuanzhang commented Apr 3, 2023

lbernick commented Apr 4, 2023

Yongxuanzhang commented May 31, 2023

Yongxuanzhang commented Mar 21, 2023 •

edited

Loading

Yongxuanzhang commented Apr 3, 2023 •

edited

Loading