Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TestDAGPipelineRun flakiness. #4419

Merged
merged 2 commits into from
Dec 14, 2021
Merged

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Dec 14, 2021

See also the linked issue for a detailed explanation of the issue this fixes.

This change alters the DAG tests in two meaningful ways:

  1. Have the tasks sleep, to actually increase the likelihood of task execution overlap,
  2. Use the sleep duration for the minimum delta in start times.

These changes combine should guarantee that the tasks actually executed in parallel, but the second part also enables this test to be less flaky on busy clusters where 5s may not be sufficient for the task to start.

A fun anecdote to note here is that the Kubernetes SLO for Pod startup latency is 5s at 99P, which means Tekton had effectively zero room for overhead. 😅

Fixes: #4418

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

NONE

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Dec 14, 2021
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 14, 2021

CLA Signed

The committers are authorized under a signed CLA.

@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 14, 2021
@mattmoor
Copy link
Member Author

Paging @dlorenc on the CLA bit 😅

@mattmoor
Copy link
Member Author

awesome... tabs vs. spaces 🤦

_See also the linked issue for a detailed explanation of the issue this fixes._

This change alters the DAG tests in two meaningful ways:
1. Have the tasks sleep, to actually increase the likelihood of task execution overlap,
2. Use the sleep duration for the minimum delta in start times.

These changes combine should guarantee that the tasks *actually* executed in parallel,
but the second part also enables this test to be less flaky on busy clusters where
`5s` may not be sufficient for the task to start.

A fun anecdote to note here is that the Kubernetes [SLO for Pod startup
latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition)
is `5s` at `99P`, which means Tekton has effectively zero room for overhead.

Fixes: tektoncd#4418
@afrittoli
Copy link
Member

/test check-pr-has-kind-label

@dlorenc
Copy link
Contributor

dlorenc commented Dec 14, 2021

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021
@@ -34,6 +35,8 @@ import (
knativetest "knative.dev/pkg/test"
)

const sleepDuration = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30s feels like a long time to wait in a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for a rerun of the e2e tests is a lot longer 😉

I'm happy to lower this, but not sure what you are comfortable with. 5s is the 99P scheduling latency on K8s when there's available capacity, and running these tests with t.Parallel() on KinD things get busy quickly. I've seen 9s apart in recently memory, but think I've seen up to 13s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be clear the 9s and 13s are failures with what's at HEAD)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced this to 15s, which is larger than I can recall seeing this fail with.

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021
@imjasonh
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021
@dlorenc
Copy link
Contributor

dlorenc commented Dec 14, 2021

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dlorenc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2021
@afrittoli
Copy link
Member

/test check-pr-has-kind-label

@tekton-robot tekton-robot merged commit 9a7a331 into tektoncd:main Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestDAGPipelineRun is flaky
5 participants