Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dag execution instead of linear one #473

Merged
merged 13 commits into from
Feb 28, 2019

Conversation

vdemeester
Copy link
Member

@vdemeester vdemeester commented Feb 4, 2019

Closes #168

This switch build-pipeline to use the dag code for execution instead of the current linear behavior.

  • update dag with a GetSchedulable task which returns the schedulable tasks

  • fix dag false-positive cycle:
    The current DAG implementation was marking some graph as invalid even
    though they were. For example, the following DAG is valid.

             42
            /  \
         100    200
           \   / |
            101  |
              \  |
               102
    

    But the dag.Build would say there is a cycle in there. This is fixed
    by appending the visited node with the node currently "checked", that
    way, we can go through the same task twice, but from different path.

  • use dag.Build and dag.GetSchedulable in the reconcilier

  • It doesn't implement any maximum number of parallel task. We may want to create an issue for that and do that in a follow-up 👼

This is still a wip:

/cc @bobcatfish @tejal29

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2019
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Feb 4, 2019
@knative-prow-robot knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 4, 2019
test/dag_test.go Outdated
)

const (
// :((((((
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobcatfish actually it's from your initial test, it should be quick now 😝

@nader-ziada
Copy link
Member

@vdemeester do you think it's okay to squish some of the commits? seems some of them can be merged.

@vdemeester
Copy link
Member Author

@pivotal-nader-ziada yes I definitely intend to squash them 😉 I just want to clean things a bit and I'll squash them when removing the wip label

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple things I think you're already aware of:

  • docs 😇
  • There are a couple of functions which could probably use their own unit tests: GetSchedulable, GetPreivousTasks 😇

Bigger thought:

Do you think Nodes map[string]*Node is the best data structure to use going forward? We can totally iterate on this (esp. since it's an implementation detail the user will hopefully never notice :D) but I wonder if we could find a data structure that would hold the Pipeline Tasks in sorted order from the very beginning (i think @tejal29 suggested a topological sort?). I also think it could be v cool if we could run the "resolution" phase on this structure, such that very early in the reconciliation, we have a data structure that:

  1. holds the pipeline tasks in order (such that getting the next ones to run is just some kind of "pop" operation)
  2. has resolved all of the references in the pipeline tasks

This is something we could do in later PRs tho too :)

@@ -299,20 +310,22 @@ func (c *Reconciler) reconcile(ctx context.Context, pr *v1alpha1.PipelineRun) er
}

serviceAccount := pr.Spec.ServiceAccount
rprt := resources.GetNextTask(pr.Name, pipelineState, c.Logger)
rprts := resources.GetNextTasks(pr.Name, d, pipelineState, c.Logger)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about the objects pipelinestate vs. d - do you think there's any potential to use the same data structure for both, or do you think it's better with them separate?

(i wonder if some of the functionality supported for pipelinestate would make sense as methods on the dag itself?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely some duplication between the two, and I feel like they should be "the same" at some point. Didn't want to go that path yet (as I wanted to keep the changeset small) but this is definitely something to lean forward to.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk sounds reasonable! :D

@@ -93,6 +93,58 @@ func (g *DAG) GetPreviousTasks(pt string) []v1alpha1.PipelineTask {
return v.getPrevTasks()
}

// GetSchedulable returns a list of PipelineTask that can be scheduled,
// given a list of "done" task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you eleaborate here on what a 'list of done tasks' is? (i think t is a list of pipeline task names that have completed - successfully?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, it will be successful tasks here (only) as if one fails, the pipeline will be marked failed and the taskruns too. I should also rename the variables 😅

@vdemeester
Copy link
Member Author

Do you think Nodes map[string]*Node is the best data structure to use going forward? We can totally iterate on this (esp. since it's an implementation detail the user will hopefully never notice :D) but I wonder if we could find a data structure that would hold the Pipeline Tasks in sorted order from the very beginning (i think @tejal29 suggested a topological sort?). I also think it could be v cool if we could run the "resolution" phase on this structure, such that very early in the reconciliation, we have a data structure that:

1. holds the pipeline tasks in order (such that getting the next ones to run is just some kind of "pop" operation)

2. has resolved all of the references in the pipeline tasks

Most likely yes — it's definitely not "optimized", but I'm not sure of the level of optimization we need yet 😝 The map is only used to quickly be able to look is a task is present or not (the map could even be map[string]struct{} I think to take less memory).

* There are a couple of functions which could probably use their own unit tests: `GetSchedulable`, `GetPreivousTasks` innocent

Yes, GetSchedulable already has its tests but the test is… incorrectly named 🤦‍♂️

@vdemeester vdemeester force-pushed the 168-use-dag-execution branch 3 times, most recently from 4a0bb87 to 5e3a989 Compare February 6, 2019 17:00
@bobcatfish
Copy link
Collaborator

Most likely yes — it's definitely not "optimized", but I'm not sure of the level of optimization we need yet

haha kk, sorry for trying to optimize prematurely XD - definitely easier to iterate on this once we get an initial implementation in :)

"d": {Task: dDependsOnA, Prev: []*Node{{Task: a}}},
"e": {Task: eDependsOnA, Prev: []*Node{{Task: a}}},
"f": {Task: fDependsOnDAndE, Prev: []*Node{{Task: dDependsOnA, Prev: []*Node{{Task: a}}}, {Task: eDependsOnA, Prev: []*Node{{Task: a}}}}},
"g": {Task: gDependOnF, Prev: []*Node{{Task: fDependsOnDAndE, Prev: []*Node{{Task: dDependsOnA, Prev: []*Node{{Task: a}}}, {Task: eDependsOnA, Prev: []*Node{{Task: a}}}}}}},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it would be crazy to have some comments that try to depict the graphs in these tests? 😇

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookin good so far! 🎉

@vdemeester vdemeester force-pushed the 168-use-dag-execution branch 2 times, most recently from 5013509 to 11128bf Compare February 7, 2019 13:26
@vdemeester
Copy link
Member Author

Cleaned a bit and rebased, there is still work to do on the e2e tests though 😅

@vdemeester vdemeester changed the title wip: Use dag execution instead of linear one Use dag execution instead of linear one Feb 7, 2019
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2019
@knative-prow-robot knative-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 15, 2019
@bobcatfish
Copy link
Collaborator

Okay I still need to:

  • rebase
  • add webhook admission validation
  • update the examples

But the bulk of the changes are here so PTAL :D

bobcatfish and others added 6 commits February 26, 2019 17:31
Since we're going to be adding 2 ways to specify the graph (i.e. adding
`runAfter` in addition to `from`), updated this test to be agnostic to
how the graph was constructed.

Also updated the test to be table driven, to try to make it easier to
look at each case.
While starting to add `runAfter` functionality, I wanted to update the
tests so that the tests for the building are separate from the tests
that get the next schedulable Task. While doing that I found what I
thought was a bug: if a graph has multiple roots, GetSchedulable would
sometimes (if there are no completed tasks in the other root ) only
return next tasks from one of the roots.

For example:

  b     a
  |    / \
  |   |   x
  |   | / |
  |   y   |
   \ /    z
    w

If you pass `x` to GetSchedulable, `b` wont' be returned, even though
theoretically this would be ready to run as well. Also, if a Task is
done, this implies everything before it is also done, even if it isn't
explicitly provided.

Eventually (a whole day later 😅) I realized that @vdemeester had
optimized by making the assumption that if anything beyond the roots was
"done", the roots themselves must already be executing, therefore we
were explicitly handling these two cases:

1. Nothing has started, so start all the roots
2. Otherwise look at everything, and if all its previous Tasks have
   completed, it must be ready to go.

I have changed this so that we instead walk the graph from the roots and
return the first schedulable nodes we encounter.

This is (only slightly) an improvement because:

1. If a root wasn't started successfully (i.e. if the controller itself
   encounters an error before starting one of the roots), we'll still
   try to start the next node on the next reconcile
2. If GetSchedulable is called with "done" Tasks that don't make sense
   (e.g. if we jump straight into the middle of the graph without
   completing the rest), we can catch that

This required adding `Next` pointers as well as `Prev` pointers - which
unfortunately meant we can't compare graphs with `cmp.Diff` because it
will have a stack overflow 😅
Now users can express graph ordering using either `from` on resources or
using `runAfter` if there is no relationship between the resources.

Also removed `errors.go` package in this PR b/c:
1. we aren't using this approach anywhere else in our codebase
2. the error would have to be passed up completely as-is to be useful
   but the layers involved will nest it and so using a an apierror type
   is not useful in the current approach
We should probably come back to this later and use better error typing
and nesting.
Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
Recommendation from @shashwathi - it's a bit easier to interact with the
schedulable tasks if they are in a map instead of a slice
- Updated example PipelineRuns so that it uses `runAfter` (to make
  unit tests run first) in addition to using `from`
- Updated pipelinerun reconcile tests to also use `runAfter`
- Updated Helm end to end test to actually use the image it builds
@bobcatfish
Copy link
Collaborator

Okay @vdemeester everything is done i think except for the webhook validation! I could even see doing that in another PR 🤷‍♀️

Copy link
Member Author

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay looks good to me
@bobcatfish agreeing, we can make a follow-up for the validation part 😉
I'll let @shashwathi or @pivotal-nader-ziada take another look (/me doesn't want to LGTM his own PR :joy_cat:)

@bobcatfish
Copy link
Collaborator

hm looks like some of our test coverage went down - im going to see what i can do to fix that!

@bobcatfish
Copy link
Collaborator

@bobcatfish agreeing, we can make a follow-up for the validation part

kk, created #559 for that!

@vdemeester
Copy link
Member Author

/meow boxes

@knative-prow-robot
Copy link

@vdemeester: cat image

In response to this:

/meow boxes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Added unit tests for `GetNextTasks` by:
1) splitting it into two functions, one of which is now
   `SuccesfulPipelineTaskNames` on the new `PipelineRunState` type
2) Adding unit tests for `SuccessfulPipelineTaskNames`
3) Adding unit tests for `GetNextTasks` - some of these cases shouldn't
   happen (e.g. failure should actually halt the Run) but including for
   completeness

(Removed `firstFinishedState` test case because it was the same as
`oneFinishedState`).
@bobcatfish
Copy link
Collaborator

Okay coverage diffs look a bit more reasonable now, PTAL @shashwathi @pivotal-nader-ziada 🎉

Copy link
Member

@nader-ziada nader-ziada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome pr!

@@ -21,9 +21,19 @@ import (
"strings"

"github.com/knative/build-pipeline/pkg/apis/pipeline/v1alpha1"
errors "github.com/knative/build-pipeline/pkg/errors"
"github.com/knative/build-pipeline/pkg/reconciler/v1alpha1/taskrun/list"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the files in pipelineRun reconciler should not be using functions from taskRun reconciler. The list package should be extracted out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pivotal-nader-ziada hum good point 🤔 I think taskrun/list package should be extracted in a upper package (as it really doesn't depend on taskrun).
@bobcatfish agreeing ? (I can take care of it 😉)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense @vdemeester

@@ -522,7 +522,7 @@ func TestAddResourceToBuild(t *testing.T) {
wantErr: false,
want: buildv1alpha1.BuildSpec{
Steps: []corev1.Container{{
Name: "create-dir-workspace-mz4c7",
Name: "create-dir-workspace-0-0-mz4c7",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this 0-0 added? container name should not have changed by this feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the following item of the commit message

Make sure `create-dir` containers gets a unique name (the same way `copy-` containers do)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but they have a random string appended to them, why do we need the index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pivotal-nader-ziada ah good point, that was done "before" we had that random string… 😅 We should remove that then 👼

go test -v -count=1 -tags=e2e ./test
go test -v -tags=e2e -count=1 ./test --kubeconfig ~/special/kubeconfig --cluster myspecialcluster
go test -v -count=1 -tags=e2e -timeout=20m ./test
go test -v -count=1 -tags=e2e -timeout=20m ./test --kubeconfig ~/special/kubeconfig --cluster myspecialcluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe tests e2e tests already take more then 20m and growing, can we change it to 30

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all running in parallel so adding more tests should not result in 10m increase of test timing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@shashwathi shashwathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pivotal-nader-ziada has comments so I will leave up to him to approve the PR

@knative-prow-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shashwathi, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [shashwathi,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… code :)

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
A random suffix is generated for the name of the pod, so we don't need
to add the index as part of the name to make sure they are unique.

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
@vdemeester
Copy link
Member Author

/test pull-knative-build-pipeline-go-coverage

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-build-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/list/diff.go Do not exist 100.0%
pkg/reconciler/v1alpha1/pipeline/resources/dag.go 97.8% 98.9% 1.1
pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go 82.2% 81.1% -1.1
pkg/reconciler/v1alpha1/pipelinerun/resources/pipelinerunresolution.go 88.9% 90.3% 1.4
test/builder/pipeline.go 91.9% 92.1% 0.2

@nader-ziada
Copy link
Member

looks good, thanks for the changes @vdemeester

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2019
@knative-prow-robot knative-prow-robot merged commit 19dbd0d into tektoncd:master Feb 28, 2019
@vdemeester vdemeester deleted the 168-use-dag-execution branch February 28, 2019 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace Pipeline linear execution with graph based execution
7 participants