Use dag execution instead of linear one #473

vdemeester · 2019-02-04T14:48:06Z

Closes #168

This switch build-pipeline to use the dag code for execution instead of the current linear behavior.

update dag with a GetSchedulable task which returns the schedulable tasks
fix dag false-positive cycle:
The current DAG implementation was marking some graph as invalid even
though they were. For example, the following DAG is valid.
```
         42
        /  \
     100    200
       \   / |
        101  |
          \  |
           102
```
But the dag.Build would say there is a cycle in there. This is fixed
by appending the visited node with the node currently "checked", that
way, we can go through the same task twice, but from different path.
use dag.Build and dag.GetSchedulable in the reconcilier
It doesn't implement any maximum number of parallel task. We may want to create an issue for that and do that in a follow-up 👼

This is still a wip:

bobcatfish · 2019-02-05T18:52:40Z

test/dag_test.go

+)
+
+const (
+	// :((((((


@bobcatfish actually it's from your initial test, it should be quick now 😝

nader-ziada · 2019-02-05T20:45:26Z

@vdemeester do you think it's okay to squish some of the commits? seems some of them can be merged.

vdemeester · 2019-02-05T20:55:18Z

@pivotal-nader-ziada yes I definitely intend to squash them 😉 I just want to clean things a bit and I'll squash them when removing the wip label

bobcatfish

Couple things I think you're already aware of:

docs 😇
There are a couple of functions which could probably use their own unit tests: GetSchedulable, GetPreivousTasks 😇

Bigger thought:

Do you think Nodes map[string]*Node is the best data structure to use going forward? We can totally iterate on this (esp. since it's an implementation detail the user will hopefully never notice :D) but I wonder if we could find a data structure that would hold the Pipeline Tasks in sorted order from the very beginning (i think @tejal29 suggested a topological sort?). I also think it could be v cool if we could run the "resolution" phase on this structure, such that very early in the reconciliation, we have a data structure that:

holds the pipeline tasks in order (such that getting the next ones to run is just some kind of "pop" operation)
has resolved all of the references in the pipeline tasks

This is something we could do in later PRs tho too :)

bobcatfish · 2019-02-05T18:57:28Z

pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go

@@ -299,20 +310,22 @@ func (c *Reconciler) reconcile(ctx context.Context, pr *v1alpha1.PipelineRun) er
 	}

 	serviceAccount := pr.Spec.ServiceAccount
-	rprt := resources.GetNextTask(pr.Name, pipelineState, c.Logger)
+	rprts := resources.GetNextTasks(pr.Name, d, pipelineState, c.Logger)


what do you think about the objects pipelinestate vs. d - do you think there's any potential to use the same data structure for both, or do you think it's better with them separate?

(i wonder if some of the functionality supported for pipelinestate would make sense as methods on the dag itself?)

There is definitely some duplication between the two, and I feel like they should be "the same" at some point. Didn't want to go that path yet (as I wanted to keep the changeset small) but this is definitely something to lean forward to.

kk sounds reasonable! :D

bobcatfish · 2019-02-05T22:12:47Z

pkg/reconciler/v1alpha1/pipeline/resources/dag.go

@@ -93,6 +93,58 @@ func (g *DAG) GetPreviousTasks(pt string) []v1alpha1.PipelineTask {
 	return v.getPrevTasks()
 }

+// GetSchedulable returns a list of PipelineTask that can be scheduled,
+// given a list of "done" task.


can you eleaborate here on what a 'list of done tasks' is? (i think t is a list of pipeline task names that have completed - successfully?)

Yes, indeed, it will be successful tasks here (only) as if one fails, the pipeline will be marked failed and the taskruns too. I should also rename the variables 😅

vdemeester · 2019-02-06T14:34:27Z

Do you think Nodes map[string]*Node is the best data structure to use going forward? We can totally iterate on this (esp. since it's an implementation detail the user will hopefully never notice :D) but I wonder if we could find a data structure that would hold the Pipeline Tasks in sorted order from the very beginning (i think @tejal29 suggested a topological sort?). I also think it could be v cool if we could run the "resolution" phase on this structure, such that very early in the reconciliation, we have a data structure that:
1. holds the pipeline tasks in order (such that getting the next ones to run is just some kind of "pop" operation)

2. has resolved all of the references in the pipeline tasks

Most likely yes — it's definitely not "optimized", but I'm not sure of the level of optimization we need yet 😝 The map is only used to quickly be able to look is a task is present or not (the map could even be map[string]struct{} I think to take less memory).

* There are a couple of functions which could probably use their own unit tests: `GetSchedulable`, `GetPreivousTasks` innocent

Yes, GetSchedulable already has its tests but the test is… incorrectly named 🤦‍♂️

bobcatfish · 2019-02-06T17:25:36Z

Most likely yes — it's definitely not "optimized", but I'm not sure of the level of optimization we need yet

haha kk, sorry for trying to optimize prematurely XD - definitely easier to iterate on this once we get an initial implementation in :)

bobcatfish · 2019-02-06T17:27:21Z

pkg/reconciler/v1alpha1/pipeline/resources/dag_test.go

+					"d": {Task: dDependsOnA, Prev: []*Node{{Task: a}}},
+					"e": {Task: eDependsOnA, Prev: []*Node{{Task: a}}},
+					"f": {Task: fDependsOnDAndE, Prev: []*Node{{Task: dDependsOnA, Prev: []*Node{{Task: a}}}, {Task: eDependsOnA, Prev: []*Node{{Task: a}}}}},
+					"g": {Task: gDependOnF, Prev: []*Node{{Task: fDependsOnDAndE, Prev: []*Node{{Task: dDependsOnA, Prev: []*Node{{Task: a}}}, {Task: eDependsOnA, Prev: []*Node{{Task: a}}}}}}},


do you think it would be crazy to have some comments that try to depict the graphs in these tests? 😇

https://github.com/knative/build-pipeline/pull/473/files#diff-affe7639a41c353e29e3091ddf89dcdaR153 👼

bobcatfish

lookin good so far! 🎉

vdemeester · 2019-02-07T13:33:25Z

Cleaned a bit and rebased, there is still work to do on the e2e tests though 😅

bobcatfish · 2019-02-15T00:42:50Z

Okay I still need to:

rebase
add webhook admission validation
update the examples

But the bulk of the changes are here so PTAL :D

Since we're going to be adding 2 ways to specify the graph (i.e. adding `runAfter` in addition to `from`), updated this test to be agnostic to how the graph was constructed. Also updated the test to be table driven, to try to make it easier to look at each case.

@vdemeester

While starting to add `runAfter` functionality, I wanted to update the tests so that the tests for the building are separate from the tests that get the next schedulable Task. While doing that I found what I thought was a bug: if a graph has multiple roots, GetSchedulable would sometimes (if there are no completed tasks in the other root ) only return next tasks from one of the roots. For example: b a | / \ | | x | | / | | y | \ / z w If you pass `x` to GetSchedulable, `b` wont' be returned, even though theoretically this would be ready to run as well. Also, if a Task is done, this implies everything before it is also done, even if it isn't explicitly provided. Eventually (a whole day later 😅) I realized that @vdemeester had optimized by making the assumption that if anything beyond the roots was "done", the roots themselves must already be executing, therefore we were explicitly handling these two cases: 1. Nothing has started, so start all the roots 2. Otherwise look at everything, and if all its previous Tasks have completed, it must be ready to go. I have changed this so that we instead walk the graph from the roots and return the first schedulable nodes we encounter. This is (only slightly) an improvement because: 1. If a root wasn't started successfully (i.e. if the controller itself encounters an error before starting one of the roots), we'll still try to start the next node on the next reconcile 2. If GetSchedulable is called with "done" Tasks that don't make sense (e.g. if we jump straight into the middle of the graph without completing the rest), we can catch that This required adding `Next` pointers as well as `Prev` pointers - which unfortunately meant we can't compare graphs with `cmp.Diff` because it will have a stack overflow 😅

Now users can express graph ordering using either `from` on resources or using `runAfter` if there is no relationship between the resources. Also removed `errors.go` package in this PR b/c: 1. we aren't using this approach anywhere else in our codebase 2. the error would have to be passed up completely as-is to be useful but the layers involved will nest it and so using a an apierror type is not useful in the current approach We should probably come back to this later and use better error typing and nesting.

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

@shashwathi

Recommendation from @shashwathi - it's a bit easier to interact with the schedulable tasks if they are in a map instead of a slice

- Updated example PipelineRuns so that it uses `runAfter` (to make unit tests run first) in addition to using `from` - Updated pipelinerun reconcile tests to also use `runAfter` - Updated Helm end to end test to actually use the image it builds

bobcatfish · 2019-02-27T01:33:43Z

Okay @vdemeester everything is done i think except for the webhook validation! I could even see doing that in another PR 🤷‍♀️

vdemeester

Yay looks good to me
@bobcatfish agreeing, we can make a follow-up for the validation part 😉
I'll let @shashwathi or @pivotal-nader-ziada take another look (/me doesn't want to LGTM his own PR :joy_cat:)

bobcatfish · 2019-02-27T17:13:55Z

hm looks like some of our test coverage went down - im going to see what i can do to fix that!

bobcatfish · 2019-02-27T17:21:33Z

@bobcatfish agreeing, we can make a follow-up for the validation part

kk, created #559 for that!

vdemeester · 2019-02-27T17:26:47Z

/meow boxes

knative-prow-robot · 2019-02-27T17:26:48Z

@vdemeester:

In response to this:

/meow boxes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Added unit tests for `GetNextTasks` by: 1) splitting it into two functions, one of which is now `SuccesfulPipelineTaskNames` on the new `PipelineRunState` type 2) Adding unit tests for `SuccessfulPipelineTaskNames` 3) Adding unit tests for `GetNextTasks` - some of these cases shouldn't happen (e.g. failure should actually halt the Run) but including for completeness (Removed `firstFinishedState` test case because it was the same as `oneFinishedState`).

bobcatfish · 2019-02-27T22:50:38Z

Okay coverage diffs look a bit more reasonable now, PTAL @shashwathi @pivotal-nader-ziada 🎉

nader-ziada

awesome pr!

nader-ziada · 2019-02-28T14:32:50Z

pkg/reconciler/v1alpha1/pipeline/resources/dag.go

@@ -21,9 +21,19 @@ import (
 	"strings"

 	"github.com/knative/build-pipeline/pkg/apis/pipeline/v1alpha1"
-	errors "github.com/knative/build-pipeline/pkg/errors"
+	"github.com/knative/build-pipeline/pkg/reconciler/v1alpha1/taskrun/list"


the files in pipelineRun reconciler should not be using functions from taskRun reconciler. The list package should be extracted out

@pivotal-nader-ziada hum good point 🤔 I think taskrun/list package should be extracted in a upper package (as it really doesn't depend on taskrun).
@bobcatfish agreeing ? (I can take care of it 😉)

makes sense @vdemeester

nader-ziada · 2019-02-28T14:37:13Z

pkg/reconciler/v1alpha1/taskrun/resources/input_resource_test.go

@@ -522,7 +522,7 @@ func TestAddResourceToBuild(t *testing.T) {
 		wantErr: false,
 		want: buildv1alpha1.BuildSpec{
 			Steps: []corev1.Container{{
-				Name:  "create-dir-workspace-mz4c7",
+				Name:  "create-dir-workspace-0-0-mz4c7",


why is this 0-0 added? container name should not have changed by this feature

This is related to the following item of the commit message

Make sure `create-dir` containers gets a unique name (the same way `copy-` containers do)

but they have a random string appended to them, why do we need the index?

@pivotal-nader-ziada ah good point, that was done "before" we had that random string… 😅 We should remove that then 👼

nader-ziada · 2019-02-28T14:38:17Z

test/README.md

-go test -v -count=1 -tags=e2e ./test
-go test -v -tags=e2e -count=1 ./test --kubeconfig ~/special/kubeconfig --cluster myspecialcluster
+go test -v -count=1 -tags=e2e -timeout=20m ./test
+go test -v -count=1 -tags=e2e -timeout=20m ./test --kubeconfig ~/special/kubeconfig --cluster myspecialcluster


I believe tests e2e tests already take more then 20m and growing, can we change it to 30

They are all running in parallel so adding more tests should not result in 10m increase of test timing.

They already take about 25m
https://gubernator.knative.dev/builds/knative-prow/pr-logs/pull/knative_build-pipeline/473/pull-knative-build-pipeline-integration-tests/

shashwathi

LGTM 👍

@pivotal-nader-ziada has comments so I will leave up to him to approve the PR

knative-prow-robot · 2019-02-28T15:01:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shashwathi, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [shashwathi,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… code :) Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

A random suffix is generated for the name of the pod, so we don't need to add the index as part of the name to make sure they are unique. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

vdemeester · 2019-02-28T16:02:54Z

/test pull-knative-build-pipeline-go-coverage

knative-metrics-robot · 2019-02-28T16:04:55Z

The following is the coverage report on pkg/.
Say /test pull-knative-build-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/list/diff.go	Do not exist	100.0%
pkg/reconciler/v1alpha1/pipeline/resources/dag.go	97.8%	98.9%	1.1
pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go	82.2%	81.1%	-1.1
pkg/reconciler/v1alpha1/pipelinerun/resources/pipelinerunresolution.go	88.9%	90.3%	1.4
test/builder/pipeline.go	91.9%	92.1%	0.2

nader-ziada · 2019-02-28T16:31:35Z

looks good, thanks for the changes @vdemeester

/lgtm

knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2019

knative-prow-robot requested review from bobcatfish and tejal29 February 4, 2019 14:48

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Feb 4, 2019

knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 4, 2019

vdemeester force-pushed the 168-use-dag-execution branch from 15fccdb to 2ae2366 Compare February 5, 2019 16:01

bobcatfish reviewed Feb 5, 2019

View reviewed changes

vdemeester force-pushed the 168-use-dag-execution branch 3 times, most recently from 4a0bb87 to 5e3a989 Compare February 6, 2019 17:00

bobcatfish reviewed Feb 6, 2019

View reviewed changes

vdemeester force-pushed the 168-use-dag-execution branch 2 times, most recently from 5013509 to 11128bf Compare February 7, 2019 13:26

vdemeester changed the title ~~wip: Use dag execution instead of linear one~~ Use dag execution instead of linear one Feb 7, 2019

knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2019

bobcatfish force-pushed the 168-use-dag-execution branch from 11128bf to 8b126ad Compare February 11, 2019 22:54

bobcatfish requested review from shashwathi, imjasonh and nader-ziada February 15, 2019 00:36

knative-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 15, 2019

bobcatfish and others added 6 commits February 26, 2019 17:31

Small refactoring following reviews ✍️

99cbba4

Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

Change GetSchedulable to return a map instead of a slice

3fe8938

Recommendation from @shashwathi - it's a bit easier to interact with the schedulable tasks if they are in a map instead of a slice

Update examples and tests for DAG

2f5fde4

- Updated example PipelineRuns so that it uses `runAfter` (to make unit tests run first) in addition to using `from` - Updated pipelinerun reconcile tests to also use `runAfter` - Updated Helm end to end test to actually use the image it builds

bobcatfish force-pushed the 168-use-dag-execution branch from 0cc67bb to 2f5fde4 Compare February 27, 2019 01:32

vdemeester commented Feb 27, 2019

View reviewed changes

bobcatfish mentioned this pull request Feb 27, 2019

Add validation on Pipeline creation for DAG #559

Closed

nader-ziada suggested changes Feb 28, 2019

View reviewed changes

shashwathi approved these changes Feb 28, 2019

View reviewed changes

vdemeester added 2 commits February 28, 2019 16:34

Move list package in pkg as it doesn't depend on reconcilier…

74e6eba

… code :) Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

Remove index usage in create and copy resource containers 🎛

8eaedba

A random suffix is generated for the name of the pod, so we don't need to add the index as part of the name to make sure they are unique. Signed-off-by: Vincent Demeester <vdemeest@redhat.com>

knative-prow-robot assigned nader-ziada Feb 28, 2019

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2019

knative-prow-robot merged commit 19dbd0d into tektoncd:master Feb 28, 2019

vdemeester deleted the 168-use-dag-execution branch February 28, 2019 16:34

mchmarny unassigned nader-ziada and shashwathi Mar 7, 2019

Use dag execution instead of linear one #473

Use dag execution instead of linear one #473

Conversation

vdemeester commented Feb 4, 2019 • edited by bobcatfish Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nader-ziada commented Feb 5, 2019

vdemeester commented Feb 5, 2019

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester commented Feb 6, 2019

bobcatfish commented Feb 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobcatfish left a comment

Choose a reason for hiding this comment

vdemeester commented Feb 7, 2019

bobcatfish commented Feb 15, 2019

bobcatfish commented Feb 27, 2019

vdemeester left a comment

Choose a reason for hiding this comment

bobcatfish commented Feb 27, 2019

bobcatfish commented Feb 27, 2019

vdemeester commented Feb 27, 2019

knative-prow-robot commented Feb 27, 2019

bobcatfish commented Feb 27, 2019

nader-ziada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shashwathi left a comment

Choose a reason for hiding this comment

knative-prow-robot commented Feb 28, 2019

vdemeester commented Feb 28, 2019

knative-metrics-robot commented Feb 28, 2019

nader-ziada commented Feb 28, 2019

vdemeester commented Feb 4, 2019 •

edited by bobcatfish

Loading