Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize cycle detection logic in dag #3539

Merged
merged 1 commit into from Nov 24, 2020

Conversation

pritidesai
Copy link
Member

@pritidesai pritidesai commented Nov 18, 2020

Changes

visited map holds name of the nodes being visited as keys and true as value. This map is being updated with currentName.HashKey on every iteration which results in keys such as a.d, b.d where a, b, and d are nodes of a dag. But such keys are not utilized anywhere in the same visit function. Visit function checks existence of the node just
by the name without any string concatenation.

This extra addition in the map is causing severe delay for a graph with more than 60 nodes with distinct dependencies.

visit function is called on line 131 with all the parents of the existing pipelineTask to check if any of the parents exist in the visited map. visit function is then recursively called to check each parent's parent to see if any of them exist in the visited map, so on and so forth to detect a cycle.

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

NONE

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Nov 18, 2020
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 18, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 18, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments to address but I think the change is good!

pkg/reconciler/pipeline/dag/dag_test.go Outdated Show resolved Hide resolved
Tasks: v1beta1.PipelineTaskList{a, bDependsOnA, cRunsAfterA, dDependsOnBAndC, eRunsAfterD, fRunsAfterD, gDependsOnF},
},
}
if _, err := dag.Build(v1beta1.PipelineTaskList(p.Spec.Tasks)); err == nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to check the error message includes the cycle we're expecting? Otherwise we could get in to a situation like above where the error is of a different type than we want.

_, err := dag.Build
if err == nil || !strings.Contains(err.Error(), "e -> a -> b -> d -> e") {
  t.Errorf("expected cycle error but got %v", err)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sbwsg, there are multiple paths possible leading to a cycle. It all depends on the order in which dependency map is processed.

This test can fail with one of these paths:

	if err == nil || !(strings.Contains(err.Error(), "e -> a -> b -> d -> e") ||
		strings.Contains(err.Error(), "a -> c -> b -> d -> e -> a") ||
		strings.Contains(err.Error(), "d -> e -> a -> b -> d")) {
		t.Errorf("expected cycle error but got %v", err)
	}

I can replace the existing error checking with the above check if we decide to keep this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the unit test multiple times with count=1 leads to many different combination of paths including:

		"a -> c -> b -> d -> e -> a",
		"a -> b -> c -> d -> e -> a",
		"a -> b -> d -> e -> a",
		"a -> c -> d -> e -> a",
		"b -> d -> e -> a -> b",
		"c -> d -> e -> a -> c",

I walked through the graph one more time and there are much more combinations possible. I think its fine to leave it as just cycle detected with e -> a or a -> e instead of looking for a specific path in the unit test.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, the order of traversal isn't stable across runs because the nodes are stored in a hash map. cycle detected makes sense 👍

pkg/reconciler/pipeline/dag/dag_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipeline/dag/dag_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipeline/dag/dag.go Show resolved Hide resolved
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 18, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@afrittoli
Copy link
Member

afrittoli commented Nov 18, 2020

Thanks for this! I didn't manage to review yet, I'll do tomorrow 🙏

I wonder if on long term we could consider using something like https://godoc.org/github.com/twmb/algoimpl/go/graph for the DAG logic? It already detects cycles via the "strongly connected components". That at the cost of an extra dependency of course :)

@bobcatfish
Copy link
Collaborator

Hey @pritidesai !! Nice catch, it looks like this one has been around for a while :D It took me a while to wrap my head around the changes here and I had to do some investigation to understand them, so I want to try to explain these changes and can you confirm if this is how you are seeing it as well?

To try to understand what's happening I had to look back at a1f2733 when the visited list was updated to do that concatenation.

Here is my understanding of the problem that was being fixed at that time:

             42
            /  \
         100    200
           \   / 
            101 
              | 
             102
  1. We'd try to add a link between 101 + 102 via addPrevPipelineTask
  2. addPrevPipelineTask would try to "visit" all of 101's previous Tasks
  3. visit would recursively "visit" all of the previous Tasks of those Tasks

While (2) and (3) were happening, a "visited" list was being built up
When 101 was "visited" the first time, it would visit it's previous tasks, etc. adding 100 and then 42 to the "visited" list
Then 200 would be "visited" - and 42 is a previous task of 200, so it would be "visited" as well
At that point, 42 would already be in the "visited" list and would incorrectly detect a cycle

So a1f2733 attempted to fix this by changing the keys in visited.

BUT!

As @pritidesai pointed out, we may ADD the concatenated keys but WE DONT USE THEM WHEN CHECKING KEYS

I think what this means is that a1f2733 effectively made us ignore all these entries we were adding to the list.

And it works! Which is why in this PR @pritidesai can remove these additions into the visited list b/c we didn't need them in the first place.

And the REASON we don't need them is because the actual question we are trying to ask when we call visit from linkPipelineTasks is a bit obscured: we are trying to ask "is this node its own parent" (which means there is a cycle). And we ask this for every Task, so we don't need to be tracking all the Tasks all the time.

So in fact I think we could simplify linkPipelineTasks a bit to make this more clear, instead of:

       // i have a strong hunch we don't even need both of these nodes in "visited" for cycle detection to work, i think we just need the somewhat confusingly named "next" (maybe a name like "curr" would be more clear?)
        visited := map[string]bool{prev.Task.HashKey(): true, next.Task.HashKey(): true}
	if err := visit(prev.Prev, path, visited); err != nil {

Something like:

        // Look through all of prev's ancestors and see if the "next" node is present
	if err := lookForNode(prev.Prev, path, next); err != nil {

TestBuild_InvalidDAGWithCycle

I don't think we actually need this new test (maybe not the other one either) - The test TestBuild_InvalidDAGWithCycle passes even without the changes (the error message is different but a cycle is still detected) and it looks like TestBuild_Invalid is testing this case (there are 3 different versions of cycles (a has to run after z but z runs after x which runs after a i think?) (it could definitely use a comment depicting the graph - and maybe a better name)

Because I think what this PR is doing is not actually changing any functionality, but is optimizing it (we're no longer storing the concatenated keys which were being ignored anyway)

I wonder if on long term we could consider using something like https://godoc.org/github.com/twmb/algoimpl/go/graph for the DAG logic? It already detects cycles via the "strongly connected components". That at the cost of an extra dependency of course :)

I have mixed feelings! On the one hand I feel like this DAG logic is at the heart of our Pipeline execution so it feels weird to entrust it to an external lib, on the other hand we had this inefficiency for nearly 2 years and didn't notice (on the other other hand, this didn't actually cause any bugs). Might be interesting to try swapping it out and see what it's like?

@pritidesai
Copy link
Member Author

can you confirm if this is how you are seeing it as well?

yes @bobcatfish you absolutely nailed it.

And the REASON we don't need them is because the actual question we are trying to ask when we call visit from linkPipelineTasks is a bit obscured: we are trying to ask "is this node its own parent" (which means there is a cycle). And we ask this for every Task, so we don't need to be tracking all the Tasks all the time.

I agree, it is bit obscured, before adding any new link, its actually checking if this node (next) is its own parent by traversing through the prev.Prev until prev.Prev is empty or finds match (next == prev.Prev).

if err := lookForNode(prev.Prev, path, next); err != nil {

visited can be replaced with next like you pointed out. I will update the PR.

The test TestBuild_InvalidDAGWithCycle passes even without the changes

yup, since there is no functionality change 😄 This test validates a little bit more complex graph compared to existing invalid test:

    a
   / \
 b   c
   \ /
   d
  / \
e    f (a runAfter e)
     |
     g

TestBuild_Invalid is validating one single cycle in three different forms (combination of runAfter and from) but from the Build unit test perspective its the same graph:

		 a
		 |
		 x
		 |
		 z (also a depends on/runAfter z)

Might be interesting to try swapping it out and see what it's like?

I generally avoid third party dependencies for the same reasons you have mentioned, specially for the heart of the pipeline execution.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@vdemeester
Copy link
Member

I have mixed feelings! On the one hand I feel like this DAG logic is at the heart of our Pipeline execution so it feels weird to entrust it to an external lib, on the other hand we had this inefficiency for nearly 2 years and didn't notice (on the other other hand, this didn't actually cause any bugs). Might be interesting to try swapping it out and see what it's like?

That's an interesting though, because I would have gone the other way. The DAG is not something specific to Tekton, the way we build it and us it is, so I would think using an external library that does the "dag" well would be good enough for us and would allow us to add value to where we need to. But I don't have strong opinion on this 🙃

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2020
@bobcatfish
Copy link
Collaborator

This test validates a little bit more complex graph compared to existing invalid test

in order to make sure it's clear which tests are covering what to future maintainers, what do you think about changing the names of these tests to make it more clear what the line is between them? and/or maybe even combining them into one table driven test?

@bobcatfish
Copy link
Collaborator

Speaking of tests, this is a bit of scope creep, but why does dagv1beta1_test.go duplicate all the tests is in dag_test.go? Is there some way we can avoid having to maintain this duplication?

@pritidesai
Copy link
Member Author

what do you think about changing the names of these tests to make it more clear what the line is between them?

I have renamed the test.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@pritidesai
Copy link
Member Author

pritidesai commented Nov 20, 2020

can we also rename TestBuild_Invalid so it's clear what relationship this test has to TestBuild_InvalidDAG_OneCyclicBranch?

TestBuild_Invalid tests DAG for other failures as well e.g. duplicate task names, invalid task name in dependency, etc. Have renamed the test to TestBuild_Invalid.

While creating a separate cleanup PR, I will rearrange test functions further e.g. local functions towards the end of the file, rearrange TestBuild and TestBuild_Invalid, etc.

@pritidesai
Copy link
Member Author

@bobcatfish if build passes, its ready for review 🙏

@pritidesai
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-build-tests

@vdemeester
Copy link
Member

And we ended up moving dag_test.go to v1beta1 about two months back
e356ca5#diff-eb55d5a93fabcf61a0a897c11c28ffe92fc96ec02f8c7a6bdbcaa97fd3af8cc1

I guess its safe to delete one _test.go at this point @vdemeester? I can create a separate PR to cleanup, if needed.

Yeah I think so 😉

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we are almost there! thanks for your patience!

TestBuild_Invalid tests DAG for other failures as well e.g. duplicate task names, invalid task name in dependency, etc. Have renamed the test to TestBuild_Invalid.

I'd like to give one last plug for having one test for all of these cases, or for removing the invalid DAG cases from the existing function - i.e. one place to find all the invalid graphs, it doesn't seem to me like we're gaining much from the 2 separate tests at the moment, since the cases they are covering overlap. so id much prefer to see one test that covers all the cases, or a clear line between which cases each of the 2 tests cover

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@pritidesai
Copy link
Member Author

i think we are almost there! thanks for your patience!

Definitely, have merged them all in one.

Also, if cleanup PR #3556 is merged before this, will update this PR to drop changes to dagv1beta1_test.go.

@afrittoli
Copy link
Member

I have mixed feelings! On the one hand I feel like this DAG logic is at the heart of our Pipeline execution so it feels weird to entrust it to an external lib, on the other hand we had this inefficiency for nearly 2 years and didn't notice (on the other other hand, this didn't actually cause any bugs). Might be interesting to try swapping it out and see what it's like?

That's an interesting though, because I would have gone the other way. The DAG is not something specific to Tekton, the way we build it and us it is, so I would think using an external library that does the "dag" well would be good enough for us and would allow us to add value to where we need to. But I don't have strong opinion on this 🙃

Hey, yeah, that was my thinking exactly, but I don't have a strong feeling about this either

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2020
@bobcatfish
Copy link
Collaborator

Thanks @pritidesai ! looks good to me :D

even tho you'll need it again for the conflict, a symbolic lgtm:

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 24, 2020
visited map holds name of the nodes been visited as keys and "true"
as value. This map is being updated with currentName.HashKey on every
iteration which results in keys such as a.d, b.d where a, b, and d
are nodes of a dag. But such keys are not utilized anywhere in the
visit function. Visit function checks existence of the node just
by the name without any string concatenation.

This extra addition in the map is causing severe delay for a graph
with more than >60 nodes.
@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 24, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipeline/dag/dag.go 98.9% 98.8% -0.1

@bobcatfish
Copy link
Collaborator

/lgtm
/meow

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 24, 2020
@tekton-robot
Copy link
Collaborator

@bobcatfish: cat image

In response to this:

/lgtm
/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pritidesai
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@pritidesai
Copy link
Member Author

unable to access 'https://github.com/GoogleContainerTools/skaffold/': Could not resolve host: 🤔

@tekton-robot tekton-robot merged commit 5e0be29 into tektoncd:master Nov 24, 2020
@pritidesai pritidesai deleted the dag.visit.1 branch November 24, 2020 17:08
@bobcatfish bobcatfish changed the title fix cycle detection logic in dag optimize cycle detection logic in dag Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants