WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment #550

pratikjagrut · 2020-07-10T08:26:29Z

Motivation

Fix #549

Changes

Add waitForDeployment inside require.Eventually().

Testing

make test-e2e

openshift-ci-robot · 2020-07-10T08:26:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign baijum
You can assign the PR to them by writing /assign @baijum in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pratikjagrut · 2020-07-10T11:56:16Z

/test 4.5-e2e
/test e2e

Avni-Sharma · 2020-07-10T13:54:22Z

/retest

isutton · 2020-07-10T14:07:13Z

test/e2e/servicebindingrequest_test.go

+			return false
+		}
+		return true
+	}, 120*time.Second, 2*time.Second)


In WaitForDeployment, it makes use of wait.Poll to achieve similar behavior[1]; WaitForDeployment will return after timeout seconds, which seems to be the same value as require.Eventually is using. This means both would return virtually at the same time.

service-binding-operator/vendor/github.com/operator-framework/operator-sdk/pkg/test/e2eutil/wait_util.go

Lines 48 to 76 in a3e812b

func waitForDeployment(t *testing.T, kubeclient kubernetes.Interface, namespace, name string, replicas int,

retryInterval, timeout time.Duration, isOperator bool) error {

if isOperator && test.Global.LocalOperator {

t.Log("Operator is running locally; skip waitForDeployment")

return nil

}

err := wait.Poll(retryInterval, timeout, func() (done bool, err error) {

deployment, err := kubeclient.AppsV1().Deployments(namespace).Get(name, metav1.GetOptions{})

if err != nil {

if apierrors.IsNotFound(err) {

t.Logf("Waiting for availability of %s deployment\n", name)

return false, nil

}

return false, err

}

if int(deployment.Status.AvailableReplicas) >= replicas {

return true, nil

}

t.Logf("Waiting for full availability of %s deployment (%d/%d)\n", name,

deployment.Status.AvailableReplicas, replicas)

return false, nil

})

if err != nil {

return err

}

t.Logf("Deployment available (%d/%d)\n", replicas, replicas)

return nil

}

Further investigation in assert.Eventually[2] shows that every tick (in this case 2*time.Second) the condition function will be executed inside a gofunc; in other words, this change can potentially create many go funcs that will be orphaned in the case the condition is never met in WaitForDeployment.

service-binding-operator/vendor/github.com/stretchr/testify/assert/assertions.go

Lines 1635 to 1662 in 97172e4

func Eventually(t TestingT, condition func() bool, waitFor time.Duration, tick time.Duration, msgAndArgs ...interface{}) bool {

if h, ok := t.(tHelper); ok {

h.Helper()

}

ch := make(chan bool, 1)

timer := time.NewTimer(waitFor)

defer timer.Stop()

ticker := time.NewTicker(tick)

defer ticker.Stop()

for tick := ticker.C; ; {

select {

case <-timer.C:

return Fail(t, "Condition never satisfied", msgAndArgs...)

case <-tick:

tick = nil

go func() { ch <- condition() }()

case v := <-ch:

if v {

return true

}

tick = ticker.C

}

}

}

Additionally, given that WaitForDeployment's retryInterval (5s) is greater than the Eventually retry interval (2s), in the best scenario at least two WaitForDeployment will be racing.

Further investigation in assert.Eventually[2] shows that every tick (in this case 2*time.Second) the condition function will be executed inside a gofunc; in other words, this change can potentially create many go funcs that will be orphaned in the case the condition is never met in WaitForDeployment.

My theory was, as Eventually creates goroutines then waitForDeployment will be running concurrently in several goroutines created on each tick and this could increase the chance of detecting deployment readiness than detecting deployment by single waitForDeployment. There is a possibility that these goroutines could be orphaned but they will be returned or exited once the waitForDeployment timeouts.

Additionally, given that WaitForDeployment's retryInterval (5s) is greater than the Eventually retry interval (2s), in the best scenario at least two WaitForDeployment will be racing.

On each tick (per 2sec) new goroutine will be created in 120 sec there will be 60 goroutines(a bit high number) which will run for 120Sec each(timeout of waitForDeployment). Some goroutines could run after the Eventually is timed-out or met the condition, and probably if I'm not wrong all the goroutines will be exited by 240Sec(if Eventually orphans the gofunc) and we just need one goroutine to meet the condition but still there is a small margin for never meeting the condition.

This was my theory for trying out this nested polling sort of thing. Yeah, this seems bit costlier approach.

Or this racing condition could be alleviated by #523 and also we would need to make single write for patching and annotating deployment.

Any suggestion for other approaches?

pratikjagrut · 2020-07-10T15:57:58Z

/test 4.5-e2e

openshift-ci-robot · 2020-07-20T12:42:37Z

@pratikjagrut: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/4.5-acceptance	`51d2711`	link	`/test 4.5-acceptance`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pratikjagrut · 2020-07-21T06:52:45Z

Closing this PR in favour of acceptance test.

Add waitForDeployment inside require.Eventually().

51d2711

openshift-ci-robot requested review from akashshinde and otaviof July 10, 2020 08:26

pratikjagrut marked this pull request as draft July 10, 2020 08:26

openshift-ci-robot added the do-not-merge/work-in-progress label Jul 10, 2020

pratikjagrut marked this pull request as ready for review July 10, 2020 08:28

openshift-ci-robot removed the do-not-merge/work-in-progress label Jul 10, 2020

pratikjagrut changed the title ~~Fix #549~~ Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment Jul 10, 2020

isutton suggested changes Jul 10, 2020

View reviewed changes

pratikjagrut changed the title ~~Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment~~ WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment Jul 14, 2020

openshift-ci-robot added the do-not-merge/work-in-progress label Jul 14, 2020

pratikjagrut requested a review from isutton July 14, 2020 09:40

pratikjagrut closed this Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment #550

WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment #550

pratikjagrut commented Jul 10, 2020

openshift-ci-robot commented Jul 10, 2020

pratikjagrut commented Jul 10, 2020

Avni-Sharma commented Jul 10, 2020

isutton Jul 10, 2020

pratikjagrut Jul 10, 2020 •

edited

pratikjagrut commented Jul 10, 2020

openshift-ci-robot commented Jul 20, 2020

pratikjagrut commented Jul 21, 2020

	func waitForDeployment(t *testing.T, kubeclient kubernetes.Interface, namespace, name string, replicas int,
	retryInterval, timeout time.Duration, isOperator bool) error {
	if isOperator && test.Global.LocalOperator {
	t.Log("Operator is running locally; skip waitForDeployment")
	return nil
	}
	err := wait.Poll(retryInterval, timeout, func() (done bool, err error) {
	deployment, err := kubeclient.AppsV1().Deployments(namespace).Get(name, metav1.GetOptions{})
	if err != nil {
	if apierrors.IsNotFound(err) {
	t.Logf("Waiting for availability of %s deployment\n", name)
	return false, nil
	}
	return false, err
	}

	if int(deployment.Status.AvailableReplicas) >= replicas {
	return true, nil
	}
	t.Logf("Waiting for full availability of %s deployment (%d/%d)\n", name,
	deployment.Status.AvailableReplicas, replicas)
	return false, nil
	})
	if err != nil {
	return err
	}
	t.Logf("Deployment available (%d/%d)\n", replicas, replicas)
	return nil
	}

	func Eventually(t TestingT, condition func() bool, waitFor time.Duration, tick time.Duration, msgAndArgs ...interface{}) bool {
	if h, ok := t.(tHelper); ok {
	h.Helper()
	}

	ch := make(chan bool, 1)

	timer := time.NewTimer(waitFor)
	defer timer.Stop()

	ticker := time.NewTicker(tick)
	defer ticker.Stop()

	for tick := ticker.C; ; {
	select {
	case <-timer.C:
	return Fail(t, "Condition never satisfied", msgAndArgs...)
	case <-tick:
	tick = nil
	go func() { ch <- condition() }()
	case v := <-ch:
	if v {
	return true
	}
	tick = ticker.C
	}
	}
	}

WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment #550

WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment #550

Conversation

pratikjagrut commented Jul 10, 2020

Motivation

Changes

Testing

openshift-ci-robot commented Jul 10, 2020

pratikjagrut commented Jul 10, 2020

Avni-Sharma commented Jul 10, 2020

isutton Jul 10, 2020

Choose a reason for hiding this comment

pratikjagrut Jul 10, 2020 • edited

Choose a reason for hiding this comment

pratikjagrut commented Jul 10, 2020

openshift-ci-robot commented Jul 20, 2020

pratikjagrut commented Jul 21, 2020

pratikjagrut Jul 10, 2020 •

edited