Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment #550

Closed

Conversation

pratikjagrut
Copy link
Contributor

Motivation

Fix #549

Changes

Add waitForDeployment inside require.Eventually().

Testing

make test-e2e

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign baijum
You can assign the PR to them by writing /assign @baijum in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pratikjagrut pratikjagrut marked this pull request as draft July 10, 2020 08:26
@pratikjagrut pratikjagrut marked this pull request as ready for review July 10, 2020 08:28
@pratikjagrut pratikjagrut changed the title Fix #549 Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment Jul 10, 2020
@pratikjagrut
Copy link
Contributor Author

/test 4.5-e2e
/test e2e

@Avni-Sharma
Copy link
Contributor

/retest

return false
}
return true
}, 120*time.Second, 2*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In WaitForDeployment, it makes use of wait.Poll to achieve similar behavior[1]; WaitForDeployment will return after timeout seconds, which seems to be the same value as require.Eventually is using. This means both would return virtually at the same time.

func waitForDeployment(t *testing.T, kubeclient kubernetes.Interface, namespace, name string, replicas int,
retryInterval, timeout time.Duration, isOperator bool) error {
if isOperator && test.Global.LocalOperator {
t.Log("Operator is running locally; skip waitForDeployment")
return nil
}
err := wait.Poll(retryInterval, timeout, func() (done bool, err error) {
deployment, err := kubeclient.AppsV1().Deployments(namespace).Get(name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
t.Logf("Waiting for availability of %s deployment\n", name)
return false, nil
}
return false, err
}
if int(deployment.Status.AvailableReplicas) >= replicas {
return true, nil
}
t.Logf("Waiting for full availability of %s deployment (%d/%d)\n", name,
deployment.Status.AvailableReplicas, replicas)
return false, nil
})
if err != nil {
return err
}
t.Logf("Deployment available (%d/%d)\n", replicas, replicas)
return nil
}

Further investigation in assert.Eventually[2] shows that every tick (in this case 2*time.Second) the condition function will be executed inside a gofunc; in other words, this change can potentially create many go funcs that will be orphaned in the case the condition is never met in WaitForDeployment.

func Eventually(t TestingT, condition func() bool, waitFor time.Duration, tick time.Duration, msgAndArgs ...interface{}) bool {
if h, ok := t.(tHelper); ok {
h.Helper()
}
ch := make(chan bool, 1)
timer := time.NewTimer(waitFor)
defer timer.Stop()
ticker := time.NewTicker(tick)
defer ticker.Stop()
for tick := ticker.C; ; {
select {
case <-timer.C:
return Fail(t, "Condition never satisfied", msgAndArgs...)
case <-tick:
tick = nil
go func() { ch <- condition() }()
case v := <-ch:
if v {
return true
}
tick = ticker.C
}
}
}

Additionally, given that WaitForDeployment's retryInterval (5s) is greater than the Eventually retry interval (2s), in the best scenario at least two WaitForDeployment will be racing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further investigation in assert.Eventually[2] shows that every tick (in this case 2*time.Second) the condition function will be executed inside a gofunc; in other words, this change can potentially create many go funcs that will be orphaned in the case the condition is never met in WaitForDeployment.

My theory was, as Eventually creates goroutines then waitForDeployment will be running concurrently in several goroutines created on each tick and this could increase the chance of detecting deployment readiness than detecting deployment by single waitForDeployment. There is a possibility that these goroutines could be orphaned but they will be returned or exited once the waitForDeployment timeouts.

Additionally, given that WaitForDeployment's retryInterval (5s) is greater than the Eventually retry interval (2s), in the best scenario at least two WaitForDeployment will be racing.

On each tick (per 2sec) new goroutine will be created in 120 sec there will be 60 goroutines(a bit high number) which will run for 120Sec each(timeout of waitForDeployment). Some goroutines could run after the Eventually is timed-out or met the condition, and probably if I'm not wrong all the goroutines will be exited by 240Sec(if Eventually orphans the gofunc) and we just need one goroutine to meet the condition but still there is a small margin for never meeting the condition.

This was my theory for trying out this nested polling sort of thing. Yeah, this seems bit costlier approach.

Or this racing condition could be alleviated by #523 and also we would need to make single write for patching and annotating deployment.

Any suggestion for other approaches?

@pratikjagrut
Copy link
Contributor Author

/test 4.5-e2e

@pratikjagrut pratikjagrut changed the title Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment WIP: Fix #549 Race condition between operator's watch and e2eutil.WaitForDeployment Jul 14, 2020
@openshift-ci-robot
Copy link
Collaborator

@pratikjagrut: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/4.5-acceptance 51d2711 link /test 4.5-acceptance

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pratikjagrut
Copy link
Contributor Author

Closing this PR in favour of acceptance test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition between operator's watch and e2eutil.WaitForDeployment.
4 participants