refactor: Fix flaky tests by using RetryOnConflict #904

Yicheng-Lu-llll · 2023-02-11T23:24:53Z

Why are these changes needed?

In a previous experiment, it was shown that 86 runs succeed, and 14 runs failed with the following errors：

[Fail] Inside the default namespace When creating a raycluster [It] should update a raycluster object 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:279

[Fail] Inside the default namespace When creating a raycluster [It] should have only 1 running worker 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:286

[Fail] Inside the default namespace When creating a rayservice [It] should perform a zero-downtime update after a code change. 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/rayservice_controller_test.go:354

The above errors are all due to 409 conflict:

In previous PR Fix flaky tests by retrying 409 conflict error #73, retryOnOldRevision is used in certain places to retry updat call if facing a 409 conflict error. According to the previous experiment, it is necessary to apply the retry strategy to more k8sClient.Update calls especially those causing the above errors.
Compare retryOnOldRevision with RetryOnConflict , RetryOnConflict gives more advantages. It provides exponential backoff to avoid exhausting the apiserver.

So, two changes are made in this PR:

Replace retryOnOldRevision with RetryOnConflict
Apply RetryOnConflict to more k8sClient.Update calls.

If you have concerns about using RetryOnConflict to handle 409 conflict errors in testing or need some background information, see the below links :

k8s api-conventions document describe the client action in the case of a conflict
The [ link link ] show how they implement the retry strategy.
Don’t Retry on Conflict section in this article and comment suggests not to use retry strategy in controllers and explain why. (Note, there are concerns and shortages to using the retry strategy in controllers as described above, but they are not for testing)

Related issue number

Closes #902

Checks

In a 2-core CPU, 7 GB RAM VM (to simulate the github's standard Linux runner), I ran the test 100 times:

rootPath="/home/lyc/Desktop/lessFlasky/kuberay"
cd $rootPath/ray-operator
for i in {1..100};
        do echo "iteration ${i}";
        make test | tee /home/lyc/Desktop/tmp/log${i}
done

All pass with no error. The test is stable now.

ray-operator/controllers/ray/raycluster_controller_test.go

ray-operator/controllers/ray/rayservice_controller_test.go

kevin85421 · 2023-02-14T21:40:47Z

ray-operator/controllers/ray/rayservice_controller_test.go


-			Expect(k8sClient.Update(ctx, myRayService)).Should(Succeed(), "failed to update test RayService resource")
+			err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
+				Eventually(


Can you explain why we need to getResourceFunc here?

Without getResourceFunc , if a 409 conflict happens, the resourceVersion we pass to API server during retry updating will never be updated and thus will never match the resourceVersion in the API server.

Yicheng-Lu-llll · 2023-02-14T23:12:44Z

Also update the usage of k8s.io/utils/pointer pkg. Some functions are deprecated and have new names. See here.

kevin85421

LGTM. Thank you for the fix!

Use RetryOnConflict to relieve flakiness.

Yicheng-Lu-llll added 2 commits February 11, 2023 15:11

add RetryOnConflict

e689d39

remove unused variables

b308229

kevin85421 self-requested a review February 14, 2023 20:53

kevin85421 reviewed Feb 14, 2023

View reviewed changes

Yicheng-Lu-llll added 2 commits February 14, 2023 16:55

remove redundant variables && update pointer pkg usage

872bac3

remove redundant variables && update pointer pkg usage

d0ab74c

kevin85421 approved these changes Feb 14, 2023

View reviewed changes

kevin85421 merged commit f058924 into ray-project:master Feb 14, 2023

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

refactor: Fix flaky tests by using RetryOnConflict (ray-project#904)

95322eb

Use RetryOnConflict to relieve flakiness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Fix flaky tests by using RetryOnConflict #904

refactor: Fix flaky tests by using RetryOnConflict #904

Yicheng-Lu-llll commented Feb 11, 2023 •

edited

Loading

kevin85421 Feb 14, 2023

Yicheng-Lu-llll Feb 14, 2023

Yicheng-Lu-llll commented Feb 14, 2023

kevin85421 left a comment

refactor: Fix flaky tests by using RetryOnConflict #904

refactor: Fix flaky tests by using RetryOnConflict #904

Conversation

Yicheng-Lu-llll commented Feb 11, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 Feb 14, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll Feb 14, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll commented Feb 14, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

Yicheng-Lu-llll commented Feb 11, 2023 •

edited

Loading