Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Fix flaky tests by using RetryOnConflict #904

Merged
merged 4 commits into from
Feb 14, 2023

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 11, 2023

Why are these changes needed?

In a previous experiment, it was shown that 86 runs succeed, and 14 runs failed with the following errors:

[Fail] Inside the default namespace When creating a raycluster [It] should update a raycluster object 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:279

[Fail] Inside the default namespace When creating a raycluster [It] should have only 1 running worker 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:286

[Fail] Inside the default namespace When creating a rayservice [It] should perform a zero-downtime update after a code change. 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/rayservice_controller_test.go:354

The above errors are all due to 409 conflict:

  • In previous PR Fix flaky tests by retrying 409 conflict error #73, retryOnOldRevision is used in certain places to retry updat call if facing a 409 conflict error. According to the previous experiment, it is necessary to apply the retry strategy to more k8sClient.Update calls especially those causing the above errors.
  • Compare retryOnOldRevision with RetryOnConflict , RetryOnConflict gives more advantages. It provides exponential backoff to avoid exhausting the apiserver.

So, two changes are made in this PR:

If you have concerns about using RetryOnConflict to handle 409 conflict errors in testing or need some background information, see the below links :

Related issue number

Closes #902

Checks

In a 2-core CPU, 7 GB RAM VM (to simulate the github's standard Linux runner), I ran the test 100 times:

rootPath="/home/lyc/Desktop/lessFlasky/kuberay"
cd $rootPath/ray-operator
for i in {1..100};
        do echo "iteration ${i}";
        make test | tee /home/lyc/Desktop/tmp/log${i}
done

All pass with no error. The test is stable now.

@kevin85421 kevin85421 self-requested a review February 14, 2023 20:53

Expect(k8sClient.Update(ctx, myRayService)).Should(Succeed(), "failed to update test RayService resource")
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
Eventually(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need to getResourceFunc here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without getResourceFunc , if a 409 conflict happens, the resourceVersion we pass to API server during retry updating will never be updated and thus will never match the resourceVersion in the API server.

@Yicheng-Lu-llll
Copy link
Contributor Author

Also update the usage of k8s.io/utils/pointer pkg. Some functions are deprecated and have new names. See here.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the fix!

@kevin85421 kevin85421 merged commit f058924 into ray-project:master Feb 14, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] 409 conflict error may occur when updating cr in the test
2 participants