Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve test stability #394

Merged

Conversation

wilsonwang371
Copy link
Collaborator

@wilsonwang371 wilsonwang371 commented Jul 20, 2022

Why are these changes needed?

  1. improve compatibility test stability by increasing the initial wait time in liveness/readiness probe.
  2. remove unnecessary gofumpt run in Makefile.

Related issue number

N/A

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 20, 2022

What's the reason the test takes that long to finish?
Uploading image.png…

@wilsonwang371 wilsonwang371 force-pushed the wilson/improve-test-stability branch 3 times, most recently from 4829dd1 to 7770740 Compare July 20, 2022 22:52
@wilsonwang371
Copy link
Collaborator Author

@Jeffwan I did some investigation, it looks like the root cause for the failure and long waiting time is because in our test environment, the initial delay seconds is too small. I am working on this right now.

@wilsonwang371 wilsonwang371 force-pushed the wilson/improve-test-stability branch 2 times, most recently from 52a6f15 to d4b4e90 Compare July 21, 2022 00:02
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 21, 2022

Currently seems it has two issues

  1. flaky and unstable issue - sometimes succeed and sometimes failed
  2. takes long time to finish

If we can resolve the flake issue and reduce running time to 10mins, that would be great.

@wilsonwang371
Copy link
Collaborator Author

Currently seems it has two issues

  1. flaky and unstable issue - sometimes succeed and sometimes failed
  2. takes long time to finish

If we can resolve the flake issue and reduce running time to 10mins, that would be great.

The first one is a more urgent task. After the first one fixed, we will work on the second one.

The root cause of the flaky test is due to slow ray cluster start in kind environment. we need to update the probe timeout parameters.

@wilsonwang371 wilsonwang371 force-pushed the wilson/improve-test-stability branch 3 times, most recently from 8615eb0 to 612916a Compare July 21, 2022 02:17
@wilsonwang371
Copy link
Collaborator Author

From what i observed, when we killed the head node in a ray cluster in our github testbed, the worker readiness probe can fail for more than 30 seconds. This will lead to our default readiness probe getting involved and kill the worker node.

I increased the readiness probe timeout to 40 seconds for now.

FYI: @iycheng @brucez-anyscale @Jeffwan @scarlet25151 @DmitriGekhtman

@wilsonwang371 wilsonwang371 force-pushed the wilson/improve-test-stability branch 2 times, most recently from 517b980 to bb0ccc1 Compare July 21, 2022 02:49
@Jeffwan Jeffwan merged commit aa447ca into ray-project:master Jul 21, 2022
@brucez-anyscale
Copy link
Contributor

@iycheng I think the worker node readiness or liveness do not depend on head node. Because if worker node cannot connect head node, it will return healthy.
Can you confirm?

@fishbone
Copy link
Contributor

@brucez-anyscale
Copy link
Contributor

@wilsonwang371 Can you confirm the worker issue is caused by head node down?

@wilsonwang371
Copy link
Collaborator Author

@wilsonwang371 Can you confirm the worker issue is caused by head node down?

this is due to probe timeout is not big enough for kind cluster

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants