Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Update wait function in test_detached_actor #635

Merged
merged 9 commits into from
Oct 24, 2022

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Oct 14, 2022

Why are these changes needed?

In KinD E2E tests, we use kubectl wait to block the process only when the system is not ready. However, it is not good to use kubectl wait --for=condition=Ready after deleting a resource. See the Example 1 section in #618 for more details.

[Example]
The test test_detached_actor kills GCS on the head pod and uses kubectl wait to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod is READY:1/1, STATUS: Running before the crash. Hence, kubectl wait cannot make sure the new head pod is ready.

# kill the gcs on head node. If fate sharing is enabled
# the whole head node pod will terminate.
utils.shell_assert_success(
'kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \\$2}") -- /bin/bash -c "ps aux | grep gcs_server | grep -v grep | awk \'{print \$2}\' | xargs kill"')
# wait for new head node getting created
time.sleep(10)
# make sure the new head is ready
utils.shell_assert_success(
'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')

The workaround solution is to replace the kubectl wait with time.sleep(180) in #619. This PR implements a wait function for head pod restart.

Explanations for some changes

  • Kill the gcs_server process on the head pod

    • Replace ps aux | grep gcs_server | ... | xargs kill with pkill gcs_server. The results of pgrep gcs_server and ps aux | grep gcs_server | grep -v grep | awk '{print $2}' are the same.
  • restart_count

    • The default container restartPolicy of a Pod is Always. Hence, when GCS server is killed, the head pod will restart the old one rather than create a new one.
      • restart the old one => head pod name will not change, and restart_count will increase by 1.
      • create a new one => head pod name will change, and restart_count will become 0.
    • When GCS server is killed, it takes nearly 1 min to kill the head pod. In the minute, the head pod will still be in 'Running' and 'Ready'. Hence, we need to check restart_count to ensure the head pod has been dead.
    • HA in ray:nightly is buggy. It has a high possibility to create a new head pod instead of restarting the old one. Hence, the restart_count will become 0 and fail this test.
  • When all containers in pods are "READY" and all pods are "Running", it still takes tens of seconds to make all ray processes become ready.

    • [Solution]:
      • time.sleep(post_wait_sec)
      • retry_with_timeout(lambda: ray.init(address='ray://127.0.0.1:10001', ...), timeout = 180)
    • Does ray.init have any timeout argument?
  • Why do I pass client.CoreV1Api() as a function argument?

    k8s_api = client.CoreV1Api()
    headpods = utils.get_pod(k8s_api, namespace='default', label_selector='rayNodeType=head')

Related issue number

This PR solved a part of #618.

Checks

RAY_IMAGE=rayproject/ray:2.0.0 python3 tests/compatibility-test.py RayFTTestCase.test_detached_actor 2>&1 | tee log  
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 marked this pull request as ready for review October 17, 2022 17:20
@kevin85421 kevin85421 changed the title (DRAFT) Update wait function in test_detached_actor [Bug] Update wait function in test_detached_actor Oct 17, 2022
@DmitriGekhtman
Copy link
Collaborator

This looks good! Just one request for a bit more documentation.

@DmitriGekhtman
Copy link
Collaborator

Let's wait for CI to finish.

@DmitriGekhtman DmitriGekhtman merged commit 457d67a into ray-project:master Oct 24, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
This commit implements a wait function for head pod restart in test_detached_actor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants