[Bug] Update wait function in test_detached_actor #635

kevin85421 · 2022-10-14T01:22:58Z

Why are these changes needed?

In KinD E2E tests, we use kubectl wait to block the process only when the system is not ready. However, it is not good to use kubectl wait --for=condition=Ready after deleting a resource. See the Example 1 section in #618 for more details.

[Example]
The test test_detached_actor kills GCS on the head pod and uses kubectl wait to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod is READY:1/1, STATUS: Running before the crash. Hence, kubectl wait cannot make sure the new head pod is ready.

kuberay/tests/compatibility-test.py

Lines 319 to 327 in ea6e8d1

    
           # kill the gcs on head node. If fate sharing is enabled 
        
           # the whole head node pod will terminate. 
        
           utils.shell_assert_success( 
        
               'kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \\$2}") -- /bin/bash -c "ps aux | grep gcs_server | grep -v grep | awk \'{print \$2}\' | xargs kill"') 
        
           # wait for new head node getting created 
        
           time.sleep(10) 
        
           # make sure the new head is ready 
        
           utils.shell_assert_success( 
        
               'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')

The workaround solution is to replace the kubectl wait with time.sleep(180) in #619. This PR implements a wait function for head pod restart.

Explanations for some changes

Kill the gcs_server process on the head pod
- Replace ps aux | grep gcs_server | ... | xargs kill with pkill gcs_server. The results of pgrep gcs_server and ps aux | grep gcs_server | grep -v grep | awk '{print $2}' are the same.
restart_count
- The default container restartPolicy of a Pod is Always. Hence, when GCS server is killed, the head pod will restart the old one rather than create a new one.
  - restart the old one => head pod name will not change, and restart_count will increase by 1.
  - create a new one => head pod name will change, and restart_count will become 0.
- When GCS server is killed, it takes nearly 1 min to kill the head pod. In the minute, the head pod will still be in 'Running' and 'Ready'. Hence, we need to check restart_count to ensure the head pod has been dead.
- HA in ray:nightly is buggy. It has a high possibility to create a new head pod instead of restarting the old one. Hence, the restart_count will become 0 and fail this test.
When all containers in pods are "READY" and all pods are "Running", it still takes tens of seconds to make all ray processes become ready.
- [Solution]:
  - time.sleep(post_wait_sec)
  - retry_with_timeout(lambda: ray.init(address='ray://127.0.0.1:10001', ...), timeout = 180)
- Does ray.init have any timeout argument?
Why do I pass client.CoreV1Api() as a function argument?
```
k8s_api = client.CoreV1Api()
headpods = utils.get_pod(k8s_api, namespace='default', label_selector='rayNodeType=head')
```
- If I use different client.CoreV1Api() instances in each function call, unittest will report "ResourceWarning: unclosed SSLSocket".

Related issue number

This PR solved a part of #618.

Checks

RAY_IMAGE=rayproject/ray:2.0.0 python3 tests/compatibility-test.py RayFTTestCase.test_detached_actor 2>&1 | tee log

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

tests/kuberay_utils/utils.py

DmitriGekhtman · 2022-10-18T20:04:38Z

This looks good! Just one request for a bit more documentation.

tests/kuberay_utils/utils.py

DmitriGekhtman · 2022-10-21T20:34:43Z

Let's wait for CI to finish.

This commit implements a wait function for head pod restart in test_detached_actor.

kevin85421 added 4 commits October 14, 2022 01:21

update

5b2e583

update dependencies

43fd4d6

remove pod name check

9da4274

update

27b22f5

kevin85421 marked this pull request as ready for review October 17, 2022 17:20

kevin85421 changed the title ~~(DRAFT) Update wait function in test_detached_actor~~ [Bug] Update wait function in test_detached_actor Oct 17, 2022

kevin85421 requested review from DmitriGekhtman and wilsonwang371 October 17, 2022 21:04

update

9d05947

DmitriGekhtman reviewed Oct 18, 2022

View reviewed changes

tests/kuberay_utils/utils.py Outdated Show resolved Hide resolved

add docstring

40b20c6

DmitriGekhtman reviewed Oct 20, 2022

View reviewed changes

tests/kuberay_utils/utils.py Outdated Show resolved Hide resolved

kevin85421 added 2 commits October 21, 2022 17:36

remove sleep

630bd62

update comment

2b878a5

DmitriGekhtman reviewed Oct 21, 2022

View reviewed changes

tests/kuberay_utils/utils.py Outdated Show resolved Hide resolved

DmitriGekhtman reviewed Oct 21, 2022

View reviewed changes

tests/kuberay_utils/utils.py Outdated Show resolved Hide resolved

update comment

56e9c27

DmitriGekhtman approved these changes Oct 21, 2022

View reviewed changes

kevin85421 mentioned this pull request Oct 23, 2022

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky #650

Merged

4 tasks

DmitriGekhtman approved these changes Oct 24, 2022

View reviewed changes

DmitriGekhtman merged commit 457d67a into ray-project:master Oct 24, 2022

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Bug] Update wait function in test_detached_actor (ray-project#635)

fc89597

This commit implements a wait function for head pod restart in test_detached_actor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Update wait function in test_detached_actor #635

[Bug] Update wait function in test_detached_actor #635

kevin85421 commented Oct 14, 2022 •

edited

Loading

DmitriGekhtman commented Oct 18, 2022

DmitriGekhtman commented Oct 21, 2022

	# kill the gcs on head node. If fate sharing is enabled
	# the whole head node pod will terminate.
	utils.shell_assert_success(
	'kubectl exec -it $(kubectl get pods -A\| grep -e "-head" \| awk "{print \\$2}") -- /bin/bash -c "ps aux \| grep gcs_server \| grep -v grep \| awk \'{print \$2}\' \| xargs kill"')
	# wait for new head node getting created
	time.sleep(10)
	# make sure the new head is ready
	utils.shell_assert_success(
	'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A \| grep -e "-head" \| awk "{print \$2}") --timeout=900s')

[Bug] Update wait function in test_detached_actor #635

[Bug] Update wait function in test_detached_actor #635

Conversation

kevin85421 commented Oct 14, 2022 • edited Loading

Why are these changes needed?

Explanations for some changes

Related issue number

Checks

DmitriGekhtman commented Oct 18, 2022

DmitriGekhtman commented Oct 21, 2022

kevin85421 commented Oct 14, 2022 •

edited

Loading