Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_detached_actor flaky #619

Merged
merged 6 commits into from
Oct 13, 2022

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Oct 7, 2022

Why are these changes needed?

Misuse of Docker API

As shown in the following code segment, the existing tests in compatibility-test.py has the following pattern:

  • Create a container which launches a Python REPL process by Docker API.
  • Attach a socket to the container with parameters params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1}.
  • Use the socket to interact with Python REPL.
  • Use a while loop to check the received message of the socket.

client = docker.from_env()
container = client.containers.run(ray_image, remove=True, detach=True, stdin_open=True, tty=True,
network_mode='host', command=["/bin/sh", "-c", "python"])
s = container.attach_socket(
params={'stdin': 1, 'stream': 1, 'stdout': 1, 'stderr': 1})
s._sock.setblocking(0)
s._sock.sendall(b'''
import ray
import time
def retry_with_timeout(func, count=90):
tmp = 0
err = None
while tmp < count:
try:
return func()
except Exception as e:
err = e
tmp += 1
assert err is not None
raise err
ray.init(address='ray://127.0.0.1:10001')
@ray.remote
class A:
def ready(self):
import os
return os.getpid()
a = A.options(name="a", lifetime="detached", max_restarts=-1).remote()
res1 = ray.get(a.ready.remote())
print('ready')
''')
count = 0
while count < 90:
try:
buf = s._sock.recv(4096)
logger.info(buf.decode())
if buf.decode().find('ready') != -1:
break
except Exception as e:
pass
time.sleep(1)
count += 1
if count >= 90:
raise Exception('failed to run script')

However, this is buggy because the received message of the socket includes STDIN, STDOUT, and STDERR. That is, the received message includes the messages sent by the function sendall (STDIN) in the above example. Hence, the condition buf.decode().find('ready') != -1 will always be fulfilled by L295 def ready(self):. In addition, STDOUT may also cause some bugs, e.g. #617.

Solution

Check exit_code instead.

Misunderstanding of Ray HA

Originally, the detached actor is defined as follows. First, the test will get the PID of the Actor. Next, kill the GCS server and wait for the new head pod ready. Finally, connect to the detached actor again and compare the PIDs. Check this for more details.

@ray.remote
class A:
    def ready(self):
        import os
        return os.getpid()

When the GCS server on the old head pod is killed, all the head pod and worker pods will be recreated. Hence, the PID will be definitely different. It is worth noting that Ray HA promises that the registered Actors will be launched again when the new head pod is up, but the internal state will not be restored.

Solution

Design a new TestCounter actor.

Related issue number

Closes #620
Closes #617

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title [Bug] Misuse of Docker API caused the wait functions for ray cluster wrong [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_detached_actor flaky Oct 7, 2022
@DmitriGekhtman
Copy link
Collaborator

When the GCS server on the old head pod is killed, all the head pod and worker pods will be recreated. Hence, the PID will be definitely different. It is worth noting that Ray HA promises that the registered Actors will be launched again when the new head pod is up, but the internal state will not be restored.

Note that worker pods being recreated is a bug. We need to track and solve that issue.

I am not sure but I think it could also be possible for the processes running actors to survive a GCS crash?

cc @iycheng @scv119 @shrekris-anyscale

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

tests/compatibility-test.py Outdated Show resolved Hide resolved
tests/compatibility-test.py Outdated Show resolved Hide resolved
tests/compatibility-test.py Outdated Show resolved Hide resolved
tests/compatibility-test.py Outdated Show resolved Hide resolved
kevin85421 and others added 2 commits October 12, 2022 13:18
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Feel free to merge.

@kevin85421 kevin85421 merged commit e2a6ae8 into ray-project:master Oct 13, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…detached_actor flaky (ray-project#619)

* fix test_detached_actor

* update

* Update tests/compatibility-test.py

Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

* update

* add link

* update comments

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants