Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray client test hangs due to server errornously calling client method #14756

Closed
ericl opened this issue Mar 18, 2021 · 2 comments · Fixed by #14782
Closed

Ray client test hangs due to server errornously calling client method #14756

ericl opened this issue Mar 18, 2021 · 2 comments · Fixed by #14782
Assignees
Labels
P1 Issue that should be fixed within a few weeks
Milestone

Comments

@ericl
Copy link
Contributor

ericl commented Mar 18, 2021

The following is a test-only (I believe) deadlock. The dataservicer (second stack) is calling back into the client, which leads to deadlock since the server is blocking on a RPC to itself.

I think the issue is we're launching the test server in the same process as the client.

Repro:
for i in seq 1 100; do RAY_CLIENT_MODE=1 pytest -v test_actor.py; sleep 0; done

py-spy dumps:

Thread 29028 (idle): "MainThread"
    do_futex_wait.constprop.1 (libpthread-2.27.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.27.so)
    PyThread_acquire_lock_timed (python3.6)
    wait (threading.py:295)
    wait_for (threading.py:330)
    _blocking_send (ray/util/client/dataclient.py:103)
    PutObject (ray/util/client/dataclient.py:127)
    _put (ray/util/client/worker.py:248)
    <listcomp> (ray/util/client/worker.py:231)
    put (ray/util/client/worker.py:231)
    put (ray/util/client/api.py:45)
    _ensure_ref (ray/util/client/common.py:163)
    _prepare_client_task (ray/util/client/common.py:190)
    _prepare_client_task (ray/util/client/common.py:285)
    call_remote (ray/util/client/worker.py:287)
    call_remote (ray/util/client/api.py:96)
    remote (ray/util/client/common.py:292)
    _remote (ray/util/client/common.py:179)
    client_mode_convert_actor (ray/_private/client_mode_hook.py:90)
    _remote (ray/actor.py:587)
    remote (ray/actor.py:413)
    test_keyword_args (test_actor.py:328)
    pytest_pyfunc_call (_pytest/python.py:183)
    _multicall (pluggy/callers.py:187)
    <lambda> (pluggy/manager.py:87)
    _hookexec (pluggy/manager.py:93)
    __call__ (pluggy/hooks.py:286)
    runtest (_pytest/python.py:1641)
    pytest_runtest_call (_pytest/runner.py:162)
    _multicall (pluggy/callers.py:187)
    <lambda> (pluggy/manager.py:87)
    _hookexec (pluggy/manager.py:93)
    __call__ (pluggy/hooks.py:286)
    <lambda> (_pytest/runner.py:255)
    from_call (_pytest/runner.py:311)
    call_runtest_hook (_pytest/runner.py:255)
    call_and_report (_pytest/runner.py:215)
    runtestprotocol (_pytest/runner.py:126)
    pytest_runtest_protocol (_pytest/runner.py:109)
    _multicall (pluggy/callers.py:187)
    <lambda> (pluggy/manager.py:87)
    _hookexec (pluggy/manager.py:93)
    __call__ (pluggy/hooks.py:286)
    pytest_runtestloop (_pytest/main.py:348)
    _multicall (pluggy/callers.py:187)
    <lambda> (pluggy/manager.py:87)
    _hookexec (pluggy/manager.py:93)
    __call__ (pluggy/hooks.py:286)
    _main (_pytest/main.py:323)
    wrap_session (_pytest/main.py:269)
    pytest_cmdline_main (_pytest/main.py:316)
    _multicall (pluggy/callers.py:187)
    <lambda> (pluggy/manager.py:87)
    _hookexec (pluggy/manager.py:93)
    __call__ (pluggy/hooks.py:286)
    main (_pytest/config/__init__.py:163)
    console_main (_pytest/config/__init__.py:185)
    <module> (pytest:8)
Thread 31940 (idle): "ThreadPoolExecutor-8_1"
    do_futex_wait.constprop.1 (libpthread-2.27.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.27.so)
    PyThread_acquire_lock_timed (python3.6)
    wait (threading.py:295)
    wait_for (threading.py:330)
    _blocking_send (ray/util/client/dataclient.py:103)
    ReleaseObject (ray/util/client/dataclient.py:139)
    _release_server (ray/util/client/worker.py:324)
    call_release (ray/util/client/worker.py:317)
    call_release (ray/util/client/api.py:108)
    __del__ (ray/util/client/common.py:52)
    __pyx_tp_new_7msgpack_9_cmsgpack_Packer (_cmsgpack.cpp:11208)
    packb (msgpack/__init__.py:35)
    dumps (ray/_raylet.so)
    _serialize_to_msgpack (ray/serialization.py:299)
    serialize (ray/serialization.py:324)
    put_object (ray/worker.py:265)
    put (ray/worker.py:1463)
    wrapper (ray/_private/client_mode_hook.py:47)
    _put_object (ray/util/client/server/server.py:217)
    Datapath (ray/util/client/server/dataservicer.py:50)
    _take_response_from_response_iterator (grpc/_server.py:453)
    _send_message_callback_to_blocking_iterator_adapter (grpc/_server.py:607)
    _stream_response_in_pool (grpc/_server.py:593)
    run (concurrent/futures/thread.py:56)
    _worker (concurrent/futures/thread.py:69)
    run (threading.py:864)
    _bootstrap_inner (threading.py:916)
    _bootstrap (threading.py:884)
@ericl
Copy link
Contributor Author

ericl commented Mar 18, 2021

cc @AmeerHajAli

@ericl ericl changed the title Ray client tests flaky due to server errornously calling client method Ray client test hangs due to server errornously calling client method Mar 18, 2021
@ericl ericl added the P1 Issue that should be fixed within a few weeks label Mar 18, 2021
@ericl ericl added this to the Core Bugs milestone Mar 18, 2021
@ericl ericl self-assigned this Mar 18, 2021
@ericl
Copy link
Contributor Author

ericl commented Mar 18, 2021

There are a couple mysteries here:

  1. How does a ClientObjectRef end up on the server side? The pickling protocol should be turning those into normal object refs.
  2. Why does this only happen occasionally? There might be some race condition related to python gc here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks
Projects
None yet
1 participant