Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayClient][Proxy] BugFixes #16040

Merged
merged 9 commits into from
May 28, 2021

Conversation

ijrsvt
Copy link
Contributor

@ijrsvt ijrsvt commented May 24, 2021

Why are these changes needed?

  • Fix bug where using python -m ray.util.client.server without starting Ray first does not work.
  • Use a more robust method of ensuring that ray.util.client.server is the main process (as opposed to the shim process)

Related issue number

Closes #16067

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ijrsvt ijrsvt changed the title [WIP] Ray ClientBuilder Docs [WIP] Update RayClient Doc Examples May 25, 2021
@ijrsvt ijrsvt changed the title [WIP] Update RayClient Doc Examples [RayClient][Proxy] BugFixes May 25, 2021
"""
if self.redis_address:
return self.redis_address
connection_tuple = ray.init(address=self.redis_address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this start a ray driver?
If yes, why do we need to pass self.redis_address to address? if the if statement did not happen, doesn't it mean that the value here is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, this will be None. The logic is the same, but it is quite weird!

@AmeerHajAli AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 26, 2021
@ijrsvt ijrsvt requested a review from AmeerHajAli May 27, 2021 04:09
@ijrsvt
Copy link
Contributor Author

ijrsvt commented May 28, 2021

Locally run MacOS tests:
Only these ones were run because they are the only tests that connect to a cluster started with the RayClient Proxy (i.e. they connect to a cluster started with ray start as opposed to with a test fixture).

  • test_client_proxy.py
  • test_namespaces.py
  • test_runtime_env_complicated.py #Manually tested, could not run tests due to issues with my local MacOS configuration {i.e. conda installs were broken}
  • test_multi_node_2.py
(base) ianrodney@Ians-Macbook-Pro ray % python -m pytest -sv '/Users/ianrodney/Documents/ray/python/ray/tests/test_client_proxy.py'##vso[task.logissue type=warning;]The 'junit_family' default value will change to 'xunit2' in pytest 6.0.
Add 'junit_family=xunit1' to your pytest.ini file to keep the current format in future versions of pytest and silence this warning.
Test session starts (platform: darwin, Python 3.6.10, pytest 5.4.3, pytest-sugar 0.9.4)
cachedir: .pytest_cache
rootdir: /Users/ianrodney/Documents/ray/python
plugins: azurepipelines-0.8.0, sugar-0.9.4, asyncio-0.15.1, rerunfailures-9.1.1, xdist-2.1.0, aiohttp-0.3.0, profiling-1.7.0, timeout-1.4.2, forked-1.3.0
collecting ... 2021-05-28 08:26:02,361  INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
2021-05-28 08:26:06,125 INFO proxier.py:185 -- SpecificServer started on port: 45000 with PID: 76649 for client: client1
INFO:ray.util.client.server.server:Starting Ray Client server on 0.0.0.0:45000

 ray/tests/test_client_proxy.py::test_proxy_manager_lifecycle ✓                                                               8% ▊         2021-05-28 08:26:16,539 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
Could not find conda environment: conda-env-that-sadly-does-not-exist
You can list all discoverable environments with `conda info --envs`.

2021-05-28 08:26:20,711 ERROR proxier.py:176 -- SpecificServer startup failed for client: client1
2021-05-28 08:26:20,711 INFO proxier.py:185 -- SpecificServer started on port: 46000 with PID: 76728 for client: client1
2021-05-28 08:26:22,716 ERROR proxier.py:194 -- Unable to find channel for client: client1

 ray/tests/test_client_proxy.py::test_proxy_manager_bad_startup ✓                                                            15% █▋        
Stopped only 14 out of 15 Ray processes. Set `-v` to see more details.
Try running the command again, or use `--force`.
 ray/tests/test_client_proxy.py::test_multiple_clients_use_different_drivers[ray start --head --ray-client-server-port 25001 --port 0] ✓23% ██▍       
Stopped only 12 out of 14 Ray processes. Set `-v` to see more details.
Try running the command again, or use `--force`.
 ray/tests/test_client_proxy.py::test_correct_num_clients[ray start --head --ray-client-server-port 25005 --port 0] ✓        31% ███▏      
 ray/tests/test_client_proxy.py::test_prepare_runtime_init_req_fails ✓                                                       38% ███▉      
 ray/tests/test_client_proxy.py::test_prepare_runtime_init_req_no_modification ✓                                             46% ████▋     
 ray/tests/test_client_proxy.py::test_prepare_runtime_init_req_modified_job ✓                                                54% █████▍    
 ray/tests/test_client_proxy.py::test_match_running_client_server[test_case0] ✓                                              62% ██████▎   
 ray/tests/test_client_proxy.py::test_match_running_client_server[test_case1] ✓                                              69% ██████▉   
 ray/tests/test_client_proxy.py::test_match_running_client_server[test_case2] ✓                                              77% ███████▊  
 ray/tests/test_client_proxy.py::test_match_running_client_server[test_case3] ✓                                              85% ████████▌ 
 ray/tests/test_client_proxy.py::test_match_running_client_server[test_case4] ✓                                              92% █████████▎
 ray/tests/test_client_proxy.py::test_match_running_client_server[test_case5] ✓                                             100% ██████████##vso[results.publish type=JUnit;runTitle='Pytest results';]/Users/ianrodney/Documents/ray/test-output.xml
Skipping uploading of coverage data.

============================================================ warnings summary =============================================================
/Users/ianrodney/miniconda3/lib/python3.6/site-packages/_pytest/junitxml.py:417
  /Users/ianrodney/miniconda3/lib/python3.6/site-packages/_pytest/junitxml.py:417: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0.
  Add 'junit_family=xunit1' to your pytest.ini file to keep the current format in future versions of pytest and silence this warning.
    _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
----------------------------------- generated xml file: /Users/ianrodney/Documents/ray/test-output.xml ------------------------------------

Results (53.39s):
      13 passed
(base) ianrodney@Ians-Macbook-Pro ray % python -m pytest -sv '/Users/ianrodney/Documents/ray/python/ray/tests/test_namespace.py'
##vso[task.logissue type=warning;]The 'junit_family' default value will change to 'xunit2' in pytest 6.0.
Add 'junit_family=xunit1' to your pytest.ini file to keep the current format in future versions of pytest and silence this warning.
Test session starts (platform: darwin, Python 3.6.10, pytest 5.4.3, pytest-sugar 0.9.4)
cachedir: .pytest_cache
rootdir: /Users/ianrodney/Documents/ray/python
plugins: azurepipelines-0.8.0, sugar-0.9.4, asyncio-0.15.1, rerunfailures-9.1.1, xdist-2.1.0, aiohttp-0.3.0, profiling-1.7.0, timeout-1.4.2, forked-1.3.0
collecting ... 2021-05-28 08:34:04,145  INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

 ray/tests/test_namespace.py::test_isolation ✓                                                                               17% █▋        2021-05-28 08:34:14,575 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

 ray/tests/test_namespace.py::test_default_namespace ✓                                                                       33% ███▍      2021-05-28 08:34:24,441 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

 ray/tests/test_namespace.py::test_namespace_in_job_config ✓                                                                 50% █████     2021-05-28 08:34:30,532 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
2021-05-28 08:34:33,413 WARNING worker.py:1114 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="35faa271-9e74-42a1-9bb9-a2fde48ce18e", ...)

 ray/tests/test_namespace.py::test_detached_warning ✓                                                                        67% ██████▋   2021-05-28 08:34:36,243 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
Done!!!


 ray/tests/test_namespace.py::test_namespace_client ✓                                                                        83% ████████▍ 2021-05-28 08:34:46,993 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

 ray/tests/test_namespace.py::test_runtime_context ✓                                                                        100% ██████████##vso[results.publish type=JUnit;runTitle='Pytest results';]/Users/ianrodney/Documents/ray/test-output.xml
Skipping uploading of coverage data.

============================================================ warnings summary =============================================================
/Users/ianrodney/miniconda3/lib/python3.6/site-packages/_pytest/junitxml.py:417
  /Users/ianrodney/miniconda3/lib/python3.6/site-packages/_pytest/junitxml.py:417: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0.
  Add 'junit_family=xunit1' to your pytest.ini file to keep the current format in future versions of pytest and silence this warning.
    _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
----------------------------------- generated xml file: /Users/ianrodney/Documents/ray/test-output.xml ------------------------------------

Results (48.40s):
       6 passed

@ijrsvt ijrsvt added the release-blocker P0 Issue that blocks the release label May 28, 2021
@ijrsvt
Copy link
Contributor Author

ijrsvt commented May 28, 2021

(base) ianrodney@Ians-Macbook-Pro ray % python -m pytest -sv python/ray/tests/test_multi_node_2.py 
##vso[task.logissue type=warning;]The 'junit_family' default value will change to 'xunit2' in pytest 6.0.
Add 'junit_family=xunit1' to your pytest.ini file to keep the current format in future versions of pytest and silence this warning.
Test session starts (platform: darwin, Python 3.6.10, pytest 5.4.3, pytest-sugar 0.9.4)
cachedir: .pytest_cache
rootdir: /Users/ianrodney/Documents/ray/python
plugins: azurepipelines-0.8.0, sugar-0.9.4, asyncio-0.15.1, rerunfailures-9.1.1, xdist-2.1.0, aiohttp-0.3.0, profiling-1.7.0, timeout-1.4.2, forked-1.3.0
collecting ... 
 ray/tests/test_multi_node_2.py::test_cluster ✓                                                                              14% █▌        
 ray/tests/test_multi_node_2.py::test_shutdown ✓                                                                             29% ██▉       2021-05-28 09:20:20,328 INFO worker.py:727 -- Connecting to existing Ray cluster at address: 192.168.1.17:6379
2021-05-28 09:20:21,743 WARNING worker.py:1114 -- The agent on node Ians-Macbook-Pro.local failed with the following error:
Traceback (most recent call last):
  File "/Users/ianrodney/Documents/ray/python/ray/new_dashboard/agent.py", line 326, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/ianrodney/miniconda3/lib/python3.6/asyncio/base_events.py", line 488, in run_until_complete
    return future.result()
  File "/Users/ianrodney/Documents/ray/python/ray/new_dashboard/agent.py", line 187, in run
    agent_ip_address=self.ip))
  File "/Users/ianrodney/miniconda3/lib/python3.6/site-packages/grpc/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1622218821.738018000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1622218821.738016000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>

(raylet) Traceback (most recent call last):
(raylet)   File "/Users/ianrodney/Documents/ray/python/ray/new_dashboard/agent.py", line 338, in <module>
(raylet)     raise e
(raylet)   File "/Users/ianrodney/Documents/ray/python/ray/new_dashboard/agent.py", line 326, in <module>
(raylet)     loop.run_until_complete(agent.run())
(raylet)   File "/Users/ianrodney/miniconda3/lib/python3.6/asyncio/base_events.py", line 488, in run_until_complete
(raylet)     return future.result()
(raylet)   File "/Users/ianrodney/Documents/ray/python/ray/new_dashboard/agent.py", line 187, in run
(raylet)     agent_ip_address=self.ip))
(raylet)   File "/Users/ianrodney/miniconda3/lib/python3.6/site-packages/grpc/aio/_call.py", line 286, in __await__
(raylet)     self._cython_call._status)
(raylet) grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
(raylet)        status = StatusCode.UNAVAILABLE
(raylet)        details = "failed to connect to all addresses"
(raylet)        debug_error_string = "{"created":"@1622218821.738018000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1622218821.738016000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
(raylet) >
2021-05-28 09:20:22,994 WARNING worker.py:1114 -- The node with node id: 845af30575bf4c19c3b86fdb87d776edd8955393796ba019f5325b23 and ip: 192.168.1.17 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

 ray/tests/test_multi_node_2.py::test_system_config[ray_start_cluster_head0] ✓                                               43% ████▍     2021-05-28 09:20:26,692 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
2021-05-28 09:20:27,717 INFO worker.py:727 -- Connecting to existing Ray cluster at address: 192.168.1.17:6379
2021-05-28 09:20:27,849 INFO monitor.py:129 -- Monitor: Started

 ray/tests/test_multi_node_2.py::test_heartbeats_single[ray_start_cluster_head0] ✓                                           57% █████▊    2021-05-28 09:20:39,823 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
2021-05-28 09:20:40,848 INFO worker.py:727 -- Connecting to existing Ray cluster at address: 192.168.1.17:6379
2021-05-28 09:20:40,987 INFO monitor.py:129 -- Monitor: Started

 ray/tests/test_multi_node_2.py::test_heartbeats_single[ray_start_cluster_head1] ✓                                           71% ███████▎  2021-05-28 09:20:53,032 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
2021-05-28 09:20:54,060 INFO worker.py:727 -- Connecting to existing Ray cluster at address: 192.168.1.17:6379

 ray/tests/test_multi_node_2.py::test_wait_for_nodes ✓                                                                       86% ████████▋ 
Stopped all 14 Ray processes.
 ray/tests/test_multi_node_2.py::test_ray_client[ray start --head --ray-client-server-port 20000 --min-worker-port=0 --max-worker-port=0 --port 0] ✓100% ██████████##vso[results.publish type=JUnit;runTitle='Pytest results';]/Users/ianrodney/Documents/ray/test-output.xml
Skipping uploading of coverage data.

============================================================ warnings summary =============================================================
/Users/ianrodney/miniconda3/lib/python3.6/site-packages/_pytest/junitxml.py:417
  /Users/ianrodney/miniconda3/lib/python3.6/site-packages/_pytest/junitxml.py:417: PytestDeprecationWarning: The 'junit_family' default value will change to 'xunit2' in pytest 6.0.
  Add 'junit_family=xunit1' to your pytest.ini file to keep the current format in future versions of pytest and silence this warning.
    _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
----------------------------------- generated xml file: /Users/ianrodney/Documents/ray/test-output.xml ------------------------------------

Results (58.64s):
       7 passed

@simon-mo simon-mo merged commit 5ca1b29 into ray-project:master May 28, 2021
DmitriGekhtman pushed a commit that referenced this pull request May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[client] Ray client 'dataclient cannot send request due to a data channel shutting down'
4 participants