Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Fix ServeHandle serialization #13695

Merged
merged 9 commits into from
Jan 27, 2021

Conversation

architkulkarni
Copy link
Contributor

Why are these changes needed?

Adds __reduce__ methods to ServeHandle and ThreadProxiedRouter to fix serialization of ServeHandles. Upon deserialization, the constructor is simply called again, and a new router is started.

Related issue number

Closes #13180

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@architkulkarni architkulkarni changed the title Fix serialize handle [Serve] Fix ServeHandle serialization Jan 25, 2021
@architkulkarni architkulkarni added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@architkulkarni
Copy link
Contributor Author

Tests pass on my laptop and on Mac CI, but on Linux CI test_user_config_update and test_task_runner_perform_batch fail. Investigating...

@@ -226,7 +226,7 @@ def batcher(requests):
query_param = make_request_param()
my_batch_sizes = await asyncio.gather(*[(
await router.assign_request.remote(query_param)) for _ in range(3)])
assert my_batch_sizes == [2, 2, 1]
assert sorted(my_batch_sizes) == [1, 2, 2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure to revert this and merge master again

@architkulkarni architkulkarni added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 27, 2021
@edoakes edoakes merged commit 202fbdf into ray-project:master Jan 27, 2021
@simon-mo
Copy link
Contributor

Hey @edoakes @architkulkarni this failed on Windows
image

==================== Test output for //python/ray/serve:test_handle:
============================= test session starts =============================
platform win32 -- Python 3.7.9, pytest-5.4.3, py-1.10.0, pluggy-0.13.1 -- C:\hostedtoolcache\windows\Python\3.7.9\x64\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\runneradmin\AppData\Local\Temp\Bazel.runfiles_kcsiur7i\runfiles\com_github_ray_project_ray
plugins: asyncio-0.14.0, rerunfailures-9.1.1, sugar-0.9.4, timeout-1.4.2
collecting ... collected 6 items

::test_async_handle_serializable 2021-01-27 20:56:30,837	INFO services.py:1182 -- View the Ray dashboard at http://127.0.0.1:8265
PASSED
::test_sync_handle_serializable PASSED
::test_handle_in_endpoint PASSED
::test_handle_http_args PASSED
::test_handle_inject_starlette_request PASSED
::test_handle_option_chaining PASSED

============================= 6 passed in 58.48s ==============================
(pid=6088) 2021-01-27 20:56:33,762	INFO http_state.py:70 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:10.1.0.4-0' on node 'node:10.1.0.4-0' listening on '127.0.0.1:8000'

(pid=1660) INFO:     Started server process [1660]

(pid=6088) 2021-01-27 20:56:37,897	INFO controller.py:180 -- Registering route 'None' to endpoint 'f' with methods '['GET']'.

(pid=6088) 2021-01-27 20:56:39,072	INFO controller.py:190 -- Deleting endpoint 'f'

(pid=6952) 2021-01-27 20:56:39,056	INFO router.py:249 -- Endpoint f doesn't exist, waiting for registration.

(pid=6952) Windows fatal exception: access violation
(pid=6952) 
(pid=8036) Windows fatal exception: access violation
(pid=8036) 
(pid=6088) 2021-01-27 20:56:44,941	INFO controller.py:180 -- Registering route 'None' to endpoint 'f' with methods '['GET']'.

(pid=6088) 2021-01-27 20:56:46,127	INFO controller.py:190 -- Deleting endpoint 'f'

(pid=7680) 2021-01-27 20:56:46,111	INFO router.py:249 -- Endpoint f doesn't exist, waiting for registration.

(pid=1428) Windows fatal exception: access violation
(pid=1428) 
(pid=6088) 2021-01-27 20:56:50,969	INFO controller.py:180 -- Registering route '/endpoint1' to endpoint 'endpoint1' with methods '['GET', 'POST']'.

(pid=8144) 2021-01-27 20:56:54,140	WARNING api.py:512 -- You are retrieving a sync handle inside an asyncio loop. Try getting client.get_handle(.., sync=False) to get better performance. Learn more at https://docs.ray.io/en/master/serve/advanced.html#sync-and-async-handles

(pid=6088) 2021-01-27 20:56:56,013	INFO controller.py:180 -- Registering route '/endpoint2' to endpoint 'endpoint2' with methods '['GET', 'POST']'.

(pid=6088) 2021-01-27 20:56:58,065	INFO controller.py:190 -- Deleting endpoint 'endpoint1'

(pid=6088) 2021-01-27 20:56:58,068	INFO controller.py:190 -- Deleting endpoint 'endpoint2'

(pid=1660) 2021-01-27 20:56:58,047	INFO router.py:249 -- Endpoint endpoint2 doesn't exist, waiting for registration.

(pid=7680) Windows fatal exception: access violation
(pid=7680) 
(pid=8144) Windows fatal exception: access violation
(pid=8144) 
(pid=6088) 2021-01-27 20:57:05,068	INFO controller.py:180 -- Registering route '/endpoint' to endpoint 'endpoint' with methods '['POST']'.

(pid=6088) 2021-01-27 20:57:07,115	INFO controller.py:190 -- Deleting endpoint 'endpoint'

(pid=1172) Windows fatal exception: access violation
(pid=1172) 
(pid=6088) 2021-01-27 20:57:12,111	INFO controller.py:180 -- Registering route '/echo' to endpoint 'echo' with methods '['GET']'.

(pid=6088) 2021-01-27 20:57:17,162	INFO controller.py:180 -- Registering route '/wrapper' to endpoint 'wrapper' with methods '['GET']'.

(pid=6088) 2021-01-27 20:57:19,241	INFO controller.py:190 -- Deleting endpoint 'echo'

(pid=6088) 2021-01-27 20:57:19,244	INFO controller.py:190 -- Deleting endpoint 'wrapper'

(pid=5628) 2021-01-27 20:57:19,224	WARNING api.py:512 -- You are retrieving a sync handle inside an asyncio loop. Try getting client.get_handle(.., sync=False) to get better performance. Learn more at https://docs.ray.io/en/master/serve/advanced.html#sync-and-async-handles

(pid=5628) 2021-01-27 20:57:19,225	INFO router.py:249 -- Endpoint echo doesn't exist, waiting for registration.

(pid=8052) Windows fatal exception: access violation
(pid=8052) 
(pid=5628) Windows fatal exception: access violation
(pid=5628) 
(pid=6088) 2021-01-27 20:57:26,229	INFO controller.py:180 -- Registering route 'None' to endpoint 'm' with methods '['GET']'.

(pid=6088) 2021-01-27 20:57:26,248	INFO controller.py:190 -- Deleting endpoint 'm'

(pid=4908) Windows fatal exception: access violation
(pid=4908) 
================================================================================
(20:57:32) [419 / 497] 13 / 81 tests, 1 failed; Testing //python/ray/serve:test_kv_store; 3s local
(20:59:08) [423 / 501] 17 / 81 tests, 1 failed; Testing //python/ray/serve:test_router; 21s local
(20:59:23) FAIL: //python/ray/serve:test_router (see C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/serve/test_router/test_attempts/attempt_1.log)

https://github.com/ray-project/ray/runs/1779615703

So I'm reverting it now. Can you open a re-revert either fixing (seems unlikely, access violation could be port conflict but this seems hard to debug) or skip?

@architkulkarni
Copy link
Contributor Author

architkulkarni commented Jan 28, 2021

So weird, thanks for catching this, and my fault for not checking the Windows CI. It looks like this PR definitely caused this issue.

I'm a little confused though, why does the run say that all 6/6 test_handle.py tests pass? And why does it say "FAIL: //python/ray/serve:test_router" but there's no such line for test_handle? Googling "Windows fatal exception: access violation" turns up a handful of pytest-related results, but also several that are not pytest-related.

The only idea I have for how to try to fix this is to use __getstate__ and __setstate__ as suggested by the Python docs instead of the lower-level __reduce__. I don't know why this would work but I can give it a try.

simon-mo added a commit that referenced this pull request Jan 28, 2021
@architkulkarni
Copy link
Contributor Author

In the log https://github.com/ray-project/ray/runs/1779615703 I see a line above the handle tests saying "TIMEOUT: //python/ray/serve:test_handle", but I can't figure out which test is timing out since all 6/6 tests pass in each of the three runs. The "Windows fatal exception" seems to be printed 1-2 times after each of the six tests, based on the endpoint names that are being printed.

architkulkarni added a commit that referenced this pull request Jan 28, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] ServeHandle no longer serializable
3 participants