Skip to content

[Serve] Issue starting Serve with Ray Client on K8s #14056

@simon-mo

Description

@simon-mo

Tried to use Ray Client + Operator support. I don't think the bug is related to the K8s operator. It's probably some weird issue related to Serve, client, and Docker.

Using py36 image, serve.start() after ray.util.connect(...) prints:

(pid=329) 2021-02-10 21:25:21,900       INFO http_state.py:70 -- Starting HTTP proxy with name 'XgvdVP:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.57.184-0' on node 'node:192.168.57.184-0' listening on '127.0.0.1:8000'
(pid=332) INFO:     Started server process [332]
Got Error from data channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>"
        debug_error_string = "{"created":"@1613021122.279057000","description":"Error received from peer ipv4:127.0.0.1:10001","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>","grpc_status":2}"
>
Traceback (most recent call last):
  File "serve_app.py", line 9, in <module>
    client = serve.start()
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 639, in start
    return Client(controller, controller_name, detached=detached)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 112, in __init__
    self._http_config = ray.get(controller.get_http_config.remote())
  File "/Users/simonmo/Desktop/ray/ray/python/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/api.py", line 35, in get
    return self.worker.get(vals, timeout=timeout)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 164, in get
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/Users/simonmo/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/simonmo/miniconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 87, in _data_main
    raise e
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 62, in _data_main
    for response in resp_stream:
  File "/Users/simonmo/miniconda3/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/Users/simonmo/miniconda3/lib/python3.6/site-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>"
        debug_error_string = "{"created":"@1613021122.279057000","description":"Error received from peer ipv4:127.0.0.1:10001","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>","grpc_status":2}"
>

    out = [self._get(x, timeout) for x in to_get]
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 164, in <listcomp>
    out = [self._get(x, timeout) for x in to_get]
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 172, in _get
    data = self.data_client.GetObject(req)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 121, in GetObject
    resp = self._blocking_send(datareq)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 106, in _blocking_send
    f"cannot send request {req}: data channel shutting down")
ConnectionError: cannot send request req_id: 5
get {
  id: "W\375\307a\355\361\332{p\025\200o\036{D\316\315\255M6\002\000\000\000\001\000\000\000"
}
: data channel shutting down
Exception ignored in: <bound method Client.__del__ of <ray.serve.api.Client object at 0x7faa602812e8>>
Traceback (most recent call last):
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 144, in __del__
    self.shutdown()
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 157, in shutdown
    if (not self._shutdown) and ray.is_initialized():
  File "/Users/simonmo/Desktop/ray/ray/python/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/__init__.py", line 120, in __getattr__
    raise Exception("Ray Client is not connected. "
Exception: Ray Client is not connected. Please connect by calling `ray.connect`.

(I actually don't understand, the HTTP proxy was started??? but it depends on the HTTPOptions object.)

On Py37, even wilder error appear:

(pid=279) 2021-02-10 18:33:47,086       INFO http_state.py:70 -- Starting HTTP proxy with name 'kXHKDS:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.45.111-0' on node 'node:192.168.45.111-0' listening on '127.0.0.1:8000'
(pid=298) INFO:     Started server process [298]
(pid=279) 2021-02-10 18:33:47,689       INFO controller.py:190 -- Deleting endpoint 'endpoint'
(raylet) Fatal Python error: initfsencoding: Unable to get the locale encoding
(raylet) ModuleNotFoundError: No module named 'encodings'
(raylet) 
(raylet) Current thread 0x00007fa860126740 (most recent call first):
The actor or task with ID ffffffffffffffff7dad32b7e02117e675ea54dc03000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {2.000000/4.000000 CPU, 0.244141 GiB/0.244141 GiB memory, 0.048828 GiB/0.048828 GiB object_store_memory, 0.980000/1.000000 node:192.168.45.111}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

I'm following this guide https://ray--14016.org.readthedocs.build/en/14016/cluster/kubernetes.html#using-ray-client-to-connect-from-outside-the-kubernetes-cluster and only changing the image tag to :nightly and nightly-py36.

And the script is just:

import ray
from ray import serve

ray.init(address="auto")
client = serve.start()

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tneeds-repro-scriptIssue needs a runnable script to be reproduced

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions