# Ray Client2 demo

This is the experimental Ray Client2. It's HTTP-based, meaning a client only talks to the ray cluster (ray dashboard head) via HTTP, and does not have a local raylet, or core worker, or grpc client connected.

This new client aim to solve these pain points on the existing Ray Client:

- It's huge: `ray/util/client` has more than 20,000 lines of code, with more lines scattered elsewhere
- It's intrusive: all supported Ray APIs has a `@client_mode_hook` and presents in stacktrace
- (By default) it breaks when the connection breaks, bringing down all running tasks
- Version skew: things silently go down when you use a library with different version on client/server.

while still keep these merits of the old Client:

- Familiar API: Same `@ray.remote` as in job code


## Create a client

Just create a `Client`. It creates a Ray Job, runs an Actor named `ClientSupervisor` that manages objects, tasks and actors for you.

Parameters:
- `server_addr`: dashboard address.
- `runtime_env`: string or dict. Default = None.
- `ttl_secs`: int. Default = 1hr. If the client breaks after this long, the server kills the job.

In [1]:
import ray
from ray.experimental.client2.client import Client

c = Client("http://localhost:8265")

2024-01-04 23:25:48,164	INFO client.py:459 -- client2 actor rayclient2_RMje3D5sACzuWHbH not connected, maybe the ClientSupervisor is still starting...
2024-01-04 23:25:49,197	INFO client.py:259 -- client2 actor rayclient2_RMje3D5sACzuWHbH connected!


## The active client

On connection, 1 thread is created to keep pinging the actor to keep it alive; 1 thread is created to poll the logs and forward to `stdout`/`stderr`.

Only one client can be active at a time. A second client would fail to create. You can disconnect the current client, after which you can create a new client. All existing object refs and actor handles are invalidated though.

The active client can be retrieved via `Client.active_client`.

In [2]:
# raises
another = Client("http://localhost:8265")

ValueError: Already have active client rayclient2_RMje3D5sACzuWHbH, consider `Client.active_client.disconnect()`.

In [3]:
c.disconnect()
c = Client("http://localhost:8265")

2024-01-04 23:25:54,338	INFO client.py:459 -- client2 actor rayclient2_zaciVxtWKJwMD3Bq not connected, maybe the ClientSupervisor is still starting...
2024-01-04 23:25:55,358	INFO client.py:259 -- client2 actor rayclient2_zaciVxtWKJwMD3Bq connected!


## `get` and `put`

These APIs are identical to the existing Ray APIs. Under the hood they are serialized and passed to the Dashboard head agent. The agent passes the serialized bytes to the actor `ClientSupervisor` verbatim; the actor deserializes it and works accordingly.

`ObjectRef` and `ActorHandle` are treated specially; `ClientSupervisor` keeps a reference to each of them to avoid them being GC'd.

In [4]:
eight = ray.put(8)
print(eight)
ray.get(eight)

ObjectRef(003155e5bdce4dc4d4c3af42d506433c3e97350f0400000001e1f505)


8

In [5]:
# Classes are OK, embedded ObjectRef are OK
class Data:
    def __init__(self, o, i):
        self.o = o
        self.i = i

d = Data(eight, 8)
dr = ray.put(d)
dr

ObjectRef(003155e5bdce4dc4d4c3af42d506433c3e97350f0400000002e1f505)

In [6]:
d2 = ray.get(dr)
print(d2.o)
print(d2.i)

ObjectRef(003155e5bdce4dc4d4c3af42d506433c3e97350f0400000001e1f505)
8


## Tasks and Actors

Still we use good old `ray.remote` APIs. The whole function / class body are serialzied, passed via the agent head to the actor `ClientSupervisor`. The actor invokes the function / class creation / method invocation, and returns the object ref(s).

All print logs from any tasks or methods are forwarded to the client.

In [7]:
@ray.remote
def fib(i):
    # FIXME: for some reason, not all logs are reported back, if the
    # task ran too shortly their logs vanish. Do ray jobs also have this?
    print(f'fib({i})')
    if i < 2:
        return 1
    ret = sum(ray.get([fib.remote(i-1), fib.remote(i-2)]))
    return ret

print(ray.get(fib.remote(5)))

[36m(fib pid=95297, ip=127.0.0.1)[0m fib(5)
8
[36m(fib pid=95524, ip=127.0.0.1)[0m fib(0)


In [8]:
@ray.remote
class Counter:
    def __init__(self, i):
        print(f"initing actor with {i}")
        self.i = i
    def increment(self, j):
        i = self.i
        self.i += j
        print(f"incrementing {i} + {j} == {self.i}")
        return i
        
a = Counter.remote(2)
two = a.increment.remote(5)
print(ray.get(two))
seven = a.increment.remote(10)
print(ray.get(seven))

2
7
[36m(Counter pid=95546, ip=127.0.0.1)[0m initing actor with 2
[36m(Counter pid=95546, ip=127.0.0.1)[0m incrementing 2 + 5 == 7
[36m(Counter pid=95546, ip=127.0.0.1)[0m incrementing 7 + 10 == 17


## Other Ray APIs

We want to explicitly support/unsupport all Ray (core) public APIs. For now we have the following list.

In [None]:
# Redirected to the HTTP APIs
SUPPORTED_APIS = [
    "ray_get",
    "ray_put",
    "ray_remotefunction_remote",
    "ray_actorclass_remote",
    "ray_actormethod_remote",
]

# Raises exception when calling, with a help message
UNSUPPORTED_APIS = [
    ("_config", "a wrapper ray task to run in remote"),
    ("get_runtime_context", "a wrapper ray task to run in remote"),
    ("client", "this new and shiny client2!"),
    ("ClientBuilder", "this new and shiny client2!"),
    ("get_gpu_ids", "a wrapper ray task to run in remote"),
    ("init", "client = Client(addr)"),
    ("shutdown", "Client.active_client.disconnect()"),
]


# These APIs are useful in client2. They don't need server side change;
# On client side make a simple task to wrap them.
TODO_APIS = [
    "available_resources",
    "cancel",
    "cluster_resources",
    "get_actor",
    "kill",
    "nodes",
    # For this one; if filename is set, we need to save the file to the client
    # local disk. This involves some code more than a wrapper.
    "timeline",
    # For wait we need server side support to avoid useleses large data passing.
    "wait",
]

# No change, becuase they don't interact with Ray cluster.
VERBATIM_APIS = [
    "__version__",
    "autoscaler",  # What do we do with the packages?
    "is_initialized",
    "java_actor_class",  # TODO: understand if this can run
    "java_function",  # TODO: understand if this can run
    "cpp_function",  # TODO: understand if this can run
    "Language",
    "method",
    "remote",  # Note: the decorator is not changed, the methods are changed.
    "show_in_dashboard",  # This one is actually not in `ray` package...?
    "LOCAL_MODE",
    "SCRIPT_MODE",
    "WORKER_MODE",
]


In [9]:
# Raises
ray.get_runtime_context()

ValueError: WARNING: You are using ray API: `ray.get_runtime_context` which can only be accessed within a Ray Cluster. You are in a Client context, consider using: `a wrapper ray task to run in remote`

2024-01-04 23:26:49,218	INFO client.py:608 -- Client rayclient2_RMje3D5sACzuWHbH is no longer active, stop pinging...


## Limitations

Not working well with Ray Data, and persumably other Ray Libraries. This is because they are not merely calling the public Ray core APIs.

For example, even "printing" a Dataset (which shows a Jupyter widget) calls these APIs:
    
- `ray.cluster_resources()`
- `ray.available_resources()`
- `ray.util.get_current_placement_group()`
- `_get_or_create_stats_actor` which inspects the current node ID and starts an actor that affines to this node.

And we can't mock out all these APIs, since after all we don't want to pretend the code is in a proper in-cluster environment.

For now, we can ask users to put all code in a `@ray.remote` task, and pass the datasets by object ref. But this can get clumsy since even a `ds.schema()` needs to be

```
@ray.remote
def call_schema(ds):
    return ds.schema()
ray.get(call_schema.remote(ds_ref))
```

So I provided an experimental helper method: `Client.run` (the name is subject to change). It wraps an ObjectRef and remotely invokes the method. So one can write

```
ray.get(c.run(ds_ref).schema())
```

Still this is far from fluency and we may want a better solution. But the old client does not support Data either. So maybe it's it.

In [10]:
@ray.remote
def read_images():
    s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/"

    return ray.data.read_images(
        s3_uri, mode="RGB"
    )

ds = read_images.remote()

# write more @ray.remote tasks and run them with `ds`.

ray.get(c.run(ds).schema())

[36m(ClientSupervisor pid=95363, ip=127.0.0.1)[0m INFO:numexpr.utils:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.


Column  Type
------  ----
image   numpy.ndarray(ndim=3, dtype=uint8)

## TODOs

- implement more APIs, e.g. `ray.available_resources()` (See TODO_APIS above)
- on version skew, or at least on Ray version skew, check __version__ and issue warnings (or exceptions)
- Support large puts and args (Now it is 100MB in dashboard default config)
- On large args, print helper message `you are in client! consider put and use object ref`
