# Ray Actors Revisited

The [Ray Crash Course](../ray-crash-course/00-Ray-Crash-Course-Overview.ipynb) introduced the core concepts of Ray's API and how they parallelize work. Specifically, we learned how to define Ray _tasks_ and _actors_, run them, and retrieve the results. 

This lesson explores Ray actors in greater depth, including the following:

* Detached actors
* Specifying limits on the number of invocations and retries on failure
* Actor pools
* Actor Patterns

In [1]:
import ray, time, sys, os 
import numpy as np 
sys.path.append("..")
from util.printing import pd  # convenience methods for printing results.

## Ray namespaces concept

A namespace is a logical grouping of jobs and named actors. When an actor is named, its name must be unique within the namespace.
Named actors, which we discuss below, are only accessible within their namespaces.

In order to set your applications namespace, it should be specified when you first connect to the cluster.

In [2]:
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster

setup_ray_cluster(
  num_worker_nodes=2,
  num_cpus_per_node=4,
  collect_log_to_path="/dbfs/path/to/ray_collected_logs"
)
ray.init()

2022-03-16 15:55:37,922	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '127.0.0.1',
 'raylet_ip_address': '127.0.0.1',
 'redis_address': None,
 'object_store_address': '/tmp/ray/session_2022-03-16_15-55-35_553735_58108/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-03-16_15-55-35_553735_58108/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2022-03-16_15-55-35_553735_58108',
 'metrics_export_port': 56973,
 'gcs_address': '127.0.0.1:59103',
 'address': '127.0.0.1:59103',
 'node_id': '4a70b5a2ee284de10f8025aef3267f22f659b7f49bcb339fd63238d7'}

The Ray Dashboard URL is printed above and also part of the output dictionary item `webui_url`

## Named and Detached Actors
[Detached actors](https://docs.ray.io/en/latest/advanced.html#detached-actors) are designed to be long-lived actors that can be referenced by name and must be explicitly cleaned up. They are not deleted automatically when references to them go out of scope, as for regular actors. 

Detached actors are useful for "services," where different tasks and actors in the application want to lookup an actor and use it.

> **Note:** This is an evolving feature. Check the [documentation](https://docs.ray.io/en/latest/advanced.html#detached-actors) for the latest details.

Here is an example of a "normal" actor definition:

In [3]:
@ray.remote
class Counter:
    def __init__(self):
        self.label = 'Counter'
        self.count = 0
    def next(self):
        self.count += 1
        return self.count

Now create a detached instance of it.

In [4]:
counter1 = Counter.options(name="Counter1", lifetime="detached").remote()
counter2 = Counter.options(name="Counter2", lifetime="detached").remote()

Then we can use it "somewhere else":

In [5]:
c1 = ray.get_actor("Counter1")
print(ray.get([c1.next.remote() for _ in range(100)]))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


In [6]:
c2 = ray.get_actor("Counter2")
print(ray.get([c2.next.remote() for _ in range(100)]))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


See also the notes on detached actors and actor lifecycles in the lesson [03: Ray Internals](03-Ray-Internals.ipynb). See also the [detached actors](https://docs.ray.io/en/latest/advanced.html#detached-actors) documentation.

To kill a detached actor, use `ray.kill()`:

In [7]:
for actor in [c1, c2]:
    ray.kill(actor)

### Limitations

This is a new feature with a few limitations, both of which will be fixed in a forthcoming release of Ray.

While `ray.kill()` kills the actor, it does not remove the name from the registration table, currently. Hence, it isn't possible to reregister a new instance with the same name. 

If the actor was created with a configuration value of `max_restarts` not equal to zero (discussed in the next section). the actor will be restarted up to `max_restarts` time, which will be infinitely many times if the value was set to -1.

A `no_restart=True|False` keyword argument is being added to `ray.kill()` for this situation:

```python
c = ray.get_actor("Counter1")
ray.kill(c, no_restart=True)  # new optional keyword argument
```

The `no_restart=True` will be necessary for these actors.

## Limiting Actor Invocations and Retries on Failure

> **Note:** This feature may change in a future version of Ray. See the latest details in the [Ray documentation](https://docs.ray.io/en/latest/package-ref.html#ray.remote). 

Two options you can pass to `ray.remote` when defining an actor affect how often it can be invoked and retrying on failure:

* `max_restarts`: This specifies the maximum number of times that the actor should be restarted when it dies unexpectedly. The minimum valid value is 0 (default), which indicates that the actor doesn't need to be restarted. A value of -1 indicates that an actor should be restarted indefinitely.
* `max_task_retries`: How many times to retry an actor task if the task fails due to a system error, e.g., the actor has died. If set to -1, the system will retry the failed task until the task succeeds, or the actor has reached its max_restart limit. If set to to a value `n` greater than 0, the system will retry the failed task up to `n` times, after which time the task will throw a `RayActorError` exception when `ray.get` attempts to retrieve a result. Note that Python exceptions are not considered system errors and will not trigger retries.

Example:

```python
@ray.remote(max_restarts=-1, max_task_retries=-1)
class Foo():
    pass
```

See the [ray.remote()](https://docs.ray.io/en/latest/package-ref.html#ray.remote) documentation for all the keyword arguments supported.

### Overriding with config()

Remote task and actor objects returned by `@ray.remote` can also be dynamically modified with the same arguments supported by `ray.remote()` using `options()` as in the following example:

```python
@ray.remote(num_cpus=2, resources={"CustomResource": 1})
class Foo:
    def method(self):
        return 1
Bar = Foo.options(num_cpus=1, resources=None)
```

## Actor Pools

The ray.util module contains a utility class, ActorPool. This class is similar to `multiprocessing.Pool` and lets you schedule Ray tasks over a fixed pool of actors.

In [8]:
from ray.util import ActorPool

@ray.remote
class Actor:
    
    def double(self, n):
        return n * 2

In [9]:
actor_pool_list = [Actor.remote() for _ in range(5)]
pool = ActorPool(actor_pool_list)

In [10]:
# pool.map(..) returns a Python generator object ActorPool.map
gen = pool.map(lambda a, v: a.double.remote(v), [1, 2, 3, 4])
print([v for v in gen])

[2, 4, 6, 8]


# Tree of Actors Pattern

A common pattern used in Ray libraries [Ray Tune](https://docs.ray.io/en/latest/tune/index.html), [Ray Train](https://docs.ray.io/en/latest/train/train.html), and [RLlib](https://docs.ray.io/en/latest/rllib/index.html) to train models in a parallel or conduct distributed HPO.

In this common pattern, tree of actors, a collection of workers as actors, are managed by a supervisor. For example, you want to train multiple models at the same time, while being able to checkpoint/inspect its state.

<img src="https://docs.ray.io/en/latest/_images/tree-of-actors.svg" width="40%" height="20%">

Let's implement a simple example to illustrate this pattern.

In [11]:
import random
STATES = ["RUNNING", "TRAINING", "DONE"]

class Model:

    def __init__(self, m:str):
        self._model = m
        self._grad = []

    def train(self):
        # do some training work here
        for _ in range(5):
            self._grad.append(random.random())
        time.sleep(1)

# Factory function to return an instance of a model type
def model_factory(m: str):
    return Model(m)

### Create a Worker Actor

In [12]:
@ray.remote
class Worker(object):
    def __init__(self, m:str):
        # type of a model: lr, cl, or nn
        self._model = m                  
        
    def state(self) -> str:
        return random.choice(STATES)
    
    # Do the work for this model
    def work(self) -> None:
        model_factory(self._model).train()

### Create Supervisor Actor 

In [13]:
@ray.remote
class Supervisor:
    def __init__(self):
        # Create three Actor Workers, each by its unique model type
        self.workers = [Worker.remote(name) for name in ["lr", "cl", "nn"]]
                        
    def start_workers(self):
        # do the work 
        [w.work.remote() for w in self.workers]
        
    def terminate(self):
        [ray.kill(w) for w in self.workers]
        
    def state(self):
        return ray.get([w.state.remote() for w in self.workers])

Create a Actor instance for supervisor and launch its workers

In [14]:
sup = Supervisor.remote()

# Launch remote actors as workers
sup.start_workers.remote()

ObjectRef(338dfb787dc9ed0703c495e59083536fba5276c10100000001000000)

### Check status until all done

In [15]:
# check their status
while True:
    # Fetch the states of all its workers
    states = ray.get(sup.state.remote())
    print(states)
    # check if all are DONE
    result = all('DONE' == e for e in states)
    if result:
        # Note: Actor processes will be terminated automatically when the initial actor handle goes out of scope in Python. 
        # If we create an actor with actor_handle = ActorClass.remote(), then when actor_handle goes out of scope and is destructed, 
        # the actor process will be terminated. Note that this only applies to the original actor handle created for the actor 
        # and not to subsequent actor handles created by passing the actor handle to other tasks.
        
        # kill supervisors all worker manually, only for illustrtation and demo
        sup.terminate.remote()

        # kill the supervisor manually, only for illustration and demo
        ray.kill(sup)
        break

['RUNNING', 'DONE', 'DONE']
['RUNNING', 'DONE', 'RUNNING']
['DONE', 'TRAINING', 'DONE']
['DONE', 'DONE', 'RUNNING']
['TRAINING', 'RUNNING', 'TRAINING']
['RUNNING', 'RUNNING', 'TRAINING']
['TRAINING', 'TRAINING', 'RUNNING']
['DONE', 'RUNNING', 'DONE']
['RUNNING', 'RUNNING', 'DONE']
['TRAINING', 'DONE', 'RUNNING']
['DONE', 'RUNNING', 'DONE']
['TRAINING', 'TRAINING', 'DONE']
['RUNNING', 'TRAINING', 'TRAINING']
['RUNNING', 'RUNNING', 'RUNNING']
['DONE', 'RUNNING', 'DONE']
['DONE', 'RUNNING', 'RUNNING']
['TRAINING', 'DONE', 'DONE']
['RUNNING', 'DONE', 'RUNNING']
['TRAINING', 'TRAINING', 'DONE']
['TRAINING', 'DONE', 'TRAINING']
['DONE', 'TRAINING', 'TRAINING']
['DONE', 'TRAINING', 'TRAINING']
['TRAINING', 'RUNNING', 'RUNNING']
['RUNNING', 'TRAINING', 'DONE']
['TRAINING', 'DONE', 'RUNNING']
['TRAINING', 'RUNNING', 'RUNNING']
['RUNNING', 'TRAINING', 'TRAINING']
['RUNNING', 'RUNNING', 'RUNNING']
['RUNNING', 'RUNNING', 'RUNNING']
['TRAINING', 'RUNNING', 'DONE']
['TRAINING', 'RUNNING', 'RUNNING']

In [16]:
shutdown_ray_cluster()

The next lesson, [Ray Internals](03-Ray-Internals.ipynb), explores the architecture of Ray, task scheduling, the Object Store, etc.