In [1]:
import ray
import os
import time
import numpy as np
import requests

addresses = ray.init()

2020-01-24 13:34:27,582	INFO resource_spec.py:212 -- Starting Ray with 3.22 GiB memory available for workers and up to 1.62 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-01-24 13:34:27,890	INFO services.py:1093 -- View the Ray dashboard at [1m[32mlocalhost:8267[39m[22m


### Debug blocked actor creation tasks

If creating an actor costs resources (e.g. CPU, GPU, oterh custom resources) more than available, the actor creation task becomes infeasible. This will be shown in red in the dashboard. 

In [3]:
@ray.remote(resources={"Custom": 1}, num_cpus=0)
class A(object):
    def __init__(self):
        pass
    
    def f(self):
        return 0
        
@ray.remote
class B(object):
    def __init__(self, x, y):
        self.a = A.remote()
        
    def f(self):
        return ray.get(self.a.f.remote())

b = B.remote(3, y=5)

try:
    ray.get(b.f.remote(), timeout=2)
except ray.exceptions.RayTimeoutError:
    print("Session hangs because actor A cannot be created. ")



Session hangs because actor A cannot be created. 


### Local memory usage

- Local memory info
    - Number of object ids in scope
    - Number of local objects
    - Used Object Memory
    
In the example below, all objects (strings) are stored in local object memory. Used local object memory increases with more calls on the remote function `g`. 

In [6]:
@ray.remote
def g():
    return "hello world!"

@ray.remote
class A(object):
    def f(self):
        object_ids = []
        for idx in range(50):
            ray.show_in_webui("Loop index = {}...".format(idx))
#             self.object_ids.append(f.remote())
            object_ids.append(g.remote())
            time.sleep(0.5)

a = A.remote()
_ = a.f.remote()

### Inspect node memory usage

- Node memory info
    - Number of object ids in scope
    - Used object store memory
    
In the example below, all objects (strings) are stored on the node. Different from above example, used local object memory is alwasy zero in this case. 

In [7]:
@ray.remote
class C(object):
    def __init__(self):
        self.object_ids = []
    
    def push(self):
        object_id = ray.put("test")
        self.object_ids.append(object_id)
        time.sleep(1)
        return object_id
    
    def clean_memory(self):
        del self.object_ids
        
@ray.remote
class D(object):
    def __init__(self):
        self.object_ids = []
        
    def fetch(self):
        c = C.remote()
        
        for idx in range(20):
            ray.show_in_webui("Loop index = {}...".format(idx))
            time.sleep(0.5)
            object_id = ray.get(c.push.remote())
            self.object_ids.append(object_id)  

    def clean_memory(self):
        del self.object_ids
        
d = D.remote()
_ = d.fetch.remote()

The following command clears out the number of object ids in scope for actor `d`, as all object ids become out of scope after `self.object_ids` is deleted.

In [9]:
_ = d.clean_memory.remote()

2020-01-24 13:44:32,885	ERROR worker.py:1003 -- Possible unhandled error from worker: [36mray::D.clean_memory()[39m (pid=10987, ip=192.168.1.27)
  File "python/ray/_raylet.pyx", line 650, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 623, in function_executor
  File "<ipython-input-7-e3f91fd69418>", line 30, in clean_memory
AttributeError: object_ids


### Profile python program with py-spy

Clicking the `profling` button on the dashboard launches `py-spy` that times your python program. The timing information will be visualized as flamegraph in a new browser tab. 

In [10]:
import urllib.parse

def operation_add():
    t = time.time()
    x = 0
    while time.time() - t < 20:
        x += 1
    return x

@ray.remote
class A(object):
    def __init__(self):
        pass
    
    def test(self):
        for _ in range(10):
            operation_add()
        
    def getpid(self):
        return os.getpid()

a = A.remote()
actor_pid = ray.get(a.getpid.remote())
a.test.remote()

webui_url = addresses["webui_url"]
launch_profiling_url = "http://{}/api/launch_profiling?node_id={}&pid={}&duration=10".format(webui_url, ray.nodes()[0]["NodeID"], actor_pid)
profiling_id = requests.get(launch_profiling_url).json()
while True:
    check_status_url = "http://{}/api/check_profiling_status?profiling_id={}".format(webui_url, profiling_id)
    status = requests.get(check_status_url).json()
    if status["status"] == "finished":
        profiling_url = "http://{}/api/get_profiling_info?profiling_id={}".format(webui_url, profiling_id)
        encoded_profile_url = urllib.parse.quote(profiling_url)
        print("http://{}/speedscope/index.html#profileURL={}".format(webui_url, encoded_profile_url))
        break

http://localhost:8267/speedscope/index.html#profileURL=http%3A//localhost%3A8267/api/get_profiling_info%3Fprofiling_id%3D8420dc14-cd89-4347-b546-a6b761822a03


### Example

The logical view of the dashboard allows you to track the progress (training accuracy, constructor configuration, the number of task completed, the name of the task currently executed, memory usage, etc.) of all parallel actors.

### Example 1: Monitor distributed actors

Reference: github issue #3609, @EricSteinberger

In [12]:
import torch


class NeuralNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l = torch.nn.Linear(1000, 2048)
        self.l2 = torch.nn.Linear(2048, 2)

    def forward(self, x):
        return self.l2(self.l(x))


@ray.remote(num_cpus=1)
class TestActor:
    def __init__(self):
        self.net = NeuralNet()

    def test(self, batch_size):
        p = self.net(torch.rand((batch_size, 1000),))
        
def test(num_actors):
    t = time.time()
    actors = [TestActor.remote() for _ in range(num_actors)]

    t = time.time()
    for _ in range(5000//num_actors):
        ray.get([actor.test.remote(128) for actor in actors])
    
    print(f"Test: num_actors = {num_actors}, time = {time.time() - t}")


# %env OMP_NUM_THREADS=1

test(num_actors=4)



Test: num_actors = 4, time = 12.881106853485107


### Example 2: Distributed network training

Reference: github issue #6633, @JaeLiiin

In [13]:
%%capture

from tensorflow.keras import layers
import json


def create_keras_model():
    import tensorflow as tf
    model = tf.keras.Sequential()
    # Adds a densely-connected layer with 64 units to the model:
    model.add(layers.Dense(64, activation="relu", input_shape=(32, )))
    # Add another:
    model.add(layers.Dense(64, activation="relu"))
    # Add a softmax layer with 10 output units:
    model.add(layers.Dense(10, activation="softmax"))

    model.compile(
        optimizer=tf.keras.optimizers.RMSprop(0.01),
        loss=tf.keras.losses.categorical_crossentropy,
        metrics=[tf.keras.metrics.categorical_accuracy])
    return model


def random_one_hot_labels(shape):
    n, n_class = shape
    classes = np.random.randint(0, n_class, n)
    labels = np.zeros((n, n_class))
    labels[np.arange(n), classes] = 1
    return labels


@ray.remote
class Network(object):
    def __init__(self):
        self.model = create_keras_model()
        self.dataset = np.random.random((1000, 32))
        self.labels = random_one_hot_labels((1000, 10))

    def train(self):
        history = self.model.fit(self.dataset, self.labels, verbose=False)
        time.sleep(0.5)
        ray.show_in_webui(repr(history.history))
        return history.history

    def get_weights(self):
        return self.model.get_weights()

    def set_weights(self, weights):
        # Note that for simplicity this does not handle the optimizer state.
        self.model.set_weights(weights)

In [14]:
%%capture

result_object_ids = []
result2_object_ids = []

NetworkActor = Network.remote()
NetworkActor2 = Network.remote()

for itr in range(20):
    weights = ray.get(
        [NetworkActor.get_weights.remote(),
         NetworkActor2.get_weights.remote()])

    averaged_weights = [(layer1 + layer2) / 2
                        for layer1, layer2 in zip(weights[0], weights[1])]

    weight_id = ray.put(averaged_weights)
    [
        actor.set_weights.remote(weight_id)
        for actor in [NetworkActor, NetworkActor2]
    ]
    result_object_ids.append(NetworkActor.train.remote())
    result2_object_ids.append(NetworkActor2.train.remote())

### Example 3: monitor MNIST training with tune
- Actor construction which exposes parameter configuration
- Task execution
    - Number of tasks executed
    - Function descriptor of currently executed task
    - Number of pending tasks listed on the task queue
- Training accuracy shown as actor message

Reference: ray docs

In [15]:
import torch.optim as optim
from ray import tune
from ray.tune.examples.mnist_pytorch import get_data_loaders, ConvNet, train, test


def train_mnist(config):
    train_loader, test_loader = get_data_loaders()
    model = ConvNet()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"])
    for i in range(100):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)
        ray.show_in_webui(str(acc))
        tune.track.log(mean_accuracy=acc)


analysis = tune.run(
    train_mnist, config={"lr": tune.grid_search([0.001, 0.01, 0.1])})

print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))

# Get a dataframe for analyzing trial results.
df = analysis.dataframe()

2020-01-24 13:51:12,748	INFO function_runner.py:250 -- tune.track signature detected.
2020-01-24 13:51:12,759	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.


Trial name,status,loc,lr
train_mnist_a365d86e,RUNNING,,
train_mnist_a365f6be,PENDING,,
train_mnist_a3661158,PENDING,,


2020-01-24 13:51:12,781	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.
2020-01-24 13:51:12,796	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.


Result for train_mnist_a365d86e:
  date: 2020-01-24_13-51-13
  done: false
  experiment_id: bc38c427222d4d2e8683e0c78761c68e
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 1
  mean_accuracy: 0.14375
  node_ip: 192.168.1.27
  pid: 11035
  time_since_restore: 0.3523221015930176
  time_this_iter_s: 0.3523221015930176
  time_total_s: 0.3523221015930176
  timestamp: 1579902673
  timesteps_since_restore: 0
  training_iteration: 0
  trial_id: a365d86e
  
Result for train_mnist_a3661158:
  date: 2020-01-24_13-51-14
  done: false
  experiment_id: 5bfd9aad111f45cfb86163c0dc263032
  experiment_tag: 2_lr=0.1
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 1
  mean_accuracy: 0.4625
  node_ip: 192.168.1.27
  pid: 11043
  time_since_restore: 0.2247622013092041
  time_this_iter_s: 0.2247622013092041
  time_total_s: 0.2247622013092041
  timestamp: 1579902674
  timesteps_since_restore: 0
  training_iteration: 0
  trial_id: a3661158
  
Res

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_a365d86e,RUNNING,192.168.1.27:11035,0.001,20,4.11773,0.325
train_mnist_a365f6be,RUNNING,192.168.1.27:11042,0.01,15,3.05858,0.75
train_mnist_a3661158,RUNNING,192.168.1.27:11043,0.1,16,3.17728,0.884375


Result for train_mnist_a365d86e:
  date: 2020-01-24_13-51-19
  done: false
  experiment_id: bc38c427222d4d2e8683e0c78761c68e
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 29
  mean_accuracy: 0.396875
  node_ip: 192.168.1.27
  pid: 11035
  time_since_restore: 5.485640048980713
  time_this_iter_s: 0.16683387756347656
  time_total_s: 5.485640048980713
  timestamp: 1579902679
  timesteps_since_restore: 0
  training_iteration: 28
  trial_id: a365d86e
  
Result for train_mnist_a3661158:
  date: 2020-01-24_13-51-19
  done: false
  experiment_id: 5bfd9aad111f45cfb86163c0dc263032
  experiment_tag: 2_lr=0.1
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 29
  mean_accuracy: 0.953125
  node_ip: 192.168.1.27
  pid: 11043
  time_since_restore: 5.240860939025879
  time_this_iter_s: 0.168565034866333
  time_total_s: 5.240860939025879
  timestamp: 1579902679
  timesteps_since_restore: 0
  training_iteration: 28
  trial_id: a3661158
  


Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_a365d86e,RUNNING,192.168.1.27:11035,0.001,49,9.05493,0.490625
train_mnist_a365f6be,RUNNING,192.168.1.27:11042,0.01,45,8.17679,0.890625
train_mnist_a3661158,RUNNING,192.168.1.27:11043,0.1,45,8.11637,0.921875


Result for train_mnist_a365d86e:
  date: 2020-01-24_13-51-24
  done: false
  experiment_id: bc38c427222d4d2e8683e0c78761c68e
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 58
  mean_accuracy: 0.584375
  node_ip: 192.168.1.27
  pid: 11035
  time_since_restore: 10.498589038848877
  time_this_iter_s: 0.19450712203979492
  time_total_s: 10.498589038848877
  timestamp: 1579902684
  timesteps_since_restore: 0
  training_iteration: 57
  trial_id: a365d86e
  
Result for train_mnist_a3661158:
  date: 2020-01-24_13-51-24
  done: false
  experiment_id: 5bfd9aad111f45cfb86163c0dc263032
  experiment_tag: 2_lr=0.1
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 58
  mean_accuracy: 0.946875
  node_ip: 192.168.1.27
  pid: 11043
  time_since_restore: 10.377768993377686
  time_this_iter_s: 0.20988988876342773
  time_total_s: 10.377768993377686
  timestamp: 1579902684
  timesteps_since_restore: 0
  training_iteration: 57
  trial_id: a36611

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_a365d86e,RUNNING,192.168.1.27:11035,0.001,76,14.1421,0.7375
train_mnist_a365f6be,RUNNING,192.168.1.27:11042,0.01,71,13.104,0.934375
train_mnist_a3661158,RUNNING,192.168.1.27:11043,0.1,72,13.2076,0.95


Result for train_mnist_a365d86e:
  date: 2020-01-24_13-51-29
  done: false
  experiment_id: bc38c427222d4d2e8683e0c78761c68e
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 85
  mean_accuracy: 0.7125
  node_ip: 192.168.1.27
  pid: 11035
  time_since_restore: 15.64893388748169
  time_this_iter_s: 0.18424487113952637
  time_total_s: 15.64893388748169
  timestamp: 1579902689
  timesteps_since_restore: 0
  training_iteration: 84
  trial_id: a365d86e
  
Result for train_mnist_a3661158:
  date: 2020-01-24_13-51-30
  done: false
  experiment_id: 5bfd9aad111f45cfb86163c0dc263032
  experiment_tag: 2_lr=0.1
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 85
  mean_accuracy: 0.95
  node_ip: 192.168.1.27
  pid: 11043
  time_since_restore: 15.42218017578125
  time_this_iter_s: 0.16505813598632812
  time_total_s: 15.42218017578125
  timestamp: 1579902690
  timesteps_since_restore: 0
  training_iteration: 84
  trial_id: a3661158
  
Resu

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_a365d86e,TERMINATED,,0.001,99,18.4761,0.734375
train_mnist_a365f6be,RUNNING,192.168.1.27:11042,0.01,98,18.1631,0.93125
train_mnist_a3661158,RUNNING,192.168.1.27:11043,0.1,99,18.2372,0.9625


Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_a365d86e,TERMINATED,,0.001,99,18.4761,0.734375
train_mnist_a365f6be,TERMINATED,,0.01,99,18.3608,0.928125
train_mnist_a3661158,TERMINATED,,0.1,99,18.2372,0.9625


2020-01-24 13:51:32,983	INFO tune.py:330 -- Returning an analysis object by default. You can call `analysis.trials` to retrieve a list of trials. This message will be removed in future versions of Tune.


Best config:  {'lr': 0.1}
