# Dashboard Overview

This notebook walks though some examples to show the functinoality of ray dashboard. 

There are two views:
1. Machine View, which groups actors by nodes. Information about occupied and total memory, resources, node ip address, logs, errors, etc. will be displayed. 
2. Logical View, which gorups actors by the hierachical structure. Actor `A` is the parent of actor `B` if `A` creates `B`. In this case, actor `B` will be placed as a nested actor of `A`. 

In [3]:
import ray
import os
import time
import numpy as np
import requests

addresses = ray.init()
print("Click here to open the dashboard: http://{}".format(addresses["webui_url"]))

2020-01-24 15:21:10,546	INFO resource_spec.py:212 -- Starting Ray with 3.71 GiB memory available for workers and up to 1.88 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-01-24 15:21:10,797	INFO services.py:1093 -- View the Ray dashboard at [1m[32mlocalhost:8273[39m[22m


Click here to open the dashboard: http://localhost:8273


### Debug blocked actor creation tasks

If creating an actor requires resources (e.g. CPUs, GPUs, other custom resources) that are not currently available, the actor creation task becomes infeasible. It might causes hanging programs. 

To make developers aware of this issue, infeasible tasks are shown in red in the dashboard. 

![title](img/infeasible-task.png)

In [5]:
@ray.remote(resources={"Custom": 1}, num_cpus=0)
class A(object):
    def __init__(self):
        pass
    
    def f(self):
        return 0
        
@ray.remote
class B(object):
    def __init__(self, x, y):
        self.a = A.remote()
        
    def f(self):
        return ray.get(self.a.f.remote())

b = B.remote(3, y=5)

try:
    ray.get(b.f.remote(), timeout=2)
except ray.exceptions.RayTimeoutError:
    print("Session hangs because actor A cannot be created. ")



Session hangs because actor A cannot be created. 


### Inspect local memory usage

The dashboard shows the following informaiton of local memory usage:
- Number of object ids in scope
- Number of local objects
- Used Object Memory
    
In the example below, all objects (strings) are stored in local object memory. Used local object memory increases as the remote function `g` is repeatedly called. 

![title](img/local-memory-usage.png)

In [6]:
@ray.remote
def g():
    return "hello world!"

@ray.remote
class A(object):
    def f(self):
        object_ids = []
        for idx in range(50):
            ray.show_in_webui("Loop index = {}...".format(idx))
            object_ids.append(g.remote())
            time.sleep(0.5)

a = A.remote()
_ = a.f.remote()

### Inspect node memory usage

Different from above example, used local object memory will alwasy be zero here because all objects (strings) are stored on the node. 

![title](img/node-memory-usage.png)

In [7]:
@ray.remote
class C(object):
    def __init__(self):
        self.object_ids = []
    
    def push(self):
        object_id = ray.put("test")
        self.object_ids.append(object_id)
        time.sleep(1)
        return object_id
    
    def clean_memory(self):
        del self.object_ids
        
@ray.remote
class D(object):
    def __init__(self):
        self.object_ids = []
        
    def fetch(self):
        c = C.remote()
        
        for idx in range(20):
            ray.show_in_webui("Loop index = {}...".format(idx))
            time.sleep(0.5)
            object_id = ray.get(c.push.remote())
            self.object_ids.append(object_id)  

    def clean_memory(self):
        del self.object_ids
        
d = D.remote()
_ = d.fetch.remote()

The following command clears out the number of object ids in scope for actor `d`, as all object ids become out of scope after `self.object_ids` is deleted. The field `NumObjectIdInScope` will be set to 0 on the dashboard. 

In [9]:
_ = d.clean_memory.remote()

### Profile python program with py-spy

Clicking the `profling` button on the dashboard launches `py-spy` that times your python program. The timing information will be visualized as flamegraph in a new browser tab. 

Checkout the example Learning to play Pong on ray documentation: https://ray.readthedocs.io/en/latest/auto_examples/plot_pong_example.html

Click `profiling`, and click `Profiling result` when it is ready. Note that there could be multiple threads in the process and some are ray internal threads and the timing information may not be so interesting. Click the left and right arrow on the middle top to see profiling results on different threads.

Now you can intuitively see where could be the computation bottleneck. More information on how to interpret the flamegraph is available at https://github.com/jlfwong/speedscope#usage.

![title](img/profiling.png)

### Example

The logical view of the dashboard allows users to track the progress (training accuracy, constructor configuration, memory usage, etc.) of all parallel actors.

### Example 1: Monitor distributed actors

Reference: github issue #3609, @EricSteinberger

In this example, the dashboard displays parallel actors and report the number of tasks each actor has completed, the name of the task currently executed and the number of tasks pending on the queue. Click `collapse` if an actor spawns too many child actors.  

![title](img/example-1.png)

In [10]:
import torch


class NeuralNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l = torch.nn.Linear(1000, 2048)
        self.l2 = torch.nn.Linear(2048, 2)

    def forward(self, x):
        return self.l2(self.l(x))


@ray.remote(num_cpus=1)
class TestActor:
    def __init__(self):
        self.net = NeuralNet()

    def test(self, batch_size):
        p = self.net(torch.rand((batch_size, 1000),))
        
def test(num_actors):
    t = time.time()
    actors = [TestActor.remote() for _ in range(num_actors)]

    t = time.time()
    for _ in range(5000//num_actors):
        ray.get([actor.test.remote(128) for actor in actors])
    
    print(f"Test: num_actors = {num_actors}, time = {time.time() - t}")

test(num_actors=4)

Test: num_actors = 4, time = 14.119410037994385


### Example 2: Distributed network training

Reference: github issue #6633, @JaeLiiin

![title](img/example-2.png)

In [11]:
%%capture

from tensorflow.keras import layers
import json


def create_keras_model():
    import tensorflow as tf
    model = tf.keras.Sequential()
    # Adds a densely-connected layer with 64 units to the model:
    model.add(layers.Dense(64, activation="relu", input_shape=(32, )))
    # Add another:
    model.add(layers.Dense(64, activation="relu"))
    # Add a softmax layer with 10 output units:
    model.add(layers.Dense(10, activation="softmax"))

    model.compile(
        optimizer=tf.keras.optimizers.RMSprop(0.01),
        loss=tf.keras.losses.categorical_crossentropy,
        metrics=[tf.keras.metrics.categorical_accuracy])
    return model


def random_one_hot_labels(shape):
    n, n_class = shape
    classes = np.random.randint(0, n_class, n)
    labels = np.zeros((n, n_class))
    labels[np.arange(n), classes] = 1
    return labels


@ray.remote
class Network(object):
    def __init__(self):
        self.model = create_keras_model()
        self.dataset = np.random.random((1000, 32))
        self.labels = random_one_hot_labels((1000, 10))

    def train(self):
        history = self.model.fit(self.dataset, self.labels, verbose=False)
        time.sleep(0.5)
        ray.show_in_webui(repr(history.history))
        return history.history

    def get_weights(self):
        return self.model.get_weights()

    def set_weights(self, weights):
        # Note that for simplicity this does not handle the optimizer state.
        self.model.set_weights(weights)

In [12]:
%%capture

result_object_ids = []
result2_object_ids = []

NetworkActor = Network.remote()
NetworkActor2 = Network.remote()

for itr in range(20):
    weights = ray.get(
        [NetworkActor.get_weights.remote(),
         NetworkActor2.get_weights.remote()])

    averaged_weights = [(layer1 + layer2) / 2
                        for layer1, layer2 in zip(weights[0], weights[1])]

    weight_id = ray.put(averaged_weights)
    [
        actor.set_weights.remote(weight_id)
        for actor in [NetworkActor, NetworkActor2]
    ]
    result_object_ids.append(NetworkActor.train.remote())
    result2_object_ids.append(NetworkActor2.train.remote())

### Example 3: monitor MNIST training with tune
- Actor construction which exposes parameter configuration
- Task execution
    - Number of tasks executed
    - Function descriptor of currently executed task
    - Number of pending tasks listed on the task queue
- Training accuracy shown as actor message

![title](img/example-3.png)

In [13]:
import torch.optim as optim
from ray import tune
from ray.tune.examples.mnist_pytorch import get_data_loaders, ConvNet, train, test


def train_mnist(config):
    train_loader, test_loader = get_data_loaders()
    model = ConvNet()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"])
    for i in range(100):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)
        ray.show_in_webui(str(acc))
        tune.track.log(mean_accuracy=acc)


analysis = tune.run(
    train_mnist, config={"lr": tune.grid_search([0.001, 0.01, 0.1])})

print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))

# Get a dataframe for analyzing trial results.
df = analysis.dataframe()

2020-01-24 15:43:27,856	INFO function_runner.py:250 -- tune.track signature detected.
2020-01-24 15:43:27,869	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.


Trial name,status,loc,lr
train_mnist_51d5e2c2,RUNNING,,
train_mnist_51d60950,PENDING,,
train_mnist_51d627e6,PENDING,,


2020-01-24 15:43:27,892	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.
2020-01-24 15:43:27,913	ERROR logger.py:328 -- pip install 'ray[tune]' to see TensorBoard files.


Result for train_mnist_51d60950:
  date: 2020-01-24_15-43-31
  done: false
  experiment_id: 430041a749c3485e89b5a16dbcae3c66
  experiment_tag: 1_lr=0.01
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 1
  mean_accuracy: 0.23125
  node_ip: 192.168.1.27
  pid: 12308
  time_since_restore: 0.363037109375
  time_this_iter_s: 0.363037109375
  time_total_s: 0.363037109375
  timestamp: 1579909411
  timesteps_since_restore: 0
  training_iteration: 0
  trial_id: 51d60950
  
Result for train_mnist_51d5e2c2:
  date: 2020-01-24_15-43-31
  done: false
  experiment_id: 5e6369a0220c49359d02f5b5a19f21b8
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 1
  mean_accuracy: 0.065625
  node_ip: 192.168.1.27
  pid: 12309
  time_since_restore: 0.46709609031677246
  time_this_iter_s: 0.46709609031677246
  time_total_s: 0.46709609031677246
  timestamp: 1579909411
  timesteps_since_restore: 0
  training_iteration: 0
  trial_id: 51d5e2c2
  
Result fo

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_51d5e2c2,RUNNING,192.168.1.27:12309,0.001,6,1.96941,0.05
train_mnist_51d60950,RUNNING,192.168.1.27:12308,0.01,7,2.12312,0.775
train_mnist_51d627e6,RUNNING,192.168.1.27:12307,0.1,6,2.01397,0.8375


Result for train_mnist_51d60950:
  date: 2020-01-24_15-43-36
  done: false
  experiment_id: 430041a749c3485e89b5a16dbcae3c66
  experiment_tag: 1_lr=0.01
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 23
  mean_accuracy: 0.85625
  node_ip: 192.168.1.27
  pid: 12308
  time_since_restore: 5.452777147293091
  time_this_iter_s: 0.21289515495300293
  time_total_s: 5.452777147293091
  timestamp: 1579909416
  timesteps_since_restore: 0
  training_iteration: 22
  trial_id: 51d60950
  
Result for train_mnist_51d5e2c2:
  date: 2020-01-24_15-43-36
  done: false
  experiment_id: 5e6369a0220c49359d02f5b5a19f21b8
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 23
  mean_accuracy: 0.128125
  node_ip: 192.168.1.27
  pid: 12309
  time_since_restore: 5.5144970417022705
  time_this_iter_s: 0.21251416206359863
  time_total_s: 5.5144970417022705
  timestamp: 1579909416
  timesteps_since_restore: 0
  training_iteration: 22
  trial_id: 51d5e2c2

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_51d5e2c2,RUNNING,192.168.1.27:12309,0.001,29,7.05838,0.125
train_mnist_51d60950,RUNNING,192.168.1.27:12308,0.01,30,7.22034,0.890625
train_mnist_51d627e6,RUNNING,192.168.1.27:12307,0.1,29,7.10089,0.884375


Result for train_mnist_51d60950:
  date: 2020-01-24_15-43-41
  done: false
  experiment_id: 430041a749c3485e89b5a16dbcae3c66
  experiment_tag: 1_lr=0.01
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 45
  mean_accuracy: 0.86875
  node_ip: 192.168.1.27
  pid: 12308
  time_since_restore: 10.705771207809448
  time_this_iter_s: 0.3253171443939209
  time_total_s: 10.705771207809448
  timestamp: 1579909421
  timesteps_since_restore: 0
  training_iteration: 44
  trial_id: 51d60950
  
Result for train_mnist_51d5e2c2:
  date: 2020-01-24_15-43-41
  done: false
  experiment_id: 5e6369a0220c49359d02f5b5a19f21b8
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 45
  mean_accuracy: 0.24375
  node_ip: 192.168.1.27
  pid: 12309
  time_since_restore: 10.82926321029663
  time_this_iter_s: 0.33666110038757324
  time_total_s: 10.82926321029663
  timestamp: 1579909421
  timesteps_since_restore: 0
  training_iteration: 44
  trial_id: 51d5e2c2
 

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_51d5e2c2,RUNNING,192.168.1.27:12309,0.001,49,12.1771,0.25
train_mnist_51d60950,RUNNING,192.168.1.27:12308,0.01,50,12.3249,0.909375
train_mnist_51d627e6,RUNNING,192.168.1.27:12307,0.1,49,12.1976,0.9125


Result for train_mnist_51d60950:
  date: 2020-01-24_15-43-46
  done: false
  experiment_id: 430041a749c3485e89b5a16dbcae3c66
  experiment_tag: 1_lr=0.01
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 64
  mean_accuracy: 0.89375
  node_ip: 192.168.1.27
  pid: 12308
  time_since_restore: 15.811901092529297
  time_this_iter_s: 0.2762620449066162
  time_total_s: 15.811901092529297
  timestamp: 1579909426
  timesteps_since_restore: 0
  training_iteration: 63
  trial_id: 51d60950
  
Result for train_mnist_51d5e2c2:
  date: 2020-01-24_15-43-46
  done: false
  experiment_id: 5e6369a0220c49359d02f5b5a19f21b8
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 64
  mean_accuracy: 0.340625
  node_ip: 192.168.1.27
  pid: 12309
  time_since_restore: 15.890575170516968
  time_this_iter_s: 0.2264111042022705
  time_total_s: 15.890575170516968
  timestamp: 1579909426
  timesteps_since_restore: 0
  training_iteration: 63
  trial_id: 51d5e2c2

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_51d5e2c2,RUNNING,192.168.1.27:12309,0.001,69,17.2965,0.365625
train_mnist_51d60950,RUNNING,192.168.1.27:12308,0.01,70,17.4563,0.8875
train_mnist_51d627e6,RUNNING,192.168.1.27:12307,0.1,69,17.3068,0.921875


Result for train_mnist_51d60950:
  date: 2020-01-24_15-43-51
  done: false
  experiment_id: 430041a749c3485e89b5a16dbcae3c66
  experiment_tag: 1_lr=0.01
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 86
  mean_accuracy: 0.89375
  node_ip: 192.168.1.27
  pid: 12308
  time_since_restore: 20.95795702934265
  time_this_iter_s: 0.21814513206481934
  time_total_s: 20.95795702934265
  timestamp: 1579909431
  timesteps_since_restore: 0
  training_iteration: 85
  trial_id: 51d60950
  
Result for train_mnist_51d5e2c2:
  date: 2020-01-24_15-43-51
  done: false
  experiment_id: 5e6369a0220c49359d02f5b5a19f21b8
  experiment_tag: 0_lr=0.001
  hostname: Yunzhis-MacBook-Pro.local
  iterations_since_restore: 86
  mean_accuracy: 0.496875
  node_ip: 192.168.1.27
  pid: 12309
  time_since_restore: 21.006662845611572
  time_this_iter_s: 0.22823190689086914
  time_total_s: 21.006662845611572
  timestamp: 1579909431
  timesteps_since_restore: 0
  training_iteration: 85
  trial_id: 51d5e2c2

Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_51d5e2c2,RUNNING,192.168.1.27:12309,0.001,91,22.3222,0.5125
train_mnist_51d60950,RUNNING,192.168.1.27:12308,0.01,92,22.4896,0.884375
train_mnist_51d627e6,RUNNING,192.168.1.27:12307,0.1,91,22.3347,0.946875


Trial name,status,loc,lr,iter,total time (s),acc
train_mnist_51d5e2c2,TERMINATED,,0.001,99,24.1395,0.5625
train_mnist_51d60950,TERMINATED,,0.01,99,24.054,0.89375
train_mnist_51d627e6,TERMINATED,,0.1,99,24.1102,0.95625


2020-01-24 15:43:54,943	INFO tune.py:330 -- Returning an analysis object by default. You can call `analysis.trials` to retrieve a list of trials. This message will be removed in future versions of Tune.


Best config:  {'lr': 0.1}


More examples are available on https://ray.readthedocs.io/en/latest/auto_examples/overview.html.
