# Ray Observability Part 1

<img src="../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">

## About this notebook

### Is this module right for you?

This module provides a general purpose introduction to the most common observability tools to effectively debug, optimize, and monitor Ray applications. It is for data scientists, ML  practitioners, ML engineers, and Python developers looking for ways to understand the behavior of their Ray systems.

### Prerequisites

For this notebook, you should satisfy the following minimum requirements:

-   Practical Python experience
-   Familiarity with Ray equivalent to completing these training modules:
    -   [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb)
    -   [Ray Core](https://github.com/ray-project/ray-educational-materials/tree/main/Ray_Core)

### Learning objectives

-   Understand the major tools available for observability with Ray, namely the State API and Dashboard UI.
-   Debug a sample application and surface errors through multiple different observability points.
-   Optimize an application with a known anti-pattern and identify the bottleneck using Ray Dashboard, and implement a common design pattern to address it.

### What will you do?

-   Introduction to the Ray observability toolbox
    -   Learn about what observability is and why it can be so difficult in distributed settings.
    -   Read about the State API and Dashboard UI and leverage them in common development workflows.
-   Ray observability workflows
    -   Debugging
        -   Reproduce an out of memory error and retrieve metrics and logs related to the failure.
        -   Reproduce a hanging bug and observe its behavior.
    -   Optimizing
        -   Run some `ray.get()` anti-patterns and observe performance bottlenecks and implement the corresponding design pattern to optimize it.
-   Summarize the most common observability tools and find resources for further advanced exploration.



## Introduction to observability in distributed systems

---

Effective debugging and optimizing is crucial in any system, but it becomes even more vital in [distributed settings](https://en.wikipedia.org/wiki/Distributed_computing). With a high number of inter-connected, heterogeneous components, updates can cause opaque failures, making it difficult to identify and fix issues. The goal of observability is to allow for the monitoring and understanding of the internal state of the system in real-time to ensure its stability and performance.

### What is observability?

<div class="alert alert-info">
  <strong>Observability:</a></strong> the ability to understand the behavior of the internal state of a system inferred from its external outputs.
</div>

Within the context of Ray, [observability](https://docs.ray.io/en/latest/ray-observability/index.html) refers to the extent of visibility into distributed applications along with the available tools for inspecting and aggregating performance data.

### Why is observability challenging in distributed systems?
To illustrate the significance of being able to access outputted information through user-friendly tools, consider the [architecture](https://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-la5w4q_TjyJgJL_jN2fI/preview) of a typical Ray application in the following example code snippet:

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Observability_part_1/code_scalability_envelope.png" width="70%" loading="lazy">|
|:--|
|Simplified Ray application used for multi-model training on many data batches. The key takeaway is to highlight the scale of coordinating individual machines in a large distributed system, with each introducing an opportunity for failure.|

A [Ray cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html) consists of a head node that manages a large number of worker nodes which execute the code of an application. As the scale of the cluster increases, so does the number of [tasks](https://docs.ray.io/en/latest/ray-core/tasks.html#ray-remote-functions), [actors](https://docs.ray.io/en/latest/ray-core/actors.html#actor-guide), and [objects](https://docs.ray.io/en/latest/ray-core/objects.html#objects-in-ray) that are concurrently executed and [stored across heterogeneous machines](https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html#memory).

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Observability_part_1/actor_failure.png" width="70%" loading="lazy">|
|:--|
|Actor failure in a Ray cluster with millions of concurrent tasks among [thousands of worker nodes](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md).|

Debugging issues in this environment can be challenging. Among [thousands of nodes and tens of thousands of actors](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md), performance bottlenecks, failures, and unpredictable behavior are inevitable. For example, diagnosing a cluster with thousands of actors and millions of processes could contain any number of non-trivial pitfalls:

-   How do I know if an actor has failed, especially if the actor failed to initialize in the first place?
    -   Some processes may become stuck indefinitely due to incomplete scheduling, known as "hang."
-   When I become aware of actor failure(s), how do I know which one(s) caused the issue?
    -   In a set-up of tens of thousands of actors, how do I begin to identify the culprit?
-   Once I know which actor(s) to inspect, how can I find the log and fix the bug?
    -   Filtering through logs and tracing failures is impossible without robust tooling.

The [Ray runtime](https://docs.ray.io/en/latest/ray-core/starting-ray.html#what-is-the-ray-runtime) manages much of the low-level system behavior in a Ray application which poses a unique opportunity to offer built-in performance data. By providing the right tools to successfully debug, optimize, and monitor Ray applications, developers can troubleshoot issues to improve a system's overall reliability and efficiency.

## Ray observability toolbox

---

### Observability tooling by layer

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Observability_part_1/observability_stack.png" width="70%" loading="lazy">|
|:--|
|Ray offers observability tooling and third-party integrations at each layer of development that allow you to understand the Ray cluster, Ray application, and ML application.|

| Layer | Tooling | Purpose | Scenarios |
|---|---|---|---|
| **Ray Cluster** | <ul><li>State API</li><li>Ray Dashboard</li></ul> | Infrastructure observability, like [`htop`](https://htop.dev/) for a Ray cluster. Monitor the status and utilization including hardware (CPUs, GPUs, TPUs), network, and memory. | Which nodes in my Ray cluster are experiencing high CPU or memory usage so that I can optimize consumption and reduce costs? |
| **Ray Application (Core and AIR)** | <ul><li>State API</li><li>Ray Dashboard</li><li>Profiler</li><li>Debugger</li></ul> | Debug, optimize, and monitor Ray applications including the status of tasks, actors, objects, placement groups, jobs, and more. | Are my thousands of in-flight tasks and actors progressing normally, or are some failing or hanging in unintended ways? If so, which ones, and how can I access the logs easily for any given node? |
| **ML Application** | Interactive Development <ul><li>[Weights & Biases](https://www.anyscale.com/events/2023/01/19/simplify-building-scaling-tracking-and-monitoring-your-ai-ml-models)</li><li>[MLflow](https://docs.ray.io/en/latest/tune/examples/tune-mlflow.html)</li><li>[Comet](https://docs.ray.io/en/latest/tune/examples/tune-comet.html)</li></ul> Production <ul><li>[Ray Dashboard](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html)</li><li>[Arize](https://www.anyscale.com/events/2023/02/07/productionizing-machine-learning-with-observability-quality-and-flexibility)</li><li>[WhyLabs](https://docs.whylabs.ai/docs/ray-integration/)</li></ul> | Monitor ML models in interactive development and production through third-party integrations. | Which hyperparameters are optimal for my model? How does model performance vary over time or across different input data? Are there any anomalies in the model’s behavior in production?  |

This section provides a solid introduction to the two main observability tools in Ray: the [State API](https://docs.ray.io/en/master/ray-observability/state/state-api.html) and the [Dashboard UI](https://docs.ray.io/en/master/ray-core/ray-dashboard.html). To demonstrate their functionality, consider the following simple example to practice on:

In [None]:
import ray
import time

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
@ray.remote
def task():
    time.sleep(60)


@ray.remote
class Actor:
    def call(self):
        print("Actor called.")

### State API

[State APIs](https://docs.ray.io/en/latest/ray-observability/state/state-api.html#monitoring-ray-states) allow users to access the current state of resources of Ray through [CLI](https://docs.ray.io/en/latest/ray-observability/state/cli.html#state-api-cli-ref) or [Python SDK](https://docs.ray.io/en/latest/ray-observability/state/ray-state-api-reference.html#state-api-ref).

-   **Resources** include Ray tasks, actors, objects, placement groups, and more.
-   **States** refer to the immutable metadata (e.g. actor's name) and mutable states (e.g. actor's scheduling state or pid) of Ray resources.

There are three main APIs that allow you to inspect cluster resources with varying levels of granularity:

-   **`summary`** returns a summarized view of a given resource (i.e. tasks, actors, objects).
-   **`list`** returns a list of resources filterable by type and state.
-   **`get`** returns information about a specific resource in detail.

In addition, you can also easily retrieve and filter through [ray logs](https://docs.ray.io/en/latest/ray-observability/state/cli.html#ray-logs-api-cli-ref):

-   **`logs`** returns the logs of tasks, actors, workers, or system log files.

#### Example: Inspect cluster resources.

In [None]:
task.remote()

In [None]:
!ray summary tasks

In [None]:
actor = Actor.remote()

In [None]:
actor.call.remote()

In [None]:
!ray list actors

#### Coding exercise

Access more information about the actor you just created by using `ray get <actor_id>` using [CLI](https://docs.ray.io/en/latest/ray-observability/state/cli.html#ray-get) or [Python SDK](https://docs.ray.io/en/latest/ray-observability/state/ray-state-api-reference.html#get-apis) .

In [None]:
### YOUR CODE HERE ###

#### Solution

In [None]:
### SAMPLE IMPLEMENTATION ###
!ray get actors ### FILL IN HERE WITH YOUR <actor_id> ###

### Dashboard UI

The [Ray Dashboard](https://docs.ray.io/en/master/ray-core/ray-dashboard.html) offers a built-in mechanism for viewing the state of cluster resources, [time series](https://en.wikipedia.org/wiki/Time_series) metrics, and other features at a glance. Information made available via the terminal with the State API is also available to view in the Dashboard UI. You can access the dashboard in a few ways:

-   Through the URL printed when Ray is initialized (the default is <https://localhost:8265>).
-   When using the [cluster launcher](https://docs.ray.io/en/master/cluster/vms/references/ray-cluster-cli.html#monitor-cluster).
-   Under the "Tools" tab in the [Anyscale Console](https://www.anyscale.com/platform) (all built-in; no setup).\

Note: In the Anyscale console, time series metrics are built-in, so you don't have to set anything up. Otherwise, download and configuration instructions for integration with Prometheus and Grafana can be found [here](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html#ray-metrics).

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Observability_part_1/dashboard_ui.png" width="70%" loading="lazy">|
|:--|
|Ray dashboard is a web-based UI to help users monitor their cluster that acts as a central hub for the best observability tools available for Ray.|

#### Dashboard Navigation

The overview page of the dashboard offers live monitoring and quick links to common views such as metrics, nodes, jobs, and events related to [Ray job submission APIs](https://docs.ray.io/en/master/cluster/running-applications/job-submission/quickstart.html#jobs-quickstart) and the [Ray autoscaler](https://docs.ray.io/en/master/cluster/key-concepts.html#cluster-autoscaler). Along the top navigation bar, you will find five viewing displays:

1.  **Jobs** - View the status and logs of [Ray jobs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#jobs-overview).
2.  **Cluster** - View state, resource utilization, and logs for each node and worker.
3.  **Actors** - View information about the actors that have existed on the Ray cluster.
4.  **Metrics** - View time series metrics that automatically refresh every 15 seconds; requires [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/oss/grafana/) running for your cluster.
5.  **Logs** - View all logs, organized by node and file name; supports filter and search.

Each view offers plenty of categories to monitor, and you can refer to the [dashboard references](https://docs.ray.io/en/latest/ray-core/ray-dashboard.html#references) for a more complete description. A more detailed investigation will come in the coming sections on Ray observability workflows.

#### Coding exercise

Open up the Ray Dashboard from the URL provided when you called `ray.init()`.

Try running a task or creating a new actor. You can use the basic task or actor provided in the beginning of this section or set up custom ones. Monitor the updating displays to see states and access logs.\
Remember: In the Anyscale console, time series metrics are built-in, so you don't have to set anything up. If you're following along locally, download and configuration instructions for integration with Prometheus and Grafana can be found [here](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html#ray-metrics).

In [None]:
### YOUR CODE HERE ###

In [None]:
task.remote()
sample_actor = Actor.remote()
sample_actor.call.remote()

### Summary

#### Key concepts

-   **State API**
    -   A way to inspect the state of resources through CLI and Python SDK.
-   **Dashboard UI** 
    -   A central way to view resources, state, and time series metrics at a glance; a great entry point to using the best support observability tools in Ray.
-   **Other tools**
    -   Profiling
    -   Debugger

## Ray observability workflows

---

The previous two sections discussed the significance of visibility into distributed systems and introduced the State API and Dashboard UI with a simple Ray application. Having laid the groundwork, you can now walk through some common observability stories organized by typical developer workflows: debugging failures and optimizing performance bottlenecks.

### Debugging

Within the context of Ray, debugging an application refers to failures that emerge during remote processes. This involves the interplay of two main APIs (which have an error handling model [very similar](https://docs.ray.io/en/latest/ray-core/actors/async_api.html#objectrefs-as-concurrent-futures-futures) to standard Python future APIs):

-   **`.remote`** - Creates a task or actor and starts a remote process; will return an exception if the remote process fails.
-   **`.get`** - Retrieves the result from an object reference; raises an exception if the remote process failed.

The [exceptions APIs](https://docs.ray.io/en/master/ray-core/package-ref.html#exceptions) can be grouped into a framework of three primary failure modes:

1.  **Application failures** - 
This happens when a remote task or actor fails resulting from errors in user-generated code. Exceptions thrown include `RayTaskError` and `RayActorError`.

2.  **Intentional system failures** - 
These indicate that while Ray has failed, this failure is a deliberate action. Common examples include using cancellation APIs such as `ray.cancel` for tasks or `ray.kill` for actors.

3.  **Unintended system failures** - 
These arise when a remote process has failed due to unforeseen system failures, such as process crashes or node failures. The following are typical cases:
    -  The out of memory killer randomly terminates processes.
    -  The machine is being terminated, particularly in the case of spot instances.
    -  The system being highly overloaded or stressed leading to failure.
    -  Bugs within Ray Core (relatively infrequent).

Each of these failures necessitates a specialized approach. In the following section, you will focus on an out of memory (OOM) example, and feel free to refer to the [debugging user guides](https://docs.ray.io/en/latest/ray-observability/monitoring-debugging/troubleshoot-failures.html) for further information on the other types of failures.

#### Example: Out of memory errors.

In the context of distributed computing, an out of memory (OOM) error occurs when a node in a cluster tries to use more memory than the amount available, leading to a failure of the entire system. Ray applications [use memory](https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html#concepts) in two main ways:

1.  **System memory:** Memory used internally by Ray to manage resources and processes.
2.  **Application memory:** Memory used by your application including creating objects in the object store with `ray.put` and reading them with `ray.get`; objects will [spill to disk](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html#id1) if the store fills up.

Consider the following example which continuously leaks memory by appending gigabyte arrays of zeros until the node runs out of memory:

<div class="alert alert-info">
  <strong>Warning:</a></strong> This script is designed to cause a memory leak and lead to an out of memory (OOM) error. Use caution when executing this code and only run it in a controlled environment.
</div>


In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
running = False  # Set to True to run the memory leaker.


@ray.remote(max_retries=0)
def memory_leaker():
    chunks = []
    bytes_per_chunk = 1024 * 1024 * 1024  # 1 gigabyte.
    while running:
        chunks.append([0] * bytes_per_chunk)
        time.sleep(5)  # Delay to observe the leak.


ray.get(memory_leaker.remote())

In this script, you define a Ray Task that leaks memory by continuously appending one-gigabyte arrays of zeros to a list. Depending on your resource configurations, this will eventually cause the memory usage to continuously increase until the OOM killer throws the following error message:

```
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
```

Note: By default, if a worker dies unexpectedly, Ray will [rerun the process](https://docs.ray.io/en/latest/ray-core/tasks/fault-tolerance.html#retries) up to 3 times. You can specify the number of `max_retries` in the `ray.remote` decorator (e.g. 0 to disable retries, -1 for infinite retries).

Effective use of observability tools is essential for [preventing OOM incidents](https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html). While you may be familiar with using [`htop`](https://htop.dev/) or [`free -h`](https://linuxhint.com/linux-free-command-examples/) to gather snapshots of system's running processes, these provide limited utility in distributed systems. Ideally, you want a tool that summarizes resource utilization and usage per node and worker without having to specify each one individually. Through the [Ray Dashboard](https://docs.ray.io/en/master/ray-core/ray-dashboard.html#node-view), you can view a live feed of vital information about your cluster and application during OOM events:

-   **Nodes:** Offers a snapshot of memory usage per node and worker.
-   **Metrics:** Shows historical usage on an active cluster via Prometheus and Grafana; comes built-in with the Anyscale console.
-   **Logs:** Accessible, searchable, and filterable via Node and Logs view.

In this section, set up the Ray Dashboard as the focal point for collecting observability data and insights.

#### OOM on Ray Dashboard

Access the Ray Dashboard from the URL provided when calling `ray.init()`. If you haven't already, download and configuration instructions for integration with Prometheus and Grafana can be found [here](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html#ray-metrics).

If you are using the Anyscale console, you can find a quick link to the Dashboard in the "Tools" menu, and all monitoring tools will be built-in for you.

Try running the memory leaker task once more, and this time pay attention to the memory usage per node as well as the historical usage graphs. In addition, you may watch the following walkthrough video:

#### Coding exercise

Try monitoring another OOM script, this time with a memory leaking actor instead of a task. The workflow should be very similar, except with different components to monitor. Try checking out the "Actors" view in the Ray Dashboard to inspect the leaking actor.

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
### SAMPLE STARTER SCRIPT ###
import math

@ray.remote
class Leaker:
    def __init__(self):
        self.leaks = []

    def allocate(self, num_bytes: int, sleep_time_s: int):
        # Each element in the array occupies 8 bytes.
        new_list = [0] * math.ceil(num_bytes / 8)
        self.leaks.append(new_list)

        time.sleep(sleep_time_s)


### YOUR CODE HERE ###

#### Example: Hanging errors.

Another common error that Ray users encounter is hang, which refers to the situation where an object reference created by `.remote()` cannot be retrieved by `.get()`. Hanging can cause delays or failure. You can detect these issues in the following ways:

1.  **State API**
- `ray status` - Pending tasks and actors will surface in the "Demands" summary.
2.  **Ray Dashboard**
- Progress Bar - Check the [status](https://github.com/ray-project/ray/blob/de2f8da435359bed6c704c1cac288ab06fcaaeca/src/ray/protobuf/common.proto#L648) of tasks and actors.
- Metrics - Visualize the state of tasks and actors over time.

Consider the following example that reproduces a lightweight hang (not an indefinite, failing hang error) due to the nature of the dependencies.

#### Coding Exercise

In [None]:
import random

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
@ray.remote
def long_running_task():
    time.sleep(random.randint(10, 60))

@ray.remote
def dependent_task(dependencies: list[ray._raylet.ObjectRef]):
    ray.get(dependencies)

dependencies = [long_running_task.remote() for _ in range(100)]
dependent_task.remote(dependencies)

Here, the `dependent_task` must wait until the dependencies (100 `long_running_task`s) return in order to resolve. Using the State API and Ray Dashboard, you can track the progress of each task and subtask.

While this example represents an engineered hanging, typically you will find a handful of common causes for encountering hang:

1.  **Application bugs**
- User-generated bugs or anti-patterns in a Ray application. [Stacktrace](https://docs.ray.io/en/master/ray-observability/monitoring-debugging/profiling.html#python-cpu-profiling-in-the-dashboard) via the Ray Dashboard will offer visibility.
2.  **Resource constraints**
- Waiting for available resources, which can be complicated by placement groups.
3.  **Object store memory insufficient**
- There's not enough memory to pull objects efficiently to the local node.
4.  **Pending upstream dependencies**
- Dependencies may not be scheduled yet or they are still running.

#### Summary

Debugging Ray applications involves detecting errors, accessing snapshots of current usage, and analyzing historical data to hone in on the issue. These workflows can be facilitated by using observability tools, especially the Ray Dashboard which acts as a central repository for collecting metrics and logs about a Ray cluster.

### Optimizing

Within the context of Ray, optimization refers to the process of improving the speed and efficiency of an application. Distributed systems can be costly to run and maintain due to multiple interconnected components needing to work together consistently and reliably. By optimizing performance, organizations can achieve better results and reduce costs.

Ray has two common types of optimization issues: resource constraints and design anti-patterns.

1.  **Resource constraints**
- **Number of cores:** Ray will not schedule more tasks than the number of CPUs that it automatically detects.
- **Physical vs. logical CPUs:** Speedup is proportional to [physical CPUs, not logical](https://www.linkedin.com/pulse/understanding-physical-logical-cpus-akshay-deshpande).

2.  **Design anti-patterns**
- **Small tasks:** Ray introduces overhead for managing each task, so if the task takes less than that overhead (around 10 milliseconds), you are likely to see worse performance.
- **Variable durations:** Calling `ray.get` on a batch of tasks with varying durations limits performance to the slowest task.
- **Multi-threaded libraries:** When all tasks compete for all resources, you experience contention that prevents runtime improvements.
- [**And much more**](https://docs.ray.io/en/latest/ray-core/patterns/index.html).

Each use case comes with its own unique set of challenges to [troubleshoot](https://docs.ray.io/en/latest/ray-observability/monitoring-debugging/troubleshoot-performance.html). In the following section, you will walk through a known design anti-pattern and practice using the relevant observability tools to profile the bottleneck.

#### Example: Anti-pattern using `ray.get`

Consider a scenario where you have a batch of independent tasks that are submitted at the same time. Each task can take a variable amount of time to complete. That is, one task may be completed quickly while another may take a long time.

In [this anti-pattern](https://docs.ray.io/en/latest/ray-core/patterns/ray-get-submission-order.html), if you were to call `ray.get()` on the entire batch, your performance would be limited by the longest running task.

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Observability_part_1/perfetto_timeline.png" width="70%" loading="lazy">|
|:--|
|The [Ray timeline](https://docs.ray.io/en/latest/ray-observability/monitoring-debugging/profiling.html#visualizing-tasks-in-the-ray-timeline) provides a high-level view of the tasks that are currently running in your application, how long they take to run, and how well the workload is distributed across all the workers in your cluster. Using `ray.get` to retrieve results from a batch of remote functions is limited by the longest running task in the batch.|

Open the Ray Dashboard, and try running the corresponding anti-pattern code below. Pay special attention to the progress bar for tasks as well as the timeline view.

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
@ray.remote
def sleep_task(i: int) -> int:
    time.sleep(i)
    return i


def post_processing_step(new_val: int):
    time.sleep(0.5)


big_sleep_times = [25]
small_sleep_times = [random.random() for _ in range(20)]
SLEEP_TIMES = big_sleep_times + small_sleep_times

# Launch remote tasks
refs = [sleep_task.remote(i) for i in SLEEP_TIMES]
for ref in refs:
    # Blocks until this ObjectRef is ready.
    result = ray.get(ref)  # Retrieve result in submission order.
    post_processing_step(result)  # Process the result.

#### Example: Design pattern using `ray.wait()`

Instead of calling `ray.get()` on a batch with variable tasks, [use `ray.wait()`](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-wait) to get the first object reference ready to return and then use `ray.get()` to retrieve the result. In this way, you can apply the post-processing function as soon as a result becomes available instead of having the submission order potentially slow down the pipeline.

#### Coding exercise

By running the anti-pattern, you may have noticed that one long-running task will block the entire process when using `ray.get()` on the entire batch.

Implement the design pattern illustrated in the diagram above which uses `ray.wait()` to process results as soon as one becomes available. With the Ray Dashboard open (especially the timeline view), observe the differences between using `ray.get()` and `ray.wait()` for pipelining data submission.

In [None]:
### YOUR CODE HERE ###

#### Solution

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
### SAMPLE IMPLEMENTATION ###

# Launch remote tasks.
refs = [sleep_task.remote(i) for i in SLEEP_TIMES]
unfinished = refs
while unfinished:
    # Returns the first ObjectRef that is ready.
    finished, unfinished = ray.wait(unfinished, num_returns=1)
    # Retrieve the first ready result.
    result = ray.get(finished[0])
    # Process the result.
    post_processing_step(result)

#### Coding exercise

Another [common anti-pattern](https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-1-delay-ray-get) involves using `ray.get()` unnecessarily, which leads to performance issues since it's a blocking call.

Run the following anti-pattern and pattern code examples with the Ray Dashboard open. Observe the differences between the two approaches using the time series metrics views.

Can you identify further improvements to this base code? Verify your results with the Ray Dashboard.

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
@ray.remote
def f(i: int) -> int:
    return i

# Anti-pattern: no parallelism due to calling ray.get inside of the loop.
sequential_returns = []
for i in range(100):
    sequential_returns.append(ray.get(f.remote(i)))

In [None]:
### YOUR CODE HERE ###

#### Solution

In [None]:
if ray.is_initialized():
    ray.shutdown()

ray.init()

In [None]:
### SAMPLE IMPLEMENTATION ###
refs = []
for i in range(100):
    refs.append(f.remote(i))

parallel_returns = ray.get(refs)

#### Summary

The Ray Dashboard offers a range of tools to help you optimize performance and identify performance bottlenecks. It provides a timeline view, progress bar for tasks, and other profiling tools to assist you in improving the speed and efficiency of your application. With these tools, you can analyze different aspects of your application, identify areas for improvement, and resolve any performance issues, ultimately reducing costs.

## Conclusion

---

Congratulations! You have now practiced using the two main observability tools (the [State API](https://docs.ray.io/en/master/ray-observability/state/state-api.html) and the [Dashboard UI](https://docs.ray.io/en/master/ray-core/ray-dashboard.html)) for debugging and optimizing Ray applications. In the next module, you will explore how to take observability beyond interactive development and into production.

### Summary

-   Introduction to the Ray observability toolbox
    -   Observability in distributed systems is essential and challenging given the scale of coordination among [thousands of heterogeneous resources](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md).
    -   The [State API](https://docs.ray.io/en/master/ray-observability/state/state-api.html) and the [Dashboard UI](https://docs.ray.io/en/master/ray-core/ray-dashboard.html) offer visibility into Ray applications.
-   Ray observability workflows
    -   Debugging
        -   Monitor and debug errors through the Ray Dashboard where you can access logs, time series, metrics, and more.
    -   Optimizing
        -   Identify performance bottlenecks by inspecting the progress bar, timeline, and node view available in the Ray Dashboard.
-   Advanced exploration
    -   Ray Observability Part 2
    -   [Ray Observability Documentation](https://docs.ray.io/en/latest/ray-observability/index.html)
    -   [Ray Observability Roadmap (upcoming releases)](https://github.com/ray-project/ray/issues/30097)

# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [**Ray documentation**](https://docs.ray.io/en/latest)

* [**Official Ray site**](https://www.ray.io/)  
Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.

* [**Join the community on Slack**](https://forms.gle/9TSdDYUgxYs8SA9e8)  
Find friends to discuss your new learnings in our Slack space.

* [**Use the discussion board**](https://discuss.ray.io/)  
Ask questions, follow topics, and view announcements on this community forum.

* [**Join a meetup group**](https://www.meetup.com/Bay-Area-Ray-Meetup/)  
Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.

* [**Open an issue**](https://github.com/ray-project/ray/issues/new/choose)  
Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

* [**Become a Ray contributor**](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html)  
We welcome community contributions to improve our documentation and Ray framework.

<img src="../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">