# Ray and Anyscale Observability Introduction

© 2025, Anyscale. All Rights Reserved

This notebook provides an overview of Ray and Anyscale observability features with an example to compare.

<div class="alert alert-block alert-info">
<b>What you'll learn:</b>
<ul>
    <li>Ray and Anyscale observability capabilities</li>
</ul>
</div>


## Ray Observability

In the context of Ray, observability is the ability for users to monitor and reason about the behavior of Ray applications and clusters through external signals such as logs, metrics, and events.

- **Logs**: Log messages from Ray applications running on both driver and worker processes (potentially across multiple machines).

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_logs.png" width="70%" loading="lazy">

- **Metrics**: Physical statistics (e.g., CPU, memory, GPU, disk, and network usage of each node), internal statistics (e.g., number of actors in the cluster, number of worker failures), and custom application metrics defined by users.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_metrics.png" width="70%" loading="lazy">

- **Events**: Chronologically ordered events associated with specific components (e.g., autoscaler or job events) that provide insights into system behavior and state changes.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_events.png" width="70%" loading="lazy">

Ray provides the Ray Dashboard as a built-in tool for monitoring and debugging applications. You can learn more about Ray Dashboard from [Ray Dashboard](https://docs.ray.io/en/latest/ray-observability/getting-started.html)

<img src="https://docs.ray.io/en/latest/_images/what-is-ray-observability.png" width="80%" loading="lazy">

## Anyscale Observability

Anyscale retains this native dashboard but significantly enhances observability through its managed platform by introducing three additional components to contextualizes the observability:

- **Anyscale Metrics**: Provides persistent, system-level metrics for infrastructure and performance monitoring.
- **Anyscale logs**: Centralized log storage and querying capabilities.
- **Ray Workloads**: Supports managing, tracking, and versioning individual Ray application deployments.

While there is some overlap between Ray’s open-source observability tools and Anyscale’s managed observability stack, each serves distinct roles with important differences in scope, persistence, and use case focus.

<div style="display: flex; gap: 20px; margin: 20px 0;">
<div style="flex: 1; padding: 15px; border: 2px solid #2196F3; border-radius: 8px; background-color: #E3F2FD; color: #1565C0;">
<h4 style="color: #0D47A1;">Ray Dashboard (OSS)</h4>
<strong>Where:</strong> Ray head node <br>
<strong>When:</strong> Only available while cluster is running<br>
<strong>What:</strong> Raw text logs, Metrics, Events
</div>
<div style="flex: 1; padding: 15px; border: 2px solid #FF9800; border-radius: 8px; background-color: #FFF3E0; color: #E65100;">
<h4 style="color: #BF360C;">Anyscale Observability</h4>
<strong>Where:</strong> Anyscale backend infrastructure<br>
<strong>When:</strong> Always available (persistent)<br>
<strong>Best for:</strong> Contextualizes the observability to improve what users get on the Ray Dashboard (OSS).
</div>
</div>

## Example

Contextualized observability is observability that a user easily understands based on the context they already have. Because Anyscale:

- **Knows what user’s workloads are**: Workload observability can be highly contextualized.
- **Controls the full stack from telemetry to visualization**: The tools can be deeply integrated and native to Ray and its libraries.
- **Handles running workloads at extremely large scale**: This itself has a strong need for good observability.


Anyscale contextualizes observability to enhance what is available from Ray OSS Observability. Here is an example of how easy it is to debug a job that uses more than 1000 nodes with Anyscale observability.

The following script creates 1 billion (10**9) synthetic images and writes them to Parquet files (see `example/main.py` for the full code).

In [None]:
import ray
import numpy as np
import os
import time
from typing import Dict, Any

"""
Each image is about 1MB. (HxWxC = 580x580x3 = 1MB)
So 1 billion images would be 1 PB. (10^9 * 1MB = 1PB)
"""
NUM_IMAGES = 10**9
IMAGE_WIDTH = 580
IMAGE_HEIGHT = 580
CHANNELS = 3


def generate_synthetic_image(image_id: int, width: int = 580, height: int = 580, channels: int = 3) -> Dict[str, Any]:
    image_array = np.random.randint(0, 256, size=(height, width, channels), dtype=np.uint8)
    return {
        "image_id": image_id,
        "image_array": image_array,
        "metadata": {
            "dtype": str(image_array.dtype),
            "shape": image_array.shape,
            "generated_by": "ray_data_synthetic"
        }
    }


if __name__ == "__main__":
    image_ids = list(range(NUM_IMAGES))
    output_path = os.path.join(os.environ["ANYSCALE_ARTIFACT_STORAGE"], "rkn/synthetic_image_output")

    ds = ray.data.from_items(image_ids)
    ds = ds.repartition(target_num_rows_per_block=1000)
    ds = ds.map(lambda x: generate_synthetic_image(x["item"], IMAGE_WIDTH, IMAGE_HEIGHT, CHANNELS))
    ds.write_parquet(output_path)

 The following is the job configuration file to start this job (see `example/job.yaml` for the full code). Replace cloud name if needed.

In [None]:
name: synthetic-image-generator

image_uri: anyscale/ray:2.48.0-py312-cu128

compute_config:
  worker_nodes:
    - instance_type: m5.12xlarge
      max_nodes: 500
      market_type: PREFER_SPOT # (Optional) Defaults to ON_DEMAND
    - instance_type: m5.16xlarge
      max_nodes: 500
      market_type: PREFER_SPOT # (Optional) Defaults to ON_DEMAND
    - instance_type: m5.24xlarge
      max_nodes: 500

When add the correct **Anyscale Cloud ID** and run this job with following command:

In [None]:
!cd example/ && anyscale job submit -f job.yaml

When the job starts, memory usage will increase gradually across all workers until some workers' memory usage hits 90%+. As it continues running, the first OOM failure occurs on smaller instances. As more and more workers fail, the remaining workers become overloaded.

This job causes:

- **Memory pressure**: Some workers experience OOM failures
- **Storage I/O bottlenecks**: 1000+ workers writing simultaneously
- **Network saturation**: Data transfer between workers
- **Cost explosion**: 1000+ instances running for hours

While the Ray Dashboard provides only raw metrics and logs for this job, the overwhelming volume of data makes it difficult for users to identify actionable insights,

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/1000_node_raydash_metrics.png" width="45%" loading="lazy">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/1000_node_raydash_logs.png" width="45%" loading="lazy">
</div>

Additionally, the dashboard becomes completely inaccessible when the head node fails due to out-of-memory (OOM) errors.

Since Ray workers continuously send metrics and logs to Anyscale's backend infrastructure, Anyscale observability remains accessible even after the Ray head node fails. 

Anyscale contextualizes these metrics and logs, enabling users to easily monitor these issues:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/OOM_1.png" width="70%" loading="lazy">

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/OOM_2.png" width="70%" loading="lazy">

With Ray Workload view, users can detect at which step does OOM happen in this data pipeline:

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_dashboard_1.png" width="45%" loading="lazy">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_dashboard_2.png" width="45%" loading="lazy">
</div>

From the above graph, it is easy to see that the data pipeline was stuck at writing images into the Parquet files (materializing images).

Logs are also available for querying in the logs view:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/log_err.png" width="70%" loading="lazy">

This example demonstrates that while the Ray Dashboard provides raw metrics, status, and logs of workloads, Anyscale observability remains available even after jobs complete or fail, and contextualizes this information to deliver a much more user-friendly observability experience. 