## Example

Contextualized observability is observability that a user easily understands based on the context they already have. Because Anyscale:

- **Knows what user’s workloads are**: Workload observability can be highly contextualized.
- **Controls the full stack from telemetry to visualization**: The tools can be deeply integrated and native to Ray and its libraries.
- **Handles running workloads at extremely large scale**: This itself has a strong need for good observability.


Anyscale contextualizes observability to enhance what is available from Ray OSS Observability. Here is an example of how easy it is to debug a job that uses more than 1000 nodes with Anyscale observability.

The following script creates 1 billion (10**9) synthetic images and writes them to Parquet files (see `example/main.py` for the full code).

In [None]:
import ray
import numpy as np
import os
import time
from typing import Dict, Any

"""
Each image is about 1MB. (HxWxC = 580x580x3 = 1MB)
So 1 billion images would be 1 PB. (10^9 * 1MB = 1PB)
"""
NUM_IMAGES = 10**9
IMAGE_WIDTH = 580
IMAGE_HEIGHT = 580
CHANNELS = 3


def generate_synthetic_image(image_id: int, width: int = 580, height: int = 580, channels: int = 3) -> Dict[str, Any]:
    image_array = np.random.randint(0, 256, size=(height, width, channels), dtype=np.uint8)
    return {
        "image_id": image_id,
        "image_array": image_array,
        "metadata": {
            "dtype": str(image_array.dtype),
            "shape": image_array.shape,
            "generated_by": "ray_data_synthetic"
        }
    }


if __name__ == "__main__":
    image_ids = list(range(NUM_IMAGES))
    output_path = os.path.join(os.environ["ANYSCALE_ARTIFACT_STORAGE"], "rkn/synthetic_image_output")

    ds = ray.data.from_items(image_ids)
    ds = ds.repartition(target_num_rows_per_block=1000)
    ds = ds.map(lambda x: generate_synthetic_image(x["item"], IMAGE_WIDTH, IMAGE_HEIGHT, CHANNELS))
    ds.write_parquet(output_path)

 The following is the job configuration file to start this job (see `example/job.yaml` for the full code). Replace cloud name if needed.

In [None]:
name: synthetic-image-generator

image_uri: anyscale/ray:2.48.0-py312-cu128

compute_config:
  worker_nodes:
    - instance_type: m5.12xlarge
      max_nodes: 500
      market_type: PREFER_SPOT # (Optional) Defaults to ON_DEMAND
    - instance_type: m5.16xlarge
      max_nodes: 500
      market_type: PREFER_SPOT # (Optional) Defaults to ON_DEMAND
    - instance_type: m5.24xlarge
      max_nodes: 500

When add the correct **Anyscale Cloud ID** and run this job with following command:

In [None]:
!cd example/ && anyscale job submit -f job.yaml

When the job starts, memory usage will increase gradually across all workers until some workers' memory usage hits 90%+. As it continues running, the first OOM failure occurs on smaller instances. As more and more workers fail, the remaining workers become overloaded.

This job causes:

- **Memory pressure**: Some workers experience OOM failures
- **Storage I/O bottlenecks**: 1000+ workers writing simultaneously
- **Network saturation**: Data transfer between workers
- **Cost explosion**: 1000+ instances running for hours

While the Ray Dashboard provides only raw metrics and logs for this job, the overwhelming volume of data makes it difficult for users to identify actionable insights,

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/1000_node_raydash_metrics.png" width="45%" loading="lazy">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/1000_node_raydash_logs.png" width="45%" loading="lazy">
</div>

Additionally, the dashboard becomes completely inaccessible when the head node fails due to out-of-memory (OOM) errors.

Since Ray workers continuously send metrics and logs to Anyscale's backend infrastructure, Anyscale observability remains accessible even after the Ray head node fails. 

Anyscale contextualizes these metrics and logs, enabling users to easily monitor these issues:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/OOM_1.png" width="70%" loading="lazy">

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/OOM_2.png" width="70%" loading="lazy">

With Ray Workload view, users can detect at which step does OOM happen in this data pipeline:

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_dashboard_1.png" width="45%" loading="lazy">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/ray_dashboard_2.png" width="45%" loading="lazy">
</div>

From the above graph, it is easy to see that the data pipeline was stuck at writing images into the Parquet files (materializing images).

Logs are also available for querying in the logs view:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/1-Overview/log_err.png" width="70%" loading="lazy">

This example demonstrates that while the Ray Dashboard provides raw metrics, status, and logs of workloads, Anyscale observability remains available even after jobs complete or fail, and contextualizes this information to deliver a much more user-friendly observability experience. 