# Ray and Anyscale Observability in Detail

Â© 2025, Anyscale. All Rights Reserved

Ray and Anyscale provide a few different dashboards to support observability. This notebook presents examples that show when and how to use Ray and Anyscale observability dashboards for monitoring and debugging.

### **Prerequisites**

Before beginning this course, ensure you have:

- **Basic Ray Data knowledge**: Familiarity with fundamental Ray Data and Ray Serve concepts and operations
- **Basic data engineering knowledge**: Familiarity with data engineering pipelines and web application backend concepts
- **Anyscale platform experience**: Previous experience using the Anyscale platform is recommended

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>1. </b> Data Pipeline Observability (Ray Data)
        <ul>
            <li>Run a simple data pipeline</li>
            <li>Ray Data Logs</li>
            <li>Ray Data Metrics</li>
            <li>Ray Workloads Data Dashboard</li>
        </ul>
    <li><b>2. </b> Web Application Observability (Ray Serve)
        <ul>
            <li>Ray Serve Metrics</li>
            <li>Ray Serve Logs</li>
            <li>Ray Serve Tracing</li>
            <li>Ray Serve Alerts</li>
            <li>Anyscale Ray Serve Observability</li>
        </ul>
    </li>
</ul>
</div>

All examples are runnable on the Anyscale console, and some are also compatible with local Ray clusters (clearly labeled). This course demonstrates the examples using the Anyscale console.

ðŸ’» **Local environment**: This notebook can be run locally. Steps to launch a local Ray cluster are in the [Setup Guide](../01_general_intro_and_setup.ipynb).

ðŸš€ **Anyscale platform**: Consider running this notebook on a Ray cluster. Register to start a cluster via the Anyscale console: [Sign up](https://console.anyscale.com/register).

## Data Pipeline Observability (Ray Data)

#### Run a simple Data Pipeline

ðŸ’» Local Environment. Please modify the value of **default_cluster_storage** to your local.

Let's start with an example pipeline that we will use to demonstrate Ray Data's observability tools. Please modify **default_cluster_storage** to the path in your environment.

In [None]:
%%writefile simple_pipeline.py
import ray
import time
import pyarrow.fs as fs

default_cluster_storage = "/mnt/cluster_storage/observed_data/"

"""
s3://anyscale-public-materials/nyc-taxi-cab/yellow_tripdata_2011-05.parquet
"""
s3_fs = fs.S3FileSystem(anonymous=True)
ds = ray.data.read_parquet(
    "s3://anyscale-public-materials/nyc-taxi-cab/yellow_tripdata_2011-05.parquet",
    filesystem=s3_fs
)

def slow_adjust_total_amount(batch):
    time.sleep(10)
    batch["adjusted_total_amount"] = batch["total_amount"] - batch["tip_amount"]
    return batch

ds = ds.map_batches(slow_adjust_total_amount)
ds.write_parquet(default_cluster_storage)

print("Done!")

Let's now execute the pipeline from the terminal

In [None]:
# copy and run this inside a terminal
# !python simple_pipeline.py

To view job running, navigate to the Ray Dashboard tab and click on the jobs tab to see the pipeline running.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/jobs_tab_v2.png" width="800">

Let's click on the job to see its overview.

Under the Job overview page, let's click on the **Ray Data Overview** section to see our dataset.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data_overview_collapsed_v2.png" width="800">

The dataset is currently named as "dataset_{index}" - the auto-generated name. 

Let's expand the dataset to see the operators/stages in the pipeline.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data_overview_expanded_v2.png" width="800">


For each operator, you can view:
- the number of blocks outputted 
- the state

these are the two key pieces of information that you can use to monitor your pipeline.

#### Ray Data Logs

Next let's look at the Ray Data logs that are generated in our terminal and stored on disk.

The logs can help us find the operators that are backpressured - i.e. operators that can't add more tasks to process their inputs

Here is an explanation of the Ray Data logs

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/progress_bar_annotated.png" width="800">


#### Ray Data Metrics

<!-- Similar to overview + logs but timeseries along with more internal metrics. -->

Now let's take a look at the Ray Data Dashboard.

The Ray Data Dashboard provides a more detailed view of the pipeline, including a timeseries view of the dataset operators and metrics.

To navigate to the Ray Data Dashboard, click on the "Ray Dashboard > Metrics tab" and then click on the "Open in Grafana" dropdown (select the Ray Data Dashboard).

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/metrics-grafana-access.png">

You should see a dashboard similar to the one below:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data-dashboard-outlook.png" width="800">

Here we can get throughput metrics for each operator in the pipeline (e.g. number of blocks/rows/bytes processed per second).

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data-dashboard-througput.png" width="800">


#### Ray Workloads Data Dashboard

ðŸš€ **Anyscale Platform**: This view only exists in Anyscale 

From the Ray Workloads Data dashboard, view detailed Ray Data pipeline execution status (Throughputs, Resources, etc):

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/3-Ray-Data/ray_data_1.png" width="80%" loading="lazy">
</div>

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/3-Ray-Data/ray_data_2.png" width="80%" loading="lazy">
</div>

Check the details of any operator in the pipeline by viewing:
<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/3-Ray-Data/ray_data_3.png" width="80%" loading="lazy">
</div>

## Web Application Observability (Ray Serve)

#### Launching a Web Application using Ray Serve

**Imports**

In [None]:
import logging
import time
import requests
from ray import serve
import json
import numpy as np

ðŸ’» Local Environment. Please modify the value of **local_path** to your local path.

Deploy a web application:

In [None]:
# run the app with default config
!cd scipts/ && serve run main:mnist_app --non-blocking --name app1

Now send http request by running following script. It generates traffic to the Ray Serve web application, allowing you to explore its observability features in the subsequent sections.

In [None]:
start = time.time()
while time.time() - start < 60:
    images = np.random.rand(2, 1, 28, 28).tolist()
    json_request = json.dumps({"image": images})
    response = requests.post("http://localhost:8000/", json=json_request)
    response.json()["predicted_label"]

#### Ray Serve Metrics

The following metrics are Ray Serve specific:

- **Throughput metrics:**
    - Queries per second (QPS)
    - Error QPS
    - Error by error code QPS

Shown are the throughput metrics for above web application. Click "Metrics" -> "VIEW IN GRAFANA" -> "Dashboards" -> "Serve Dashboard"

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/throughput_per_application.png" alt="Ray Serve Metrics" width="800">

- **Latency metrics:**
    - P50, P90, P99 latencies

Shown are the latency metrics for the MNIST application.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/latency_per_application.png" alt="Ray Serve Latency Metrics" width="800">

- **Latency and throughput metrics are available at different levels of granularity:**
    - Per-application metrics
    - Per-deployment metrics
    - Per-replica metrics

Shown are the latency metrics on the deployment level.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/latency_per_deployment.png" alt="Ray Serve Latency Metrics" width="800">

- **Deployment-specific metrics:**
    - Number of replicas
    - Queue size (TODO - explain which queue)

Shown are the number of replicas and queue size for the MNIST application.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/replicas_per_deployment.png" alt="Ray Serve Deployment Metrics" width="400">

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/queue_size_per_deploymnet.png" alt="Ray Serve Deployment Metrics" width="400">


For Anyscale users, the following metrics are also available:
- **Rollout-specific metrics:**
    - QPS per version
    - Error QPS per version
    - P90 latency per version
    - Number of replicas per version

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/rollouts_per_application.png" alt="Ray Serve Rollout Metrics" width="800">

For details on how to setup custom dashboards and alerts, refer to this [guide in the Anyscale docs](https://docs.anyscale.com/monitoring/custom-dashboards-and-alerting)

#### Ray Serve Logs

ðŸ’» Local Environment

To understand system-level behavior and to surface application-level details during runtime, you can leverage Ray logging.

**Implementation:**
- Uses Python's standard logging module
- Logger name is "ray.serve"

**Log Output Locations:**
- Logs are sent to stderr
- Logs are written to disk at `/tmp/ray/session_latest/logs/serve/`

**Types of Logs Captured:**
- System-level logs (from Serve controller and proxy)
- Access logs
- Custom user logs from deployment replicas

**Development Environment Behavior:**
- Logs are streamed to the driver Ray program
- Driver program can be either:
    - Python script calling serve.run()
    - serve run CLI command

<div class="alert alert-info">

**Note:**
Given Ray Serve uses Python's standard logging module, aggressive logging inside your application will incur a performance penalty. Use logging levels to control the verbosity of your logs and to avoid this penalty when running in production.

</div>

Here is how to use logging in a deployment.

In [None]:
@serve.deployment()
class SayHelloDefaultLogging:
    async def __call__(self):
        logger = logging.getLogger("ray.serve")
        logger.info("hello world")


serve.run(SayHelloDefaultLogging.bind())

resp = requests.get("http://localhost:8000/")

#### Logging Configuration

ðŸ’» Local Environment

Here are the common configurations for logging.

- `enable_access_log`: Access logs are injected by default into Replica and Proxy logs. By default, it is `True`.
- `log_level`: Set the log level. By default, it is `INFO`.
- `encoding`: Set the encoding of the log file. By default, it is `JSON`.

You can set the logging configuration:
- At the deployment level
- At the serve instance level

Both programmatically or via a configuration file.


In [None]:
@serve.deployment(logging_config={"log_level": "DEBUG"})
class SayHelloDebugLogging:
    async def __call__(self):
        logger = logging.getLogger("ray.serve")
        logger.debug("hello world")


serve.run(
    SayHelloDebugLogging.bind(),
    logging_config={
        "encoding": "JSON",
        "log_level": "INFO",
        "enable_access_log": False,
    },
)

resp = requests.get("http://localhost:8000/")

#### Ray Serve Tracing (Anyscale Only)

To perform end-to-end tracing of requests, you can use the Anyscale Tracing integration.

See the [tracing guide](https://docs.anyscale.com/monitoring/tracing/) for details.

After following [README.md](tracing_example/README.md) to run the example, a single request's tracing logs display the following hierarchical structure:

```
1. proxy_http_request (Root) - Duration: 245ms
   â””â”€â”€ 2. proxy_route_to_replica (APIGateway) - Duration: 240ms
       â””â”€â”€ 3. replica_handle_request (APIGateway) - Duration: 235ms
           â””â”€â”€ 4. proxy_route_to_replica (UserService) - Duration: 180ms
               â””â”€â”€ 5. replica_handle_request (UserService) - Duration: 175ms
                   â””â”€â”€ 6. proxy_route_to_replica (DatabaseService) - Duration: 110ms
                       â””â”€â”€ 7. replica_handle_request (DatabaseService) - Duration: 105ms
```

#### Ray Serve Alerts

**Alert Types:**
Grafana [can alert](https://grafana.com/docs/grafana/v7.5/alerting/) based on:
- Metric values
- Rate of change
- Metric absence

**Notification Options:**
- Supports multiple [notification channels](https://grafana.com/docs/grafana/v7.5/alerting/notifications/#add-a-notification-channel) (Slack, PagerDuty, etc.)
- Email support planned for future
- Configurable through notification channels

**Documentation:** Full setup details available in Grafana's [official documentation](https://grafana.com/docs/grafana/v7.5/alerting/)

#### Anyscale Ray Serve Observability

ðŸš€ **Anyscale Platform**: This view only exists in Anyscale 

When deploying a web application to Anyscale, utilize Anyscale's advanced observability features to debug and manage the service.

The Logs view enables viewing and filtering logs:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/4-Ray-Serve/ray_serve_log_filter.png" width="80%" loading="lazy">

Since each deployment has a version ID, rolling back to an older version is easy through the **Versions** view. Zero-downtime rollouts can be performed with one click:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/4-Ray-Serve/ray_serve_versions.png" width="80%" loading="lazy">
