Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polish observability (o11y) docs #39069

Merged
merged 12 commits into from
Sep 8, 2023
1 change: 1 addition & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,7 @@ parts:
- file: ray-observability/user-guides/debug-apps/ray-debugging
- file: ray-observability/user-guides/cli-sdk
- file: ray-observability/user-guides/configure-logging
- file: ray-observability/user-guides/profiling
- file: ray-observability/user-guides/add-app-metrics
- file: ray-observability/user-guides/ray-tracing
- file: ray-observability/reference/index
Expand Down
16 changes: 13 additions & 3 deletions doc/source/cluster/configure-manage-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,9 @@ Configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`, `R
* Set `RAY_GRAFANA_IFRAME_HOST` to an address that the user's browsers can use to access Grafana and embed visualizations. If `RAY_GRAFANA_IFRAME_HOST` is not set, Ray Dashboard uses the value of `RAY_GRAFANA_HOST`.

For example, if the IP of the head node is 55.66.77.88 and Grafana is hosted on port 3000. Set the value to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`.
* If you start a single-node Ray Cluster manually, make sure these environment variables are set and accessible before you start the cluster or as a prefix to the `ray start ...` command, e.g., `RAY_GRAFANA_HOST=http://55.66.77.88:3000 ray start ...`
* If you start a Ray Cluster with {ref}`VM Cluster Launcher <cloud-vm-index>`, the environment variables should be set under `head_start_ray_commands` as a prefix to the `ray start ...` command.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved
* If you start a Ray Cluster with {ref}`KubeRay <kuberay-index>`, refer to this {ref}`tutorial <kuberay-prometheus-grafana>`.

If all the environment variables are set properly, you should see time-series metrics in {ref}`Ray Dashboard <observability-getting-started>`.

Expand Down Expand Up @@ -237,7 +240,7 @@ When both Grafana and the Ray Cluster are on the same Kubernetes cluster, set `R


#### User authentication for Grafana
When the Grafana instance requires user authentication, the following settings have to be in its `configuration file <https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/>`_ to correctly embed in Ray Dashboard:
When the Grafana instance requires user authentication, the following settings have to be in its [configuration file](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/) to correctly embed in Ray Dashboard:

```ini
[security]
Expand All @@ -248,8 +251,15 @@ When the Grafana instance requires user authentication, the following settings h

#### Troubleshooting

##### Grafana dashboards are not embedded in the Ray dashboard
If you're getting an error that says `RAY_GRAFANA_HOST` is not setup despite having set it up, check that:
##### Dashboard message: either Prometheus or Grafana server is not deteced
If you have followed the instructions above to set up everything, run the connection checks below in your browser:
* check Head Node connection to Prometheus server: add `api/prometheus_health` to the end of Ray Dashboard URL (for example: http://127.0.0.1:8265/api/prometheus_health)and visit it.
* check Head Node connection to Grafana server: add `api/grafana_health` to the end of Ray Dashboard URL (for example: http://127.0.0.1:8265/api/grafana_health) and visit it.
* check browser connection to Grafana server: visit the URL used in `RAY_GRAFANA_IFRAME_HOST`.


##### Getting an error that says `RAY_GRAFANA_HOST` is not setup
If you have set up Grafana , check that:
* You've included the protocol in the URL (e.g., `http://your-grafana-url.com` instead of `your-grafana-url.com`).
* The URL doesn't have a trailing slash (e.g., `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`).

Expand Down
Binary file modified doc/source/images/profile.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions doc/source/ray-observability/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ When you start a single-node Ray Cluster on your laptop, access the dashboard wi

INFO worker.py:1487 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265.

.. note::

If you start Ray in a docker container, ``--dashboard-host`` is a required parameter. For example, ``ray start --head --dashboard-host=0.0.0.0``.



When you start a remote Ray Cluster with the :ref:`VM Cluster Launcher <vm-cluster-quick-start>`, :ref:`KubeRay operator <kuberay-quickstart>`, or manual configuration, Ray Dashboard launches on the head node but the dashboard port may not be publicly exposed. View :ref:`configuring the dashboard <dashboard-in-browser>` for how to view Dashboard from outside the Head Node.

.. note::
Expand Down
10 changes: 5 additions & 5 deletions doc/source/ray-observability/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,13 @@ Profiling
---------
Profiling is way of analyzing the performance of an application by sampling the resource usage of it. Ray supports various profiling tools:

- CPU profiling for Worker processes, including integration with :ref:`py-spy <dashboard-profiling>` and :ref:`cProfile <dashboard-cprofile>`
- Memory profiling for Worker processes with :ref:`memray <memray-profiling>`
- Built in Task and Actor profiling tool called :ref:`ray timeline <ray-core-timeline>`
- CPU profiling for Driver and Worker processes, including integration with :ref:`py-spy <profiling-pyspy>` and :ref:`cProfile <profiling-cprofile>`
- Memory profiling for Driver and Worker processes with :ref:`memray <profiling-memray>`
- GPU profiling with :ref:`Pytorch Profiler <profiling-pytoch-profiler>`
- Built in Task and Actor profiling tool called :ref:`Ray Timeline <profiling-timeline>`

Ray doesn't provide native integration with GPU profiling tools. Try running GPU profilers like `Pytorch Profiler`_ without Ray to identify the issues.
View :ref:`Profiling <profiling>` for more details. Note that this list isn't comprehensive and feel free to contribute to it if you find other useful tools.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

.. _`Pytorch Profiler`: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

Tracing
-------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ Debugging Hangs
===============
View stack traces in Ray Dashboard
-----------------------------------
The :ref:`Ray dashboard <observability-getting-started>` lets you profile Ray worker processes by clicking on the "Stack Trace"
actions for active worker processes, actors, and job's driver process.
The :ref:`Ray dashboard <observability-getting-started>` lets you profile Ray Driver or Worker processes, by clicking on the "CPU profiling" or "Stack Trace" actions for active Worker processes, Tasks, Actors, and Job's driver process.

.. image:: /images/profile.png
:align: center
Expand All @@ -19,16 +18,19 @@ trace is shown. To show native code frames, set the URL parameter ``native=1`` (
:width: 60%

.. note::
If you run Ray in a Docker container, you may run into permission errors when viewing the stack traces. Follow the `py-spy documentation`_ to resolve it.
You may run into permission errors when using py-spy in the docker containers. To fix the issue:

* if you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved
* if you are a KubeRay user, follow the :ref:`guide to configure KubeRay <kuberay-pyspy-integration>` and resolve it.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

.. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker


Use ``ray stack`` CLI command
------------------------------

You can run ``ray stack`` to dump the stack traces of all Ray Worker processes on
the current node. This requires ``py-spy`` to be installed.
Once ``py-spy`` is installed (it is automatically installed if "Ray Dashboard" component is included when :ref:`installing Ray <installation>`), you can run ``ray stack`` to dump the stack traces of all Ray Worker processes on
the current node.

This document discusses some common problems that people run into when using Ray
as well as some known problems. If you encounter other problems, please
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,6 @@ It is also possible tasks and actors use more memory than you expect. For exampl

View the instructions below to learn how to memory profile individual actors and tasks.


.. _memray-profiling:

Memory Profiling Ray tasks and actors
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,11 @@ not have root permissions, the Dashboard prompts with instructions on how to set
Alternatively, you can start Ray with passwordless sudo / root permissions.

.. note::
If you run Ray in a Docker container, you may run into permission errors when using py-spy. Follow the `py-spy documentation`_ to resolve it.

You may run into permission errors when using py-spy in the docker containers. To fix the issue:

* If you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it.
* if you are a KubeRay user, follow the :ref:`guide to configure KubeRay <kuberay-pyspy-integration>` and resolve it.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

.. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker


Expand Down Expand Up @@ -324,14 +327,25 @@ Our example in total now takes only 1.5 seconds to run:
20 0.001 0.000 0.001 0.000 worker.py:514(submit_task)
...

GPU Profiling
------------------------
Ray doesn't provide native integration with GPU profiling tools. Try running GPU profilers like `Pytorch Profiler`_ without Ray to identify the issues.

If you have related feature requests, `let us know`_.
.. _performance-debugging-gpu-profiling:

GPU Profiling with PyTorch Profiler
-----------------------------------
Here are the steps to use PyTorch Profiler during training with Ray Train or batch inference with Ray Data:

* Follow the `PyTorch Profiler documentation <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html>`_ to record events in your PyTorch code.

* Convert your PyTorch script to a :ref:`Ray Train training script <train-pytorch>` or a :ref:`Ray Data batch inference script <batch_inference_home>`. (no change to your profiler-related code)
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

* Run your training or batch inference script.

* Collect the profiling results from all the nodes (compared to 1 node in a non-distributed setting).

* You may want to upload results on each Node to NFS or object storage like S3 so that you don't have to fetch results from each Node respectively.

* Visualize the results with tools like Tensorboard.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

.. _`let us know`: https://github.com/ray-project/ray/issues
.. _`Pytorch Profiler`: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

Profiling for Developers
------------------------
Expand Down
67 changes: 67 additions & 0 deletions doc/source/ray-observability/user-guides/profiling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
(profiling)=
# Profiling
Profiling is one of the most important debugging tools to diagnose performance, out of memory, hanging, or other application issues.
Here is a list of common profiling tools you may use when debugging Ray applications.
- CPU profiling
- py-spy
- Memory profiling
- memray
- GPU profiling
- PyTorch Profiler
- Ray Task / Actor timeline

If Ray doesn't work with certain profiling tools, try running them without Ray to debug the issues.

(profiling-cpu)=
## CPU profiling
Profile the CPU usage for Driver and Worker processes. This helps you understand the CPU usage by different processes and debug unexpectedly high or low usage.

(profiling-pyspy)=
### py-spy
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved
[py-spy](https://github.com/benfred/py-spy/tree/master) is a sampling profiler for Python programs. Ray Dashboard has native integration with pyspy:

- It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way.
- It dumps the stacktrace of the running process so that you can see what the process is doing at a certain time. It is useful when programs hangs.

:::{note}
You may run into permission errors when using py-spy in the docker containers. To fix the issue:

- if you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved
- if you are a KubeRay user, follow the :ref:`guide to configure KubeRay <kuberay-pyspy-integration>` and resolve it.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved
:::

Here are the {ref}`steps to use py-spy with Ray and Ray Dashboard <observability-debug-hangs>`.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

(profiling-cprofile)=
### cProfile
cProfile is Python’s native profiling module to profile the performance of your Ray application.

Here are the {ref}`steps to use cProfile <dashboard-cprofile>`.
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved

(profiling-memory)=
## Memory profiling
Profile the memory usage for Driver and Worker processes. This helps you analyze memory allocations in applications, trace memory leaks, and debug high/low memory or out of memory issues.

(profiling-memray)=
### memray
memray is a memory profiler for Python. It can track memory allocations in Python code, in native extension modules, and in the Python interpreter itself.

Here are the {ref}`steps to profile the memory usage of Ray Tasks and Actors <memray-profiling>`.

(profiling-gpu)=
## GPU profiling
GPU and GRAM profiling for your GPU workloads like distributed training. This helps you analyze performance and debug memory issues.
- PyTorch profiler is supported out of box when used with Ray Train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what link? Ray Train doesn't have doc for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so we don't have any example for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither Ray Train nor Ray Data has doc about it yet. We should ask them to add it later.

- NVIDIA Nsight System is not natively supported yet. Leave your comments in this [feature request for Nisght System support](https://github.com/ray-project/ray/issues/19631).

(profiling-pytoch-profiler)=
### PyTorch Profiler
PyTorch Profiler is a tool that allows the collection of performance metrics (especially GPU metrics) during training and inference.

Here are the {ref}`steps to use PyTorch Profiler with Ray Train or Ray Data <performance-debugging-gpu-profiling>`.

(profiling-timeline)=
## Ray Task / Actor timeline
Ray Timeline profiles the execution time of Ray Tasks and Actors. This helps you analyze performance, identify the stragglers, and understand the distribution of workloads.

Open your Ray Job in Ray Dashboard and follow the {ref}`instructions to download and visualize the trace files <dashboard-timeline>` generated by Ray Timeline.
Loading