ray-project · matthewdeng · Sep 8, 2023 · Aug 29, 2023 · Aug 30, 2023 · Aug 30, 2023
@@ -371,6 +371,7 @@ parts:
                   - file: ray-observability/user-guides/debug-apps/ray-debugging
               - file: ray-observability/user-guides/cli-sdk
               - file: ray-observability/user-guides/configure-logging
+              - file: ray-observability/user-guides/profiling
               - file: ray-observability/user-guides/add-app-metrics
               - file: ray-observability/user-guides/ray-tracing
           - file: ray-observability/reference/index

@@ -209,6 +209,9 @@ Configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`, `R
 * Set `RAY_GRAFANA_IFRAME_HOST` to an address that the user's browsers can use to access Grafana and embed visualizations. If `RAY_GRAFANA_IFRAME_HOST` is not set, Ray Dashboard uses the value of `RAY_GRAFANA_HOST`.
 
 For example, if the IP of the head node is 55.66.77.88 and Grafana is hosted on port 3000. Set the value to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`.
+* If you start a single-node Ray Cluster manually, make sure these environment variables are set and accessible before you start the cluster or as a prefix to the `ray start ...` command, e.g., `RAY_GRAFANA_HOST=http://55.66.77.88:3000 ray start ...`
+* If you start a Ray Cluster with {ref}`VM Cluster Launcher <cloud-vm-index>`, the environment variables should be set under `head_start_ray_commands` as a prefix to the `ray start ...` command.
+* If you start a Ray Cluster with {ref}`KubeRay <kuberay-index>`, refer to this {ref}`tutorial <kuberay-prometheus-grafana>`.
 
 If all the environment variables are set properly, you should see time-series metrics in {ref}`Ray Dashboard <observability-getting-started>`.
 
@@ -237,7 +240,7 @@ When both Grafana and the Ray Cluster are on the same Kubernetes cluster, set `R
 
 
 #### User authentication for Grafana
-When the Grafana instance requires user authentication, the following settings have to be in its `configuration file <https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/>`_ to correctly embed in Ray Dashboard:
+When the Grafana instance requires user authentication, the following settings have to be in its [configuration file](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/) to correctly embed in Ray Dashboard:
 
 ```ini
   [security]
@@ -248,8 +251,15 @@ When the Grafana instance requires user authentication, the following settings h
 
 #### Troubleshooting
 
-##### Grafana dashboards are not embedded in the Ray dashboard
-If you're getting an error that says `RAY_GRAFANA_HOST` is not setup despite having set it up, check that:
+##### Dashboard message: either Prometheus or Grafana server is not deteced
+If you have followed the instructions above to set up everything, run the connection checks below in your browser:
+* check Head Node connection to Prometheus server: add `api/prometheus_health` to the end of Ray Dashboard URL (for example: http://127.0.0.1:8265/api/prometheus_health)and visit it.
+* check Head Node connection to Grafana server: add `api/grafana_health` to the end of Ray Dashboard URL (for example: http://127.0.0.1:8265/api/grafana_health) and visit it.
+* check browser connection to Grafana server: visit the URL used in `RAY_GRAFANA_IFRAME_HOST`.
+
+
+##### Getting an error that says `RAY_GRAFANA_HOST` is not setup
+If you have set up Grafana , check that:
 * You've included the protocol in the URL (e.g., `http://your-grafana-url.com` instead of `your-grafana-url.com`).
 * The URL doesn't have a trailing slash (e.g., `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`).
 

@@ -51,6 +51,12 @@ When you start a single-node Ray Cluster on your laptop, access the dashboard wi
 
   INFO worker.py:1487 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265.
 
+.. note::
+
+    If you start Ray in a docker container, ``--dashboard-host`` is a required parameter. For example, ``ray start --head --dashboard-host=0.0.0.0``.  
+
+
+
 When you start a remote Ray Cluster with the :ref:`VM Cluster Launcher <vm-cluster-quick-start>`, :ref:`KubeRay operator <kuberay-quickstart>`, or manual configuration, Ray Dashboard launches on the head node but the dashboard port may not be publicly exposed. View :ref:`configuring the dashboard <dashboard-in-browser>` for how to view Dashboard from outside the Head Node.
 
 .. note::

@@ -75,13 +75,13 @@ Profiling
 ---------
 Profiling is way of analyzing the performance of an application by sampling the resource usage of it. Ray supports various profiling tools:
 
-- CPU profiling for Worker processes, including integration with :ref:`py-spy <dashboard-profiling>` and :ref:`cProfile <dashboard-cprofile>`
-- Memory profiling for Worker processes with :ref:`memray <memray-profiling>`
-- Built in Task and Actor profiling tool called :ref:`ray timeline <ray-core-timeline>`
+- CPU profiling for Driver and Worker processes, including integration with :ref:`py-spy <profiling-pyspy>` and :ref:`cProfile <profiling-cprofile>`
+- Memory profiling for Driver and Worker processes with :ref:`memray <profiling-memray>`
+- GPU profiling with :ref:`Pytorch Profiler <profiling-pytoch-profiler>`
+- Built in Task and Actor profiling tool called :ref:`Ray Timeline <profiling-timeline>`
 
-Ray doesn't provide native integration with GPU profiling tools. Try running GPU profilers like `Pytorch Profiler`_ without Ray to identify the issues.
+View :ref:`Profiling <profiling>` for more details. Note that this list isn't comprehensive and feel free to contribute to it if you find other useful tools.
 
-.. _`Pytorch Profiler`: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
 
 Tracing
 -------

@@ -4,8 +4,7 @@ Debugging Hangs
 ===============
 View stack traces in Ray Dashboard
 -----------------------------------
-The :ref:`Ray dashboard <observability-getting-started>`  lets you profile Ray worker processes by clicking on the "Stack Trace"
-actions for active worker processes, actors, and job's driver process.
+The :ref:`Ray dashboard <observability-getting-started>`  lets you profile Ray Driver or Worker processes, by clicking on the "CPU profiling" or "Stack Trace" actions for active Worker processes, Tasks, Actors, and Job's driver process.
 
 .. image:: /images/profile.png
    :align: center
@@ -19,16 +18,19 @@ trace is shown. To show native code frames, set the URL parameter ``native=1`` (
    :width: 60%
 
 .. note::
-   If you run Ray in a Docker container, you may run into permission errors when viewing the stack traces. Follow the `py-spy documentation`_  to resolve it.
+   You may run into permission errors when using py-spy in the docker containers. To fix the issue:
+
+   * if you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it. 
+   * if you are a KubeRay user, follow the :ref:`guide to configure KubeRay <kuberay-pyspy-integration>` and resolve it.
 
 .. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker
 
 
 Use ``ray stack`` CLI command
 ------------------------------
 
-You can run ``ray stack`` to dump the stack traces of all Ray Worker processes on
-the current node. This requires ``py-spy`` to be installed.
+Once ``py-spy`` is installed (it is automatically installed if "Ray Dashboard" component is included when :ref:`installing Ray <installation>`), you can run ``ray stack`` to dump the stack traces of all Ray Worker processes on
+the current node.
 
 This document discusses some common problems that people run into when using Ray
 as well as some known problems. If you encounter other problems, please

@@ -173,7 +173,6 @@ It is also possible tasks and actors use more memory than you expect. For exampl
 
 View the instructions below to learn how to memory profile individual actors and tasks.
 
-
 .. _memray-profiling:
 
 Memory Profiling Ray tasks and actors

@@ -115,8 +115,11 @@ not have root permissions, the Dashboard prompts with instructions on how to set
     Alternatively, you can start Ray with passwordless sudo / root permissions.
 
 .. note::
-   If you run Ray in a Docker container, you may run into permission errors when using py-spy. Follow the `py-spy documentation`_  to resolve it.
-
+   You may run into permission errors when using py-spy in the docker containers. To fix the issue:
+
+   * If you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it. 
+   * if you are a KubeRay user, follow the :ref:`guide to configure KubeRay <kuberay-pyspy-integration>` and resolve it.
+
 .. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker
 
 
@@ -324,14 +327,25 @@ Our example in total now takes only 1.5 seconds to run:
   20    0.001    0.000    0.001    0.000 worker.py:514(submit_task)
   ...
 
-GPU Profiling
-------------------------
-Ray doesn't provide native integration with GPU profiling tools. Try running GPU profilers like `Pytorch Profiler`_ without Ray to identify the issues.
 
-If you have related feature requests, `let us know`_.
+.. _performance-debugging-gpu-profiling:
+
+GPU Profiling with PyTorch Profiler
+-----------------------------------
+Here are the steps to use PyTorch Profiler during training with Ray Train or batch inference with Ray Data:
+
+* Follow the `PyTorch Profiler documentation <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html>`_ to record events in your PyTorch code.
+
+* Convert your PyTorch script to a :ref:`Ray Train training script <train-pytorch>` or a :ref:`Ray Data batch inference script <batch_inference_home>`. (no change to your profiler-related code)
+
+* Run your training or batch inference script.
+
+* Collect the profiling results from all the nodes (compared to 1 node in a non-distributed setting).
+
+  * You may want to upload results on each Node to NFS or object storage like S3 so that you don't have to fetch results from each Node respectively.
+
+* Visualize the results with tools like Tensorboard.
 
-.. _`let us know`: https://github.com/ray-project/ray/issues
-.. _`Pytorch Profiler`: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
 
 Profiling for Developers
 ------------------------

@@ -0,0 +1,67 @@
+(profiling)=
+# Profiling
+Profiling is one of the most important debugging tools to diagnose performance, out of memory, hanging, or other application issues.
+Here is a list of common profiling tools you may use when debugging Ray applications. 
+- CPU profiling
+    - py-spy
+- Memory profiling
+    - memray
+- GPU profiling
+    - PyTorch Profiler
+- Ray Task / Actor timeline
+
+If Ray doesn't work with certain profiling tools, try running them without Ray to debug the issues.
+
+(profiling-cpu)=
+## CPU profiling
+Profile the CPU usage for Driver and Worker processes. This helps you understand the CPU usage by different processes and debug unexpectedly high or low usage.
+
+(profiling-pyspy)=
+### py-spy
+[py-spy](https://github.com/benfred/py-spy/tree/master) is a sampling profiler for Python programs. Ray Dashboard has native integration with pyspy:
+
+- It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way.
+- It dumps the stacktrace of the running process so that you can see what the process is doing at a certain time. It is useful when programs hangs.
+
+:::{note}
+You may run into permission errors when using py-spy in the docker containers. To fix the issue:
+
+- if you start Ray manually in a Docker container, follow the `py-spy documentation`_ to resolve it. 
+- if you are a KubeRay user, follow the :ref:`guide to configure KubeRay <kuberay-pyspy-integration>` and resolve it.
+:::
+
+Here are the {ref}`steps to use py-spy with Ray and Ray Dashboard <observability-debug-hangs>`.
+
+(profiling-cprofile)=
+### cProfile
+cProfile is Python’s native profiling module to profile the performance of your Ray application.
+
+Here are the {ref}`steps to use cProfile <dashboard-cprofile>`.
+
+(profiling-memory)=
+## Memory profiling
+Profile the memory usage for Driver and Worker processes. This helps you analyze memory allocations in applications, trace memory leaks, and debug high/low memory or out of memory issues.
+
+(profiling-memray)=
+### memray
+memray is a memory profiler for Python. It can track memory allocations in Python code, in native extension modules, and in the Python interpreter itself.
+
+Here are the {ref}`steps to profile the memory usage of Ray Tasks and Actors <memray-profiling>`.
+
+(profiling-gpu)=
+## GPU profiling
+GPU and GRAM profiling for your GPU workloads like distributed training. This helps you analyze performance and debug memory issues. 
+- PyTorch profiler is supported out of box when used with Ray Train
+- NVIDIA Nsight System is not natively supported yet. Leave your comments in this [feature request for Nisght System support](https://github.com/ray-project/ray/issues/19631).
+
+(profiling-pytoch-profiler)=
+### PyTorch Profiler
+PyTorch Profiler is a tool that allows the collection of performance metrics (especially GPU metrics) during training and inference.
+
+Here are the {ref}`steps to use PyTorch Profiler with Ray Train or Ray Data <performance-debugging-gpu-profiling>`.
+
+(profiling-timeline)=
+## Ray Task / Actor timeline
+Ray Timeline profiles the execution time of Ray Tasks and Actors. This helps you analyze performance, identify the stragglers, and understand the distribution of workloads.
+
+Open your Ray Job in Ray Dashboard and follow the {ref}`instructions to download and visualize the trace files <dashboard-timeline>` generated by Ray Timeline.