Discover TPU logs in Ray Dashboard#47737
Conversation
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
@kevin85421 can we include this in the next release if possible? |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
|
@kevin85421 This PR would be helpful to add since TPU logs currently aren't discoverable from the Ray dashboard, making it necessary for users to exec into the Pod containers to view relevant logs to their workloads. |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
@kevin85421 I fixed the first round of comments and re-tested it with the unit test and by manually deploying in my cluster. We now check for the TPU logs directory before creating a symlink and adding it to the monitor path. |
| try_to_create_directory(self._runtime_env_dir) | ||
| # Create a symlink to the libtpu tpu_logs directory if it exists. | ||
| user_temp_dir = ray._private.utils.get_user_temp_dir() | ||
| tpu_log_dir = os.environ.get("TPU_LOG_DIR", f"{user_temp_dir}/tpu_logs") |
There was a problem hiding this comment.
When is {user_temp_dir}/tpu_logs created? Is it guaranteed to exist before this function is called?
There was a problem hiding this comment.
In my manual testing /tmp/tpu_logs existed when the Ray node initialized so I think it's beforehand, but I'm not sure how to check if it's guaranteed. The directory contains libtpu logs, so I believe it's created when libtpu is installed in the container.
There was a problem hiding this comment.
I ran some manual tests, I created a RayCluster with the following manifest:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-tpu-v4-singlehost
spec:
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/tpu-logs:latest
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
env:
- name: RAY_memory_monitor_refresh_ms
value: "0"
- name: RAY_GRAFANA_IFRAME_HOST
value: http://${grafana_host}
- name: RAY_GRAFANA_HOST
value: http://grafana:80
- name: RAY_PROMETHEUS_HOST
value: http://frontend:9090
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
workerGroupSpecs:
- rayStartParams: {}
replicas: 1
minReplicas: 1
maxReplicas: 1
numOfHosts: 1
groupName: tpu-group
template:
spec:
initContainers:
- name: check-folder
image: busybox
command:
- /bin/sh
- -c
- 'if [ -d "/tmp/tpu_logs" ]; then echo "Folder exists"; else echo "Folder does not exist"; fi'
containers:
- name: ray-worker
image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/tpu-logs:latest
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "1"
ephemeral-storage: 20Gi
google.com/tpu: "4"
memory: 40G
requests:
cpu: "1"
ephemeral-storage: 10Gi
google.com/tpu: "4"
memory: 40G
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x1
when I check the logs of check-folder, I can see that the /tmp/tpu_logs folder is not present in the initContainer:
kubectl logs raycluster-tpu-v4-singlehost-tpu-group-worker-cdwkg -c check-folder
Folder does not exist
This makes sense since the TPU node hasn't started initializing, which is when it starts making calls to libtpu and writing logs. I'm not sure if this is an issue though, the folder is only added in _init_temp if the path exists (so it won't cause an error) and from manually tests it's consistently showing up in the dashboard - so the libtpu calls must be occurring immediately when the Ray container is created.
There was a problem hiding this comment.
OK, we can delegate the responsibility to users to ensure that /tmp/tpu_logs is created before running ray start.
There was a problem hiding this comment.
It's better to add documentation for it. It's not necessary to include it in this PR, but it would be helpful to document it. Would you mind opening an issue to track the progress?
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
TPU device logs for k8s containers that request `google.com/tpu` resources are written to the `/tmp/tpu_logs` directory. This PR adds a symlink to the `/tmp/tpu_logs` directory when the `TPU_WORKER_ID` env var is set, TPU log files are then added to `monitor_log_paths`. The logs are then viewable from the Ray Dashboard: Create a file in /tmp/tpu_logs and view symlink:  The tpu_logs directory is added to the 'Logs' tab on a TPU Ray worker:  The log file we created is ingested/viewable:  --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com>
TPU device logs for k8s containers that request `google.com/tpu` resources are written to the `/tmp/tpu_logs` directory. This PR adds a symlink to the `/tmp/tpu_logs` directory when the `TPU_WORKER_ID` env var is set, TPU log files are then added to `monitor_log_paths`. The logs are then viewable from the Ray Dashboard: Create a file in /tmp/tpu_logs and view symlink:  The tpu_logs directory is added to the 'Logs' tab on a TPU Ray worker:  The log file we created is ingested/viewable:  --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com> Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Why are these changes needed?
TPU device logs for k8s containers that request
google.com/tpuresources are written to the/tmp/tpu_logsdirectory. This PR adds a symlink to the/tmp/tpu_logsdirectory when theTPU_WORKER_IDenv var is set, TPU log files are then added tomonitor_log_paths. The logs are then viewable from the Ray Dashboard:Create a file in /tmp/tpu_logs and view symlink:

The tpu_logs directory is added to the 'Logs' tab on a TPU Ray worker:

The log file we created is ingested/viewable:

Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.