Skip to content

Discover TPU logs in Ray Dashboard#47737

Merged
edoakes merged 9 commits into
ray-project:masterfrom
ryanaoleary:tpu-logs
Mar 6, 2025
Merged

Discover TPU logs in Ray Dashboard#47737
edoakes merged 9 commits into
ray-project:masterfrom
ryanaoleary:tpu-logs

Conversation

@ryanaoleary
Copy link
Copy Markdown
Contributor

Why are these changes needed?

TPU device logs for k8s containers that request google.com/tpu resources are written to the /tmp/tpu_logs directory. This PR adds a symlink to the /tmp/tpu_logs directory when the TPU_WORKER_ID env var is set, TPU log files are then added to monitor_log_paths. The logs are then viewable from the Ray Dashboard:

Create a file in /tmp/tpu_logs and view symlink:
command-line-logging

The tpu_logs directory is added to the 'Logs' tab on a TPU Ray worker:
tpu_logs_dir

The log file we created is ingested/viewable:
tpu-device-log-file

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

cc: @andrewsykim @kevin85421

@kevin85421 kevin85421 self-assigned this Sep 18, 2024
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

These are what actual log files when running a Serve application on TPUs look like:

actual-tpu-logs

@ryanaoleary
Copy link
Copy Markdown
Contributor Author

@kevin85421 can we include this in the next release if possible?

@stale
Copy link
Copy Markdown

stale Bot commented Feb 24, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 24, 2025
@stale stale Bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 24, 2025
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

@kevin85421 This PR would be helpful to add since TPU logs currently aren't discoverable from the Ray dashboard, making it necessary for users to exec into the Pod containers to view relevant logs to their workloads.

Comment thread python/ray/_private/log_monitor.py Outdated
Comment thread python/ray/_private/log_monitor.py Outdated
Comment thread python/ray/_private/log_monitor.py Outdated
Comment thread python/ray/_private/node.py Outdated
ryanaoleary and others added 2 commits February 26, 2025 02:22
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Copy Markdown
Contributor Author

@kevin85421 I fixed the first round of comments and re-tested it with the unit test and by manually deploying in my cluster. We now check for the TPU logs directory before creating a symlink and adding it to the monitor path.

Comment thread python/ray/_private/node.py Outdated
Comment thread python/ray/_private/node.py Outdated
try_to_create_directory(self._runtime_env_dir)
# Create a symlink to the libtpu tpu_logs directory if it exists.
user_temp_dir = ray._private.utils.get_user_temp_dir()
tpu_log_dir = os.environ.get("TPU_LOG_DIR", f"{user_temp_dir}/tpu_logs")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is {user_temp_dir}/tpu_logs created? Is it guaranteed to exist before this function is called?

Copy link
Copy Markdown
Contributor Author

@ryanaoleary ryanaoleary Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my manual testing /tmp/tpu_logs existed when the Ray node initialized so I think it's beforehand, but I'm not sure how to check if it's guaranteed. The directory contains libtpu logs, so I believe it's created when libtpu is installed in the container.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update on this?

Copy link
Copy Markdown
Contributor Author

@ryanaoleary ryanaoleary Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some manual tests, I created a RayCluster with the following manifest:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-tpu-v4-singlehost
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
          - name: ray-head
            image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/tpu-logs:latest
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "500m"
                memory: "512Mi"
              requests:
                cpu: "500m"
                memory: "512Mi"
            env:
              - name: RAY_memory_monitor_refresh_ms
                value: "0"
              - name: RAY_GRAFANA_IFRAME_HOST
                value: http://${grafana_host}
              - name: RAY_GRAFANA_HOST
                value: http://grafana:80
              - name: RAY_PROMETHEUS_HOST
                value: http://frontend:9090
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
  workerGroupSpecs:
  - rayStartParams: {}
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    groupName: tpu-group
    template:
      spec:
        initContainers:
          - name: check-folder
            image: busybox
            command:
              - /bin/sh
              - -c
              - 'if [ -d "/tmp/tpu_logs" ]; then echo "Folder exists"; else echo "Folder does not exist"; fi'
        containers:
          - name: ray-worker
            image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/tpu-logs:latest
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "1"
                ephemeral-storage: 20Gi
                google.com/tpu: "4"
                memory: 40G
              requests:
                cpu: "1"
                ephemeral-storage: 10Gi
                google.com/tpu: "4"
                memory: 40G
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
          cloud.google.com/gke-tpu-topology: 2x2x1

when I check the logs of check-folder, I can see that the /tmp/tpu_logs folder is not present in the initContainer:

kubectl logs raycluster-tpu-v4-singlehost-tpu-group-worker-cdwkg -c check-folder
Folder does not exist

This makes sense since the TPU node hasn't started initializing, which is when it starts making calls to libtpu and writing logs. I'm not sure if this is an issue though, the folder is only added in _init_temp if the path exists (so it won't cause an error) and from manually tests it's consistently showing up in the dashboard - so the libtpu calls must be occurring immediately when the Ray container is created.

Copy link
Copy Markdown
Member

@kevin85421 kevin85421 Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can delegate the responsibility to users to ensure that /tmp/tpu_logs is created before running ray start.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add documentation for it. It's not necessary to include it in this PR, but it would be helpful to document it. Would you mind opening an issue to track the progress?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, here's the issue: #51102

ryanaoleary and others added 2 commits March 4, 2025 23:27
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 5, 2025
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@edoakes edoakes merged commit c953682 into ray-project:master Mar 6, 2025
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
TPU device logs for k8s containers that request `google.com/tpu`
resources are written to the `/tmp/tpu_logs` directory. This PR adds a
symlink to the `/tmp/tpu_logs` directory when the `TPU_WORKER_ID` env
var is set, TPU log files are then added to `monitor_log_paths`. The
logs are then viewable from the Ray Dashboard:

Create a file in /tmp/tpu_logs and view symlink:

![command-line-logging](https://github.com/user-attachments/assets/c50915ad-8382-4af7-a398-40d5a249e8c8)

The tpu_logs directory is added to the 'Logs' tab on a TPU Ray worker:

![tpu_logs_dir](https://github.com/user-attachments/assets/394133b0-be70-4b98-9e86-dcad50c1b4fd)

The log file we created is ingested/viewable:

![tpu-device-log-file](https://github.com/user-attachments/assets/c42ab96a-f88b-4959-adf2-8650fd75c773)

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com>
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
TPU device logs for k8s containers that request `google.com/tpu`
resources are written to the `/tmp/tpu_logs` directory. This PR adds a
symlink to the `/tmp/tpu_logs` directory when the `TPU_WORKER_ID` env
var is set, TPU log files are then added to `monitor_log_paths`. The
logs are then viewable from the Ray Dashboard:

Create a file in /tmp/tpu_logs and view symlink:

![command-line-logging](https://github.com/user-attachments/assets/c50915ad-8382-4af7-a398-40d5a249e8c8)

The tpu_logs directory is added to the 'Logs' tab on a TPU Ray worker:

![tpu_logs_dir](https://github.com/user-attachments/assets/394133b0-be70-4b98-9e86-dcad50c1b4fd)

The log file we created is ingested/viewable:

![tpu-device-log-file](https://github.com/user-attachments/assets/c42ab96a-f88b-4959-adf2-8650fd75c773)

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants