-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
What happened + What you expected to happen
The commands ray job submit and ray job logs -f are supposed to keep printing the output of the job until it terminates.
Lately though, I find these commands sporadically stop following logs and exit.
Interestingly, before exiting, they print the job status, same as ray job status would.
Example:
root@a687ff44d99c:/code# ray job submit ...
...
-------------------------------------------------------
Job 'raysubmit_aBW2LEDfWwhL1kSh' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_aBW2LEDfWwhL1kSh
Query the status of the job:
ray job status raysubmit_aBW2LEDfWwhL1kSh
Request the job to be stopped:
ray job stop raysubmit_aBW2LEDfWwhL1kSh
Tailing logs until the job exits (disable with --no-wait):
2025-03-21 18:47:42,202 INFO job_manager.py:530 -- Runtime env is setting up.
2025-03-21 18:47:45,289 INFO worker.py:1514 -- Using address 10.212.79.105:6379 set in the environment variable RAY_ADDRESS
2025-03-21 18:47:45,290 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.212.79.105:6379...
2025-03-21 18:47:45,302 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.212.79.105:8265
Creating placement group...
(autoscaler +2s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +2s) Adding 100 node(s) of type t3a.large1.
Status for job 'raysubmit_aBW2LEDfWwhL1kSh': RUNNING
Status message: Job is currently running.
... and here is where ray job submit has exited.
Note that I am running Ray using KubeRay, and setting RAY_ADDRESS to a doman name that maps to a Kubernetes ingress that routes all HTTP traffic to the head node's dashboard port (8265). I suspect it may be this networking configuration that plays a role, however, the issue manifests in the ray job command so I am submitting the issue here rather than in KubeRay.
Versions / Dependencies
Ray: 2.43.0
KubeRay: 1.3.0
Python: 3.10
Reproduction script
Any ray job submit or ray job logs -f will do.
Issue Severity
Low: It annoys or frustrates me.