Skip to content

[Cluster] Ray job submit/logs sporadically stops following logs #51601

@jleben

Description

@jleben

What happened + What you expected to happen

The commands ray job submit and ray job logs -f are supposed to keep printing the output of the job until it terminates.

Lately though, I find these commands sporadically stop following logs and exit.

Interestingly, before exiting, they print the job status, same as ray job status would.

Example:

root@a687ff44d99c:/code# ray job submit ...
...
-------------------------------------------------------
Job 'raysubmit_aBW2LEDfWwhL1kSh' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_aBW2LEDfWwhL1kSh
  Query the status of the job:
    ray job status raysubmit_aBW2LEDfWwhL1kSh
  Request the job to be stopped:
    ray job stop raysubmit_aBW2LEDfWwhL1kSh

Tailing logs until the job exits (disable with --no-wait):
2025-03-21 18:47:42,202	INFO job_manager.py:530 -- Runtime env is setting up.
2025-03-21 18:47:45,289	INFO worker.py:1514 -- Using address 10.212.79.105:6379 set in the environment variable RAY_ADDRESS
2025-03-21 18:47:45,290	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.212.79.105:6379...
2025-03-21 18:47:45,302	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.212.79.105:8265 
Creating placement group...
(autoscaler +2s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +2s) Adding 100 node(s) of type t3a.large1.
Status for job 'raysubmit_aBW2LEDfWwhL1kSh': RUNNING
Status message: Job is currently running.

... and here is where ray job submit has exited.

Note that I am running Ray using KubeRay, and setting RAY_ADDRESS to a doman name that maps to a Kubernetes ingress that routes all HTTP traffic to the head node's dashboard port (8265). I suspect it may be this networking configuration that plays a role, however, the issue manifests in the ray job command so I am submitting the issue here rather than in KubeRay.

Versions / Dependencies

Ray: 2.43.0
KubeRay: 1.3.0
Python: 3.10

Reproduction script

Any ray job submit or ray job logs -f will do.

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogjobstriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions