-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mid-way through a job, ray.util.queue.Queue.get() reports "Ray has not been started yet" #17153
Comments
thanks a bunch for this @djakubiec! appreciate it . @ijrsvt can you take a look when you get the chance? |
No problem, thanks for the help. Oh, the stack trace above occurred at 2021-07-16 9:58:34,295. Probably a lot of other jobs logged prior to that. |
@ijrsvt I also reproduced this problem on a smaller test cluster we have. I did this after a clean restart of the cluster. So these logs may be cleaner to look at: NOTE: I've also been getting this earlier DEADLINE_EXCEEDED exception below which I didn't think was related but maybe it is? Let me know what else I can provide, thank you!
|
I caught another incarnation of this problem in a different script which printed some additional worker logs this time. I am not 100% positive the work JSON errors are related since there are no timestamps on those, but I think they probably are related or at least resultant. Note that prior to this run I had moved the cluster from the master commit referenced above to a basic 1.4.1 release.
|
@djakubiec Just to check, these are happening from a call in a thread? Could you try running on a more recent version of nightly (specifically one that contains #16731)? That PR fixes Client behavior in threads! |
@ijrsvt Yes, this call was indeed from a Python thread. I'll build a cluster with a more recent nightly, but it will probably take me a day or two. Do you have a reasonably stable master commit you could recommend, or should I just pick the most recent? |
Hmm, I'm not sure of a specific commit. We are in the process of doing a release, so the most recent pick on that branch |
@ijrsvt Following up a little late on this, but it does appear that 1.5.0 does fix this issue, thank you! |
What is the problem?
(Copied from discuss.ray.io per suggestion from @richardliaw)
I am getting this unexpected error mid-way through the my Ray job.
@richardliaw asked for me to upload log files, which are attached here: logs.zip
Ray version and other system information (Python version, TensorFlow version, OS):
Using the following Ray nightly (which had other bug fixes I needed):
https://s3-us-west-2.amazonaws.com/ray-wheels/master/f5f23448fcab7c896e478e9e5c9804a73688ec41/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
Reproduction (REQUIRED)
I start a job with 5097 tasks and do many successful Queue.get() calls. About 15% of the way through the job, Queue.get() all of a sudden complains that Ray has not been started yet.
The text was updated successfully, but these errors were encountered: