Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mid-way through a job, ray.util.queue.Queue.get() reports "Ray has not been started yet" #17153

Closed
1 of 2 tasks
djakubiec opened this issue Jul 16, 2021 · 8 comments
Closed
1 of 2 tasks
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@djakubiec
Copy link

What is the problem?

(Copied from discuss.ray.io per suggestion from @richardliaw)
I am getting this unexpected error mid-way through the my Ray job.

@richardliaw asked for me to upload log files, which are attached here: logs.zip

Traceback (most recent call last):
  File "/home/djakubiec/var/jupyter/marketNeutralEquityStrategy/test.build/mnesRanking.py", line 1765, in <module>
    rayClient.processJob(parameters.priority, [(parametersReference, tickerRegion) for tickerRegion in tickerRegions], carryForwardRequestHandler, carryForwardCompletionHandler)
  File "/ceph/var/ray/tools/python.modules/focusvq/raytools/RayClient.py", line 77, in processJob
    threadsAvailable = threadQueue.get()
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/queue.py", line 155, in get
    return ray.get(self.actor.get.remote(timeout))
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/worker.py", line 1471, in get
    worker.check_connected()
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/worker.py", line 218, in check_connected
    raise RaySystemError("Ray has not been started yet. You can "
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.

Ray version and other system information (Python version, TensorFlow version, OS):
Using the following Ray nightly (which had other bug fixes I needed):

https://s3-us-west-2.amazonaws.com/ray-wheels/master/f5f23448fcab7c896e478e9e5c9804a73688ec41/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Reproduction (REQUIRED)

I start a job with 5097 tasks and do many successful Queue.get() calls. About 15% of the way through the job, Queue.get() all of a sudden complains that Ray has not been started yet.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels. TODO
@djakubiec djakubiec added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 16, 2021
@richardliaw
Copy link
Contributor

thanks a bunch for this @djakubiec! appreciate it .

@ijrsvt can you take a look when you get the chance?

@djakubiec
Copy link
Author

No problem, thanks for the help.

Oh, the stack trace above occurred at 2021-07-16 9:58:34,295. Probably a lot of other jobs logged prior to that.

@djakubiec
Copy link
Author

@ijrsvt I also reproduced this problem on a smaller test cluster we have. I did this after a clean restart of the cluster. So these logs may be cleaner to look at:

logs2.zip

NOTE: I've also been getting this earlier DEADLINE_EXCEEDED exception below which I didn't think was related but maybe it is?

Let me know what else I can provide, thank you!

[2021-07-16 14:52:05,167 INFO   ] Submitting 5097 tickerRegions to workers for multi-era compute.
The autoscaler failed with the following error:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 317, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 207, in _run
    self.update_load_metrics()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 170, in update_load_metrics
    request, timeout=4)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1626461549.374300032","description":"Error received from peer ipv4:10.0.1.250:37305","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

[2021-07-16 14:53:23,160 INFO   ] Pre-ranking is 5% complete.  Estimated completion in 15.4 minutes at 2021-07-16T15:08:47.  Estimated total duration 0:16:13.
/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py:358: UserWarning: More than 1000 remote tasks have been scheduled. This can be slow on Ray Client due to communication overhead over the network. If you're running many fine-grained tasks, consider running them in a single remote function. See the section on "Too fine-grained tasks" in the Ray Design Patterns document for more details: https://docs.google.com/document/d/167rnnDFIVRhHhK4mznEIemOtj63IOhtIPvSYaPgI4Fg/edit#heading=h.f7ins22n6nyl
  UserWarning)
[2021-07-16 14:54:17,328 INFO   ] Pre-ranking is 10% complete.  Estimated completion in 15.3 minutes at 2021-07-16T15:09:37.  Estimated total duration 0:17:03.
[2021-07-16 14:55:10,171 INFO   ] Pre-ranking is 15% complete.  Estimated completion in 13.1 minutes at 2021-07-16T15:08:15.  Estimated total duration 0:15:24.
[2021-07-16 14:56:00,780 INFO   ] Pre-ranking is 20% complete.  Estimated completion in 13.4 minutes at 2021-07-16T15:09:23.  Estimated total duration 0:16:44.
[2021-07-16 14:56:51,155 INFO   ] Pre-ranking is 25% complete.  Estimated completion in 12.6 minutes at 2021-07-16T15:09:24.  Estimated total duration 0:16:44.
[2021-07-16 14:57:44,695 INFO   ] Pre-ranking is 30% complete.  Estimated completion in 12.3 minutes at 2021-07-16T15:10:00.  Estimated total duration 0:17:31.
[2021-07-16 14:58:35,995 INFO   ] Pre-ranking is 35% complete.  Estimated completion in 11.4 minutes at 2021-07-16T15:09:57.  Estimated total duration 0:17:28.
Traceback (most recent call last):
  File "/home/djakubiec/var/jupyter/marketNeutralEquityStrategy/test.build/mnesRanking.py", line 1767, in <module>
    rayClient.processJob(parameters.priority, [(parametersReference, tickerRegion) for tickerRegion in tickerRegions], carryForwardRequestHandler, carryForwardCompletionHandler)
  File "/ceph/var/ray/tools/python.modules/focusvq/raytools/RayClient.py", line 84, in processJob
    threadsAvailable = threadQueue.get()
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/queue.py", line 155, in get
    return ray.get(self.actor.get.remote(timeout))
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/worker.py", line 1471, in get
    worker.check_connected()
  File "/ceph/var/users/djakubiec/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/worker.py", line 218, in check_connected
    raise RaySystemError("Ray has not been started yet. You can "
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.
[2021-07-16 14:59:19,042 INFO   ] Total run time of 450.927s (0:07:31) from 2021-07-16T14:51:48-04:00 to 2021-07-16T14:59:19-04:00.

@djakubiec
Copy link
Author

djakubiec commented Jul 20, 2021

I caught another incarnation of this problem in a different script which printed some additional worker logs this time. I am not 100% positive the work JSON errors are related since there are no timestamps on those, but I think they probably are related or at least resultant.

Note that prior to this run I had moved the cluster from the master commit referenced above to a basic 1.4.1 release.

[2021-07-20 02:38:39,014 INFO   ] Submitting 245 eras to workers.
/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/util/client/worker.py:374: UserWarning: More than 10MB of messages have been created to schedule tasks on the server. This can be slow on Ray Client due to communication overhead over the network. If you're running many fine-grained tasks, consider running them inside a single remote function. See the section on "Too fine-grained tasks" in the Ray Design Patterns document for more details: https://docs.google.com/document/d/167rnnDFIVRhHhK4mznEIemOtj63IOhtIPvSYaPgI4Fg/edit#heading=h.f7ins22n6nyl. If your functions frequently use large objects, consider storing the objects remotely with ray.put. An example of this is shown in the "Closure capture of large / unserializable object" section of the Ray Design Patterns document, available here: https://docs.google.com/document/d/167rnnDFIVRhHhK4mznEIemOtj63IOhtIPvSYaPgI4Fg/edit#heading=h.1afmymq455wu
  f"{DESIGN_PATTERN_LARGE_OBJECTS_LINK}", UserWarning)
Exception in thread Thread-12:
Traceback (most recent call last):
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/ceph/var/ray/tools/python.modules/focusvq/raytools/RayClient.py", line 118, in resultProcessor
    ready, futures = ray.wait(futures, num_returns=len(futures), timeout=1.0)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/worker.py", line 1581, in wait
    worker.check_connected()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/worker.py", line 222, in check_connected
    raise RaySystemError("Ray has not been started yet. You can "
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.

Exception in thread Thread-11:
Traceback (most recent call last):
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/ceph/var/ray/tools/python.modules/focusvq/raytools/RayClient.py", line 72, in heartbeat
    ray.get(self.manager.heartbeat.remote(self.rayContext['clientId']))
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/worker.py", line 1465, in get
    worker.check_connected()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray3/lib/python3.7/site-packages/ray/worker.py", line 222, in check_connected
    raise RaySystemError("Ray has not been started yet. You can "
ray.exceptions.RaySystemError: System error: Ray has not been started yet. You can start Ray with 'ray.init()'.

(raylet, ip=10.0.1.19) Traceback (most recent call last):
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 175, in <module>
(raylet, ip=10.0.1.19)     connect_only=True)
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 204, in __init__
(raylet, ip=10.0.1.19)     "metrics_agent_port", default_port=ray_params.metrics_agent_port)
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 635, in _get_cached_port
(raylet, ip=10.0.1.19)     ports_by_node.update(json.load(f))
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 296, in load
(raylet, ip=10.0.1.19)     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 348, in loads
(raylet, ip=10.0.1.19)     return _default_decoder.decode(s)
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 337, in decode
(raylet, ip=10.0.1.19)     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
(raylet, ip=10.0.1.19)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 355, in raw_decode
(raylet, ip=10.0.1.19)     raise JSONDecodeError("Expecting value", s, err.value) from None
(raylet, ip=10.0.1.19) json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
(raylet, ip=10.0.1.38) Traceback (most recent call last):
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 175, in <module>
(raylet, ip=10.0.1.38)     connect_only=True)
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 204, in __init__
(raylet, ip=10.0.1.38)     "metrics_agent_port", default_port=ray_params.metrics_agent_port)
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 635, in _get_cached_port
(raylet, ip=10.0.1.38)     ports_by_node.update(json.load(f))
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 296, in load
(raylet, ip=10.0.1.38)     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 348, in loads
(raylet, ip=10.0.1.38)     return _default_decoder.decode(s)
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 337, in decode
(raylet, ip=10.0.1.38)     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
(raylet, ip=10.0.1.38)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 355, in raw_decode
(raylet, ip=10.0.1.38)     raise JSONDecodeError("Expecting value", s, err.value) from None
(raylet, ip=10.0.1.38) json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
(raylet) Traceback (most recent call last):
(raylet)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 175, in <module>
(raylet)     connect_only=True)
(raylet)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 204, in __init__
(raylet)     "metrics_agent_port", default_port=ray_params.metrics_agent_port)
(raylet)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 635, in _get_cached_port
(raylet)     ports_by_node.update(json.load(f))
(raylet)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 296, in load
(raylet)     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
(raylet)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 348, in loads
(raylet)     return _default_decoder.decode(s)
(raylet)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 337, in decode
(raylet)     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
(raylet)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 355, in raw_decode
(raylet)     raise JSONDecodeError("Expecting value", s, err.value) from None
(raylet) json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
(raylet, ip=10.0.1.25) Traceback (most recent call last):
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 175, in <module>
(raylet, ip=10.0.1.25)     connect_only=True)
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 204, in __init__
(raylet, ip=10.0.1.25)     "metrics_agent_port", default_port=ray_params.metrics_agent_port)
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/site-packages/ray/node.py", line 635, in _get_cached_port
(raylet, ip=10.0.1.25)     ports_by_node.update(json.load(f))
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 296, in load
(raylet, ip=10.0.1.25)     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/__init__.py", line 348, in loads
(raylet, ip=10.0.1.25)     return _default_decoder.decode(s)
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 337, in decode
(raylet, ip=10.0.1.25)     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
(raylet, ip=10.0.1.25)   File "/home/ray/anaconda3/envs/dan-py37/lib/python3.7/json/decoder.py", line 355, in raw_decode
(raylet, ip=10.0.1.25)     raise JSONDecodeError("Expecting value", s, err.value) from None

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 21, 2021

@djakubiec Just to check, these are happening from a call in a thread?

Could you try running on a more recent version of nightly (specifically one that contains #16731)? That PR fixes Client behavior in threads!

@djakubiec
Copy link
Author

@ijrsvt Yes, this call was indeed from a Python thread. I'll build a cluster with a more recent nightly, but it will probably take me a day or two.

Do you have a reasonably stable master commit you could recommend, or should I just pick the most recent?

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 22, 2021

Hmm, I'm not sure of a specific commit. We are in the process of doing a release, so the most recent pick on that branch https://ray-wheels.s3.us-west-2.amazonaws.com/releases/1.5.0/8ea70c23cd031ef854ec1755768eab1e14daefd1/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl is probably a bit more stable than latest master.

@djakubiec
Copy link
Author

@ijrsvt Following up a little late on this, but it does appear that 1.5.0 does fix this issue, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants