[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000

dongreenberg · 2023-03-03T16:21:50Z

What happened + What you expected to happen

Our OSS library launches Ray tasks on a cluster on behalf of our users. One user (@dcavadia) reported (run-house/runhouse#9 (comment)) that the task is failing due to a Ray error, and shared the following stack traces and Ray status:

INFO | 2023-02-27 22:59:07,628 | Reloaded module langchain.llms.self_hosted_hugging_face
(raylet) [2023-02-27 22:59:41,974 E 69555 69555] (raylet) worker_pool.cc:502: Some workers of the worker process(69854) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet) Traceback (most recent call last):
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 101, in <module>
(raylet)     _configure_system()
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 98, in _configure_system
(raylet)     CDLL(so_path, ctypes.RTLD_GLOBAL)
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/ctypes/__init__.py", line 374, in __init__
(raylet)     self._handle = _dlopen(self._name, mode)
(raylet) OSError: /home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/_raylet.so: undefined symbol: _Py_CheckRecursionLimit

Status:

======== Autoscaler status: 2023-02-27 22:57:07.933530 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_0ba826e055591e93b1eedf2ca00b44c0c8e2ac28fa7b77053bca62f9
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 9.999999999976694e-05/30.0 CPU
 9.999999999998899e-05/1.0 GPU
 0.0/1.0 accelerator_type:A10
 0.00/127.547 GiB memory
 0.00/58.654 GiB object_store_memory

Demands:
 {'CPU': 0.0001, 'GPU': 0.0001}: 1+ pending tasks/actor

We start tasks with {'CPU': 0.0001, 'GPU': 0.0001} (if GPU is present, otherwise just CPU) because we don't want Ray to manage the contention and rather just leave it to the OS, and because we want to launch a new worker so we can collect the logs for just that task (even though we know it's slower, there's no way we see to collect logs for just one task, only a worker).

This looks like a bug to me - the resources appear to be abundantly available but the task is being held. I may be reading it wrong, because I don't recognize the exception.

Versions / Dependencies

Ray: Installed via the skypilot main branch, which (prior to two days ago) should have installed 2.2.0.
Python: To be honest I'm a bit confused about the python version, it looks from the stack trace like both 3.8 and 3.10 are in play (perhaps that's the issue, but I'm not sure how it happens?)
OS : ubuntu

Reproduction script

Copied from user shared script here. Requires runhouse latest, langchain latest, transformers, and torch.

from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A100:1").save()
gpu.restart_grpc_server(restart_ray=True)
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["./", "transformers", "torch", "langchain"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"
llm_chain.run(question)

Issue Severity

None

The text was updated successfully, but these errors were encountered:

jjyao · 2023-03-03T18:20:50Z

Hi @dongreenberg,

Is it possible to have a simple repro without those dependencies?

stale · 2023-08-10T04:00:59Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

dongreenberg added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 3, 2023

hora-anyscale added the core Issues that should be addressed in Ray Core label Mar 3, 2023

jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 3, 2023

rkooo567 added needs-repro-script Issue needs a runnable script to be reproduced core-correctness Leak, crash, hang P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed P1 Issue that should be fixed within a few weeks labels Mar 24, 2023

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023

rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023

anyscalesam added the stability label Mar 1, 2024

anyscalesam added the core-scheduler label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000

[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000

dongreenberg commented Mar 3, 2023

jjyao commented Mar 3, 2023

stale bot commented Aug 10, 2023

[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000

[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000

Comments

dongreenberg commented Mar 3, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jjyao commented Mar 3, 2023

stale bot commented Aug 10, 2023