Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Task waiting to start due to insufficient resources even though resources are present and detected #33000

Open
dongreenberg opened this issue Mar 3, 2023 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang core-scheduler needs-repro-script Issue needs a runnable script to be reproduced P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared stability

Comments

@dongreenberg
Copy link

What happened + What you expected to happen

Our OSS library launches Ray tasks on a cluster on behalf of our users. One user (@dcavadia) reported (run-house/runhouse#9 (comment)) that the task is failing due to a Ray error, and shared the following stack traces and Ray status:

INFO | 2023-02-27 22:59:07,628 | Reloaded module langchain.llms.self_hosted_hugging_face
(raylet) [2023-02-27 22:59:41,974 E 69555 69555] (raylet) worker_pool.cc:502: Some workers of the worker process(69854) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet) Traceback (most recent call last):
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 101, in <module>
(raylet)     _configure_system()
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 98, in _configure_system
(raylet)     CDLL(so_path, ctypes.RTLD_GLOBAL)
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/ctypes/__init__.py", line 374, in __init__
(raylet)     self._handle = _dlopen(self._name, mode)
(raylet) OSError: /home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/_raylet.so: undefined symbol: _Py_CheckRecursionLimit

Status:

======== Autoscaler status: 2023-02-27 22:57:07.933530 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_0ba826e055591e93b1eedf2ca00b44c0c8e2ac28fa7b77053bca62f9
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 9.999999999976694e-05/30.0 CPU
 9.999999999998899e-05/1.0 GPU
 0.0/1.0 accelerator_type:A10
 0.00/127.547 GiB memory
 0.00/58.654 GiB object_store_memory

Demands:
 {'CPU': 0.0001, 'GPU': 0.0001}: 1+ pending tasks/actor

We start tasks with {'CPU': 0.0001, 'GPU': 0.0001} (if GPU is present, otherwise just CPU) because we don't want Ray to manage the contention and rather just leave it to the OS, and because we want to launch a new worker so we can collect the logs for just that task (even though we know it's slower, there's no way we see to collect logs for just one task, only a worker).

This looks like a bug to me - the resources appear to be abundantly available but the task is being held. I may be reading it wrong, because I don't recognize the exception.

Versions / Dependencies

  • Ray: Installed via the skypilot main branch, which (prior to two days ago) should have installed 2.2.0.
  • Python: To be honest I'm a bit confused about the python version, it looks from the stack trace like both 3.8 and 3.10 are in play (perhaps that's the issue, but I'm not sure how it happens?)
  • OS : ubuntu

Reproduction script

Copied from user shared script here. Requires runhouse latest, langchain latest, transformers, and torch.

from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A100:1").save()
gpu.restart_grpc_server(restart_ray=True)
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["./", "transformers", "torch", "langchain"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"
llm_chain.run(question)

Issue Severity

None

@dongreenberg dongreenberg added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 3, 2023
@hora-anyscale hora-anyscale added the core Issues that should be addressed in Ray Core label Mar 3, 2023
@jjyao
Copy link
Contributor

jjyao commented Mar 3, 2023

Hi @dongreenberg,

Is it possible to have a simple repro without those dependencies?

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 3, 2023
@rkooo567 rkooo567 added needs-repro-script Issue needs a runnable script to be reproduced core-correctness Leak, crash, hang P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed P1 Issue that should be fixed within a few weeks labels Mar 24, 2023
@stale
Copy link

stale bot commented Aug 10, 2023

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
@rkooo567 rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang core-scheduler needs-repro-script Issue needs a runnable script to be reproduced P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared stability
Projects
None yet
Development

No branches or pull requests

5 participants