Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Introduce AcceleratorManager interface #40286

Merged
merged 21 commits into from
Oct 17, 2023

Conversation

jjyao
Copy link
Contributor

@jjyao jjyao commented Oct 12, 2023

Why are these changes needed?

Introduce AcceleratorManager interface so that each accelerator support can just be implementing a subclass.

Related issue number

#38504

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao changed the title [Core][WIP] Introduce Accelerator interface [Core] Introduce Accelerator interface Oct 13, 2023
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
The number of Neuron cores if any were detected, otherwise 0.
"""
nc_count: int = 0
neuron_path = "/opt/aws/neuron/bin/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The path doesn't guarantee for all environments (based on recent slack conversation https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1696448156026509)

Expected:
find (neuron-ls) command if it exists
then run the command to get the cores

Also, we did ask AWS Neuron SDK team to get the cores information using IMDS (API driven) but no plans to support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy pasted from the existing implementation. Are you planning to fix it for other environments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a try/catch and fix it in the current PR?
If not, ok to move it to issue and I'll own it (tentative ETA: Q1 2024).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add TODO maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #40405 to track this.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm.

python/ray/_private/accelerators/__init__.py Outdated Show resolved Hide resolved
from ray._private.accelerators.neuron import NeuronAccelerator


def get_all_accelerators() -> Set[Accelerator]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not DevAPIsright?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just implementation details.



@DeveloperAPI
class Accelerator(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Accelerator(ABC):
class AcceleratorUtil(ABC):

? Since it seems like it is also all staticmethod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to AcceleratorManager per ray-project/enhancements#46 (comment)

python/ray/_private/accelerators/accelerator.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/accelerator.py Outdated Show resolved Hide resolved
"""Get the mapping from accelerator resource name
to the visible ids."""

from ray._private.accelerators import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we import here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

circular dependency

constraint_name = f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}" f"{pretty_name}"
return constraint_name
if last_set_visible_accelerator_ids.get(resource_name, None) == accelerator_ids:
continue # optimization: already set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it new or did we have the same optimization before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is it really necessary? Aren't they just setting an env var? Maybe remove this for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same old code.

I actually tried to remove it since I also think it might be unnecessary but it uncovered a bug. I decided to fix the bug in a follow-up PR and then remove this optimization.

}


def get_all_accelerator_resource_names() -> Set[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just use enums here instead of names directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefits do enum provide here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the top-level API makes sense to accept string, but for internal functions, it's cleaner to pass enum around (otherwise, we should rely on some implicit assumption the input is always valid, or we should do validation everywhere).

@@ -175,7 +175,7 @@ def testValidateDefaultConfigAWSMultiNodeTypes(self):
"CPU": 4,
"memory": 12025908428,
"neuron_cores": 2,
"accelerator_type:aws-neuron-core": 2,
"accelerator_type:aws-neuron-core": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug from the previous PR. Total quantity of the special accelerator_type resource should only be 1.

@@ -685,36 +684,6 @@ def get_neuron_core_ids(neuron_cores_per_worker):
nc_f = ray.remote(resources={"neuron_cores": 2})(lambda: get_neuron_core_ids(2))
assert ray.get(nc_f.remote()) == 2

with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now allow specifying both gpu and neuro core resources for a single task.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from rkooo567 October 17, 2023 08:30
@jjyao jjyao changed the title [Core] Introduce Accelerator interface [Core] Introduce AcceleratorManager interface Oct 17, 2023
@jjyao jjyao merged commit 16da484 into ray-project:master Oct 17, 2023
65 of 69 checks passed
@jjyao jjyao deleted the jjyao/accelerator branch October 17, 2023 15:25

Returns:
The resource name: e.g., the resource name for Nvidia GPUs is "GPU"
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AcceleratorManager should provide get_resource_type() interface,
while get_resource_name provide detail resource name, such as NvidiaGPU, IntelGPU
get_resource_type provide resource type, such as NvidiaGPU is GPU, IntelGPU is GPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_resource_name returns the Ray resource name so it should be "GPU". We can add a get_accelerator_family that returns NvidiaGPU or IntelGPU in the future if needed.

from ray._private.accelerators.neuron import NeuronAcceleratorManager


def get_all_accelerator_managers() -> Set[AcceleratorManager]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also support get accelerator managers from env vars? Some users may need to append accelerator manager modules which are not maintained in Ray repo.

fishbone pushed a commit that referenced this pull request Mar 5, 2024
…set (#43714)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

#40286 accidentally changed ray.get_gpu_ids() to always return a list of int while it should return a list of str when CUDA_VISIBLE_DEVICES is set before starting ray.

This PR reverts back to the original behavior.
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
…set (ray-project#43714)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

ray-project#40286 accidentally changed ray.get_gpu_ids() to always return a list of int while it should return a list of str when CUDA_VISIBLE_DEVICES is set before starting ray.

This PR reverts back to the original behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants