-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Introduce AcceleratorManager interface #40286
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
The number of Neuron cores if any were detected, otherwise 0. | ||
""" | ||
nc_count: int = 0 | ||
neuron_path = "/opt/aws/neuron/bin/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The path doesn't guarantee for all environments (based on recent slack conversation https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1696448156026509)
Expected:
find (neuron-ls) command if it exists
then run the command to get the cores
Also, we did ask AWS Neuron SDK team to get the cores information using IMDS (API driven) but no plans to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copy pasted from the existing implementation. Are you planning to fix it for other environments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a try/catch and fix it in the current PR?
If not, ok to move it to issue and I'll own it (tentative ETA: Q1 2024).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add TODO maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #40405 to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm.
from ray._private.accelerators.neuron import NeuronAccelerator | ||
|
||
|
||
def get_all_accelerators() -> Set[Accelerator]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not DevAPIsright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just implementation details.
|
||
|
||
@DeveloperAPI | ||
class Accelerator(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class Accelerator(ABC): | |
class AcceleratorUtil(ABC): |
? Since it seems like it is also all staticmethod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to AcceleratorManager
per ray-project/enhancements#46 (comment)
"""Get the mapping from accelerator resource name | ||
to the visible ids.""" | ||
|
||
from ray._private.accelerators import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
circular dependency
constraint_name = f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}" f"{pretty_name}" | ||
return constraint_name | ||
if last_set_visible_accelerator_ids.get(resource_name, None) == accelerator_ids: | ||
continue # optimization: already set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it new or did we have the same optimization before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also is it really necessary? Aren't they just setting an env var? Maybe remove this for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same old code.
I actually tried to remove it since I also think it might be unnecessary but it uncovered a bug. I decided to fix the bug in a follow-up PR and then remove this optimization.
} | ||
|
||
|
||
def get_all_accelerator_resource_names() -> Set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just use enums here instead of names directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What benefits do enum provide here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the top-level API makes sense to accept string, but for internal functions, it's cleaner to pass enum around (otherwise, we should rely on some implicit assumption the input is always valid, or we should do validation everywhere).
@@ -175,7 +175,7 @@ def testValidateDefaultConfigAWSMultiNodeTypes(self): | |||
"CPU": 4, | |||
"memory": 12025908428, | |||
"neuron_cores": 2, | |||
"accelerator_type:aws-neuron-core": 2, | |||
"accelerator_type:aws-neuron-core": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bug from the previous PR. Total quantity of the special accelerator_type resource should only be 1.
@@ -685,36 +684,6 @@ def get_neuron_core_ids(neuron_cores_per_worker): | |||
nc_f = ray.remote(resources={"neuron_cores": 2})(lambda: get_neuron_core_ids(2)) | |||
assert ray.get(nc_f.remote()) == 2 | |||
|
|||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we now allow specifying both gpu and neuro core resources for a single task.
|
||
Returns: | ||
The resource name: e.g., the resource name for Nvidia GPUs is "GPU" | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AcceleratorManager
should provide get_resource_type()
interface,
while get_resource_name
provide detail resource name, such as NvidiaGPU, IntelGPU
get_resource_type
provide resource type, such as NvidiaGPU is GPU, IntelGPU is GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_resource_name returns the Ray resource name so it should be "GPU". We can add a get_accelerator_family
that returns NvidiaGPU or IntelGPU in the future if needed.
from ray._private.accelerators.neuron import NeuronAcceleratorManager | ||
|
||
|
||
def get_all_accelerator_managers() -> Set[AcceleratorManager]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also support get accelerator managers from env vars? Some users may need to append accelerator manager modules which are not maintained in Ray repo.
…set (ray-project#43714) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> ray-project#40286 accidentally changed ray.get_gpu_ids() to always return a list of int while it should return a list of str when CUDA_VISIBLE_DEVICES is set before starting ray. This PR reverts back to the original behavior.
Why are these changes needed?
Introduce
AcceleratorManager
interface so that each accelerator support can just be implementing a subclass.Related issue number
#38504
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.